git.ipfire.org Git - thirdparty/kernel/stable.git/log

btrfs: enhance error messages for delalloc range failure

When running emulated write error tests like generic/475, we can hit
error messages like this:

BTRFS error (device dm-12 state EA): run_delalloc_nocow failed, root=596 inode=264 start=1605632 len=73728: -5
BTRFS error (device dm-12 state EA): failed to run delalloc range, root=596 ino=264 folio=1605632 submit_bitmap=0-7 start=1605632 len=73728: -5

Which is normally buried by direct IO error messages.

However above error messages are not enough to determine which is the
real range that caused the error.
Considering we can have multiple different extents in one delalloc
range (e.g. some COW extents along with some NOCOW extents), just
outputting the error at the end of run_delalloc_nocow() is not enough.

To enhance the error messages:

- Remove the rate limit on the existing error messages
  In the generic/475 example, most error messages are from direct IO,
  not really from the delalloc range.
  Considering how useful the delalloc range error messages are, we don't
  want they to be rate limited.

- Add extra @cur_offset output for cow_file_range()
- Add extra variable output for run_delalloc_nocow()
  This is especially important for run_delalloc_nocow(), as there
  are extra error paths where we can hit error without into
  nocow_one_range() nor fallback_to_cow().

- Add an error message for nocow_one_range()
  That's the missing part.
  For fallback_to_cow(), we have error message from cow_file_range()
  already.

- Constify the @len and @end local variables for nocow_one_range()
  This makes it much easier to make sure @len and @end are not modified
  at runtime.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rework error handling of run_delalloc_nocow()

Currently the error handling of run_delalloc_nocow() modifies
@cur_offset to handle different parts of the delalloc range.

However the error handling can always be split into 3 parts:

1) The range with ordered extents allocated (OE cleanup)
   We have to cleanup the ordered extents and unlock the folios.

2) The range that have been cleaned up (skip)
   For now it's only for the range of fallback_to_cow().

   We should not touch it at all.

3) The range that is not yet touched (untouched)
   We have to unlock the folios and clear any reserved space.

This 3 ranges split has the same principle as cow_file_range(), however
the NOCOW/COW handling makes the above 3 range split much more complex:

a) Failure immediately after a successful OE allocation
   Thus no @cow_start nor @cow_end set.

   start         cur_offset               end
   | OE cleanup  |       untouched        |

b) Failure after hitting a COW range but before calling
   fallback_to_cow()

   start        cow_start    cur_offset   end
   | OE Cleanup |       untouched         |

c) Failure to call fallback_to_cow()

   start        cow_start    cow_end      end
   | OE Cleanup |    skip    |  untouched |

Instead of modifying @cur_offset, do proper range calculation for
OE-cleanup and untouched ranges using above 3 cases with proper range
charts.

This avoid updating @cur_offset, as it will an extra output for debug
purposes later, and explain the behavior easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add mount option for ref_tracker

The ref_tracker infrastructure aids debugging but is not enabled by
default as it has a performance impact. Add mount option 'ref_tracker'
so it can be selectively enabled on a filesystem. Currently it track
references of 'delayed inodes'.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print leaked references in kill_all_delayed_nodes()

We are seeing soft lockups in kill_all_delayed_nodes due to a
delayed_node with a lingering reference count of 1. Printing at this
point will reveal the guilty stack trace. If the delayed_node has no
references there should be no output.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: implement ref_tracker for delayed_nodes

Add ref_tracker infrastructure for struct btrfs_delayed_node.

It is a response to the largest btrfs related crash in our fleet.  We're
seeing soft lockups in btrfs_kill_all_delayed_nodes() that seem to be a
result of delayed_nodes not being released properly.

A ref_tracker object is allocated on reference count increases and freed
on reference count decreases. The ref_tracker object stores a stack
trace of where it is allocated. The ref_tracker_dir object is embedded
in btrfs_delayed_node and keeps track of all current and some old/freed
ref_tracker objects. When a leak is detected we can print the stack
traces for all ref_trackers that have not yet been freed.

Here is a common example of taking a reference to a delayed_node and
freeing it with ref_tracker.

    struct btrfs_ref_tracker tracker;
    struct btrfs_delayed_node *node;

    node = btrfs_get_delayed_node(inode, &tracker);
    // use delayed_node...
    btrfs_release_delayed_node(node, &tracker);

There are two special cases where the delayed_node reference is "long
lived", meaning that the thread that takes the reference and the thread
that releases the reference are different. The 'inode_cache_tracker'
tracks the delayed_node stored in btrfs_inode. The 'node_list_tracker'
tracks the delayed_node stored in the btrfs_delayed_root
node_list/prepare_list. These trackers are embedded in the
btrfs_delayed_node.

btrfs_ref_tracker and btrfs_ref_tracker_dir are wrappers that either
compile to the corresponding ref_tracker structs or empty structs
depending on CONFIG_BTRFS_DEBUG. There are also btrfs wrappers for
the ref_tracker API.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: convert several int parameters to bool

We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping. Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move ref-verify under CONFIG_BTRFS_DEBUG

Remove CONFIG_BTRFS_FS_REF_VERIFY Kconfig and add it as part of
CONFIG_BTRFS_DEBUG. This should not be impactful to the performance
of debug. The struct btrfs_ref takes an additional u64, btrfs_fs_info
takes an additional spinlock_t and rb_root. All of the ref_verify logic
is still protected by a mount option.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use PTR_ERR_OR_ZERO() to simplify code inbtrfs_control_ioctl()

Use the standard error pointer macro to simplify the code.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify support block size check

Currently we manually check the block size against 3 different values:

- 4K
- PAGE_SIZE
- MIN_BLOCKSIZE

Those 3 values can match or differ from each other.  This makes it
pretty complex to output the supported block sizes.

Considering we're going to add block size > page size support soon, this
can make the support block size sysfs attribute much harder to
implement.

To make it easier, factor out a helper, btrfs_supported_blocksize() to
do a simple check for the block size.

Then utilize it in the two locations:

- btrfs_validate_super()
  This is very straightforward

- supported_sectorsizes_show()
  Iterate through all valid block sizes, and only output supported ones.

  This is to make future full range block sizes support much easier.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use blocksize to check if compression is making things larger

[BEHAVIOR DIFFERENCE BETWEEN COMPRESSION ALGOS]
Currently LZO compression algorithm will check if we're making the
compressed data larger after compressing more than 2 blocks.

But zlib and zstd do the same checks after compressing more than 8192
bytes.

This is not a big deal, but since we're already supporting larger block
size (e.g. 64K block size if page size is also 64K), this check is not
suitable for all block sizes.

For example, if our page and block size are both 16KiB, and after the
first block compressed using zlib, the resulted compressed data is
slightly larger than 16KiB, we will immediately abort the compression.

This makes zstd and zlib compression algorithms to behave slightly
different from LZO, which only aborts after compressing two blocks.

[ENHANCEMENT]
To unify the behavior, only abort the compression after compressing at
least two blocks.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass btrfs_inode pointer directly into btrfs_compress_folios()

For the 3 supported compression algorithms, two of them (zstd and zlib)
are already grabbing the btrfs inode for error messages.

It's more common to pass btrfs_inode and grab the address space from it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: refine extent allocator hint selection

The hint block group selection in the extent allocator is wrong in the
first place, as it can select the dedicated data relocation block group for
the normal data allocation.

Since we separated the normal data space_info and the data relocation
space_info, we can easily identify a block group is for data relocation or
not. Do not choose it for the normal data allocation.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: try to search for data csums in commit root

If you run a workload with:

- a cgroup that does tons of parallel data reading, with a working set
  much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
  with no memory limit
(see full code listing at the bottom for a reproducer)

Then what quickly occurs is:

- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
  memory pressure

The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.

As a result of this, we achieve an unpleasant priority inversion. We
have:

- a high degree of contention on the csum root node (and other upper
  nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
  as readers, but not scheduling promptly. i.e., task __state == 0, but
  not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
  (running delayed_refs, deleting csums for unreferenced data extents)

This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).

It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.

The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.

As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.

Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.

This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.

Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.

To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):

====== big-read.c ======
  #include <fcntl.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <unistd.h>
  #include <errno.h>

  #define BUF_SZ (128 * (1 << 10UL))

  int read_once(int fd, size_t sz) {
   char buf[BUF_SZ];
   size_t rd = 0;
   int ret = 0;

   while (rd < sz) {
   ret = read(fd, buf, BUF_SZ);
   if (ret < 0) {
   if (errno == EINTR)
   continue;
   fprintf(stderr, "read failed: %d\n", errno);
   return -errno;
   } else if (ret == 0) {
   break;
   } else {
   rd += ret;
   }
   }
   return rd;
  }

  int read_loop(char *fname) {
   int fd;
   struct stat st;
   size_t sz = 0;
   int ret;

   while (1) {
   fd = open(fname, O_RDONLY);
   if (fd == -1) {
   perror("open");
   return 1;
   }
   if (!sz) {
   if (!fstat(fd, &st)) {
   sz = st.st_size;
   } else {
   perror("stat");
   return 1;
   }
   }

                  ret = read_once(fd, sz);
   close(fd);
   }
  }

  int main(int argc, char *argv[]) {
   int fd;
   struct stat st;
   off_t sz;
   char *buf;
   int ret;

   if (argc != 2) {
   fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
   return 1;
   }

   return read_loop(argv[1]);
  }

====== repro.sh ======
  #!/usr/bin/env bash

  SCRIPT=$(readlink -f "$0")
  DIR=$(dirname "$SCRIPT")

  dev=$1
  mnt=$2
  shift
  shift

  CG_ROOT=/sys/fs/cgroup
  BAD_CG=$CG_ROOT/bad-nbr
  GOOD_CG=$CG_ROOT/good-nbr
  NR_BIGGOS=1
  NR_LITTLE=10
  NR_VICTIMS=32
  NR_VILLAINS=512

  START_SEC=$(date +%s)

  _elapsed() {
   echo "elapsed: $(($(date +%s) - $START_SEC))"
  }

  _stats() {
   local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)

   echo "================"
   date
   _elapsed
   cat $sysfs/commit_stats
   cat $BAD_CG/memory.pressure
  }

  _setup_cgs() {
   echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
   mkdir -p $GOOD_CG
   mkdir -p $BAD_CG
   echo max > $BAD_CG/memory.max
   # memory.high much less than the working set will cause heavy reclaim
   echo $((1 << 30)) > $BAD_CG/memory.high

   # victims get a subset of villain CPUs
   echo 0 > $GOOD_CG/cpuset.cpus
   echo 0,1,2,3 > $BAD_CG/cpuset.cpus
  }

  _kill_cg() {
   local cg=$1
   local attempts=0
   echo "kill cgroup $cg"
   [ -f $cg/cgroup.procs ] || return
   while true; do
   attempts=$((attempts + 1))
   echo 1 > $cg/cgroup.kill
   sleep 1
   procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
   [ $procs -eq 0 ] && break
   done
   rmdir $cg
   echo "killed cgroup $cg in $attempts attempts"
  }

  _biggo_vol() {
   echo $mnt/biggo_vol.$1
  }

  _biggo_file() {
   echo $(_biggo_vol $1)/biggo
  }

  _subvoled_biggos() {
   total_sz=$((10 << 30))
   per_sz=$((total_sz / $NR_VILLAINS))
   dd_count=$((per_sz >> 20))
   echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
   for i in $(seq $NR_VILLAINS)
   do
   btrfs subvol create $(_biggo_vol $i) &>/dev/null
   dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
   done
   echo "done creating subvols."
  }

  _setup() {
   [ -f .done ] && rm .done
   findmnt -n $dev && exit 1
        if [ -f .re-mkfs ]; then
mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
else
echo "touch .re-mkfs to populate the test fs"
fi

   mount -o noatime $dev $mnt || exit 3
   [ -f .re-mkfs ] && _subvoled_biggos
   _setup_cgs
  }

  _my_cleanup() {
   echo "CLEANUP!"
   _kill_cg $BAD_CG
   _kill_cg $GOOD_CG
   sleep 1
   umount $mnt
  }

  _bad_exit() {
   _err "Unexpected Exit! $?"
   _stats
   exit $?
  }

  trap _my_cleanup EXIT
  trap _bad_exit INT TERM

  _setup

  # Use a lot of page cache reading the big file
  _villain() {
   local i=$1
   echo $BASHPID > $BAD_CG/cgroup.procs
   $DIR/big-read $(_biggo_file $i)
  }

  # Hit del_csum a lot by overwriting lots of small new files
  _victim() {
   echo $BASHPID > $GOOD_CG/cgroup.procs
   i=0;
   while (true)
   do
   local tmp=$mnt/tmp.$i

   dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
   i=$((i+1))
   [ $i -eq $NR_LITTLE ] && i=0
   done
  }

  _one_sync() {
   echo "sync..."
   before=$(date +%s)
   sync
   after=$(date +%s)
   echo "sync done in $((after - before))s"
   _stats
  }

  # sync in a loop
  _sync() {
   echo "start sync loop"
   syncs=0
   echo $BASHPID > $GOOD_CG/cgroup.procs
   while true
   do
   [ -f .done ] && break
   _one_sync
   syncs=$((syncs + 1))
   [ -f .done ] && break
   sleep 10
   done
   if [ $syncs -eq 0 ]; then
   echo "do at least one sync!"
   _one_sync
   fi
   echo "sync loop done."
  }

  _sleep() {
   local time=${1-60}
   local now=$(date +%s)
   local end=$((now + time))
   while [ $now -lt $end ];
   do
   echo "SLEEP: $((end - now))s left. Sleep 10."
   sleep 10
   now=$(date +%s)
   done
  }

  echo "start $NR_VILLAINS villains"
  for i in $(seq $NR_VILLAINS)
  do
   _villain $i &
   disown # get rid of annoying log on kill (done via cgroup anyway)
  done

  echo "start $NR_VICTIMS victims"
  for i in $(seq $NR_VICTIMS)
  do
   _victim &
   disown
  done

  _sync &
  SYNC_PID=$!

  _sleep $1
  _elapsed
  touch .done
  wait $SYNC_PID

  echo "OK"
  exit 0

Without this patch, that reproducer:

- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes

sync done in 388s
================
Wed Jul  9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274

419 hits state R, D comms big-read
                 btrfs_tree_read_lock_nested
                 btrfs_read_lock_root_node
                 btrfs_search_slot
                 btrfs_lookup_csum
                 btrfs_lookup_bio_sums
                 btrfs_submit_bbio

1 hits state D comms btrfs-transacti
                 btrfs_tree_lock_nested
                 btrfs_lock_root_node
                 btrfs_search_slot
                 btrfs_del_csums
                 __btrfs_run_delayed_refs
                 btrfs_run_delayed_refs

With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.

sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168

some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
                 shrink_folio_list
                 shrink_lruvec
                 shrink_node
                 do_try_to_free_pages
                 try_to_free_mem_cgroup_pages
                 reclaim_high

Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove duplicate inclusion of linux/types.h

In messages.h there's linux/types.h included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22939
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: return error from btrfs_zone_finish_endio()

Now that btrfs_zone_finish_endio_workfn() is directly calling
do_zone_finish() the only caller of btrfs_zone_finish_endio() is
btrfs_finish_one_ordered().

btrfs_finish_one_ordered() already has error handling in-place so
btrfs_zone_finish_endio() can return an error if the block group lookup
fails.

Also as btrfs_zone_finish_endio() already checks for zoned filesystems and
returns early, there's no need to do this in the caller.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: directly call do_zone_finish() from btrfs_zone_finish_endio_workfn()

When btrfs_zone_finish_endio_workfn() is calling btrfs_zone_finish_endio()
it already has a pointer to the block group. Furthermore
btrfs_zone_finish_endio() does additional checks if the block group can be
finished or not.

But in the context of btrfs_zone_finish_endio_workfn() only the actual
call to do_zone_finish() is of interest, as the skipping condition when
there is still room to allocate from the block group cannot be checked.

Directly call do_zone_finish() on the block group.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: collapse unaccount_log_buffer() into clean_log_buffer()

There's one only one caller of unaccount_log_buffer() and both this
function and the caller are short, so move its code into the caller.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use local key variable to pass arguments in replay_one_extent()

Instead of extracting again the disk_bytenr and disk_num_bytes values from
the file extent item to pass to btrfs_qgroup_trace_extent(), use the key
local variable 'ins' which already has those values, reducing the size of
the source code.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: process inline extent earlier in replay_one_extent()

Instead of having an if statement to check for regular and prealloc
extents first, process them in a block, and then following with an else
statement to check for an inline extent, check for an inline extent first,
process it and jump to the 'update_inode' label, allowing us to avoid
having the code for processing regular and prealloc extents inside a
block, reducing the high indentation level by one and making the code
easier to read and avoid line splittings too.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: exit early when replaying hole file extent item from a log tree

At replay_one_extent(), we can jump to the code that updates the file
extent range and updates the inode when processing a file extent item that
represents a hole and we don't have the NO_HOLES feature enabled. This
helps reduce the high indentation level by one in replay_one_extent() and
avoid splitting some lines to make the code easier to read.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: abort transaction where errors happen during log tree replay

In the replay_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
replay_one_buffer() callback and the functions it calls, making it as
fine grained as possible, so that it helps figuring out why failures
happen.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: return real error from read_alloc_one_name() in drop_one_dir_item()

If read_alloc_one_name() we explicitly return -ENOMEM and currently that
is fine since it's the only error read_alloc_one_name() can return for
now. However this is fragile and not future proof, so return instead what
read_alloc_one_name() returned.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use local variable for the transaction handle in replay_one_buffer()

Instead of keep dereferencing the walk_control structure to extract the
transaction handle whenever is needed, do it once by storing it in a local
variable and then use the variable everywhere. This reduces code verbosity
and eliminates the need for some split lines.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: abort transaction in the process_one_buffer() log tree walk callback

In the process_one_buffer() log tree walk callback we return errors to the
log tree walk caller and then the caller aborts the transaction, if we
have one, or turns the fs into error state if we don't have one. While
this reduces code it makes it harder to figure out where exactly an error
came from. So add the transaction aborts after every failure inside the
process_one_buffer() callback, so that it helps figuring out why failures
happen.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: abort transaction on specific error places when walking log tree

We do several things while walking a log tree (for replaying and for
freeing a log tree) like reading extent buffers and cleaning them up,
but we don't immediately abort the transaction, or turn the fs into an
error state, when one of these things fails. Instead we the transaction
abort or turn the fs into error state in the caller of the entry point
function that walks a log tree - walk_log_tree() - which means we don't
get to know exactly where an error came from.

Improve on this by doing a transaction abort / turn fs into error state
after each such failure so that when it happens we have a better
understanding where the failure comes from. This deliberately leaves
the transaction abort / turn fs into error state in the callers of
walk_log_tree() as to ensure we don't get into an inconsistent state in
case we forget to do it deeper in call chain. It also deliberately does
not do it after errors from the calls to the callback defined in
struct walk_control::process_func(), as we will do it later on another
patch.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: replace double boolean parameters of cow_file_range()

The function cow_file_range() has two boolean parameters. Replace it
with a single @flags parameter, with two flags:

- COW_FILE_RANGE_NO_INLINE
- COW_FILE_RANGE_KEEP_LOCKED

And since we're here, also update the comments of cow_file_range() to
replace the old "page" usage with "folio".

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Linux 6.17-rc7

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
"Fixes to the Allwinner and Renesas clk drivers:

   - Do the math properly in Allwinner's ccu_mp_recalc_rate() so clk
     rates aren't bogus

   - Fix a clock domain regression on Renesas R-Car M1A, R-Car H1,
     and RZ/A1 by registering the domain after the pmdomain bus is
     registered instead of before"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: sunxi-ng: mp: Fix dual-divider clock rate readback
  clk: renesas: mstp: Add genpd OF provider at postcore_initcall()

Merge tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull a few more btrfs fixes from David Sterba:

- in tree-checker, fix wrong size of check for inode ref item

- in ref-verify, handle combination of mount options that allow
   partially damaged extent tree (reported by syzbot)

- additional validation of compression mount option to catch invalid
   string as level

* tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: reject invalid compression level
  btrfs: ref-verify: handle damaged extent root tree
  btrfs: tree-checker: fix the incorrect inode ref size check

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
"One driver fix for a dma error checking thinko"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: mcq: Fix memory allocation checks for SQE and CQE

Merge tag 'firewire-fixes-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire fix from Takashi Sakamoto:
"When new structures and events were added to UAPI in v6.5 kernel, the
  required update to the subsystem ABI version returned to userspace
  client was overlooked. The version is now updated"

* tag 'firewire-fixes-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: core: fix overlooked update of subsystem ABI version

Merge tag 'x86-urgent-2025-09-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Ingo Molnar:
"Fix a SEV-SNP regression when CONFIG_KVM_AMD_SEV is disabled"

* tag 'x86-urgent-2025-09-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Guard sev_evict_cache() with CONFIG_AMD_MEM_ENCRYPT

Merge tag 'sunxi-clk-fixes-for-6.17' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into clk-fixes

Pull an Allwinner clk driver fix from Chen-Yu Tsai:

- One fix for the clock rate readback on the recently added dual
divider clocks

* tag 'sunxi-clk-fixes-for-6.17' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
clk: sunxi-ng: mp: Fix dual-divider clock rate readback

firewire: core: fix overlooked update of subsystem ABI version

In kernel v6.5, several functions were added to the cdev layer. This
required updating the default version of subsystem ABI up to 6, but
this requirement was overlooked.

This commit updates the version accordingly.

Fixes: 6add87e9764d ("firewire: cdev: add new version of ABI to notify time stamp at request/response subaction of transaction#")
Link: https://lore.kernel.org/r/20250920025148.163402-1-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>

Merge tag '6.17-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- Two unlink fixes: one for rename and one for deferred close

- Four smbdirect/RDMA fixes: fix buffer leak in negotiate, two fixes
   for races in smbd_destroy, fix offset and length checks in recv_done

* tag '6.17-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: fix smbdirect_recv_io leak in smbd_negotiate() error path
  smb: client: fix file open check in __cifs_unlink()
  smb: client: let smbd_destroy() call disable_work_sync(&info->post_send_credits_work)
  smb: client: use disable[_delayed]_work_sync in smbdirect.c
  smb: client: fix filename matching of deferred files
  smb: client: let recv_done verify data_offset, data_length and remaining_data_length

Merge tag 'iommu-fixes-v6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux

Pull iommu fixes from Joerg Roedel:

- Fixes for memory leak and memory corruption bugs on S390 and AMD-Vi

- Race condition fix in AMD-Vi page table code and S390 device attach
   code

- Intel VT-d: Fix alignment checks in __domain_mapping()

- AMD-Vi: Fix potentially incorrect DTE settings when device has
   aliases

* tag 'iommu-fixes-v6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
  iommu/amd/pgtbl: Fix possible race while increase page table level
  iommu/amd: Fix alias device DTE setting
  iommu/s390: Make attach succeed when the device was surprise removed
  iommu/vt-d: Fix __domain_mapping()'s usage of switch_to_super_page()
  iommu/s390: Fix memory corruption when using identity domain
  iommu/amd: Fix ivrs_base memleak in early_amd_iommu_init()

Merge tag 'block-6.17-20250918' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
"A set of fixes for an issue with md array assembly and drbd for
  devices supporting write zeros"

* tag 'block-6.17-20250918' of git://git.kernel.dk/linux:
  drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
  md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter

Merge tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

- Fix for a regression introduced in the io-wq worker creation logic.

- Remove the allocation cache for the msg_ring io_kiocb allocations. I
   have a suspicion that there's a bug there, and since we just fixed
   one in that area, let's just yank the use of that cache entirely.
   It's not that important, and it kills some code.

- Treat a closed ring like task exiting in that any requests that
   trigger post that condition should just get canceled. Doesn't fix any
   real issues, outside of having tasks being able to rely on that
   guarantee.

- Fix for a bug in the network zero-copy notification mechanism, where
   a comparison for matching tctx/ctx for notifications was buggy in
   that it didn't correctly compare with the previous notification.

* tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linux:
  io_uring: fix incorrect io_kiocb reference in io_link_skb
  io_uring/msg_ring: kill alloc_cache for io_kiocb allocations
  io_uring: include dying ring in task_work "should cancel" state
  io_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflow

Merge tag 'gpio-fixes-for-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

- fix an ACPI I2C HID driver breakage due to not initializing a
   structure on the stack and passing garbage down to GPIO core

- ignore touchpad wakeup on GPD G1619-05

- fix debouncing configuration when looking up GPIOs in ACPI

* tag 'gpio-fixes-for-v6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpiolib: acpi: initialize acpi_gpio_info struct
  gpiolib: acpi: Ignore touchpad wakeup on GPD G1619-05
  gpiolib: acpi: Program debounce when finding GPIO

Merge tag 'mmc-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC host fixes from Ulf Hansson:

- mvsdio: Fix dma_unmap_sg() nents value

- sdhci: Fix clock management for UHS-II

- sdhci-pci-gli: Fix initialization of UHS-II for GL9767

* tag 'mmc-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-pci-gli: GL9767: Fix initializing the UHS-II interface during a power-on
  mmc: sdhci-uhs2: Fix calling incorrect sdhci_set_clock() function
  mmc: sdhci: Move the code related to setting the clock from sdhci_set_ios_common() into sdhci_set_ios()
  mmc: mvsdio: Fix dma_unmap_sg() nents value

Merge tag 'pmdomain-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm

Pull pmdomain fixes from Ulf Hansson:
"pmdomain core:
   - Restore behaviour for disabling unused PM domains and introduce the
     GENPD_FLAG_NO_STAY_ON configuration bit

  pmdomain providers:
   - renesas: Don't keep unused PM domains powered-on
   - rockchip: Fix regulator dependency with GENPD_FLAG_NO_STAY_ON"

* tag 'pmdomain-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
  pmdomain: renesas: rmobile-sysc: Don't keep unused PM domains powered-on
  pmdomain: renesas: rcar-gen4-sysc: Don't keep unused PM domains powered-on
  pmdomain: renesas: rcar-sysc: Don't keep unused PM domains powered-on
  pmdomain: rockchip: Fix regulator dependency with GENPD_FLAG_NO_STAY_ON
  pmdomain: core: Restore behaviour for disabling unused PM domains
  pmdomain: renesas: rcar-sysc: Make rcar_sysc_onecell_np __initdata

Merge tag 'loongarch-fixes-6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
"Fix some build warnings for RUST-enabled objtool check, align ACPI
  structures for ARCH_STRICT_ALIGN, fix an unreliable stack for live
  patching, add some NULL pointer checkings, and fix some bugs around
  KVM"

* tag 'loongarch-fixes-6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_pch_pic_regs_access()
  LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_eiointc_sw_status_access()
  LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_eiointc_regs_access()
  LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_eiointc_ctrl_access()
  LoongArch: KVM: Fix VM migration failure with PTW enabled
  LoongArch: KVM: Remove unused returns and semicolons
  LoongArch: vDSO: Check kcalloc() result in init_vdso()
  LoongArch: Fix unreliable stack for live patching
  LoongArch: Replace sprintf() with sysfs_emit()
  LoongArch: Check the return value when creating kobj
  LoongArch: Align ACPI structures if ARCH_STRICT_ALIGN enabled
  LoongArch: Update help info of ARCH_STRICT_ALIGN
  LoongArch: Handle jump tables options for RUST
  LoongArch: Make LTO case independent in Makefile
  objtool/LoongArch: Mark special atomic instruction as INSN_BUG type
  objtool/LoongArch: Mark types based on break immediate code

Merge tag 'v6.17-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
"This fixes a NULL pointer dereference in ccp and a couple of bugs in
  the af_alg interface"

* tag 'v6.17-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: af_alg - Disallow concurrent writes in af_alg_sendmsg
  crypto: af_alg - Set merge to zero early in af_alg_sendmsg
  crypto: ccp - Always pass in an error pointer to __sev_platform_shutdown_locked()

Merge tag 'sound-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A collection of small fixes. The volume became higher than wished, but
  nothing really stands out -- all small, nice and smooth.

  A slightly large change is found in qcom USB-audio offload stuff, but
  this is a regression fix specific to this device, hence it should be
  safe to apply at this late stage.

   - Various small fixes for ASoC Cirrus, Realtek, lpass, Intel and
     Qualcomm drivers

   - ASoC SoundWire fixes

   - A few TAS2781 HD-audio side-codec driver fixes

   - A fix for Qualcomm USB-audio offload breakage

   - Usual a few HD-audio quirks"

* tag 'sound-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (35 commits)
  ALSA: hda/realtek: Fix mute led for HP Laptop 15-dw4xx
  ALSA: hda: intel-dsp-config: Prevent SEGFAULT if ACPI_HANDLE() is NULL
  ALSA: usb: qcom: Fix false-positive address space check
  ASoC: rt5682s: Adjust SAR ADC button mode to fix noise issue
  ASoC: Intel: PTL: Add entry for HDMI-In capture support to non-I2S codec boards.
  ASoC: amd: acp: Fix incorrect retrival of acp_chip_info
  ASoC: Intel: sof_sdw: use PRODUCT_FAMILY for Fatcat series
  ASoC: qcom: sc8280xp: Fix sound card driver name match data for QCS8275
  ALSA: hda/realtek: Fix volume control on Lenovo Thinkbook 13x Gen 4
  ALSA: hda/realtek: Support Lenovo Thinkbook 13x Gen 5
  ALSA: hda: cs35l41: Support Lenovo Thinkbook 13x Gen 5
  ALSA: hda/realtek: Add ALC295 Dell TAS2781 I2C fixup
  ALSA: hda/tas2781: Fix a potential race condition that causes a NULL pointer in case no efi.get_variable exsits
  ASoC: qcom: sc8280xp: Enable DAI format configuration for MI2S interfaces
  ASoC: qcom: q6apm-lpass-dais: Fix missing set_fmt DAI op for I2S
  ASoC: qcom: audioreach: Fix lpaif_type configuration for the I2S interface
  ASoC: Intel: catpt: Expose correct bit depth to userspace
  ALSA: hda/tas2781: Fix the order of TAS2781 calibrated-data
  ASoC: codecs: lpass-wsa-macro: Fix speaker quality distortion
  ASoC: codecs: lpass-rx-macro: Fix playback quality distortion
  ...

Merge tag 'drm-fixes-2025-09-19' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly fixes for drm, it's a bit busier than I'd like on the xe side
  this week, but otherwise amdgpu and some smaller fixes for i915/bridge
  and a revert on docs.

  docs:
   - fix docs build regression

  i915:
   - Honor VESA eDP backlight luminance control capability

  bridge:
   - anx7625: Fix NULL pointer dereference with early IRQ
   - cdns-mhdp8546: Fix missing mutex unlock on error path

  xe:
   - Release kobject for the failure path
   - SRIOV PF: Drop rounddown_pow_of_two fair
   - Remove type casting on hwmon
   - Defer free of NVM auxiliary container to device release
   - Fix a NULL vs IS_ERR
   - Add cleanup action in xe_device_sysfs_init
   - Fix error handling if PXP fails to start
   - Set GuC RCS/CCS yield policy

  amdgpu:
   - GC 11.0.1/4 cleaner shader support
   - DC irq fix
   - OD fix

  amdkfd:
   - S0ix fix"

* tag 'drm-fixes-2025-09-19' of https://gitlab.freedesktop.org/drm/kernel:
  drm/amdgpu: suspend KFD and KGD user queues for S0ix
  drm/amdkfd: add proper handling for S0ix
  drm/xe/guc: Set RCS/CCS yield policy
  drm/xe: Fix error handling if PXP fails to start
  drm/xe/sysfs: Add cleanup action in xe_device_sysfs_init
  drm/amd: Only restore cached manual clock settings in restore if OD enabled
  drm/xe: Fix a NULL vs IS_ERR() in xe_vm_add_compute_exec_queue()
  drm: bridge: cdns-mhdp8546: Fix missing mutex unlock on error path
  drm/i915/backlight: Honor VESA eDP backlight luminance control capability
  drm/amd/display: Allow RX6xxx & RX7700 to invoke amdgpu_irq_get/put
  drm/amdgpu/gfx11: Add Cleaner Shader Support for GFX11.0.1/11.0.4 GPUs
  drm: bridge: anx7625: Fix NULL pointer dereference with early IRQ
  drm/xe: defer free of NVM auxiliary container to device release callback
  drm/xe/hwmon: Remove type casting
  drm/xe/pf: Drop rounddown_pow_of_two fair LMEM limitation
  drm/xe/tile: Release kobject for the failure path
  Revert "drm: Add directive to format code in comment"

io_uring: fix incorrect io_kiocb reference in io_link_skb

In io_link_skb function, there is a bug where prev_notif is incorrectly
assigned using 'nd' instead of 'prev_nd'. This causes the context
validation check to compare the current notification with itself instead
of comparing it with the previous notification.

Fix by using the correct prev_nd parameter when obtaining prev_notif.

Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 6fe4220912d19 ("io_uring/notif: implement notification stacking")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

iommu/amd/pgtbl: Fix possible race while increase page table level

The AMD IOMMU host page table implementation supports dynamic page table levels
(up to 6 levels), starting with a 3-level configuration that expands based on
IOVA address. The kernel maintains a root pointer and current page table level
to enable proper page table walks in alloc_pte()/fetch_pte() operations.

The IOMMU IOVA allocator initially starts with 32-bit address and onces its
exhuasted it switches to 64-bit address (max address is determined based
on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU
driver increases page table level.

But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads
pgtable->[root/mode] without lock. So its possible that in exteme corner case,
when increase_address_space() is updating pgtable->[root/mode], fetch_pte()
reads wrong page table level (pgtable->mode). It does compare the value with
level encoded in page table and returns NULL. This will result is
iommu_unmap ops to fail and upper layer may retry/log WARN_ON.

CPU 0                                         CPU 1
------                                       ------
map pages                                    unmap pages
alloc_pte() -> increase_address_space()      iommu_v1_unmap_pages() -> fetch_pte()
  pgtable->root = pte (new root value)
                                             READ pgtable->[mode/root]
       Reads new root, old mode
  Updates mode (pgtable->mode += 1)

Since Page table level updates are infrequent and already synchronized with a
spinlock, implement seqcount to enable lock-free read operations on the read path.

Fixes: 754265bcab7 ("iommu/amd: Fix race in increase_address_space()")
Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Cc: stable@vger.kernel.org
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

Merge tag 'amd-drm-fixes-6.17-2025-09-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.17-2025-09-18:

amdgpu:
- GC 11.0.1/4 cleaner shader support
- DC irq fix
- OD fix

amdkfd:
- S0ix fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250918191428.2553105-1-alexander.deucher@amd.com

Merge tag 'drm-xe-fixes-2025-09-18' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

- Release kobject for the failure path (Shuicheng)
- SRIOV PF: Drop rounddown_pow_of_two fair (Michal)
- Remove type casting on hwmon (Mallesh)
- Defer free of NVM auxiliary container to device release (Nitin)
- Fix a NULL vs IS_ERR (Dan)
- Add cleanup action in xe_device_sysfs_init (Zongyao)
- Fix error handling if PXP fails to start (Daniele)
- Set GuC RCS/CCS yield policy (Daniele)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/aMwL7vxFP1L94IML@intel.com

Merge tag 'drm-misc-fixes-2025-09-18' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

One fix for a documentation warning, a null pointer dereference fix for
anx7625, and a mutex unlock fix for cdns-mhdp8546

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://lore.kernel.org/r/20250918-orthodox-pretty-puma-1ddeea@houat

Merge tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier fixes from Steven Rostedt:

- Fix build in some RISC-V flavours

   Some system calls only are available for the 64bit RISC-V machines.
   #ifdef out the cases of clock_nanosleep and futex in the sleep
   monitor if they are not supported by the architecture.

- Fix wrong cast, obsolete after refactoring

   Use container_of() to get to the rv_monitor structure from the
   enable_monitors_next() 'p' pointer. The assignment worked only
   because the list field used happened to be the first field of the
   structure.

- Remove redundant include files

   Some include files were listed twice. Remove the extra ones and sort
   the includes.

- Fix missing unlock on failure

   There was an error path that exited the rv_register_monitor()
   function without releasing a lock. Change that to goto the lock
   release.

- Add Gabriele Monaco to be Runtime Verifier maintainer

   Gabriele is doing most of the work on RV as well as collecting
   patches. Add him to the maintainers file for Runtime Verification.

* tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Add Gabriele Monaco as maintainer for Runtime Verification
  rv: Fix missing mutex unlock in rv_register_monitor()
  include/linux/rv.h: remove redundant include file
  rv: Fix wrong type cast in enabled_monitors_next()
  rv: Support systems with time64-only syscalls

smb: client: fix smbdirect_recv_io leak in smbd_negotiate() error path

During tests of another unrelated patch I was able to trigger this
error: Objects remaining on __kmem_cache_shutdown()

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: fix file open check in __cifs_unlink()

Fix the file open check to decide whether or not silly-rename the file
in SMB2+.

Fixes: c5ea3065586d ("smb: client: fix data loss due to broken rename(2)")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: Frank Sorenson <sorenson@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

rv: Add Gabriele Monaco as maintainer for Runtime Verification

Gabriele will start taking over managing the changes to the Runtime
Verification. Make him officially one of the maintainers.

Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250911115744.66ccade3@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

io_uring/msg_ring: kill alloc_cache for io_kiocb allocations

A recent commit:

fc582cd26e88 ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU")

fixed an issue with not deferring freeing of io_kiocb structs that
msg_ring allocates to after the current RCU grace period. But this only
covers requests that don't end up in the allocation cache. If a request
goes into the alloc cache, it can get reused before it is sane to do so.
A recent syzbot report would seem to indicate that there's something
there, however it may very well just be because of the KASAN poisoning
that the alloc_cache handles manually.

Rather than attempt to make the alloc_cache sane for that use case, just
drop the usage of the alloc_cache for msg_ring request payload data.

Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries")
Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/
Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ALSA: hda/realtek: Fix mute led for HP Laptop 15-dw4xx

This laptop uses the ALC236 codec with COEF 0x7 and idx 1 to
control the mute LED. Enable the existing quirk for this device.

Signed-off-by: Praful Adiga <praful.adiga@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

drm/amdgpu: suspend KFD and KGD user queues for S0ix

We need to make sure the user queues are preempted so
GFX can enter gfxoff.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f8b367e6fa1716cab7cc232b9e3dff29187fc99d)
Cc: stable@vger.kernel.org

drm/amdkfd: add proper handling for S0ix

When in S0i3, the GFX state is retained, so all we need to do
is stop the runlist so GFX can enter gfxoff.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4bfa8609934dbf39bbe6e75b4f971469384b50b1)
Cc: stable@vger.kernel.org

Merge tag 'net-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless. No known regressions at this point.

  Current release - fix to a fix:

   - eth: Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set"

   - wifi: iwlwifi: pcie: fix byte count table for 7000/8000 devices

   - net: clear sk->sk_ino in sk_set_socket(sk, NULL), fix CRIU

  Previous releases - regressions:

   - bonding: set random address only when slaves already exist

   - rxrpc: fix untrusted unsigned subtract

   - eth:
       - ice: fix Rx page leak on multi-buffer frames
       - mlx5: don't return mlx5_link_info table when speed is unknown

  Previous releases - always broken:

   - tls: make sure to abort the stream if headers are bogus

   - tcp: fix null-deref when using TCP-AO with TCP_REPAIR

   - dpll: fix skipping last entry in clock quality level reporting

   - eth: qed: don't collect too many protection override GRC elements,
     fix memory corruption"

* tag 'net-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
  octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp()
  cnic: Fix use-after-free bugs in cnic_delete_task
  devlink rate: Remove unnecessary 'static' from a couple places
  MAINTAINERS: update sundance entry
  net: liquidio: fix overflow in octeon_init_instr_queue()
  net: clear sk->sk_ino in sk_set_socket(sk, NULL)
  Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set"
  selftests: tls: test skb copy under mem pressure and OOB
  tls: make sure to abort the stream if headers are bogus
  selftest: packetdrill: Add tcp_fastopen_server_reset-after-disconnect.pkt.
  tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect().
  octeon_ep: fix VF MAC address lifecycle handling
  selftests: bonding: add vlan over bond testing
  bonding: don't set oif to bond dev when getting NS target destination
  net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer
  net/mlx5e: Add a miss level for ipsec crypto offload
  net/mlx5e: Harden uplink netdev access against device unbind
  MAINTAINERS: make the DPLL entry cover drivers
  doc/netlink: Fix typos in operation attributes
  igc: don't fail igc_probe() on LED setup error
  ...

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"These are mostly Oliver's Arm changes: lock ordering fixes for the
  vGIC, and reverts for a buggy attempt to avoid RCU stalls on large
  VMs.

  Arm:

   - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when
     visiting from an MMU notifier

   - Fixes to the TLB match process and TLB invalidation range for
     managing the VCNR pseudo-TLB

   - Prevent SPE from erroneously profiling guests due to UNKNOWN reset
     values in PMSCR_EL1

   - Fix save/restore of host MDCR_EL2 to account for eagerly
     programming at vcpu_load() on VHE systems

   - Correct lock ordering when dealing with VGIC LPIs, avoiding
     scenarios where an xarray's spinlock was nested with a *raw*
     spinlock

   - Permit stage-2 read permission aborts which are possible in the
     case of NV depending on the guest hypervisor's stage-2 translation

   - Call raw_spin_unlock() instead of the internal spinlock API

   - Fix parameter ordering when assigning VBAR_EL1

   - Reverted a couple of fixes for RCU stalls when destroying a stage-2
     page table.

     There appears to be some nasty refcounting / UAF issues lurking in
     those patches and the band-aid we tried to apply didn't hold.

  s390:

   - mm fixes, including userfaultfd bug fix

  x86:

   - Sync the vTPR from the local APIC to the VMCB even when AVIC is
     active.

     This fixes a bug where host updates to the vTPR, e.g. via
     KVM_SET_LAPIC or emulation of a guest access, are lost and result
     in interrupt delivery issues in the guest"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active
  Revert "KVM: arm64: Split kvm_pgtable_stage2_destroy()"
  Revert "KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables"
  KVM: arm64: vgic: fix incorrect spinlock API usage
  KVM: arm64: Remove stage 2 read fault check
  KVM: arm64: Fix parameter ordering for VBAR_EL1 assignment
  KVM: arm64: nv: Fix incorrect VNCR invalidation range calculation
  KVM: arm64: vgic-v3: Indicate vgic_put_irq() may take LPI xarray lock
  KVM: arm64: vgic-v3: Don't require IRQs be disabled for LPI xarray lock
  KVM: arm64: vgic-v3: Erase LPIs from xarray outside of raw spinlocks
  KVM: arm64: Spin off release helper from vgic_put_irq()
  KVM: arm64: vgic-v3: Use bare refcount for VGIC LPIs
  KVM: arm64: vgic: Drop stale comment on IRQ active state
  KVM: arm64: VHE: Save and restore host MDCR_EL2 value correctly
  KVM: arm64: Initialize PMSCR_EL1 when in VHE
  KVM: arm64: nv: fix VNCR TLB ASID match logic for non-Global entries
  KVM: s390: Fix FOLL_*/FAULT_FLAG_* confusion
  KVM: s390: Fix incorrect usage of mmu_notifier_register()
  KVM: s390: Fix access to unavailable adapter indicator pages during postcopy
  KVM: arm64: Mark freed S2 MMUs as invalid

io_uring: include dying ring in task_work "should cancel" state

When running task_work for an exiting task, rather than perform the
issue retry attempt, the task_work is canceled. However, this isn't
done for a ring that has been closed. This can lead to requests being
successfully completed post the ring being closed, which is somewhat
confusing and surprising to an application.

Rather than just check the task exit state, also include the ring
ref state in deciding whether or not to terminate a given request when
run from task_work.

Cc: stable@vger.kernel.org # 6.1+
Link: https://github.com/axboe/liburing/discussions/1459
Reported-by: Benedek Thaler <thaler@thaler.hu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'platform-drivers-x86-v6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Ilpo Järvinen:
"Fixes and new HW support:

   - amd/pmc: Add MECHREVO Yilong15Pro to spurious_8042 list

   - amd/pmf: Support new ACPI ID AMDI0108

   - asus-wmi: Re-add extra keys to ignore_key_wlan quirk

   - oxpec: Add support for AOKZOE A1X and OneXPlayer X1Pro EVA-02"

* tag 'platform-drivers-x86-v6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: asus-wmi: Re-add extra keys to ignore_key_wlan quirk
  platform/x86/amd/pmf: Support new ACPI ID AMDI0108
  platform/x86: oxpec: Add support for AOKZOE A1X
  platform/x86: oxpec: Add support for OneXPlayer X1Pro EVA-02
  platform/x86/amd/pmc: Add MECHREVO Yilong15Pro to spurious_8042 list

Merge tag 'uml-for-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux

Pull UML fixes from Johannes Berg:
"A few fixes for UML, which I'd meant to send earlier but then forgot.

  All of them are pretty long-standing issues that are either not really
  happening (the UAF), in rarely used code (the FD buffer issue), or an
  issue only for some host configurations (the executable stack):

   - mark stack not executable to work on more modern systems with
     selinux

   - fix use-after-free in a virtio error path

   - fix stack buffer overflow in external unix socket FD receive
     function"

* tag 'uml-for-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
  um: Fix FD copy size in os_rcv_fd_msg()
  um: virtio_uml: Fix use-after-free after put_device in probe
  um: Don't mark stack executable

octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp()

The original code relies on cancel_delayed_work() in otx2_ptp_destroy(),
which does not ensure that the delayed work item synctstamp_work has fully
completed if it was already running. This leads to use-after-free scenarios
where otx2_ptp is deallocated by otx2_ptp_destroy(), while synctstamp_work
remains active and attempts to dereference otx2_ptp in otx2_sync_tstamp().
Furthermore, the synctstamp_work is cyclic, the likelihood of triggering
the bug is nonnegligible.

A typical race condition is illustrated below:

CPU 0 (cleanup)           | CPU 1 (delayed work callback)
otx2_remove()             |
  otx2_ptp_destroy()      | otx2_sync_tstamp()
    cancel_delayed_work() |
    kfree(ptp)            |
                          |   ptp = container_of(...); //UAF
                          |   ptp-> //UAF

This is confirmed by a KASAN report:

BUG: KASAN: slab-use-after-free in __run_timer_base.part.0+0x7d7/0x8c0
Write of size 8 at addr ffff88800aa09a18 by task bash/136
...
Call Trace:
<IRQ>
dump_stack_lvl+0x55/0x70
print_report+0xcf/0x610
? __run_timer_base.part.0+0x7d7/0x8c0
kasan_report+0xb8/0xf0
? __run_timer_base.part.0+0x7d7/0x8c0
__run_timer_base.part.0+0x7d7/0x8c0
? __pfx___run_timer_base.part.0+0x10/0x10
? __pfx_read_tsc+0x10/0x10
? ktime_get+0x60/0x140
? lapic_next_event+0x11/0x20
? clockevents_program_event+0x1d4/0x2a0
run_timer_softirq+0xd1/0x190
handle_softirqs+0x16a/0x550
irq_exit_rcu+0xaf/0xe0
sysvec_apic_timer_interrupt+0x70/0x80
</IRQ>
...
Allocated by task 1:
kasan_save_stack+0x24/0x50
kasan_save_track+0x14/0x30
__kasan_kmalloc+0x7f/0x90
otx2_ptp_init+0xb1/0x860
otx2_probe+0x4eb/0xc30
local_pci_probe+0xdc/0x190
pci_device_probe+0x2fe/0x470
really_probe+0x1ca/0x5c0
__driver_probe_device+0x248/0x310
driver_probe_device+0x44/0x120
__driver_attach+0xd2/0x310
bus_for_each_dev+0xed/0x170
bus_add_driver+0x208/0x500
driver_register+0x132/0x460
do_one_initcall+0x89/0x300
kernel_init_freeable+0x40d/0x720
kernel_init+0x1a/0x150
ret_from_fork+0x10c/0x1a0
ret_from_fork_asm+0x1a/0x30

Freed by task 136:
kasan_save_stack+0x24/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3a/0x60
__kasan_slab_free+0x3f/0x50
kfree+0x137/0x370
otx2_ptp_destroy+0x38/0x80
otx2_remove+0x10d/0x4c0
pci_device_remove+0xa6/0x1d0
device_release_driver_internal+0xf8/0x210
pci_stop_bus_device+0x105/0x150
pci_stop_and_remove_bus_device_locked+0x15/0x30
remove_store+0xcc/0xe0
kernfs_fop_write_iter+0x2c3/0x440
vfs_write+0x871/0xd70
ksys_write+0xee/0x1c0
do_syscall_64+0xac/0x280
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...

Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure
that the delayed work item is properly canceled before the otx2_ptp is
deallocated.

This bug was initially identified through static analysis. To reproduce
and test it, I simulated the OcteonTX2 PCI device in QEMU and introduced
artificial delays within the otx2_sync_tstamp() function to increase the
likelihood of triggering the bug.

Fixes: 2958d17a8984 ("octeontx2-pf: Add support for ptp 1-step mode on CN10K silicon")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cnic: Fix use-after-free bugs in cnic_delete_task

The original code uses cancel_delayed_work() in cnic_cm_stop_bnx2x_hw(),
which does not guarantee that the delayed work item 'delete_task' has
fully completed if it was already running. Additionally, the delayed work
item is cyclic, the flush_workqueue() in cnic_cm_stop_bnx2x_hw() only
blocks and waits for work items that were already queued to the
workqueue prior to its invocation. Any work items submitted after
flush_workqueue() is called are not included in the set of tasks that the
flush operation awaits. This means that after the cyclic work items have
finished executing, a delayed work item may still exist in the workqueue.
This leads to use-after-free scenarios where the cnic_dev is deallocated
by cnic_free_dev(), while delete_task remains active and attempt to
dereference cnic_dev in cnic_delete_task().

A typical race condition is illustrated below:

CPU 0 (cleanup)              | CPU 1 (delayed work callback)
cnic_netdev_event()          |
  cnic_stop_hw()             | cnic_delete_task()
    cnic_cm_stop_bnx2x_hw()  | ...
      cancel_delayed_work()  | /* the queue_delayed_work()
      flush_workqueue()      |    executes after flush_workqueue()*/
                             | queue_delayed_work()
  cnic_free_dev(dev)//free   | cnic_delete_task() //new instance
                             |   dev = cp->dev; //use

Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure
that the cyclic delayed work item is properly canceled and that any
ongoing execution of the work item completes before the cnic_dev is
deallocated. Furthermore, since cancel_delayed_work_sync() uses
__flush_work(work, true) to synchronously wait for any currently
executing instance of the work item to finish, the flush_workqueue()
becomes redundant and should be removed.

This bug was identified through static analysis. To reproduce the issue
and validate the fix, I simulated the cnic PCI device in QEMU and
introduced intentional delays — such as inserting calls to ssleep()
within the cnic_delete_task() function — to increase the likelihood
of triggering the bug.

Fixes: fdf24086f475 ("cnic: Defer iscsi connection cleanup")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink rate: Remove unnecessary 'static' from a couple places

devlink_rate_node_get_by_name() and devlink_rate_nodes_destroy() have a
couple of unnecessary static variables for iterating over devlink rates.

This could lead to races/corruption/unhappiness if two concurrent
operations execute the same function.

Remove 'static' from both. It's amazing this was missed for 4+ years.
While at it, I confirmed there are no more examples of this mistake in
net/ with 1, 2 or 3 levels of indentation.

Fixes: a8ecb93ef03d ("devlink: Introduce rate nodes")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: update sundance entry

Signed-off-by: Denis Kirjanov <kirjanov@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: liquidio: fix overflow in octeon_init_instr_queue()

The expression `(conf->instr_type == 64) << iq_no` can overflow because
`iq_no` may be as high as 64 (`CN23XX_MAX_RINGS_PER_PF`). Casting the
operand to `u64` ensures correct 64-bit arithmetic.

Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters")
Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: clear sk->sk_ino in sk_set_socket(sk, NULL)

Andrei Vagin reported that blamed commit broke CRIU.

Indeed, while we want to keep sk_uid unchanged when a socket
is cloned, we want to clear sk->sk_ino.

Otherwise, sock_diag might report multiple sockets sharing
the same inode number.

Move the clearing part from sock_orphan() to sk_set_socket(sk, NULL),
called both from sock_orphan() and sk_clone_lock().

Fixes: 5d6b58c932ec ("net: lockless sock_i_ino()")
Closes: https://lore.kernel.org/netdev/aMhX-VnXkYDpKd9V@google.com/
Closes: https://github.com/checkpoint-restore/criu/issues/2744
Reported-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20250917135337.1736101-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set"

This reverts commit d24341740fe48add8a227a753e68b6eedf4b385a.
It causes errors when trying to configure QoS, as well as
loss of L2 connectivity (on multi-host devices).

Reported-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/20250910170011.70528106@kernel.org
Fixes: d24341740fe4 ("net/mlx5e: Update and set Xon/Xoff upon port speed set")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_pch_pic_regs_access()

Function copy_from_user() and copy_to_user() may sleep because of page
fault, and they cannot be called in spin_lock hold context. Here move
function calling of copy_from_user() and copy_to_user() out of spinlock
context in function kvm_pch_pic_regs_access().

Otherwise there will be possible warning such as:

BUG: sleeping function called from invalid context at include/linux/uaccess.h:192
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 6292, name: qemu-system-loo
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last  enabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 41 UID: 0 PID: 6292 Comm: qemu-system-loo Tainted: G W 6.17.0-rc3+ #31 PREEMPT(full)
Tainted: [W]=WARN
Stack : 0000000000000076 0000000000000000 9000000004c28264 9000100092ff4000
        9000100092ff7b80 9000100092ff7b88 0000000000000000 9000100092ff7cc8
        9000100092ff7cc0 9000100092ff7cc0 9000100092ff7a00 0000000000000001
        0000000000000001 9000100092ff7b88 947d2f9216a5e8b9 900010008773d880
        00000000ffff8b9f fffffffffffffffe 0000000000000ba1 fffffffffffffffe
        000000000000003e 900000000825a15b 000010007ad38000 9000100092ff7ec0
        0000000000000000 0000000000000000 9000000006f3ac60 9000000007252000
        0000000000000000 00007ff746ff2230 0000000000000053 9000200088a021b0
        0000555556c9d190 0000000000000000 9000000004c2827c 000055556cfb5f40
        00000000000000b0 0000000000000007 0000000000000007 0000000000071c1d
Call Trace:
[<9000000004c2827c>] show_stack+0x5c/0x180
[<9000000004c20fac>] dump_stack_lvl+0x94/0xe4
[<9000000004c99c7c>] __might_resched+0x26c/0x290
[<9000000004f68968>] __might_fault+0x20/0x88
[<ffff800002311de0>] kvm_pch_pic_regs_access.isra.0+0x88/0x380 [kvm]
[<ffff8000022f8514>] kvm_device_ioctl+0x194/0x290 [kvm]
[<900000000506b0d8>] sys_ioctl+0x388/0x1010
[<90000000063ed210>] do_syscall+0xb0/0x2d8
[<9000000004c25ef8>] handle_syscall+0xb8/0x158

Cc: stable@vger.kernel.org
Fixes: d206d95148732 ("LoongArch: KVM: Add PCHPIC user mode read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_eiointc_sw_status_access()

Function copy_from_user() and copy_to_user() may sleep because of page
fault, and they cannot be called in spin_lock hold context. Here move
funtcion calling of copy_from_user() and copy_to_user() out of function
kvm_eiointc_sw_status_access().

Otherwise there will be possible warning such as:

BUG: sleeping function called from invalid context at include/linux/uaccess.h:192
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 6292, name: qemu-system-loo
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last  enabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 41 UID: 0 PID: 6292 Comm: qemu-system-loo Tainted: G W 6.17.0-rc3+ #31 PREEMPT(full)
Tainted: [W]=WARN
Stack : 0000000000000076 0000000000000000 9000000004c28264 9000100092ff4000
        9000100092ff7b80 9000100092ff7b88 0000000000000000 9000100092ff7cc8
        9000100092ff7cc0 9000100092ff7cc0 9000100092ff7a00 0000000000000001
        0000000000000001 9000100092ff7b88 947d2f9216a5e8b9 900010008773d880
        00000000ffff8b9f fffffffffffffffe 0000000000000ba1 fffffffffffffffe
        000000000000003e 900000000825a15b 000010007ad38000 9000100092ff7ec0
        0000000000000000 0000000000000000 9000000006f3ac60 9000000007252000
        0000000000000000 00007ff746ff2230 0000000000000053 9000200088a021b0
        0000555556c9d190 0000000000000000 9000000004c2827c 000055556cfb5f40
        00000000000000b0 0000000000000007 0000000000000007 0000000000071c1d
Call Trace:
[<9000000004c2827c>] show_stack+0x5c/0x180
[<9000000004c20fac>] dump_stack_lvl+0x94/0xe4
[<9000000004c99c7c>] __might_resched+0x26c/0x290
[<9000000004f68968>] __might_fault+0x20/0x88
[<ffff800002311de0>] kvm_eiointc_sw_status_access.isra.0+0x88/0x380 [kvm]
[<ffff8000022f8514>] kvm_device_ioctl+0x194/0x290 [kvm]
[<900000000506b0d8>] sys_ioctl+0x388/0x1010
[<90000000063ed210>] do_syscall+0xb0/0x2d8
[<9000000004c25ef8>] handle_syscall+0xb8/0x158

Cc: stable@vger.kernel.org
Fixes: 1ad7efa552fd5 ("LoongArch: KVM: Add EIOINTC user mode read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_eiointc_regs_access()

Function copy_from_user() and copy_to_user() may sleep because of page
fault, and they cannot be called in spin_lock hold context. Here move
function calling of copy_from_user() and copy_to_user() before spinlock
context in function kvm_eiointc_ctrl_access().

Otherwise there will be possible warning such as:

BUG: sleeping function called from invalid context at include/linux/uaccess.h:192
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 6292, name: qemu-system-loo
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last  enabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 41 UID: 0 PID: 6292 Comm: qemu-system-loo Tainted: G W 6.17.0-rc3+ #31 PREEMPT(full)
Tainted: [W]=WARN
Stack : 0000000000000076 0000000000000000 9000000004c28264 9000100092ff4000
        9000100092ff7b80 9000100092ff7b88 0000000000000000 9000100092ff7cc8
        9000100092ff7cc0 9000100092ff7cc0 9000100092ff7a00 0000000000000001
        0000000000000001 9000100092ff7b88 947d2f9216a5e8b9 900010008773d880
        00000000ffff8b9f fffffffffffffffe 0000000000000ba1 fffffffffffffffe
        000000000000003e 900000000825a15b 000010007ad38000 9000100092ff7ec0
        0000000000000000 0000000000000000 9000000006f3ac60 9000000007252000
        0000000000000000 00007ff746ff2230 0000000000000053 9000200088a021b0
        0000555556c9d190 0000000000000000 9000000004c2827c 000055556cfb5f40
        00000000000000b0 0000000000000007 0000000000000007 0000000000071c1d
Call Trace:
[<9000000004c2827c>] show_stack+0x5c/0x180
[<9000000004c20fac>] dump_stack_lvl+0x94/0xe4
[<9000000004c99c7c>] __might_resched+0x26c/0x290
[<9000000004f68968>] __might_fault+0x20/0x88
[<ffff800002311de0>] kvm_eiointc_regs_access.isra.0+0x88/0x380 [kvm]
[<ffff8000022f8514>] kvm_device_ioctl+0x194/0x290 [kvm]
[<900000000506b0d8>] sys_ioctl+0x388/0x1010
[<90000000063ed210>] do_syscall+0xb0/0x2d8
[<9000000004c25ef8>] handle_syscall+0xb8/0x158

Cc: stable@vger.kernel.org
Fixes: 1ad7efa552fd5 ("LoongArch: KVM: Add EIOINTC user mode read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_eiointc_ctrl_access()

Function copy_from_user() and copy_to_user() may sleep because of page
fault, and they cannot be called in spin_lock hold context. Here move
function calling of copy_from_user() and copy_to_user() before spinlock
context in function kvm_eiointc_ctrl_access().

Otherwise there will be possible warning such as:

BUG: sleeping function called from invalid context at include/linux/uaccess.h:192
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 6292, name: qemu-system-loo
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last  enabled at (0): [<9000000004c4a554>] copy_process+0x90c/0x1d40
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 41 UID: 0 PID: 6292 Comm: qemu-system-loo Tainted: G W 6.17.0-rc3+ #31 PREEMPT(full)
Tainted: [W]=WARN
Stack : 0000000000000076 0000000000000000 9000000004c28264 9000100092ff4000
        9000100092ff7b80 9000100092ff7b88 0000000000000000 9000100092ff7cc8
        9000100092ff7cc0 9000100092ff7cc0 9000100092ff7a00 0000000000000001
        0000000000000001 9000100092ff7b88 947d2f9216a5e8b9 900010008773d880
        00000000ffff8b9f fffffffffffffffe 0000000000000ba1 fffffffffffffffe
        000000000000003e 900000000825a15b 000010007ad38000 9000100092ff7ec0
        0000000000000000 0000000000000000 9000000006f3ac60 9000000007252000
        0000000000000000 00007ff746ff2230 0000000000000053 9000200088a021b0
        0000555556c9d190 0000000000000000 9000000004c2827c 000055556cfb5f40
        00000000000000b0 0000000000000007 0000000000000007 0000000000071c1d
Call Trace:
[<9000000004c2827c>] show_stack+0x5c/0x180
[<9000000004c20fac>] dump_stack_lvl+0x94/0xe4
[<9000000004c99c7c>] __might_resched+0x26c/0x290
[<9000000004f68968>] __might_fault+0x20/0x88
[<ffff800002311de0>] kvm_eiointc_ctrl_access.isra.0+0x88/0x380 [kvm]
[<ffff8000022f8514>] kvm_device_ioctl+0x194/0x290 [kvm]
[<900000000506b0d8>] sys_ioctl+0x388/0x1010
[<90000000063ed210>] do_syscall+0xb0/0x2d8
[<9000000004c25ef8>] handle_syscall+0xb8/0x158

Cc: stable@vger.kernel.org
Fixes: 1ad7efa552fd5 ("LoongArch: KVM: Add EIOINTC user mode read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Fix VM migration failure with PTW enabled

With PTW disabled system, bit _PAGE_DIRTY is a HW bit for page writing.
However with PTW enabled system, bit _PAGE_WRITE is also a "HW bit" for
page writing, because hardware synchronizes _PAGE_WRITE to _PAGE_DIRTY
automatically. Previously, _PAGE_WRITE is treated as a SW bit to record
the page writeable attribute for the fast page fault handling in the
secondary MMU, however with PTW enabled machine, this bit is used by HW
already (so setting it will silence the TLB modify exception).

Here define KVM_PAGE_WRITEABLE with the SW bit _PAGE_MODIFIED, so that
it can work on both PTW disabled and enabled machines. And for HW write
bits, both _PAGE_DIRTY and _PAGE_WRITE are set or clear together.

Cc: stable@vger.kernel.org
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Remove unused returns and semicolons

The default branch has already handled all undefined cases, so the final
return statement is redundant. Redundant semicolons are removed, too.

Cc: stable@vger.kernel.org
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: vDSO: Check kcalloc() result in init_vdso()

Add a NULL-pointer check after the kcalloc() call in init_vdso(). If
allocation fails, return -ENOMEM to prevent a possible dereference of
vdso_info.code_mapping.pages when it is NULL.

Cc: stable@vger.kernel.org
Fixes: 2ed119aef60d ("LoongArch: Set correct size for vDSO code mapping")
Signed-off-by: Guangshuo Li <202321181@mail.sdu.edu.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Fix unreliable stack for live patching

When testing the kernel live patching with "modprobe livepatch-sample",
there is a timeout over 15 seconds from "starting patching transition"
to "patching complete". The dmesg command shows "unreliable stack" for
user tasks in debug mode, here is one of the messages:

  livepatch: klp_try_switch_task: bash:1193 has an unreliable stack

The "unreliable stack" is because it can not unwind from do_syscall()
to its previous frame handle_syscall(). It should use fp to find the
original stack top due to secondary stack in do_syscall(), but fp is
not used for some other functions, then fp can not be restored by the
next frame of do_syscall(), so it is necessary to save fp if task is
not current, in order to get the stack top of do_syscall().

Here are the call chains:

  klp_enable_patch()
    klp_try_complete_transition()
      klp_try_switch_task()
        klp_check_and_switch_task()
          klp_check_stack()
            stack_trace_save_tsk_reliable()
              arch_stack_walk_reliable()

When executing "rmmod livepatch-sample", there exists a similar issue.
With this patch, it takes a short time for patching and unpatching.

Before:

  # modprobe livepatch-sample
  # dmesg -T | tail -3
  [Sat Sep  6 11:00:20 2025] livepatch: 'livepatch_sample': starting patching transition
  [Sat Sep  6 11:00:35 2025] livepatch: signaling remaining tasks
  [Sat Sep  6 11:00:36 2025] livepatch: 'livepatch_sample': patching complete

  # echo 0 > /sys/kernel/livepatch/livepatch_sample/enabled
  # rmmod livepatch_sample
  rmmod: ERROR: Module livepatch_sample is in use
  # rmmod livepatch_sample
  # dmesg -T | tail -3
  [Sat Sep  6 11:06:05 2025] livepatch: 'livepatch_sample': starting unpatching transition
  [Sat Sep  6 11:06:20 2025] livepatch: signaling remaining tasks
  [Sat Sep  6 11:06:21 2025] livepatch: 'livepatch_sample': unpatching complete

After:

  # modprobe livepatch-sample
  # dmesg -T | tail -2
  [Tue Sep 16 16:19:30 2025] livepatch: 'livepatch_sample': starting patching transition
  [Tue Sep 16 16:19:31 2025] livepatch: 'livepatch_sample': patching complete

  # echo 0 > /sys/kernel/livepatch/livepatch_sample/enabled
  # rmmod livepatch_sample
  # dmesg -T | tail -2
  [Tue Sep 16 16:19:36 2025] livepatch: 'livepatch_sample': starting unpatching transition
  [Tue Sep 16 16:19:37 2025] livepatch: 'livepatch_sample': unpatching complete

Cc: stable@vger.kernel.org # v6.9+
Fixes: 199cc14cb4f1 ("LoongArch: Add kernel livepatching support")
Reported-by: Xi Zhang <zhangxi@kylinos.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Replace sprintf() with sysfs_emit()

As Documentation/filesystems/sysfs.rst suggested, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Check the return value when creating kobj

Add a check for the return value of kobject_create_and_add(), to ensure
that the kobj allocation succeeds for later use.

Cc: stable@vger.kernel.org
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Align ACPI structures if ARCH_STRICT_ALIGN enabled

ARCH_STRICT_ALIGN is used for hardware without UAL, now it only control
the -mstrict-align flag. However, ACPI structures are packed by default
so will cause unaligned accesses.

To avoid this, define ACPI_MISALIGNMENT_NOT_SUPPORTED in asm/acenv.h to
align ACPI structures if ARCH_STRICT_ALIGN enabled.

Cc: stable@vger.kernel.org
Reported-by: Binbin Zhou <zhoubinbin@loongson.cn>
Suggested-by: Xi Ruoyao <xry111@xry111.site>
Suggested-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Update help info of ARCH_STRICT_ALIGN

Loongson-3A6000 and 3C6000 CPUs also support unaligned memory access, so
the current description is out of date to some extent.

Actually, all of Loongson-3 series processors based on LoongArch support
unaligned memory access, this hardware capability is indicated by the bit
20 (UAL) of CPUCFG1 register, update the help info to reflect the reality.

Cc: stable@vger.kernel.org
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Handle jump tables options for RUST

When compiling with LLVM and CONFIG_RUST is set, there exist objtool
warnings in rust/core.o and rust/kernel.o, like this:

    rust/core.o: warning: objtool:
_RNvXs1_NtNtCs5QSdWC790r4_4core5ascii10ascii_charNtB5_9AsciiCharNtNtB9_3fmt5Debug3fmt+0x54:
sibling call from callable instruction with modified stack frame

For this special case, the related object file shows that there is no
generated relocation section '.rela.discard.tablejump_annotate' for the
table jump instruction jirl, thus objtool can not know that what is the
actual destination address.

If rustc has the option "-Cllvm-args=--loongarch-annotate-tablejump",
pass the option to enable jump tables for objtool, otherwise it should
pass "-Zno-jump-tables" to keep compatibility with older rustc.

How to test:

  $ rustup component add rust-src
  $ make LLVM=1 rustavailable
  $ make ARCH=loongarch LLVM=1 clean defconfig
  $ scripts/config -d MODVERSIONS \
    -e RUST -e SAMPLES -e SAMPLES_RUST \
    -e SAMPLE_RUST_CONFIGFS -e SAMPLE_RUST_MINIMAL \
    -e SAMPLE_RUST_MISC_DEVICE -e SAMPLE_RUST_PRINT \
    -e SAMPLE_RUST_DMA -e SAMPLE_RUST_DRIVER_PCI \
    -e SAMPLE_RUST_DRIVER_PLATFORM -e SAMPLE_RUST_DRIVER_FAUX \
    -e SAMPLE_RUST_DRIVER_AUXILIARY -e SAMPLE_RUST_HOSTPROGS
  $ make ARCH=loongarch LLVM=1 olddefconfig all

Cc: stable@vger.kernel.org
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Reported-by: Miguel Ojeda <ojeda@kernel.org>
Closes: https://lore.kernel.org/rust-for-linux/CANiq72mNeCuPkCDrG2db3w=AX+O-zYrfprisDPmRac_qh65Dmg@mail.gmail.com/
Suggested-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Make LTO case independent in Makefile

LTO is not only used for Clang, but maybe also used for Rust, make LTO
case out of CONFIG_CC_HAS_ANNOTATE_TABLEJUMP in Makefile.

This is preparation for later patch, no function changes.

Cc: stable@vger.kernel.org
Suggested-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

objtool/LoongArch: Mark special atomic instruction as INSN_BUG type

When compiling with LLVM and CONFIG_RUST is set, there exists the
following objtool warning:

rust/compiler_builtins.o: warning: objtool: __rust__unordsf2(): unexpected end of section .text.unlikely.

objdump shows that the end of section .text.unlikely is an atomic
instruction:

amswap.w $zero, $ra, $zero

According to the LoongArch Reference Manual, if the amswap.w atomic
memory access instruction has the same register number as rd and rj,
the execution will trigger an Instruction Non-defined Exception, so
mark the above instruction as INSN_BUG type to fix the warning.

Cc: stable@vger.kernel.org
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

objtool/LoongArch: Mark types based on break immediate code

If the break immediate code is 0, it should mark the type as
INSN_TRAP. If the break immediate code is 1, it should mark the
type as INSN_BUG.

While at it, format the code style and add the code comment for nop.

Cc: stable@vger.kernel.org
Suggested-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

btrfs: reject invalid compression level

Inspired by recent changes to compression level parsing in
6db1df415d73fc ("btrfs: accept and ignore compression level for lzo")
it turns out that we do not do any extra validation for compression
level input string, thus allowing things like "compress=lzo:invalid" to
be accepted without warnings.

Although we accept levels that are beyond the supported algorithm
ranges, accepting completely invalid level specification is not correct.

Fix the too loose checks for compression level, by doing proper error
handling of kstrtoint(), so that we will reject not only too large
values (beyond int range) but also completely wrong levels like
"lzo:invalid".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

selftests: tls: test skb copy under mem pressure and OOB

Add a test which triggers mem pressure via OOB writes.

Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250917002814.1743558-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tls: make sure to abort the stream if headers are bogus

Normally we wait for the socket to buffer up the whole record
before we service it. If the socket has a tiny buffer, however,
we read out the data sooner, to prevent connection stalls.
Make sure that we abort the connection when we find out late
that the record is actually invalid. Retrying the parsing is
fine in itself but since we copy some more data each time
before we parse we can overflow the allocated skb space.

Constructing a scenario in which we're under pressure without
enough data in the socket to parse the length upfront is quite
hard. syzbot figured out a way to do this by serving us the header
in small OOB sends, and then filling in the recvbuf with a large
normal send.

Make sure that tls_rx_msg_size() aborts strp, if we reach
an invalid record there's really no way to recover.

Reported-by: Lee Jones <lee@kernel.org>
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250917002814.1743558-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

crypto: af_alg - Disallow concurrent writes in af_alg_sendmsg

Issuing two writes to the same af_alg socket is bogus as the
data will be interleaved in an unpredictable fashion. Furthermore,
concurrent writes may create inconsistencies in the internal
socket state.

Disallow this by adding a new ctx->write field that indiciates
exclusive ownership for writing.

Fixes: 8ff590903d5 ("crypto: algif_skcipher - User-space interface for skcipher operations")
Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

crypto: af_alg - Set merge to zero early in af_alg_sendmsg

If an error causes af_alg_sendmsg to abort, ctx->merge may contain
a garbage value from the previous loop. This may then trigger a
crash on the next entry into af_alg_sendmsg when it attempts to do
a merge that can't be done.

Fix this by setting ctx->merge to zero near the start of the loop.

Fixes: 8ff590903d5 ("crypto: algif_skcipher - User-space interface for skcipher operations")
Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Merge tag 'drm-intel-fixes-2025-09-17' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Honor VESA eDP backlight luminance control capability (Aaron Ma)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tursulin@igalia.com>
Link: https://lore.kernel.org/r/aMrPx4FZ66t1Kfe-@linux

Merge tag 'asoc-fix-v6.17-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v6.17

A pile of fixes that accumilated over the past few -rcs, this is all
driver specifics including a small pile of quirks for new systems.

Merge tag 'mm-hotfixes-stable-2025-09-17-21-10' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"15 hotfixes. 11 are cc:stable and the remainder address post-6.16
  issues or aren't considered necessary for -stable kernels. 13 of these
  fixes are for MM.

  The usual shower of singletons, plus

   - fixes from Hugh to address various misbehaviors in get_user_pages()

   - patches from SeongJae to address a quite severe issue in DAMON

   - another series also from SeongJae which completes some fixes for a
     DAMON startup issue"

* tag 'mm-hotfixes-stable-2025-09-17-21-10' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  zram: fix slot write race condition
  nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/*
  samples/damon/mtier: avoid starting DAMON before initialization
  samples/damon/prcl: avoid starting DAMON before initialization
  samples/damon/wsse: avoid starting DAMON before initialization
  MAINTAINERS: add Lance Yang as a THP reviewer
  MAINTAINERS: add Jann Horn as rmap reviewer
  mm/damon/sysfs: use dynamically allocated repeat mode damon_call_control
  mm/damon/core: introduce damon_call_control->dealloc_on_cancel
  mm: folio_may_be_lru_cached() unless folio_test_large()
  mm: revert "mm: vmscan.c: fix OOM on swap stress test"
  mm: revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
  mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
  mm/gup: check ref_count instead of lru before migration

btrfs: ref-verify: handle damaged extent root tree

Syzbot hits a problem with enabled ref-verify, ignorebadroots and a
fuzzed/damaged extent tree. There's no fallback option like in other
places that can deal with it so disable the whole ref-verify as it is
just a debugging feature.

Reported-by: syzbot+9c3e0cdfbfe351b0bc0e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000001b6052062139be1c@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: fix the incorrect inode ref size check

[BUG]
Inside check_inode_ref(), we need to make sure every structure,
including the btrfs_inode_extref header, is covered by the item. But
our code is incorrectly using "sizeof(iref)", where @iref is just a
pointer.

This means "sizeof(iref)" will always be "sizeof(void *)", which is much
smaller than "sizeof(struct btrfs_inode_extref)".

This will allow some bad inode extrefs to sneak in, defeating tree-checker.

[FIX]
Fix the typo by calling "sizeof(*iref)", which is the same as
"sizeof(struct btrfs_inode_extref)", and will be the correct behavior we
want.

Fixes: 71bf92a9b877 ("btrfs: tree-checker: Add check for INODE_REF")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

smb: client: let smbd_destroy() call disable_work_sync(&info->post_send_credits_work)

In smbd_destroy() we may destroy the memory so we better
wait until post_send_credits_work is no longer pending
and will never be started again.

I actually just hit the case using rxe:

WARNING: CPU: 0 PID: 138 at drivers/infiniband/sw/rxe/rxe_verbs.c:1032 rxe_post_recv+0x1ee/0x480 [rdma_rxe]
...
[ 5305.686979] [    T138]  smbd_post_recv+0x445/0xc10 [cifs]
[ 5305.687135] [    T138]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687149] [    T138]  ? __kasan_check_write+0x14/0x30
[ 5305.687185] [    T138]  ? __pfx_smbd_post_recv+0x10/0x10 [cifs]
[ 5305.687329] [    T138]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 5305.687356] [    T138]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687368] [    T138]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687378] [    T138]  ? _raw_spin_unlock_irqrestore+0x11/0x60
[ 5305.687389] [    T138]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687399] [    T138]  ? get_receive_buffer+0x168/0x210 [cifs]
[ 5305.687555] [    T138]  smbd_post_send_credits+0x382/0x4b0 [cifs]
[ 5305.687701] [    T138]  ? __pfx_smbd_post_send_credits+0x10/0x10 [cifs]
[ 5305.687855] [    T138]  ? __pfx___schedule+0x10/0x10
[ 5305.687865] [    T138]  ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 5305.687875] [    T138]  ? queue_delayed_work_on+0x8e/0xa0
[ 5305.687889] [    T138]  process_one_work+0x629/0xf80
[ 5305.687908] [    T138]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687917] [    T138]  ? __kasan_check_write+0x14/0x30
[ 5305.687933] [    T138]  worker_thread+0x87f/0x1570
...

It means rxe_post_recv was called after rdma_destroy_qp().
This happened because put_receive_buffer() was triggered
by ib_drain_qp() and called:
queue_work(info->workqueue, &info->post_send_credits_work);

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: use disable[_delayed]_work_sync in smbdirect.c

This makes it safer during the disconnect and avoids
requeueing.

It's ok to call disable[delayed_]work[_sync]() more than once.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 050b8c374019 ("smbd: Make upper layer decide when to destroy the transport")
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Fixes: c7398583340a ("CIFS: SMBD: Implement RDMA memory registration")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: fix filename matching of deferred files

Fix the following case where the client would end up closing both
deferred files (foo.tmp & foo) after unlink(foo) due to strstr() call
in cifs_close_deferred_file_under_dentry():

  fd1 = openat(AT_FDCWD, "foo", O_WRONLY|O_CREAT|O_TRUNC, 0666);
  fd2 = openat(AT_FDCWD, "foo.tmp", O_WRONLY|O_CREAT|O_TRUNC, 0666);
  close(fd1);
  close(fd2);
  unlink("foo");

Fixes: e3fc065682eb ("cifs: Deferred close performance improvements")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Cc: Frank Sorenson <sorenson@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: let recv_done verify data_offset, data_length and remaining_data_length

This is inspired by the related server fixes.

Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>