]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
9 days agof2fs: fix to not account invalid blocks in get_left_section_blocks()
Chao Yu [Fri, 28 Nov 2025 09:25:07 +0000 (17:25 +0800)] 
f2fs: fix to not account invalid blocks in get_left_section_blocks()

w/ LFS mode, in get_left_section_blocks(), we should not account the
blocks which were used before and now are invalided, otherwise those
blocks will be counted as freed one in has_curseg_enough_space(), result
in missing to trigger GC in time.

Cc: stable@kernel.org
Fixes: 249ad438e1d9 ("f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode.")
Fixes: bf34c93d2645 ("f2fs: check curseg space before foreground GC")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: support to show curseg.next_blkoff in debugfs
Chao Yu [Fri, 28 Nov 2025 09:25:34 +0000 (17:25 +0800)] 
f2fs: support to show curseg.next_blkoff in debugfs

cat /sys/kernel/debug/f2fs/status

Main area: 17 segs, 17 secs 17 zones
    TYPE           blkoff    segno    secno   zoneno  dirty_seg   full_seg  valid_blk
  - COLD   data:        0        4        4        4          0          0          0
  - WARM   data:        0        7        7        7          0          0          0
  - HOT    data:        1        5        5        5          2          0        512
  - Dir   dnode:        3        0        0        0          1          0          2
  - File  dnode:        0        1        1        1          0          0          0
  - Indir nodes:        0        2        2        2          0          0          0
  - Pinned file:        0       -1       -1       -1
  - ATGC   data:        0       -1       -1       -1

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agodocs: f2fs: wrap ASCII tables in literal blocks to fix LaTeX build
Masaharu Noguchi [Mon, 17 Nov 2025 12:27:54 +0000 (21:27 +0900)] 
docs: f2fs: wrap ASCII tables in literal blocks to fix LaTeX build

Sphinx's LaTeX builder fails when converting the nested ASCII tables in
f2fs.rst, producing the following error:

  "Markup is unsupported in LaTeX: longtable does not support nesting a table."

Wrap the affected ASCII tables in literal code blocks to force Sphinx to
render them verbatim. This prevents nested longtables and fixes the PDF
build failure on Sphinx 8.2.x.

Acked-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
Signed-off-by: Masaharu Noguchi <nogunix@gmail.com>
Acked-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: expand scalability of f2fs mount option
Chao Yu [Mon, 17 Nov 2025 12:45:59 +0000 (20:45 +0800)] 
f2fs: expand scalability of f2fs mount option

opt field in structure f2fs_mount_info and opt_mask field in structure
f2fs_fs_context is 32-bits variable, now we're running out of available
bits in them, let's expand them to 64-bits for better scalability.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: change default schedule timeout value
Chao Yu [Wed, 12 Nov 2025 01:47:49 +0000 (09:47 +0800)] 
f2fs: change default schedule timeout value

This patch changes default schedule timeout value from 20ms to 1ms,
in order to give caller more chances to check whether IO or non-IO
congestion condition has already been mitigable.

In addition, default interval of periodical discard submission is
kept to 20ms.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: introduce f2fs_schedule_timeout()
Chao Yu [Wed, 12 Nov 2025 01:47:48 +0000 (09:47 +0800)] 
f2fs: introduce f2fs_schedule_timeout()

In f2fs retry logic, we will call f2fs_io_schedule_timeout() to sleep as
uninterruptible state (waiting for IO) for a while, however, in several
paths below, we are not blocked by IO:
- f2fs_write_single_data_page() return -EAGAIN due to racing on cp_rwsem.
- f2fs_flush_device_cache() failed to submit preflush command.
- __issue_discard_cmd_range() sleeps periodically in between two in batch
discard submissions.

So, in order to reveal state of task more accurate, let's introduce
f2fs_schedule_timeout() and call it in above paths in where we are waiting
for non-IO reasons.

Then we can get real reason of uninterruptible sleep for a thread in
tracepoint, perfetto, etc.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: use memalloc_retry_wait() as much as possible
Chao Yu [Wed, 12 Nov 2025 01:47:47 +0000 (09:47 +0800)] 
f2fs: use memalloc_retry_wait() as much as possible

memalloc_retry_wait() is recommended in memory allocation retry logic,
use it as much as possible.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: add a sysfs entry to show max open zones
Yongpeng Yang [Mon, 10 Nov 2025 08:22:22 +0000 (16:22 +0800)] 
f2fs: add a sysfs entry to show max open zones

This patch adds a sysfs entry showing the max zones that F2FS can write
concurrently.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: wrap all unusable_blocks_per_sec code in CONFIG_BLK_DEV_ZONED
Yongpeng Yang [Mon, 10 Nov 2025 08:22:21 +0000 (16:22 +0800)] 
f2fs: wrap all unusable_blocks_per_sec code in CONFIG_BLK_DEV_ZONED

The usage of unusable_blocks_per_sec is already wrapped by
CONFIG_BLK_DEV_ZONED, except for its declaration and the definitions of
CAP_BLKS_PER_SEC and CAP_SEGS_PER_SEC. This patch ensures that all code
related to unusable_blocks_per_sec is properly wrapped under the
CONFIG_BLK_DEV_ZONED option.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: simplify list initialization in f2fs_recover_fsync_data()
Baolin Liu [Tue, 11 Nov 2025 12:17:28 +0000 (20:17 +0800)] 
f2fs: simplify list initialization in f2fs_recover_fsync_data()

In f2fs_recover_fsync_data(),use LIST_HEAD() to declare and
initialize the list_head in one step instead of using
INIT_LIST_HEAD() separately.

No functional change.

Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: revert summary entry count from 2048 to 512 in 16kb block support
Daeho Jeong [Tue, 11 Nov 2025 17:52:46 +0000 (09:52 -0800)] 
f2fs: revert summary entry count from 2048 to 512 in 16kb block support

The recent increase in the number of Segment Summary Area (SSA) entries
from 512 to 2048 was an unintentional change in logic of 16kb block
support. This commit corrects the issue.

To better utilize the space available from the erroneous 2048-entry
calculation, we are implementing a solution to share the currently
unused SSA space with neighboring segments. This enhances overall
SSA utilization without impacting the established 8MB segment size.

Fixes: d7e9a9037de2 ("f2fs: Support Block Size == Page Size")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix to detect recoverable inode during dryrun of find_fsync_dnodes()
Chao Yu [Wed, 5 Nov 2025 06:50:23 +0000 (14:50 +0800)] 
f2fs: fix to detect recoverable inode during dryrun of find_fsync_dnodes()

mkfs.f2fs -f /dev/vdd
mount /dev/vdd /mnt/f2fs
touch /mnt/f2fs/foo
sync # avoid CP_UMOUNT_FLAG in last f2fs_checkpoint.ckpt_flags
touch /mnt/f2fs/bar
f2fs_io fsync /mnt/f2fs/bar
f2fs_io shutdown 2 /mnt/f2fs
umount /mnt/f2fs
blockdev --setro /dev/vdd
mount /dev/vdd /mnt/f2fs
mount: /mnt/f2fs: WARNING: source write-protected, mounted read-only.

For the case if we create and fsync a new inode before sudden power-cut,
without norecovery or disable_roll_forward mount option, the following
mount will succeed w/o recovering last fsynced inode.

The problem here is that we only check inode_list list after
find_fsync_dnodes() in f2fs_recover_fsync_data() to find out whether
there is recoverable data in the iamge, but there is a missed case, if
last fsynced inode is not existing in last checkpoint, then, we will
fail to get its inode due to nat of inode node is not existing in last
checkpoint, so the inode won't be linked in inode_list.

Let's detect such case in dyrun mode to fix this issue.

After this change, mount will fail as expected below:
mount: /mnt/f2fs: cannot mount /dev/vdd read-only.
       dmesg(1) may have more information after failed mount system call.
demsg:
F2FS-fs (vdd): Need to recover fsync data, but write access unavailable, please try mount w/ disable_roll_forward or norecovery

Cc: stable@kernel.org
Fixes: 6781eabba1bd ("f2fs: give -EINVAL for norecovery and rw mount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix return value of f2fs_recover_fsync_data()
Chao Yu [Wed, 5 Nov 2025 06:50:22 +0000 (14:50 +0800)] 
f2fs: fix return value of f2fs_recover_fsync_data()

With below scripts, it will trigger panic in f2fs:

mkfs.f2fs -f /dev/vdd
mount /dev/vdd /mnt/f2fs
touch /mnt/f2fs/foo
sync
echo 111 >> /mnt/f2fs/foo
f2fs_io fsync /mnt/f2fs/foo
f2fs_io shutdown 2 /mnt/f2fs
umount /mnt/f2fs
mount -o ro,norecovery /dev/vdd /mnt/f2fs
or
mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs

F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0
F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f
F2FS-fs (vdd): Stopped filesystem due to reason: 0
F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1
Filesystem f2fs get_tree() didn't set fc->root, returned 1
------------[ cut here ]------------
kernel BUG at fs/super.c:1761!
Oops: invalid opcode: 0000 [#1] SMP PTI
CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ #721 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:vfs_get_tree.cold+0x18/0x1a
Call Trace:
 <TASK>
 fc_mount+0x13/0xa0
 path_mount+0x34e/0xc50
 __x64_sys_mount+0x121/0x150
 do_syscall_64+0x84/0x800
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa6cc126cfe

The root cause is we missed to handle error number returned from
f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or
ro,disable_roll_forward mount option, result in returning a positive
error number to vfs_get_tree(), fix it.

Cc: stable@kernel.org
Fixes: 6781eabba1bd ("f2fs: give -EINVAL for norecovery and rw mount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: add fadvise tracepoint
Jaegeuk Kim [Tue, 28 Oct 2025 19:50:11 +0000 (19:50 +0000)] 
f2fs: add fadvise tracepoint

This adds a tracepoint in the fadvise call path.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix age extent cache insertion skip on counter overflow
Xiaole He [Mon, 27 Oct 2025 09:23:41 +0000 (17:23 +0800)] 
f2fs: fix age extent cache insertion skip on counter overflow

The age extent cache uses last_blocks (derived from
allocated_data_blocks) to determine data age. However, there's a
conflict between the deletion
marker (last_blocks=0) and legitimate last_blocks=0 cases when
allocated_data_blocks overflows to 0 after reaching ULLONG_MAX.

In this case, valid extents are incorrectly skipped due to the
"if (!tei->last_blocks)" check in __update_extent_tree_range().

This patch fixes the issue by:
1. Reserving ULLONG_MAX as an invalid/deletion marker
2. Limiting allocated_data_blocks to range [0, ULLONG_MAX-1]
3. Using F2FS_EXTENT_AGE_INVALID for deletion scenarios
4. Adjusting overflow age calculation from ULLONG_MAX to (ULLONG_MAX-1)

Reproducer (using a patched kernel with allocated_data_blocks
initialized to ULLONG_MAX - 3 for quick testing):

Step 1: Mount and check initial state
  # dd if=/dev/zero of=/tmp/test.img bs=1M count=100
  # mkfs.f2fs -f /tmp/test.img
  # mkdir -p /mnt/f2fs_test
  # mount -t f2fs -o loop,age_extent_cache /tmp/test.img /mnt/f2fs_test
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551612 # ULLONG_MAX - 3
  Inner Struct Count: tree: 1(0), node: 0

Step 2: Create files and write data to trigger overflow
  # touch /mnt/f2fs_test/{1,2,3,4}.txt; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551613 # ULLONG_MAX - 2
  Inner Struct Count: tree: 5(0), node: 1

  # dd if=/dev/urandom of=/mnt/f2fs_test/1.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551614 # ULLONG_MAX - 1
  Inner Struct Count: tree: 5(0), node: 2

  # dd if=/dev/urandom of=/mnt/f2fs_test/2.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 18446744073709551615 # ULLONG_MAX
  Inner Struct Count: tree: 5(0), node: 3

  # dd if=/dev/urandom of=/mnt/f2fs_test/3.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 0 # Counter overflowed!
  Inner Struct Count: tree: 5(0), node: 4

Step 3: Trigger the bug - next write should create node but gets skipped
  # dd if=/dev/urandom of=/mnt/f2fs_test/4.txt bs=4K count=1; sync
  # cat /sys/kernel/debug/f2fs/status | grep -A 4 "Block Age"
  Allocated Data Blocks: 1
  Inner Struct Count: tree: 5(0), node: 4

  Expected: node: 5 (new extent node for 4.txt)
  Actual: node: 4 (extent insertion was incorrectly skipped due to
  last_blocks = allocated_data_blocks = 0 in __get_new_block_age)

After this fix, the extent node is correctly inserted and node count
becomes 5 as expected.

Fixes: 71644dff4811 ("f2fs: add block_age-based extent cache")
Cc: stable@kernel.org
Signed-off-by: Xiaole He <hexiaole1994@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: Add sanity checks before unlinking and loading inodes
Nikola Z. Ivanov [Wed, 5 Nov 2025 11:09:43 +0000 (13:09 +0200)] 
f2fs: Add sanity checks before unlinking and loading inodes

Add check for inode->i_nlink == 1 for directories during unlink,
as their value is decremented twice, which can trigger a warning in
drop_nlink. In such case mark the filesystem as corrupted and return
from the function call with the relevant failure return value.

Additionally add the check for i_nlink == 1 in
sanity_check_inode in order to detect on-disk corruption early.

Reported-by: syzbot+c07d47c7bc68f47b9083@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c07d47c7bc68f47b9083
Tested-by: syzbot+c07d47c7bc68f47b9083@syzkaller.appspotmail.com
Signed-off-by: Nikola Z. Ivanov <zlatistiv@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: Rename f2fs_unlink exit label
Nikola Z. Ivanov [Wed, 5 Nov 2025 11:09:42 +0000 (13:09 +0200)] 
f2fs: Rename f2fs_unlink exit label

Rename "fail" label to "out" as it's used as a default
exit path out of f2fs_unlink as well as error path.

Signed-off-by: Nikola Z. Ivanov <zlatistiv@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: ensure minimum trim granularity accounts for all devices
Yongpeng Yang [Fri, 24 Oct 2025 14:37:46 +0000 (22:37 +0800)] 
f2fs: ensure minimum trim granularity accounts for all devices

When F2FS uses multiple block devices, each device may have a
different discard granularity. The minimum trim granularity must be
at least the maximum discard granularity of all devices, excluding
zoned devices. Use max_t instead of the max() macro to compute the
maximum value.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix uninitialized one_time_gc in victim_sel_policy
Xiaole He [Wed, 29 Oct 2025 05:18:07 +0000 (13:18 +0800)] 
f2fs: fix uninitialized one_time_gc in victim_sel_policy

The one_time_gc field in struct victim_sel_policy is conditionally
initialized but unconditionally read, leading to undefined behavior
that triggers UBSAN warnings.

In f2fs_get_victim() at fs/f2fs/gc.c:774, the victim_sel_policy
structure is declared without initialization:

    struct victim_sel_policy p;

The field p.one_time_gc is only assigned when the 'one_time' parameter
is true (line 789):

    if (one_time) {
        p.one_time_gc = one_time;
        ...
    }

However, this field is unconditionally read in subsequent get_gc_cost()
at line 395:

    if (p->one_time_gc && (valid_thresh_ratio < 100) && ...)

When one_time is false, p.one_time_gc contains uninitialized stack
memory. Hence p.one_time_gc is an invalid bool value.

UBSAN detects this invalid bool value:

    UBSAN: invalid-load in fs/f2fs/gc.c:395:7
    load of value 77 is not a valid value for type '_Bool'
    CPU: 3 UID: 0 PID: 1297 Comm: f2fs_gc-252:16 Not tainted 6.18.0-rc3
    #5 PREEMPT(voluntary)
    Hardware name: OpenStack Foundation OpenStack Nova,
    BIOS 1.13.0-1ubuntu1.1 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x70/0x90
     dump_stack+0x14/0x20
     __ubsan_handle_load_invalid_value+0xb3/0xf0
     ? dl_server_update+0x2e/0x40
     ? update_curr+0x147/0x170
     f2fs_get_victim.cold+0x66/0x134 [f2fs]
     ? sched_balance_newidle+0x2ca/0x470
     ? finish_task_switch.isra.0+0x8d/0x2a0
     f2fs_gc+0x2ba/0x8e0 [f2fs]
     ? _raw_spin_unlock_irqrestore+0x12/0x40
     ? __timer_delete_sync+0x80/0xe0
     ? timer_delete_sync+0x14/0x20
     ? schedule_timeout+0x82/0x100
     gc_thread_func+0x38b/0x860 [f2fs]
     ? gc_thread_func+0x38b/0x860 [f2fs]
     ? __pfx_autoremove_wake_function+0x10/0x10
     kthread+0x10b/0x220
     ? __pfx_gc_thread_func+0x10/0x10 [f2fs]
     ? _raw_spin_unlock_irq+0x12/0x40
     ? __pfx_kthread+0x10/0x10
     ret_from_fork+0x11a/0x160
     ? __pfx_kthread+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>

This issue is reliably reproducible with the following steps on a
100GB SSD /dev/vdb:

    mkfs.f2fs -f /dev/vdb
    mount /dev/vdb /mnt/f2fs_test
    fio --name=gc --directory=/mnt/f2fs_test --rw=randwrite \
        --bs=4k --size=8G --numjobs=12 --fsync=4 --runtime=10 \
        --time_based
    echo 1 > /sys/fs/f2fs/vdb/gc_urgent

The uninitialized value causes incorrect GC victim selection, leading
to unpredictable garbage collection behavior.

Fix by zero-initializing the entire victim_sel_policy structure to
ensure all fields have defined values.

Fixes: e791d00bd06c ("f2fs: add valid block ratio not to do excessive GC for one time GC")
Cc: stable@kernel.org
Signed-off-by: Xiaole He <hexiaole1994@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix to access i_size w/ i_size_read()
Chao Yu [Wed, 29 Oct 2025 06:31:04 +0000 (14:31 +0800)] 
f2fs: fix to access i_size w/ i_size_read()

It recommends to use i_size_{read,write}() to access and update i_size,
otherwise, we may get wrong tearing value due to high 32-bits value
and low 32-bits value of i_size field are not updated atomically in
32-bits archicture machine.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: ensure node page reads complete before f2fs_put_super() finishes
Jan Prusakowski [Mon, 6 Oct 2025 08:46:15 +0000 (10:46 +0200)] 
f2fs: ensure node page reads complete before f2fs_put_super() finishes

Xfstests generic/335, generic/336 sometimes crash with the following message:

F2FS-fs (dm-0): detect filesystem reference count leak during umount, type: 9, count: 1
------------[ cut here ]------------
kernel BUG at fs/f2fs/super.c:1939!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 UID: 0 PID: 609351 Comm: umount Tainted: G        W           6.17.0-rc5-xfstests-g9dd1835ecda5 #1 PREEMPT(none)
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:f2fs_put_super+0x3b3/0x3c0
Call Trace:
 <TASK>
 generic_shutdown_super+0x7e/0x190
 kill_block_super+0x1a/0x40
 kill_f2fs_super+0x9d/0x190
 deactivate_locked_super+0x30/0xb0
 cleanup_mnt+0xba/0x150
 task_work_run+0x5c/0xa0
 exit_to_user_mode_loop+0xb7/0xc0
 do_syscall_64+0x1ae/0x1c0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
 </TASK>
---[ end trace 0000000000000000 ]---

It appears that sometimes it is possible that f2fs_put_super() is called before
all node page reads are completed.
Adding a call to f2fs_wait_on_all_pages() for F2FS_RD_NODE fixes the problem.

Cc: stable@kernel.org
Fixes: 20872584b8c0b ("f2fs: fix to drop all dirty meta/node pages during umount()")
Signed-off-by: Jan Prusakowski <jprusakowski@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: block cache/dio write during f2fs_enable_checkpoint()
Chao Yu [Mon, 27 Oct 2025 06:35:34 +0000 (14:35 +0800)] 
f2fs: block cache/dio write during f2fs_enable_checkpoint()

If there are too many background IOs during f2fs_enable_checkpoint(),
sync_inodes_sb() may be blocked for long time due to it will loop to
write dirty datas which are generated by in parallel write()
continuously.

Let's change as below to resolve this issue:
- hold cp_enable_rwsem write lock to block any cache/dio write
- decrease DEF_ENABLE_INTERVAL from 16 to 5

In addition, dump more logs during f2fs_enable_checkpoint().

Testcase:
1. fill data into filesystem until 90% usage.
2. mount -o remount,checkpoint=disable:10% /data
3. fio --rw=randwrite  --bs=4kb  --size=1GB  --numjobs=10  \
--iodepth=64  --ioengine=psync  --time_based  --runtime=600 \
--directory=/data/fio_dir/ &
4. mount -o remount,checkpoint=enable /data

Before:
F2FS-fs (dm-51): f2fs_enable_checkpoint() finishes, writeback:7232, sync:39793, cp:457

After:
F2FS-fs (dm-51): f2fs_enable_checkpoint end, writeback:5032, lock:0, sync_inode:5552, sync_fs:84

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix to propagate error from f2fs_enable_checkpoint()
Chao Yu [Mon, 27 Oct 2025 06:35:33 +0000 (14:35 +0800)] 
f2fs: fix to propagate error from f2fs_enable_checkpoint()

In order to let userspace detect such error rather than suffering
silent failure.

Fixes: 4354994f097d ("f2fs: checkpoint disabling")
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: change the unlock parameter of f2fs_put_page to bool
Yongpeng Yang [Mon, 27 Oct 2025 12:55:43 +0000 (20:55 +0800)] 
f2fs: change the unlock parameter of f2fs_put_page to bool

Change the type of the unlock parameter of f2fs_put_page to bool.
All callers should consistently pass true or false. No logical change.

Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: invalidate dentry cache on failed whiteout creation
Deepanshu Kartikey [Mon, 27 Oct 2025 13:06:34 +0000 (18:36 +0530)] 
f2fs: invalidate dentry cache on failed whiteout creation

F2FS can mount filesystems with corrupted directory depth values that
get runtime-clamped to MAX_DIR_HASH_DEPTH. When RENAME_WHITEOUT
operations are performed on such directories, f2fs_rename performs
directory modifications (updating target entry and deleting source
entry) before attempting to add the whiteout entry via f2fs_add_link.

If f2fs_add_link fails due to the corrupted directory structure, the
function returns an error to VFS, but the partial directory
modifications have already been committed to disk. VFS assumes the
entire rename operation failed and does not update the dentry cache,
leaving stale mappings.

In the error path, VFS does not call d_move() to update the dentry
cache. This results in new_dentry still pointing to the old inode
(new_inode) which has already had its i_nlink decremented to zero.
The stale cache causes subsequent operations to incorrectly reference
the freed inode.

This causes subsequent operations to use cached dentry information that
no longer matches the on-disk state. When a second rename targets the
same entry, VFS attempts to decrement i_nlink on the stale inode, which
may already have i_nlink=0, triggering a WARNING in drop_nlink().

Example sequence:
1. First rename (RENAME_WHITEOUT): file2 → file1
   - f2fs updates file1 entry on disk (points to inode 8)
   - f2fs deletes file2 entry on disk
   - f2fs_add_link(whiteout) fails (corrupted directory)
   - Returns error to VFS
   - VFS does not call d_move() due to error
   - VFS cache still has: file1 → inode 7 (stale!)
   - inode 7 has i_nlink=0 (already decremented)

2. Second rename: file3 → file1
   - VFS uses stale cache: file1 → inode 7
   - Tries to drop_nlink on inode 7 (i_nlink already 0)
   - WARNING in drop_nlink()

Fix this by explicitly invalidating old_dentry and new_dentry when
f2fs_add_link fails during whiteout creation. This forces VFS to
refresh from disk on subsequent operations, ensuring cache consistency
even when the rename partially succeeds.

Reproducer:
1. Mount F2FS image with corrupted i_current_depth
2. renameat2(file2, file1, RENAME_WHITEOUT)
3. renameat2(file3, file1, 0)
4. System triggers WARNING in drop_nlink()

Fixes: 7e01e7ad746b ("f2fs: support RENAME_WHITEOUT")
Reported-by: syzbot+632cf32276a9a564188d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=632cf32276a9a564188d
Suggested-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/all/20251022233349.102728-1-kartikey406@gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: use global inline_xattr_slab instead of per-sb slab cache
Chao Yu [Tue, 21 Oct 2025 03:48:56 +0000 (11:48 +0800)] 
f2fs: use global inline_xattr_slab instead of per-sb slab cache

As Hong Yun reported in mailing list:

loop7: detected capacity change from 0 to 131072
------------[ cut here ]------------
kmem_cache of name 'f2fs_xattr_entry-7:7' already exists
WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 kmem_cache_sanity_check mm/slab_common.c:109 [inline]
WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 __kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307
CPU: 0 UID: 0 PID: 24426 Comm: syz.7.1370 Not tainted 6.17.0-rc4 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:kmem_cache_sanity_check mm/slab_common.c:109 [inline]
RIP: 0010:__kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307
Call Trace:
 __kmem_cache_create include/linux/slab.h:353 [inline]
 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2943 [inline]
 f2fs_init_xattr_caches+0xa5/0xe0 fs/f2fs/xattr.c:843
 f2fs_fill_super+0x1645/0x2620 fs/f2fs/super.c:4918
 get_tree_bdev_flags+0x1fb/0x260 fs/super.c:1692
 vfs_get_tree+0x43/0x140 fs/super.c:1815
 do_new_mount+0x201/0x550 fs/namespace.c:3808
 do_mount fs/namespace.c:4136 [inline]
 __do_sys_mount fs/namespace.c:4347 [inline]
 __se_sys_mount+0x298/0x2f0 fs/namespace.c:4324
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x8e/0x3a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

The bug can be reproduced w/ below scripts:
- mount /dev/vdb /mnt1
- mount /dev/vdc /mnt2
- umount /mnt1
- mounnt /dev/vdb /mnt1

The reason is if we created two slab caches, named f2fs_xattr_entry-7:3
and f2fs_xattr_entry-7:7, and they have the same slab size. Actually,
slab system will only create one slab cache core structure which has
slab name of "f2fs_xattr_entry-7:3", and two slab caches share the same
structure and cache address.

So, if we destroy f2fs_xattr_entry-7:3 cache w/ cache address, it will
decrease reference count of slab cache, rather than release slab cache
entirely, since there is one more user has referenced the cache.

Then, if we try to create slab cache w/ name "f2fs_xattr_entry-7:3" again,
slab system will find that there is existed cache which has the same name
and trigger the warning.

Let's changes to use global inline_xattr_slab instead of per-sb slab cache
for fixing.

Fixes: a999150f4fe3 ("f2fs: use kmem_cache pool during inline xattr lookups")
Cc: stable@kernel.org
Reported-by: Hong Yun <yhong@link.cuhk.edu.hk>
Tested-by: Hong Yun <yhong@link.cuhk.edu.hk>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix to avoid updating compression context during writeback
Chao Yu [Wed, 22 Oct 2025 03:06:36 +0000 (11:06 +0800)] 
f2fs: fix to avoid updating compression context during writeback

Bai, Shuangpeng <sjb7183@psu.edu> reported a bug as below:

Oops: divide error: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 11441 Comm: syz.0.46 Not tainted 6.17.0 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:f2fs_all_cluster_page_ready+0x106/0x550 fs/f2fs/compress.c:857
Call Trace:
 <TASK>
 f2fs_write_cache_pages fs/f2fs/data.c:3078 [inline]
 __f2fs_write_data_pages fs/f2fs/data.c:3290 [inline]
 f2fs_write_data_pages+0x1c19/0x3600 fs/f2fs/data.c:3317
 do_writepages+0x38e/0x640 mm/page-writeback.c:2634
 filemap_fdatawrite_wbc mm/filemap.c:386 [inline]
 __filemap_fdatawrite_range mm/filemap.c:419 [inline]
 file_write_and_wait_range+0x2ba/0x3e0 mm/filemap.c:794
 f2fs_do_sync_file+0x6e6/0x1b00 fs/f2fs/file.c:294
 generic_write_sync include/linux/fs.h:3043 [inline]
 f2fs_file_write_iter+0x76e/0x2700 fs/f2fs/file.c:5259
 new_sync_write fs/read_write.c:593 [inline]
 vfs_write+0x7e9/0xe00 fs/read_write.c:686
 ksys_write+0x19d/0x2d0 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xf7/0x470 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The bug was triggered w/ below race condition:

fsync setattr ioctl
- f2fs_do_sync_file
 - file_write_and_wait_range
  - f2fs_write_cache_pages
  : inode is non-compressed
  : cc.cluster_size =
    F2FS_I(inode)->i_cluster_size = 0
   - tag_pages_for_writeback
- f2fs_setattr
 - truncate_setsize
 - f2fs_truncate
- f2fs_fileattr_set
 - f2fs_setflags_common
  - set_compress_context
  : F2FS_I(inode)->i_cluster_size = 4
  : set_inode_flag(inode, FI_COMPRESSED_FILE)
   - f2fs_compressed_file
   : return true
   - f2fs_all_cluster_page_ready
   : "pgidx % cc->cluster_size" trigger dividing 0 issue

Let's change as below to fix this issue:
- introduce a new atomic type variable .writeback in structure f2fs_inode_info
to track the number of threads which calling f2fs_write_cache_pages().
- use .i_sem lock to protect .writeback update.
- check .writeback before update compression context in f2fs_setflags_common()
to avoid race w/ ->writepages.

Fixes: 4c8ff7095bef ("f2fs: support data compression")
Cc: stable@kernel.org
Reported-by: Bai, Shuangpeng <sjb7183@psu.edu>
Tested-by: Bai, Shuangpeng <sjb7183@psu.edu>
Closes: https://lore.kernel.org/lkml/44D8F7B3-68AD-425F-9915-65D27591F93F@psu.edu
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix to avoid updating zero-sized extent in extent cache
Chao Yu [Mon, 20 Oct 2025 02:42:12 +0000 (10:42 +0800)] 
f2fs: fix to avoid updating zero-sized extent in extent cache

As syzbot reported:

F2FS-fs (loop0): __update_extent_tree_range: extent len is zero, type: 0, extent [0, 0, 0], age [0, 0]
------------[ cut here ]------------
kernel BUG at fs/f2fs/extent_cache.c:678!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 5336 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:__update_extent_tree_range+0x13bc/0x1500 fs/f2fs/extent_cache.c:678
Call Trace:
 <TASK>
 f2fs_update_read_extent_cache_range+0x192/0x3e0 fs/f2fs/extent_cache.c:1085
 f2fs_do_zero_range fs/f2fs/file.c:1657 [inline]
 f2fs_zero_range+0x10c1/0x1580 fs/f2fs/file.c:1737
 f2fs_fallocate+0x583/0x990 fs/f2fs/file.c:2030
 vfs_fallocate+0x669/0x7e0 fs/open.c:342
 ioctl_preallocate fs/ioctl.c:289 [inline]
 file_ioctl+0x611/0x780 fs/ioctl.c:-1
 do_vfs_ioctl+0xb33/0x1430 fs/ioctl.c:576
 __do_sys_ioctl fs/ioctl.c:595 [inline]
 __se_sys_ioctl+0x82/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f07bc58eec9

In error path of f2fs_zero_range(), it may add a zero-sized extent
into extent cache, it should be avoided.

Fixes: 6e9619499f53 ("f2fs: support in batch fzero in dnode page")
Cc: stable@kernel.org
Reported-by: syzbot+24124df3170c3638b35f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/68e5d698.050a0220.256323.0032.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: fix to avoid potential deadlock
Chao Yu [Tue, 14 Oct 2025 11:47:35 +0000 (19:47 +0800)] 
f2fs: fix to avoid potential deadlock

As Jiaming Zhang and syzbot reported, there is potential deadlock in
f2fs as below:

Chain exists of:
  &sbi->cp_rwsem --> fs_reclaim --> sb_internal#2

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(sb_internal#2);
                               lock(fs_reclaim);
                               lock(sb_internal#2);
  rlock(&sbi->cp_rwsem);

 *** DEADLOCK ***

3 locks held by kswapd0/73:
 #0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:7015 [inline]
 #0: ffffffff8e247a40 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0x951/0x2800 mm/vmscan.c:7389
 #1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_trylock_shared fs/super.c:562 [inline]
 #1: ffff8880118400e0 (&type->s_umount_key#50){.+.+}-{4:4}, at: super_cache_scan+0x91/0x4b0 fs/super.c:197
 #2: ffff888011840610 (sb_internal#2){.+.+}-{0:0}, at: f2fs_evict_inode+0x8d9/0x1b60 fs/f2fs/inode.c:890

stack backtrace:
CPU: 0 UID: 0 PID: 73 Comm: kswapd0 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_circular_bug+0x2ee/0x310 kernel/locking/lockdep.c:2043
 check_noncircular+0x134/0x160 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain+0xb9b/0x2140 kernel/locking/lockdep.c:3908
 __lock_acquire+0xab9/0xd20 kernel/locking/lockdep.c:5237
 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5868
 down_read+0x46/0x2e0 kernel/locking/rwsem.c:1537
 f2fs_down_read fs/f2fs/f2fs.h:2278 [inline]
 f2fs_lock_op fs/f2fs/f2fs.h:2357 [inline]
 f2fs_do_truncate_blocks+0x21c/0x10c0 fs/f2fs/file.c:791
 f2fs_truncate_blocks+0x10a/0x300 fs/f2fs/file.c:867
 f2fs_truncate+0x489/0x7c0 fs/f2fs/file.c:925
 f2fs_evict_inode+0x9f2/0x1b60 fs/f2fs/inode.c:897
 evict+0x504/0x9c0 fs/inode.c:810
 f2fs_evict_inode+0x1dc/0x1b60 fs/f2fs/inode.c:853
 evict+0x504/0x9c0 fs/inode.c:810
 dispose_list fs/inode.c:852 [inline]
 prune_icache_sb+0x21b/0x2c0 fs/inode.c:1000
 super_cache_scan+0x39b/0x4b0 fs/super.c:224
 do_shrink_slab+0x6ef/0x1110 mm/shrinker.c:437
 shrink_slab_memcg mm/shrinker.c:550 [inline]
 shrink_slab+0x7ef/0x10d0 mm/shrinker.c:628
 shrink_one+0x28a/0x7c0 mm/vmscan.c:4955
 shrink_many mm/vmscan.c:5016 [inline]
 lru_gen_shrink_node mm/vmscan.c:5094 [inline]
 shrink_node+0x315d/0x3780 mm/vmscan.c:6081
 kswapd_shrink_node mm/vmscan.c:6941 [inline]
 balance_pgdat mm/vmscan.c:7124 [inline]
 kswapd+0x147c/0x2800 mm/vmscan.c:7389
 kthread+0x70e/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

The root cause is deadlock among four locks as below:

kswapd
- fs_reclaim --- Lock A
 - shrink_one
  - evict
   - f2fs_evict_inode
    - sb_start_intwrite --- Lock B

- iput
 - evict
  - f2fs_evict_inode
   - sb_start_intwrite --- Lock B
   - f2fs_truncate
    - f2fs_truncate_blocks
     - f2fs_do_truncate_blocks
      - f2fs_lock_op --- Lock C

ioctl
- f2fs_ioc_commit_atomic_write
 - f2fs_lock_op --- Lock C
  - __f2fs_commit_atomic_write
   - __replace_atomic_write_block
    - f2fs_get_dnode_of_data
     - __get_node_folio
      - f2fs_check_nid_range
       - f2fs_handle_error
        - f2fs_record_errors
         - f2fs_down_write --- Lock D

open
- do_open
 - do_truncate
  - security_inode_need_killpriv
   - f2fs_getxattr
    - lookup_all_xattrs
     - f2fs_handle_error
      - f2fs_record_errors
       - f2fs_down_write --- Lock D
        - f2fs_commit_super
         - read_mapping_folio
          - filemap_alloc_folio_noprof
           - prepare_alloc_pages
            - fs_reclaim_acquire --- Lock A

In order to avoid such deadlock, we need to avoid grabbing sb_lock in
f2fs_handle_error(), so, let's use asynchronous method instead:
- remove f2fs_handle_error() implementation
- rename f2fs_handle_error_async() to f2fs_handle_error()
- spread f2fs_handle_error()

Fixes: 95fa90c9e5a7 ("f2fs: support recording errors into superblock")
Cc: stable@kernel.org
Reported-by: syzbot+14b90e1156b9f6fc1266@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/68eae49b.050a0220.ac43.0001.GAE@google.com
Reported-by: Jiaming Zhang <r772577952@gmail.com>
Closes: https://lore.kernel.org/lkml/CANypQFa-Gy9sD-N35o3PC+FystOWkNuN8pv6S75HLT0ga-Tzgw@mail.gmail.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: use f2fs_filemap_get_folio() to support fault injection
Chao Yu [Tue, 14 Oct 2025 06:27:04 +0000 (14:27 +0800)] 
f2fs: use f2fs_filemap_get_folio() to support fault injection

Use f2fs_filemap_get_folio() instead of __filemap_get_folio() in:
- f2fs_find_data_folio
- f2fs_write_begin
- f2fs_read_merkle_tree_page

So that, we can trigger fault injection in those places.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: use f2fs_filemap_get_folio() instead of f2fs_pagecache_get_page()
Chao Yu [Tue, 14 Oct 2025 06:27:03 +0000 (14:27 +0800)] 
f2fs: use f2fs_filemap_get_folio() instead of f2fs_pagecache_get_page()

Let's use f2fs_filemap_get_folio() instead of f2fs_pagecache_get_page() in
ra_data_block() and move_data_block(), then remove f2fs_pagecache_get_page()
since it has no user.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: convert add_ipu_page() to use folio
Chao Yu [Tue, 14 Oct 2025 06:27:02 +0000 (14:27 +0800)] 
f2fs: convert add_ipu_page() to use folio

No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
9 days agof2fs: clean up w/ bio_add_folio_nofail()
Chao Yu [Tue, 14 Oct 2025 06:27:01 +0000 (14:27 +0800)] 
f2fs: clean up w/ bio_add_folio_nofail()

In add_bio_entry(), adding a page to newly allocated bio should never fail,
let's use bio_add_folio_nofail() instead of bio_add_page() & unnecessary
error handling for cleanup.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
5 weeks agof2fs: Use mapping->gfp_mask to get file cache for writing
Jiucheng Xu [Fri, 10 Oct 2025 10:45:50 +0000 (10:45 +0000)] 
f2fs: Use mapping->gfp_mask to get file cache for writing

On 32-bit architectures, when GFP_NOFS is used, the file cache for write
operations cannot be allocated from the highmem and CMA.

Since mapping->gfp_mask is set to GFP_HIGHUSER_MOVABLE during inode
allocation, using mapping_gfp_mask(mapping) as the GFP flag of getting file
cache for writing is more efficient for 32-bit architectures.

Additionally, use FGP_NOFS to avoid potential deadlock issues caused by
GFP_FS in GFP_HIGHUSER_MOVABLE

Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 weeks agof2fs: use folio_nr_pages() instead of shift operation
Pedro Demarchi Gomes [Sat, 4 Oct 2025 03:12:17 +0000 (00:12 -0300)] 
f2fs: use folio_nr_pages() instead of shift operation

folio_nr_pages() is a faster helper function to get the number of pages when
NR_PAGES_IN_LARGE_FOLIO is enabled.

Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 weeks agof2fs: set default valid_thresh_ratio to 80 for zoned devices
Daeho Jeong [Tue, 7 Oct 2025 16:46:14 +0000 (09:46 -0700)] 
f2fs: set default valid_thresh_ratio to 80 for zoned devices

Zoned storage devices provide marginal over-capacity space, typically
around 10%, for filesystem level storage control.

By utilizing this extra capacity, we can safely reduce the default
'valid_thresh_ratio' to 80. This action helps to significantly prevent
excessive garbage collection (GC) and the resulting power consumption,
as the filesystem becomes less aggressive about cleaning segments
that still hold a high percentage of valid data.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 weeks agof2fs: maintain one time GC mode is enabled during whole zoned GC cycle
Daeho Jeong [Fri, 3 Oct 2025 22:43:08 +0000 (15:43 -0700)] 
f2fs: maintain one time GC mode is enabled during whole zoned GC cycle

The current version missed setting one time GC for normal zoned GC
cycle. So, valid threshold control is not working. Need to fix it to
prevent excessive GC for zoned devices.

Fixes: e791d00bd06c ("f2fs: add valid block ratio not to do excessive GC for one time GC")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 weeks agoMerge tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 24 Oct 2025 19:48:19 +0000 (12:48 -0700)] 
Merge tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

 - Fix dma alignment for PI

 - Fix selinux bogosity with nbd, where sendmsg would get rejected

* tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  block: require LBA dma_alignment when using PI
  nbd: override creds to kernel when calling sock_{send,recv}msg()

7 weeks agoMerge tag 'io_uring-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 24 Oct 2025 19:44:31 +0000 (12:44 -0700)] 
Merge tag 'io_uring-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

 - Add MAINTAINERS entry for zcrx, mostly so that netdev gets
   automatically CC'ed by default on any changes there too.

 - Fix for the SQPOLL busy vs work time accounting.

   It was using getrusage(), which was both broken from a thread point
   of view (we only care about the SQPOLL thread itself), and vastly
   overkill as only the systime was used. On top of that, also be a bit
   smarter in when it's queried. It used excessive CPU before this
   change. Marked for stable as well.

 - Fix provided ring buffer auto commit for uring_cmd.

 - Fix a few style issues and sparse annotation for a lock.

* tag 'io_uring-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring: fix buffer auto-commit for multishot uring_cmd
  io_uring: correct __must_hold annotation in io_install_fixed_file
  io_uring zcrx: add MAINTAINERS entry
  io_uring: Fix code indentation error
  io_uring/sqpoll: be smarter on when to update the stime usage
  io_uring/sqpoll: switch away from getrusage() for CPU accounting
  io_uring: fix incorrect unlikely() usage in io_waitid_prep()

7 weeks agoMerge tag 'slab-for-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka...
Linus Torvalds [Fri, 24 Oct 2025 19:40:51 +0000 (12:40 -0700)] 
Merge tag 'slab-for-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fixes from Vlastimil Babka:

 - Two fixes for race conditions in obj_exts allocation (Hao Ge)

 - Fix for slab accounting imbalance due to deferred slab decativation
   (Vlastimil Babka)

* tag 'slab-for-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  slab: Fix obj_ext mistakenly considered NULL due to race condition
  slab: fix slab accounting imbalance due to defer_deactivate_slab()
  slab: Avoid race on slab->obj_exts in alloc_slab_obj_exts

7 weeks agoMerge tag 'devicetree-fixes-for-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 24 Oct 2025 18:17:38 +0000 (11:17 -0700)] 
Merge tag 'devicetree-fixes-for-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

 - Fix handling of GICv5 ITS MSI properties on platforms with
   'msi-parent' as well as a of_node refcounting fix.

   This is also preparation for further refactoring in 6.19 to use
   common DT parsing of MSI properties.

* tag 'devicetree-fixes-for-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  of/irq: Export of_msi_xlate() for module usage
  of/irq: Fix OF node refcount in of_msi_get_domain()
  of/irq: Add msi-parent check to of_msi_xlate()

7 weeks agoMerge tag 'soc-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Fri, 24 Oct 2025 18:15:17 +0000 (11:15 -0700)] 
Merge tag 'soc-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull SoC fixes from Arnd Bergmann:
 "The main change this time is an update to the MAINTAINERS file,
  listing Krzysztof Kozlowski, Alexandre Belloni, and Linus Walleij as
  additional maintainers for the SoC tree, in order to go back to a
  group maintainership. Drew Fustini joins as an additional reviewer for
  the SoC tree.

  Thanks to all of you for volunteering to help out.

  On the actual bugfixes, we have a few correctness changes for firmware
  drivers (qtee, arm-ffa, scmi) and two devicetree fixes for Raspberry
  Pi"

* tag 'soc-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  soc: officially expand maintainership team
  firmware: arm_scmi: Fix premature SCMI_XFER_FLAG_IS_RAW clearing in raw mode
  firmware: arm_scmi: Skip RAW initialization on failure
  include: trace: Fix inflight count helper on failed initialization
  firmware: arm_scmi: Account for failed debug initialization
  ARM: dts: broadcom: rpi: Switch to V3D firmware clock
  arm64: dts: broadcom: bcm2712: Define VGIC interrupt
  firmware: arm_ffa: Add support for IMPDEF value in the memory access descriptor
  tee: QCOMTEE should depend on ARCH_QCOM
  tee: qcom: return -EFAULT instead of -EINVAL if copy_from_user() fails
  tee: qcom: prevent potential off by one read

7 weeks agoMerge tag 'hwmon-for-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 24 Oct 2025 18:11:35 +0000 (11:11 -0700)] 
Merge tag 'hwmon-for-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:

 - cgbc-hwmon: Add missing NULL check after devm_kzalloc

 - gpd-fan: Fix error handling

 - pmbus/isl68137: Fix child node reference leak

 - pmbus/max34440: Update adpm12160 coefficients to match latest FW

 - sht3x: Fix error handling

* tag 'hwmon-for-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (sht3x) Fix error handling
  hwmon: (cgbc-hwmon) Add missing NULL check after devm_kzalloc()
  hwmon: (pmbus/isl68137) Fix child node reference leak on early return
  hwmon: (gpd-fan) Fix error handling in gpd_fan_probe()
  hwmon: (gpd-fan) Fix return value when platform_get_resource() fails
  hwmon: (pmbus/max34440) Update adpm12160 coeff due to latest FW

7 weeks agoMerge tag 'spi-fix-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brooni...
Linus Torvalds [Fri, 24 Oct 2025 18:01:40 +0000 (11:01 -0700)] 
Merge tag 'spi-fix-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A moderately large collection of device specific changes here, mostly
  fixes but also including a few new quirks and device IDs. This is all
  fairly routine even for the affected devices"

* tag 'spi-fix-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: dt-bindings: spi-rockchip: Add RK3506 compatible
  spi: intel-pci: Add support for Intel Wildcat Lake SPI serial flash
  spi: intel-pci: Add support for Arrow Lake-H SPI serial flash
  spi: intel: Add support for 128M component density
  spi: airoha: fix reading/writing of flashes with more than one plane per lun
  spi: airoha: switch back to non-dma mode in the case of error
  spi: airoha: add support of dual/quad wires spi modes to exec_op() handler
  spi: airoha: return an error for continuous mode dirmap creation cases
  spi: amlogic: fix spifc build error
  spi: cadence-quadspi: Fix pm_runtime unbalance on dma EPROBE_DEFER
  spi: spi-nxp-fspi: limit the clock rate for different sample clock source selection
  spi: spi-nxp-fspi: add extra delay after dll locked
  spi: spi-nxp-fspi: re-config the clock rate when operation require new clock rate
  spi: dw-mmio: add error handling for reset_control_deassert()
  spi: rockchip-sfc: Fix DMA-API usage
  spi: dt-bindings: cadence: add soc-specific compatible strings for zynqmp and versal-net

7 weeks agoMerge tag 'gpio-fixes-for-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 24 Oct 2025 17:45:29 +0000 (10:45 -0700)] 
Merge tag 'gpio-fixes-for-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

 - fix regressions in regmap cache initialization in gpio-104-idio-16
   and gpio-pci-idio-16

 - configure first 16 GPIO lines of the IDIO-16 as fixed outputs

 - fix duplicated IRQ mapping that can lead to an RCU stall in gpio-ljca

 - fix printf formatters passed to dev_err() and make failure to set
   debounce period non fatal

* tag 'gpio-fixes-for-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio: ljca: Fix duplicated IRQ mapping
  gpiolib: acpi: Use %pe when passing an error pointer to dev_err()
  gpiolib: acpi: Make set debounce errors non fatal
  gpio: idio-16: Define fixed direction of the GPIO lines
  gpio: regmap: add the .fixed_direction_output configuration parameter
  gpio: pci-idio-16: Define maximum valid register address offset
  gpio: 104-idio-16: Define maximum valid register address offset

7 weeks agosoc: officially expand maintainership team
Arnd Bergmann [Fri, 17 Oct 2025 14:08:24 +0000 (16:08 +0200)] 
soc: officially expand maintainership team

Since Olof moved on from the soc tree maintenance, Arnd has mainly taken
care of the day-to-day activities around the SoC tree by himself, which
is generally not a good setup.

Krzysztof, Linus and Alexandre have volunteered to become co-maintainers
of the SoC tree, with the plan of taking turns to do merges and reviews
to spread the workload. In addition, Drew joins as another reviewer.

Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
7 weeks agoof/irq: Export of_msi_xlate() for module usage
Lorenzo Pieralisi [Tue, 21 Oct 2025 12:41:01 +0000 (14:41 +0200)] 
of/irq: Export of_msi_xlate() for module usage

of_msi_xlate() is required by drivers that can be configured
as modular, export the symbol.

Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Cc: Rob Herring <robh@kernel.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251021124103.198419-4-lpieralisi@kernel.org
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
7 weeks agoslab: Fix obj_ext mistakenly considered NULL due to race condition
Hao Ge [Thu, 23 Oct 2025 14:33:13 +0000 (22:33 +0800)] 
slab: Fix obj_ext mistakenly considered NULL due to race condition

If two competing threads enter alloc_slab_obj_exts(), and the one that
allocates the vector wins the cmpxchg(), the other thread that failed
allocation mistakenly assumes that slab->obj_exts is still empty due to
its own allocation failure. This will then trigger warnings with
CONFIG_MEM_ALLOC_PROFILING_DEBUG checks in the subsequent free path.

Therefore, let's check the result of cmpxchg() to see if marking the
allocation as failed was successful. If it wasn't, check whether the
winning side has succeeded its allocation (it might have been also
marking it as failed) and if yes, return success.

Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Fixes: f7381b911640 ("slab: mark slab->obj_exts allocation failures unconditionally")
Cc: <stable@vger.kernel.org>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Link: https://patch.msgid.link/20251023143313.1327968-1-hao.ge@linux.dev
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
7 weeks agoio_uring: fix buffer auto-commit for multishot uring_cmd
Ming Lei [Fri, 24 Oct 2025 01:34:59 +0000 (09:34 +0800)] 
io_uring: fix buffer auto-commit for multishot uring_cmd

Commit 620a50c92700 ("io_uring: uring_cmd: add multishot support") added
multishot uring_cmd support with explicit buffer upfront commit via
io_uring_mshot_cmd_post_cqe(). However, the buffer selection path in
io_ring_buffer_select() was auto-committing buffers for non-pollable files,
which conflicts with uring_cmd's explicit upfront commit model.

This way consumes the whole selected buffer immediately, and causes
failure on the following buffer selection.

Fix this by checking uring_cmd to identify operations that handle buffer
commit explicitly, and skip auto-commit for these operations.

Cc: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 620a50c92700 ("io_uring: uring_cmd: add multishot support")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoMAINTAINERS: add Mark Brown as a linux-next maintainer
Stephen Rothwell [Wed, 22 Oct 2025 05:36:25 +0000 (16:36 +1100)] 
MAINTAINERS: add Mark Brown as a linux-next maintainer

Mark has been kindly helping fill in when I have been unavailable over
the past several years.  He has also put his hand up to take over
linux-next maintenance when I finally decide to stop (which may be some
time yet ;-) ).

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 weeks agoMerge tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Thu, 23 Oct 2025 23:50:25 +0000 (16:50 -0700)] 
Merge tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A couple of fixes for Runtime Verification:

   - A bug caused a kernel panic when reading enabled_monitors was
     reported.

     Change callback functions to always use list_head iterators and by
     doing so, fix the wrong pointer that was leading to the panic.

   - The rtapp/pagefault monitor relies on the MMU to be present
     (pagefaults exist) but that was not enforced via kconfig, leading
     to potential build errors on systems without an MMU.

     Add that kconfig dependency"

* tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Make rtapp/pagefault monitor depends on CONFIG_MMU
  rv: Fully convert enabled_monitors to use list_head as iterator

7 weeks agoMerge tag 'arm-soc/for-6.18/devicetree-arm64-fixes' of https://github.com/Broadcom...
Arnd Bergmann [Thu, 23 Oct 2025 20:30:41 +0000 (22:30 +0200)] 
Merge tag 'arm-soc/for-6.18/devicetree-arm64-fixes' of https://github.com/Broadcom/stblinux into arm/fixes

This pull request contains Broadcom ARM64-based SoCs Device Tree fixes
for 6.18, please pull the following:

- Peter describes the VGIC interrupt line such that KVM can be used on
  Raspberry Pi 5 systems.

* tag 'arm-soc/for-6.18/devicetree-arm64-fixes' of https://github.com/Broadcom/stblinux:
  arm64: dts: broadcom: bcm2712: Define VGIC interrupt

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
7 weeks agoMerge tag 'arm-soc/for-6.18/devicetree-fixes' of https://github.com/Broadcom/stblinux...
Arnd Bergmann [Thu, 23 Oct 2025 20:30:29 +0000 (22:30 +0200)] 
Merge tag 'arm-soc/for-6.18/devicetree-fixes' of https://github.com/Broadcom/stblinux into arm/fixes

This pull request contains Broadcom ARM-based SoCs Device Tree fixes for
6.18, please pull the following:

- Stefan switches the V3D block to use the firmware clock, rather than
  the bare metal clock. This fixes hangs on boot after recent changes to
  the V3D driver clocking went in.

* tag 'arm-soc/for-6.18/devicetree-fixes' of https://github.com/Broadcom/stblinux:
  ARM: dts: broadcom: rpi: Switch to V3D firmware clock

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
7 weeks agoMerge tag 'scmi-fixes-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep...
Arnd Bergmann [Thu, 23 Oct 2025 20:30:01 +0000 (22:30 +0200)] 
Merge tag 'scmi-fixes-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes

Arm SCMI fixes for v6.18

This series contains a set of small, focused fixes that address
robustness and lifecycle issues in the Arm SCMI core and debug support,
ensuring safer handling of debug initialization failures, correct flag
management in raw mode, and consistent inflight counter tracking.

Brief summary:

 - Fix raw xfer flag clearing
 - Skip RAW debug initialization on failure
 - Make inflight counter helpers null-safe, preventing crashes if debug
   initialization fails
 - Account for failed debug initialization globally

There is no functional change for standard SCMI operation, but these
fixes improve stability in debug and raw modes, particularly in error
paths.

* tag 'scmi-fixes-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
  firmware: arm_scmi: Fix premature SCMI_XFER_FLAG_IS_RAW clearing in raw mode
  firmware: arm_scmi: Skip RAW initialization on failure
  include: trace: Fix inflight count helper on failed initialization
  firmware: arm_scmi: Account for failed debug initialization

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
7 weeks agoMerge tag 'ffa-fix-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep...
Arnd Bergmann [Thu, 23 Oct 2025 20:29:39 +0000 (22:29 +0200)] 
Merge tag 'ffa-fix-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes

Arm FF-A fix for v6.18

The FF-A driver was updated to support specification version 1.2 but omitted
support for the 16-byte implementation-defined (IMPDEF) field introduced in
FF-A v1.2 within the Endpoint Memory Access Descriptor (EMAD). This omission
breaks all memory interfaces.

This change updates the EMAD sizing and offset logic to correctly handle the
FF-A v1.2 layout while preserving backward compatibility with older versions.

* tag 'ffa-fix-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
  firmware: arm_ffa: Add support for IMPDEF value in the memory access descriptor

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
7 weeks agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Thu, 23 Oct 2025 19:26:47 +0000 (09:26 -1000)] 
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

 - Do not make a clean PTE dirty in pte_mkwrite()

   The Arm architecture, for backwards compatibility reasons (ARMv8.0
   before in-hardware dirty bit management - DBM), uses the PTE_RDONLY
   bit to mean !dirty while the PTE_WRITE bit means DBM enabled. The
   arm64 pte_mkwrite() simply clears the PTE_RDONLY bit and this
   inadvertently makes the PTE pte_hw_dirty(). Most places making a PTE
   writable also invoke pte_mkdirty() but do_swap_page() does not and we
   end up with dirty, freshly swapped in, writeable pages.

 - Do not warn if the destination page is already MTE-tagged in
   copy_highpage()

   In the majority of the cases, a destination page copied into is
   freshly allocated without the PG_mte_tagged flag set. However, the
   folio migration may be restarted if __folio_migrate_mapping() failed,
   triggering the benign WARN_ON_ONCE().

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: mte: Do not warn if the page is already tagged in copy_highpage()
  arm64, mm: avoid always making PTE dirty in pte_mkwrite()

7 weeks agoMerge tag 'net-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 23 Oct 2025 17:03:18 +0000 (07:03 -1000)] 
Merge tag 'net-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from can. Slim pickings, I'm guessing people haven't
  really started testing.

  Current release - new code bugs:

   - eth: mlx5e:
       - psp: avoid 'accel' NULL pointer dereference
       - skip PPHCR register query for FEC histogram if not supported

  Previous releases - regressions:

   - bonding: update the slave array for broadcast mode

   - rtnetlink: re-allow deleting FDB entries in user namespace

   - eth: dpaa2: fix the pointer passed to PTR_ALIGN on Tx path

  Previous releases - always broken:

   - can: drop skb on xmit if device is in listen-only mode

   - gro: clear skb_shinfo(skb)->hwtstamps in napi_reuse_skb()

   - eth: mlx5e
       - RX, fix generating skb from non-linear xdp_buff if program
         trims frags
       - make devcom init failures non-fatal, fix races with IPSec

  Misc:

   - some documentation formatting 'fixes'"

* tag 'net-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
  net/mlx5: Fix IPsec cleanup over MPV device
  net/mlx5: Refactor devcom to return NULL on failure
  net/mlx5e: Skip PPHCR register query if not supported by the device
  net/mlx5: Add PPHCR to PCAM supported registers mask
  virtio-net: zero unused hash fields
  net: phy: micrel: always set shared->phydev for LAN8814
  vsock: fix lock inversion in vsock_assign_transport()
  ovpn: use datagram_poll_queue for socket readiness in TCP
  espintcp: use datagram_poll_queue for socket readiness
  net: datagram: introduce datagram_poll_queue for custom receive queues
  net: bonding: fix possible peer notify event loss or dup issue
  net: hsr: prevent creation of HSR device with slaves from another netns
  sctp: avoid NULL dereference when chunk data buffer is missing
  ptp: ocp: Fix typo using index 1 instead of i in SMA initialization loop
  net: ravb: Ensure memory write completes before ringing TX doorbell
  net: ravb: Enforce descriptor type ordering
  net: hibmcge: select FIXED_PHY
  net: dlink: use dev_kfree_skb_any instead of dev_kfree_skb
  Documentation: networking: ax25: update the mailing list info.
  net: gro_cells: fix lock imbalance in gro_cells_receive()
  ...

7 weeks agoMerge tag 'acpi-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Thu, 23 Oct 2025 16:53:12 +0000 (06:53 -1000)] 
Merge tag 'acpi-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These fix a fallout of a recent ACPI properties management update and
  work around a compiler bug in ACPICA:

   - Fix a recent coding mistake causing __acpi_node_get_property_reference()
     arguments to be put in an incorrect order (Sunil V L)

   - Work around bogus -Wstringop-overread warning on LoongArch since
     GCC 11 in ACPICA (Xi Ruoyao)"

* tag 'acpi-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPICA: Work around bogus -Wstringop-overread warning since GCC 11
  ACPI: property: Fix argument order in __acpi_node_get_property_reference()

7 weeks agoMerge tag 'pm-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Thu, 23 Oct 2025 16:48:32 +0000 (06:48 -1000)] 
Merge tag 'pm-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These revert a cpuidle menu governor commit leading to a performance
  regression, fix an amd-pstate driver regression introduced recently,
  and fix new conditional guard definitions for runtime PM.

   - Add missing _RET == 0 condition to recently introduced conditional
     guard definitions for runtime PM (Rafael Wysocki)

   - Revert a cpuidle menu governor change that introduced a serious
     performance regression on Chromebooks with Intel Jasper Lake
     processors (Rafael Wysocki)

   - Fix an amd-pstate driver regression leading to EPP=0 after
     hibernation (Mario Limonciello)"

* tag 'pm-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: runtime: Fix conditional guard definitions
  Revert "cpuidle: menu: Avoid discarding useful information"
  cpufreq/amd-pstate: Fix a regression leading to EPP 0 after hibernate

7 weeks agoMerge tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Thu, 23 Oct 2025 16:44:43 +0000 (06:44 -1000)] 
Merge tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - in send, fix duplicated rmdir operations when using extrefs
   (hardlinks), receive can fail with ENOENT

 - fixup of error check when reading extent root in ref-verify and
   damaged roots are allowed by mount option (found by smatch)

 - fix freeing partially initialized fs info (found by syzkaller)

 - fix use-after-free when printing ref_tracking status of delayed
   inodes

* tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree()
  btrfs: fix delayed_node ref_tracker use after free
  btrfs: send: fix duplicated rmdir operations when using extrefs
  btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()

7 weeks agoarm64: mte: Do not warn if the page is already tagged in copy_highpage()
Catalin Marinas [Wed, 22 Oct 2025 10:09:14 +0000 (11:09 +0100)] 
arm64: mte: Do not warn if the page is already tagged in copy_highpage()

The arm64 copy_highpage() assumes that the destination page is newly
allocated and not MTE-tagged (PG_mte_tagged unset) and warns
accordingly. However, following commit 060913999d7a ("mm: migrate:
support poisoned recover from migrate folio"), folio_mc_copy() is called
before __folio_migrate_mapping(). If the latter fails (-EAGAIN), the
copy will be done again to the same destination page. Since
copy_highpage() already set the PG_mte_tagged flag, this second copy
will warn.

Replace the WARN_ON_ONCE(page already tagged) in the arm64
copy_highpage() with a comment.

Reported-by: syzbot+d1974fc28545a3e6218b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/68dda1ae.a00a0220.102ee.0065.GAE@google.com
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: stable@vger.kernel.org # 6.12.x
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
7 weeks agoslab: fix slab accounting imbalance due to defer_deactivate_slab()
Vlastimil Babka [Thu, 23 Oct 2025 12:01:07 +0000 (14:01 +0200)] 
slab: fix slab accounting imbalance due to defer_deactivate_slab()

Since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and
kfree_nolock().") there's a possibility in alloc_single_from_new_slab()
that we discard the newly allocated slab if we can't spin and we fail to
trylock. As a result we don't perform inc_slabs_node() later in the
function. Instead we perform a deferred deactivate_slab() which can
either put the unacounted slab on partial list, or discard it
immediately while performing dec_slabs_node(). Either way will cause an
accounting imbalance.

Fix this by not marking the slab as frozen, and using free_slab()
instead of deactivate_slab() for non-frozen slabs in
free_deferred_objects(). For CONFIG_SLUB_TINY, that's the only possible
case. By not using discard_slab() we avoid dec_slabs_node().

Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Link: https://patch.msgid.link/20251023-fix-slab-accounting-v2-1-0e62d50986ea@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
7 weeks agoMerge branch 'mlx5-misc-fixes-2025-10-22'
Jakub Kicinski [Thu, 23 Oct 2025 14:14:38 +0000 (07:14 -0700)] 
Merge branch 'mlx5-misc-fixes-2025-10-22'

Tariq Toukan says:

====================
mlx5 misc fixes 2025-10-22

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/1761136182-918470-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet/mlx5: Fix IPsec cleanup over MPV device
Patrisious Haddad [Wed, 22 Oct 2025 12:29:42 +0000 (15:29 +0300)] 
net/mlx5: Fix IPsec cleanup over MPV device

When we do mlx5e_detach_netdev() we eventually disable blocking events
notifier, among those events are IPsec MPV events from IB to core.

So before disabling those blocking events, make sure to also unregister
the devcom device and mark all this device operations as complete,
in order to prevent the other device from using invalid netdev
during future devcom events which could cause the trace below.

BUG: kernel NULL pointer dereference, address: 0000000000000010
PGD 146427067 P4D 146427067 PUD 146488067 PMD 0
Oops: Oops: 0000 [#1] SMP
CPU: 1 UID: 0 PID: 7735 Comm: devlink Tainted: GW 6.12.0-rc6_for_upstream_min_debug_2024_11_08_00_46 #1
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5_devcom_comp_set_ready+0x5/0x40 [mlx5_core]
Code: 00 01 48 83 05 23 32 1e 00 01 41 b8 ed ff ff ff e9 60 ff ff ff 48 83 05 00 32 1e 00 01 eb e3 66 0f 1f 44 00 00 0f 1f 44 00 00 <48> 8b 47 10 48 83 05 5f 32 1e 00 01 48 8b 50 40 48 85 d2 74 05 40
RSP: 0018:ffff88811a5c35f8 EFLAGS: 00010206
RAX: ffff888106e8ab80 RBX: ffff888107d7e200 RCX: ffff88810d6f0a00
RDX: ffff88810d6f0a00 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff88811a17e620 R08: 0000000000000040 R09: 0000000000000000
R10: ffff88811a5c3618 R11: 0000000de85d51bd R12: ffff88811a17e600
R13: ffff88810d6f0a00 R14: 0000000000000000 R15: ffff8881034bda80
FS:  00007f27bdf89180(0000) GS:ffff88852c880000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000010f159005 CR4: 0000000000372eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? __die+0x20/0x60
 ? page_fault_oops+0x150/0x3e0
 ? exc_page_fault+0x74/0x130
 ? asm_exc_page_fault+0x22/0x30
 ? mlx5_devcom_comp_set_ready+0x5/0x40 [mlx5_core]
 mlx5e_devcom_event_mpv+0x42/0x60 [mlx5_core]
 mlx5_devcom_send_event+0x8c/0x170 [mlx5_core]
 blocking_event+0x17b/0x230 [mlx5_core]
 notifier_call_chain+0x35/0xa0
 blocking_notifier_call_chain+0x3d/0x60
 mlx5_blocking_notifier_call_chain+0x22/0x30 [mlx5_core]
 mlx5_core_mp_event_replay+0x12/0x20 [mlx5_core]
 mlx5_ib_bind_slave_port+0x228/0x2c0 [mlx5_ib]
 mlx5_ib_stage_init_init+0x664/0x9d0 [mlx5_ib]
 ? idr_alloc_cyclic+0x50/0xb0
 ? __kmalloc_cache_noprof+0x167/0x340
 ? __kmalloc_noprof+0x1a7/0x430
 __mlx5_ib_add+0x34/0xd0 [mlx5_ib]
 mlx5r_probe+0xe9/0x310 [mlx5_ib]
 ? kernfs_add_one+0x107/0x150
 ? __mlx5_ib_add+0xd0/0xd0 [mlx5_ib]
 auxiliary_bus_probe+0x3e/0x90
 really_probe+0xc5/0x3a0
 ? driver_probe_device+0x90/0x90
 __driver_probe_device+0x80/0x160
 driver_probe_device+0x1e/0x90
 __device_attach_driver+0x7d/0x100
 bus_for_each_drv+0x80/0xd0
 __device_attach+0xbc/0x1f0
 bus_probe_device+0x86/0xa0
 device_add+0x62d/0x830
 __auxiliary_device_add+0x3b/0xa0
 ? auxiliary_device_init+0x41/0x90
 add_adev+0xd1/0x150 [mlx5_core]
 mlx5_rescan_drivers_locked+0x21c/0x300 [mlx5_core]
 esw_mode_change+0x6c/0xc0 [mlx5_core]
 mlx5_devlink_eswitch_mode_set+0x21e/0x640 [mlx5_core]
 devlink_nl_eswitch_set_doit+0x60/0xe0
 genl_family_rcv_msg_doit+0xd0/0x120
 genl_rcv_msg+0x180/0x2b0
 ? devlink_get_from_attrs_lock+0x170/0x170
 ? devlink_nl_eswitch_get_doit+0x290/0x290
 ? devlink_nl_pre_doit_port_optional+0x50/0x50
 ? genl_family_rcv_msg_dumpit+0xf0/0xf0
 netlink_rcv_skb+0x54/0x100
 genl_rcv+0x24/0x40
 netlink_unicast+0x1fc/0x2d0
 netlink_sendmsg+0x1e4/0x410
 __sock_sendmsg+0x38/0x60
 ? sockfd_lookup_light+0x12/0x60
 __sys_sendto+0x105/0x160
 ? __sys_recvmsg+0x4e/0x90
 __x64_sys_sendto+0x20/0x30
 do_syscall_64+0x4c/0x100
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f27bc91b13a
Code: bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 fa 96 2c 00 45 89 c9 4c 63 d1 48 63 ff 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 f3 c3 0f 1f 40 00 41 55 41 54 4d 89 c5 55
RSP: 002b:00007fff369557e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000009c54b10 RCX: 00007f27bc91b13a
RDX: 0000000000000038 RSI: 0000000009c54b10 RDI: 0000000000000006
RBP: 0000000009c54920 R08: 00007f27bd0030e0 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
 </TASK>
Modules linked in: mlx5_vdpa vringh vhost_iotlb vdpa xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi ib_umad scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_fwctl mlx5_ib ib_uverbs ib_core mlx5_core
CR2: 0000000000000010

Fixes: 82f9378c443c ("net/mlx5: Handle IPsec steering upon master unbind/bind")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761136182-918470-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet/mlx5: Refactor devcom to return NULL on failure
Patrisious Haddad [Wed, 22 Oct 2025 12:29:41 +0000 (15:29 +0300)] 
net/mlx5: Refactor devcom to return NULL on failure

Devcom device and component registration isn't always critical to the
functionality of the caller, hence the registration can fail and we can
continue working with an ERR_PTR value saved inside a variable.

In order to avoid that make sure all devcom failures return NULL.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761136182-918470-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet/mlx5e: Skip PPHCR register query if not supported by the device
Alexei Lazar [Wed, 22 Oct 2025 12:29:40 +0000 (15:29 +0300)] 
net/mlx5e: Skip PPHCR register query if not supported by the device

Check the PCAM supported registers mask before querying the PPHCR
register, as it is not supported in older devices.

Fixes: 44907e7c8fd0 ("net/mlx5e: Add logic to read RS-FEC histogram bin ranges from PPHCR")
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761136182-918470-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet/mlx5: Add PPHCR to PCAM supported registers mask
Alexei Lazar [Wed, 22 Oct 2025 12:29:39 +0000 (15:29 +0300)] 
net/mlx5: Add PPHCR to PCAM supported registers mask

Add the PPHCR bit to the port_access_reg_cap_mask field of PCAM
register to indicate that the device supports the PPHCR register
and the RS-FEC histogram feature.

Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761136182-918470-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agovirtio-net: zero unused hash fields
Jason Wang [Wed, 22 Oct 2025 03:44:21 +0000 (11:44 +0800)] 
virtio-net: zero unused hash fields

When GSO tunnel is negotiated virtio_net_hdr_tnl_from_skb() tries to
initialize the tunnel metadata but forget to zero unused rxhash
fields. This may leak information to another side. Fixing this by
zeroing the unused hash fields.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: a2fb4bc4e2a6a ("net: implement virtio helpers to handle UDP GSO tunneling")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20251022034421.70244-1-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: phy: micrel: always set shared->phydev for LAN8814
Robert Marko [Tue, 21 Oct 2025 13:20:26 +0000 (15:20 +0200)] 
net: phy: micrel: always set shared->phydev for LAN8814

Currently, during the LAN8814 PTP probe shared->phydev is only set if PTP
clock gets actually set, otherwise the function will return before setting
it.

This is an issue as shared->phydev is unconditionally being used when IRQ
is being handled, especially in lan8814_gpio_process_cap and since it was
not set it will cause a NULL pointer exception and crash the kernel.

So, simply always set shared->phydev to avoid the NULL pointer exception.

Fixes: b3f1a08fcf0d ("net: phy: micrel: Add support for PTP_PF_EXTTS for lan8814")
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://patch.msgid.link/20251021132034.983936-1-robert.marko@sartura.hr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agovsock: fix lock inversion in vsock_assign_transport()
Stefano Garzarella [Tue, 21 Oct 2025 12:17:18 +0000 (14:17 +0200)] 
vsock: fix lock inversion in vsock_assign_transport()

Syzbot reported a potential lock inversion deadlock between
vsock_register_mutex and sk_lock-AF_VSOCK when vsock_linger() is called.

The issue was introduced by commit 687aa0c5581b ("vsock: Fix
transport_* TOCTOU") which added vsock_register_mutex locking in
vsock_assign_transport() around the transport->release() call, that can
call vsock_linger(). vsock_assign_transport() can be called with sk_lock
held. vsock_linger() calls sk_wait_event() that temporarily releases and
re-acquires sk_lock. During this window, if another thread hold
vsock_register_mutex while trying to acquire sk_lock, a circular
dependency is created.

Fix this by releasing vsock_register_mutex before calling
transport->release() and vsock_deassign_transport(). This is safe
because we don't need to hold vsock_register_mutex while releasing the
old transport, and we ensure the new transport won't disappear by
obtaining a module reference first via try_module_get().

Reported-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com
Tested-by: syzbot+10e35716f8e4929681fa@syzkaller.appspotmail.com
Fixes: 687aa0c5581b ("vsock: Fix transport_* TOCTOU")
Cc: mhal@rbox.co
Cc: stable@vger.kernel.org
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20251021121718.137668-1-sgarzare@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoMerge branch 'fix-poll-behaviour-for-tcp-based-tunnel-protocols'
Paolo Abeni [Thu, 23 Oct 2025 13:46:10 +0000 (15:46 +0200)] 
Merge branch 'fix-poll-behaviour-for-tcp-based-tunnel-protocols'

Ralf Lici says:

====================
fix poll behaviour for TCP-based tunnel protocols

This patch series introduces a polling function for datagram-style
sockets that operates on custom skb queues, and updates ovpn (the
OpenVPN data-channel offload module) and espintcp (the TCP Encapsulation
of IKE and IPsec Packets implementation) to use it accordingly.

Protocols like the aforementioned one decapsulate packets received over
TCP and deliver userspace-bound data through a separate skb queue, not
the standard sk_receive_queue. Previously, both relied on
datagram_poll(), which would signal readiness based on non-userspace
packets, leading to misleading poll results and unnecessary recv
attempts in userspace.

Patch 1 introduces datagram_poll_queue(), a variant of datagram_poll()
that accepts an explicit receive queue. This builds on the approach
introduced in commit b50b058, which extended other skb-related functions
to support custom queues. Patch 2 and 3 update espintcp_poll() and
ovpn_tcp_poll() respectively to use this helper, ensuring readiness is
only signaled when userspace data is available.

Each patch is self-contained and the ovpn one includes rationale and
lifecycle enforcement where appropriate.
====================

Link: https://patch.msgid.link/20251021100942.195010-1-ralf@mandelbit.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoovpn: use datagram_poll_queue for socket readiness in TCP
Ralf Lici [Tue, 21 Oct 2025 10:09:42 +0000 (12:09 +0200)] 
ovpn: use datagram_poll_queue for socket readiness in TCP

openvpn TCP encapsulation uses a custom queue to deliver packets to
userspace. Currently it relies on datagram_poll, which checks
sk_receive_queue, leading to false readiness signals when that queue
contains non-userspace packets.

Switch ovpn_tcp_poll to use datagram_poll_queue with the peer's
user_queue, ensuring poll only signals readiness when userspace data is
actually available. Also refactor ovpn_tcp_poll in order to enforce the
assumption we can make on the lifetime of ovpn_sock and peer.

Fixes: 11851cbd60ea ("ovpn: implement TCP transport")
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251021100942.195010-4-ralf@mandelbit.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoespintcp: use datagram_poll_queue for socket readiness
Ralf Lici [Tue, 21 Oct 2025 10:09:41 +0000 (12:09 +0200)] 
espintcp: use datagram_poll_queue for socket readiness

espintcp uses a custom queue (ike_queue) to deliver packets to
userspace. The polling logic relies on datagram_poll, which checks
sk_receive_queue, which can lead to false readiness signals when that
queue contains non-userspace packets.

Switch espintcp_poll to use datagram_poll_queue with ike_queue, ensuring
poll only signals readiness when userspace data is actually available.

Fixes: e27cca96cd68 ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251021100942.195010-3-ralf@mandelbit.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agonet: datagram: introduce datagram_poll_queue for custom receive queues
Ralf Lici [Tue, 21 Oct 2025 10:09:40 +0000 (12:09 +0200)] 
net: datagram: introduce datagram_poll_queue for custom receive queues

Some protocols using TCP encapsulation (e.g., espintcp, openvpn) deliver
userspace-bound packets through a custom skb queue rather than the
standard sk_receive_queue.

Introduce datagram_poll_queue that accepts an explicit receive queue,
and convert datagram_poll into a wrapper around datagram_poll_queue.
This allows protocols with custom skb queues to reuse the core polling
logic without relying on sk_receive_queue.

Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Antonio Quartulli <antonio@openvpn.net>
Signed-off-by: Ralf Lici <ralf@mandelbit.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20251021100942.195010-2-ralf@mandelbit.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoio_uring: correct __must_hold annotation in io_install_fixed_file
Alok Tiwari [Thu, 23 Oct 2025 11:55:24 +0000 (04:55 -0700)] 
io_uring: correct __must_hold annotation in io_install_fixed_file

The __must_hold annotation references &req->ctx->uring_lock, but req
is not in scope in io_install_fixed_file. This change updates the
annotation to reference the correct ctx->uring_lock.
improving code clarity.

Fixes: f110ed8498af ("io_uring: split out fixed file installation and removal")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agogpio: ljca: Fix duplicated IRQ mapping
Haotian Zhang [Thu, 23 Oct 2025 07:02:30 +0000 (15:02 +0800)] 
gpio: ljca: Fix duplicated IRQ mapping

The generic_handle_domain_irq() function resolves the hardware IRQ
internally. The driver performed a duplicative mapping by calling
irq_find_mapping() first, which could lead to an RCU stall.

Delete the redundant irq_find_mapping() call and pass the hardware IRQ
directly to generic_handle_domain_irq().

Fixes: c5a4b6fd31e8 ("gpio: Add support for Intel LJCA USB GPIO driver")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Link: https://lore.kernel.org/r/20251023070231.1305-1-vulab@iscas.ac.cn
[Bartosz: remove unused variable]
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
7 weeks agoMerge branch 'acpi-property'
Rafael J. Wysocki [Thu, 23 Oct 2025 11:25:02 +0000 (13:25 +0200)] 
Merge branch 'acpi-property'

Merge an ACPI device properties handling change fixing the order of
__acpi_node_get_property_reference() arguments broken by a recent
update (Sunil V L)

* 'acpi-property':
  ACPI: property: Fix argument order in __acpi_node_get_property_reference()

7 weeks agoMerge branches 'pm-cpuidle' and 'pm-cpufreq'
Rafael J. Wysocki [Thu, 23 Oct 2025 11:12:24 +0000 (13:12 +0200)] 
Merge branches 'pm-cpuidle' and 'pm-cpufreq'

Merge cpuidle and cpufreq fixes for 6.18-rc3:

 - Revert a cpuidle menu governor change that introduced a serious
   performance regression on Chromebooks with Intel Jasper Lake
   processors (Rafael Wysocki)

 - Fix an amd-pstate driver regression leading to EPP=0 after
   hibernation (Mario Limonciello)

* pm-cpuidle:
  Revert "cpuidle: menu: Avoid discarding useful information"

* pm-cpufreq:
  cpufreq/amd-pstate: Fix a regression leading to EPP 0 after hibernate

7 weeks agonet: bonding: fix possible peer notify event loss or dup issue
Tonghao Zhang [Tue, 21 Oct 2025 05:09:33 +0000 (13:09 +0800)] 
net: bonding: fix possible peer notify event loss or dup issue

If the send_peer_notif counter and the peer event notify are not synchronized.
It may cause problems such as the loss or dup of peer notify event.

Before this patch:
- If should_notify_peers is true and the lock for send_peer_notif-- fails, peer
  event may be sent again in next mii_monitor loop, because should_notify_peers
  is still true.
- If should_notify_peers is true and the lock for send_peer_notif-- succeeded,
  but the lock for peer event fails, the peer event will be lost.

This patch locks the RTNL for send_peer_notif, events, and commit simultaneously.

Fixes: 07a4ddec3ce9 ("bonding: add an option to specify a delay between peer notifications")
Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Vincent Bernat <vincent@bernat.ch>
Cc: <stable@vger.kernel.org>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Acked-by: Jay Vosburgh <jv@jvosburgh.net>
Link: https://patch.msgid.link/20251021050933.46412-1-tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoMerge tag 'intel-gpio-v6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy...
Bartosz Golaszewski [Thu, 23 Oct 2025 08:06:59 +0000 (10:06 +0200)] 
Merge tag 'intel-gpio-v6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-gpio-intel into gpio/for-current

intel-gpio fixes for v6.18-1

* Make set debounce errors non-fatal in GPIO ACPI case
* Use human readable error when printing a message in GPIO ACPI code

7 weeks agogpiolib: acpi: Use %pe when passing an error pointer to dev_err()
Andy Shevchenko [Thu, 23 Oct 2025 06:39:58 +0000 (08:39 +0200)] 
gpiolib: acpi: Use %pe when passing an error pointer to dev_err()

One of the coccinelle recipe suggests to use %pe when we deal with
an error pointer. Do it so.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/r/202510231350.calxvXIm-lkp@intel.com/
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
7 weeks agogpiolib: acpi: Make set debounce errors non fatal
Hans de Goede [Wed, 22 Oct 2025 13:37:15 +0000 (15:37 +0200)] 
gpiolib: acpi: Make set debounce errors non fatal

Commit 16c07342b542 ("gpiolib: acpi: Program debounce when finding GPIO")
adds a gpio_set_debounce_timeout() call to acpi_find_gpio() and makes
acpi_find_gpio() fail if this fails.

But gpio_set_debounce_timeout() failing is a somewhat normal occurrence,
since not all debounce values are supported on all GPIO/pinctrl chips.

Making this an error for example break getting the card-detect GPIO for
the micro-sd slot found on many Bay Trail tablets, breaking support for
the micro-sd slot on these tablets.

acpi_request_own_gpiod() already treats gpio_set_debounce_timeout()
failures as non-fatal, just warning about them.

Add a acpi_gpio_set_debounce_timeout() helper which wraps
gpio_set_debounce_timeout() and warns on failures and replace both existing
gpio_set_debounce_timeout() calls with the helper.

Since the helper only warns on failures this fixes the card-detect issue.

Fixes: 16c07342b542 ("gpiolib: acpi: Program debounce when finding GPIO")
Cc: stable@vger.kernel.org
Cc: Mario Limonciello <superm1@kernel.org>
Signed-off-by: Hans de Goede <hansg@kernel.org>
Acked-by: Andy Shevchenko <andy@kernel.org>
Link: https://lore.kernel.org/stable/20250920201200.20611-1-hansg%40kernel.org
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
7 weeks agonet: hsr: prevent creation of HSR device with slaves from another netns
Fernando Fernandez Mancera [Mon, 20 Oct 2025 13:55:33 +0000 (15:55 +0200)] 
net: hsr: prevent creation of HSR device with slaves from another netns

HSR/PRP driver does not handle correctly having slaves/interlink devices
in a different net namespace. Currently, it is possible to create a HSR
link in a different net namespace than the slaves/interlink with the
following command:

 ip link add hsr0 netns hsr-ns type hsr slave1 eth1 slave2 eth2

As there is no use-case on supporting this scenario, enforce that HSR
device link matches netns defined by IFLA_LINK_NETNSID.

The iproute2 command mentioned above will throw the following error:

 Error: hsr: HSR slaves/interlink must be on the same net namespace than HSR link.

Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20251020135533.9373-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agosctp: avoid NULL dereference when chunk data buffer is missing
Alexey Simakov [Tue, 21 Oct 2025 13:00:36 +0000 (16:00 +0300)] 
sctp: avoid NULL dereference when chunk data buffer is missing

chunk->skb pointer is dereferenced in the if-block where it's supposed
to be NULL only.

chunk->skb can only be NULL if chunk->head_skb is not. Check for frag_list
instead and do it just before replacing chunk->skb. We're sure that
otherwise chunk->skb is non-NULL because of outer if() condition.

Fixes: 90017accff61 ("sctp: Add GSO support")
Signed-off-by: Alexey Simakov <bigalex934@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://patch.msgid.link/20251021130034.6333-1-bigalex934@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoptp: ocp: Fix typo using index 1 instead of i in SMA initialization loop
Jiasheng Jiang [Tue, 21 Oct 2025 18:24:56 +0000 (18:24 +0000)] 
ptp: ocp: Fix typo using index 1 instead of i in SMA initialization loop

In ptp_ocp_sma_fb_init(), the code mistakenly used bp->sma[1]
instead of bp->sma[i] inside a for-loop, which caused only SMA[1]
to have its DIRECTION_CAN_CHANGE capability cleared. This led to
inconsistent capability flags across SMA pins.

Fixes: 09eeb3aecc6c ("ptp_ocp: implement DPLL ops")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251021182456.9729-1-jiashengjiangcool@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'net-ravb-fix-soc-specific-configuration-and-descriptor-handling-issues'
Jakub Kicinski [Thu, 23 Oct 2025 01:08:06 +0000 (18:08 -0700)] 
Merge branch 'net-ravb-fix-soc-specific-configuration-and-descriptor-handling-issues'

Lad Prabhakar says:

====================
net: ravb: Fix SoC-specific configuration and descriptor handling issues [part]

This series addresses several issues in the Renesas Ethernet AVB (ravb)
driver related descriptor ordering.

A potential ordering hazard in descriptor setup could cause
the DMA engine to start prematurely, leading to TX stalls on some
platforms.

The series includes the following changes:

Enforce descriptor type ordering to prevent early DMA start
Ensure proper write ordering of TX descriptor type fields to prevent the
DMA engine from observing an incomplete descriptor chain. This fixes
observed TX stalls on RZ/G2L platforms running RT kernels.

Tested on R/G1x Gen2, RZ/G2x Gen3 and RZ/G2L family hardware.
====================

Link: https://patch.msgid.link/20251017151830.171062-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: ravb: Ensure memory write completes before ringing TX doorbell
Lad Prabhakar [Fri, 17 Oct 2025 15:18:30 +0000 (16:18 +0100)] 
net: ravb: Ensure memory write completes before ringing TX doorbell

Add a final dma_wmb() barrier before triggering the transmit request
(TCCR_TSRQ) to ensure all descriptor and buffer writes are visible to
the DMA engine.

According to the hardware manual, a read-back operation is required
before writing to the doorbell register to guarantee completion of
previous writes. Instead of performing a dummy read, a dma_wmb() is
used to both enforce the same ordering semantics on the CPU side and
also to ensure completion of writes.

Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Cc: stable@vger.kernel.org
Co-developed-by: Fabrizio Castro <fabrizio.castro.jz@renesas.com>
Signed-off-by: Fabrizio Castro <fabrizio.castro.jz@renesas.com>
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251017151830.171062-5-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: ravb: Enforce descriptor type ordering
Lad Prabhakar [Fri, 17 Oct 2025 15:18:29 +0000 (16:18 +0100)] 
net: ravb: Enforce descriptor type ordering

Ensure the TX descriptor type fields are published in a safe order so the
DMA engine never begins processing a descriptor chain before all descriptor
fields are fully initialised.

For multi-descriptor transmits the driver writes DT_FEND into the last
descriptor and DT_FSTART into the first. The DMA engine begins processing
when it observes DT_FSTART. Move the dma_wmb() barrier so it executes
immediately after DT_FEND and immediately before writing DT_FSTART
(and before DT_FSINGLE in the single-descriptor case). This guarantees
that all prior CPU writes to the descriptor memory are visible to the
device before DT_FSTART is seen.

This avoids a situation where compiler/CPU reordering could publish
DT_FSTART ahead of DT_FEND or other descriptor fields, allowing the DMA to
start on a partially initialised chain and causing corrupted transmissions
or TX timeouts. Such a failure was observed on RZ/G2L with an RT kernel as
transmit queue timeouts and device resets.

Fixes: 2f45d1902acf ("ravb: minimize TX data copying")
Cc: stable@vger.kernel.org
Co-developed-by: Fabrizio Castro <fabrizio.castro.jz@renesas.com>
Signed-off-by: Fabrizio Castro <fabrizio.castro.jz@renesas.com>
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251017151830.171062-4-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Thu, 23 Oct 2025 01:00:34 +0000 (15:00 -1000)] 
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "All driver fixes. The big change is the storvsc one to rejig the
  hyper-v channel handling to be more efficient for SMP virtual
  machines"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: phy: dt-bindings: Add QMP UFS PHY compatible for Kaanapali
  scsi: ufs: qcom: dt-bindings: Document the Kaanapali UFS controller
  scsi: libfc: Prevent integer overflow in fc_fcp_recv_data()
  scsi: qla4xxx: Fix typos in comments
  scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU

7 weeks agoMerge tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Thu, 23 Oct 2025 00:57:35 +0000 (14:57 -1000)] 
Merge tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "17 hotfixes. 12 are cc:stable and 14 are for MM.

  There's a two-patch DAMON series from SeongJae Park which addresses a
  missed check and possible memory leak. Apart from that it's all
  singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  csky: abiv2: adapt to new folio flags field
  mm/damon/core: use damos_commit_quota_goal() for new goal commit
  mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme
  hugetlbfs: move lock assertions after early returns in huge_pmd_unshare()
  vmw_balloon: indicate success when effectively deflating during migration
  mm/damon/core: fix list_add_tail() call on damon_call()
  mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap
  mm: prevent poison consumption when splitting THP
  ocfs2: clear extent cache after moving/defragmenting extents
  mm: don't spin in add_stack_record when gfp flags don't allow
  dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC
  mm/damon/sysfs: dealloc commit test ctx always
  mm/damon/sysfs: catch commit test ctx alloc failure
  hung_task: fix warnings caused by unaligned lock pointers

7 weeks agoio_uring zcrx: add MAINTAINERS entry
David Wei [Tue, 21 Oct 2025 20:29:44 +0000 (13:29 -0700)] 
io_uring zcrx: add MAINTAINERS entry

Same as [1] but also with netdev@ as an additional mailing list.
io_uring zero copy receive is of particular interest to netdev
participants too, given its tight integration to netdev core.

With this updated entry, folks running get_maintainer.pl on patches that
touch io_uring/zcrx.* will know to send it to netdev@ as well.

Note that this doesn't mean all changes require explicit acks from
netdev; this is purely for wider visibility and for other contributors
to know where to send patches.

[1]: https://lore.kernel.org/io-uring/989528e611b51d71fb712691ebfb76d2059ba561.1755461246.git.asml.silence@gmail.com/

Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
[axboe: use correct io_uring tree URL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoio_uring: Fix code indentation error
Ranganath V N [Tue, 21 Oct 2025 17:29:30 +0000 (22:59 +0530)] 
io_uring: Fix code indentation error

Fix the indentation to ensure consistent code style and improve
readability and to fix the errors:
ERROR: code indent should use tabs where possible
+               return io_net_import_vec(req, kmsg, sr->buf, sr->len, ITER_SOURCE);$

ERROR: code indent should use tabs where possible
+^I^I^I           struct io_big_cqe *big_cqe)$

Tested by running the /scripts/checkpatch.pl

Signed-off-by: Ranganath V N <vnranganath.20@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoio_uring/sqpoll: be smarter on when to update the stime usage
Jens Axboe [Tue, 21 Oct 2025 17:44:39 +0000 (11:44 -0600)] 
io_uring/sqpoll: be smarter on when to update the stime usage

The current approach is a bit naive, and hence calls the time querying
way too often. Only start the "doing work" timer when there's actual
work to do, and then use that information to terminate (and account) the
work time once done. This greatly reduces the frequency of these calls,
when they cannot have changed anyway.

Running a basic random reader that is setup to use SQPOLL, a profile
before this change shows these as the top cycle consumers:

+   32.60%  iou-sqp-1074  [kernel.kallsyms]  [k] thread_group_cputime_adjusted
+   19.97%  iou-sqp-1074  [kernel.kallsyms]  [k] thread_group_cputime
+   12.20%  io_uring      io_uring           [.] submitter_uring_fn
+    4.13%  iou-sqp-1074  [kernel.kallsyms]  [k] getrusage
+    2.45%  iou-sqp-1074  [kernel.kallsyms]  [k] io_submit_sqes
+    2.18%  iou-sqp-1074  [kernel.kallsyms]  [k] __pi_memset_generic
+    2.09%  iou-sqp-1074  [kernel.kallsyms]  [k] cputime_adjust

and after this change, top of profile looks as follows:

+   36.23%  io_uring     io_uring           [.] submitter_uring_fn
+   23.26%  iou-sqp-819  [kernel.kallsyms]  [k] io_sq_thread
+   10.14%  iou-sqp-819  [kernel.kallsyms]  [k] io_sq_tw
+    6.52%  iou-sqp-819  [kernel.kallsyms]  [k] tctx_task_work_run
+    4.82%  iou-sqp-819  [kernel.kallsyms]  [k] nvme_submit_cmds.part.0
+    2.91%  iou-sqp-819  [kernel.kallsyms]  [k] io_submit_sqes
[...]
     0.02%  iou-sqp-819  [kernel.kallsyms]  [k] cputime_adjust

where it's spending the cycles on things that actually matter.

Reported-by: Fengnan Chang <changfengnan@bytedance.com>
Cc: stable@vger.kernel.org
Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoio_uring/sqpoll: switch away from getrusage() for CPU accounting
Jens Axboe [Tue, 21 Oct 2025 13:16:08 +0000 (07:16 -0600)] 
io_uring/sqpoll: switch away from getrusage() for CPU accounting

getrusage() does a lot more than what the SQPOLL accounting needs, the
latter only cares about (and uses) the stime. Rather than do a full
RUSAGE_SELF summation, just query the used stime instead.

Cc: stable@vger.kernel.org
Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: require LBA dma_alignment when using PI
Christoph Hellwig [Wed, 22 Oct 2025 08:33:31 +0000 (10:33 +0200)] 
block: require LBA dma_alignment when using PI

The block layer PI generation / verification code expects the bio_vecs
to have at least LBA size (or more correctly integrity internal)
granularity.  With the direct I/O alignment relaxation in 2022, user
space can now feed bios with less alignment than that, leading to
scribbling outside the PI buffers.  Apparently this wasn't noticed so far
because none of the tests generate such buffers, but since 851c4c96db00
("xfs: implement XFS_IOC_DIOINFO in terms of vfs_getattr"), xfstests
generic/013 by default generates such I/O now that the relaxed alignment
is advertised by the XFS_IOC_DIOINFO ioctl.

Fix this by increasing the required alignment when using PI, although
handling arbitrary alignment in the long run would be even nicer.

Fixes: bf8d08532bc1 ("iomap: add support for dma aligned direct-io")
Fixes: b1a000d3b8ec ("block: relax direct io memory alignment")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoMerge tag 'platform-drivers-x86-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 22 Oct 2025 15:17:32 +0000 (05:17 -1000)] 
Merge tag 'platform-drivers-x86-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Ilpo Järvinen:

 - alienware-wmi-wmax:
     - Fix NULL pointer dereference in sleep handlers
     - Add AWCC support to Dell G15 5530

 - mellanox: mlxbf-pmc: add sysfs_attr_init() to count_clock init

* tag 'platform-drivers-x86-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: alienware-wmi-wmax: Add AWCC support to Dell G15 5530
  MAINTAINERS: add Denis Benato as maintainer for asus notebooks
  platform/mellanox: mlxbf-pmc: add sysfs_attr_init() to count_clock init
  platform/x86: alienware-wmi-wmax: Fix NULL pointer dereference in sleep handlers

7 weeks agoMerge tag 'erofs-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 22 Oct 2025 14:58:00 +0000 (04:58 -1000)] 
Merge tag 'erofs-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "Just three small fixes to address fuzzed images in relatively new
  features, as reported by Robert.

   - Hardening against fuzzed encoded extents

   - Fix infinite loops due to crafted subpage compact indexes

   - Improve z_erofs_extent_lookback()"

* tag 'erofs-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: consolidate z_erofs_extent_lookback()
  erofs: avoid infinite loops due to corrupted subpage compact indexes
  erofs: fix crafted invalid cases for encoded extents

7 weeks agoMerge tag '9p-for-6.18-rc3-v2' of https://github.com/martinetd/linux
Linus Torvalds [Wed, 22 Oct 2025 14:53:34 +0000 (04:53 -1000)] 
Merge tag '9p-for-6.18-rc3-v2' of https://github.com/martinetd/linux

Pull 9pfs fix from Dominique Martinet:
 "Fix 9p cache=mmap regression by revert

  This reverts the problematic commit instead of trying to fix it in a
  rush"

* tag '9p-for-6.18-rc3-v2' of https://github.com/martinetd/linux:
  Revert "fs/9p: Refresh metadata in d_revalidate for uncached mode too"

7 weeks agoof/irq: Fix OF node refcount in of_msi_get_domain()
Lorenzo Pieralisi [Tue, 21 Oct 2025 12:41:00 +0000 (14:41 +0200)] 
of/irq: Fix OF node refcount in of_msi_get_domain()

In of_msi_get_domain() if the iterator loop stops early because an
irq_domain match is detected, an of_node_put() on the iterator node is
needed to keep the OF node refcount in sync.

Add it.

Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Cc: Rob Herring <robh@kernel.org>
Link: https://patch.msgid.link/20251021124103.198419-3-lpieralisi@kernel.org
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
7 weeks agoof/irq: Add msi-parent check to of_msi_xlate()
Lorenzo Pieralisi [Tue, 21 Oct 2025 12:40:59 +0000 (14:40 +0200)] 
of/irq: Add msi-parent check to of_msi_xlate()

In some legacy platforms the MSI controller for a PCI host bridge is
identified by an msi-parent property whose phandle points at an MSI
controller node with no #msi-cells property, that implicitly
means #msi-cells == 0.

For such platforms, mapping a device ID and retrieving the MSI controller
node becomes simply a matter of checking whether in the device hierarchy
there is an msi-parent property pointing at an MSI controller node with
such characteristics.

Add a helper function to of_msi_xlate() to check the msi-parent property in
addition to msi-map and retrieve the MSI controller node (with a 1:1 ID
deviceID-IN<->deviceID-OUT  mapping) to provide support for deviceID
mapping and MSI controller node retrieval for such platforms.

Fixes: 57d72196dfc8 ("irqchip/gic-v5: Add GICv5 ITS support")
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Cc: Sascha Bischoff <sascha.bischoff@arm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20251021124103.198419-2-lpieralisi@kernel.org
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>