git.ipfire.org Git - thirdparty/kernel/linux.git/log

sockmap: Fix use-after-free in udp_bpf_recvmsg()

syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0]

sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock,
but its processing is lockless.

Thus, sk_msg_recvmsg() must be serialised by callers, otherwise
multiple threads could touch the same sk_msg.

For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock.

Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited
commit removed it.

Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg().

Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not
an option due to copy_page_to_iter() in sk_msg_recvmsg().

[0]:
BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428
Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020

CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xba/0x230 mm/kasan/report.c:482
kasan_report+0x117/0x150 mm/kasan/report.c:595
sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428
udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84
inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891
sock_recvmsg_nosec net/socket.c:1078 [inline]
sock_recvmsg+0x1a8/0x270 net/socket.c:1100
____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812
___sys_recvmsg+0x215/0x590 net/socket.c:2854
do_recvmmsg+0x334/0x800 net/socket.c:2949
__sys_recvmmsg net/socket.c:3023 [inline]
__do_sys_recvmmsg net/socket.c:3046 [inline]
__se_sys_recvmmsg net/socket.c:3039 [inline]
__x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb319f9aeb9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb31ad97028 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 00007fb31a216090 RCX: 00007fb319f9aeb9
RDX: 0000000000000001 RSI: 0000200000000400 RDI: 0000000000000004
RBP: 00007fb31a008c1f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000040000021 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb31a216128 R14: 00007fb31a216090 R15: 00007ffe21dd0a98
</TASK>

Allocated by task 6019:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__kmalloc_cache_noprof+0x3d1/0x6e0 mm/slub.c:5780
kmalloc_noprof include/linux/slab.h:957 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
alloc_sk_msg net/core/skmsg.c:510 [inline]
sk_psock_skb_ingress_self+0x60/0x350 net/core/skmsg.c:612
sk_psock_verdict_apply net/core/skmsg.c:1038 [inline]
sk_psock_verdict_recv+0x7d9/0x8d0 net/core/skmsg.c:1236
udp_read_skb+0x73e/0x7e0 net/ipv4/udp.c:2045
sk_psock_verdict_data_ready+0x12d/0x550 net/core/skmsg.c:1257
__udp_enqueue_schedule_skb+0xc54/0x10b0 net/ipv4/udp.c:1789
__udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline]
udp_queue_rcv_one_skb+0xac5/0x19c0 net/ipv4/udp.c:2475
__udp4_lib_mcast_deliver+0xc06/0xcf0 net/ipv4/udp.c:2585
__udp4_lib_rcv+0x10f6/0x2620 net/ipv4/udp.c:2724
ip_protocol_deliver_rcu+0x282/0x440 net/ipv4/ip_input.c:207
ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241
NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318
dst_input include/net/dst.h:474 [inline]
ip_sublist_rcv_finish+0x221/0x2a0 net/ipv4/ip_input.c:584
ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline]
ip_sublist_rcv+0x5c6/0xa70 net/ipv4/ip_input.c:644
ip_list_rcv+0x3f1/0x450 net/ipv4/ip_input.c:678
__netif_receive_skb_list_ptype net/core/dev.c:6195 [inline]
__netif_receive_skb_list_core+0x7e5/0x810 net/core/dev.c:6242
__netif_receive_skb_list net/core/dev.c:6294 [inline]
netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6385
netif_receive_skb_list+0x54/0x410 net/core/dev.c:6437
xdp_recv_frames net/bpf/test_run.c:269 [inline]
xdp_test_run_batch net/bpf/test_run.c:350 [inline]
bpf_test_run_xdp_live+0x1946/0x1cf0 net/bpf/test_run.c:379
bpf_prog_test_run_xdp+0x81c/0x1160 net/bpf/test_run.c:1396
bpf_prog_test_run+0x2c7/0x340 kernel/bpf/syscall.c:4703
__sys_bpf+0x5cb/0x920 kernel/bpf/syscall.c:6182
__do_sys_bpf kernel/bpf/syscall.c:6274 [inline]
__se_sys_bpf kernel/bpf/syscall.c:6272 [inline]
__x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6272
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 6021:
kasan_save_stack mm/kasan/common.c:57 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
poison_slab_object mm/kasan/common.c:253 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
kasan_slab_free include/linux/kasan.h:235 [inline]
slab_free_hook mm/slub.c:2540 [inline]
slab_free mm/slub.c:6674 [inline]
kfree+0x1be/0x650 mm/slub.c:6882
kfree_sk_msg include/linux/skmsg.h:385 [inline]
sk_msg_recvmsg+0xaa8/0xc30 net/core/skmsg.c:483
udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84
inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891
sock_recvmsg_nosec net/socket.c:1078 [inline]
sock_recvmsg+0x1a8/0x270 net/socket.c:1100
____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812
___sys_recvmsg+0x215/0x590 net/socket.c:2854
do_recvmmsg+0x334/0x800 net/socket.c:2949
__sys_recvmmsg net/socket.c:3023 [inline]
__do_sys_recvmmsg net/socket.c:3046 [inline]
__se_sys_recvmmsg net/socket.c:3039 [inline]
__x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy")
Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260615021959.140010-5-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, sockmap: keep sk_msg copy state in sync

SK_MSG uses msg->sg.copy as per-scatterlist-entry provenance. Entries
with this bit set are copied before data/data_end are exposed to SK_MSG
BPF programs for direct packet access.

bpf_msg_pull_data(), bpf_msg_push_data(), and bpf_msg_pop_data()
rewrite the sk_msg scatterlist ring by collapsing, splitting, and
shifting entries. These operations move msg->sg.data[] entries, but the
parallel copy bitmap can be left behind on the old slot. A copied entry
can then return to msg->sg.start with its copy bit clear and be exposed
as directly writable packet data.

This corruption path requires an attached SK_MSG BPF program that calls
the mutating helpers; ordinary sockmap/TLS traffic that never runs
push/pop/pull helper sequences is not affected.

Keep msg->sg.copy synchronized with scatterlist entry moves, preserve
the copy bit when an entry is split, clear it when a helper replaces an
entry with a private page, and clear slots vacated by pull-data
compaction.

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
Fixes: 7246d8ed4dcc ("bpf: helper to pop data from messages")
Cc: stable@vger.kernel.org
Co-developed-by: Han Guidong <2045gemini@gmail.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Han Guidong <2045gemini@gmail.com>
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260615021959.140010-4-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, sockmap: Fix wrong rsge offset in bpf_msg_push_data()

When bpf_msg_push_data() splits a scatterlist element into head and
tail, the tail's page offset is advanced by `start` (absolute message
byte offset) instead of `start - offset` (byte position within the
element). This makes rsge.offset overshoot by `offset` bytes, pointing
to the wrong location within the page or beyond its boundary. Consumers
of the corrupted entry either silently read wrong data or trigger an
out-of-bounds access.

BUG: KASAN: slab-use-after-free in bpf_msg_pull_data (net/core/filter.c:2728)
Read of size 32752 at addr ffff8881042f0010 by task poc/130
Call Trace:
  __asan_memcpy (mm/kasan/shadow.c:105)
  bpf_msg_pull_data (net/core/filter.c:2728)
  bpf_prog_run_pin_on_cpu (include/linux/bpf.h:1402)
  sk_psock_msg_verdict (net/core/skmsg.c:934)
  tcp_bpf_send_verdict (net/ipv4/tcp_bpf.c:421)
  sock_sendmsg_nosec (net/socket.c:727)

Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
Reported-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260615021959.140010-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, sockmap: reject overflowing copy + len in bpf_msg_push_data()

When the scatterlist ring is full or nearly full, bpf_msg_push_data()
enters a copy fallback path and computes copy + len for the page
allocation size. Since len comes from BPF with arg3_type = ARG_ANYTHING
and both are u32, a crafted len can wrap the sum to a small value,
causing an undersized allocation followed by an out-of-bounds memcpy.

BUG: unable to handle page fault for address: ffffed104089a402
Oops: Oops: 0000 [#1] SMP KASAN NOPTI
Call Trace:
  __asan_memcpy (mm/kasan/shadow.c:105)
  bpf_msg_push_data (net/core/filter.c:2852 net/core/filter.c:2788)
  bpf_prog_9ed8b5711920a7d7+0x2e/0x36
  sk_psock_msg_verdict (net/core/skmsg.c:934)
  tcp_bpf_sendmsg (net/ipv4/tcp_bpf.c:421 net/ipv4/tcp_bpf.c:584)
  __sys_sendto (net/socket.c:2206)
  do_syscall_64 (arch/x86/entry/syscall_64.c:94)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Add an overflow check before the allocation.

Link: https://lore.kernel.org/all/20260424155913.A19FDC19425@smtp.kernel.org
Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
Tested-by: Xiang Mei <xmei5@asu.edu>
Tested-by: Xinyu Ma <mmmxny@gmail.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260615021959.140010-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftsets/bpf: Retry map update on helper_fill_hashmap()

helper_fill_hashmap() is used also on parallel and stress map tests.
Those are consistently failing with ENOMEM on kernels built with
PREEMPT_RT if preallocation is disabled. The failure is transient and
only called by the memory cache refill running in a preemptible
irq_work, which can easily stall in case of contention.

Use a retriable update in those cases to handle transient ENOMEM and
make the test more stable also on PREEMPT_RT.
Also fix the sign of the value printed in case of error (strerror()
expects a positive errno while updates return it negative).

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260611150704.95133-1-gmonaco@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'rust-7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/ojeda/linux

Pull Rust updates from Miguel Ojeda:
"This one is big due to the vendoring of the `zerocopy` library, which
  allows us to replace a bunch of `unsafe` code dealing with conversions
  between byte sequences and other types with safe alternatives. More
  details on that below (and in its merge commit).

  Toolchain and infrastructure:

   - Introduce support for the 'zerocopy' library [1][2]:

         Fast, safe, compile error. Pick two.

         Zerocopy makes zero-cost memory manipulation effortless. We write
         `unsafe` so you don't have to.

     It essentially provides derivable traits (e.g. 'FromBytes') and
     macros (e.g. 'transmute!') for safely converting between byte
     sequences and other types. Having such support allows us to remove
     some 'unsafe' code.

     It is among the most downloaded Rust crates and it is also used by
     the Rust compiler itself.

     It is licensed under "BSD-2-Clause OR Apache-2.0 OR MIT".

     The crates are imported essentially as-is (only +2/-3 lines needed
     to be adapted), plus SPDX identifiers. Upstream has since added the
     SPDX identifiers as well as one of the tweaks at my request, thus
     reducing our future diffs on updates -- I keep the details in one
     of our usual live lists [3].

     In total, it is about ~39k lines added, ~32k without counting
     'benches/' which are just for documentation purposes.

     The series includes a few Kbuild and rust-analyzer improvements and
     an example patch using it in Nova, removing one 'unsafe impl'.

     I checked that the codegen of an isolated example function (similar
     to the Nova patch on top) is essentially identical. It also turns
     out that (for that particular case) the 'zerocopy' version, even
     with 'debug-assertions' enabled, has no remaining panics, unlike a
     few in the current code (since the compiler can prove the remaining
     'ub_checks' statically).

     So their "fast, safe" does indeed check out -- at least in that
     case.

   - Support AutoFDO. This allows Rust code to be profiled and optimized
     based on the profile. Tested with Rust Binder: ~13% slower without
     AutoFDO in the binderAddInts benchmark (using an app-launch
     benchmark for the profile).

   - Support Software Tag-Based KASAN.

     In addition, fix KASAN Kconfig by requiring Clang.

   - Add Kconfig options for each existing Rust KUnit test suite, such
     as 'CONFIG_RUST_BITMAP_KUNIT_TEST'.

     They are placed within a new menu, 'CONFIG_RUST_KUNIT_TESTS', in
     the new 'rust/kernel/Kconfig.test' file.

   - Support the upcoming Rust 1.98.0 release (expected 2026-08-20):
     lint cleanups and an unstable flag rename.

   - Disable 'rustdoc' documentation inlining for all prelude items,
     which bloats the generated documentation.

   - Ignore (in Git) and clean (in Kbuild) the (rarely) 'rustc'-generated
     '*.long-type-*.txt' files.

  'kernel' crate:

   - Add new 'bitfield' module with the 'bitfield!' macro (extracted
     from the existing 'register!' one), which declares integer types
     that are split into distinct bit fields of arbitrary length.

     Each field is a 'Bounded' of the appropriate bit width (ensuring
     values are properly validated and avoiding implicit data loss) and
     gets several generated getters and setters (infallible, 'const' and
     fallible) as well as associated constants ('_MASK', '_SHIFT' and
     '_RANGE'). It also supports fields that can be converted from/to
     custom types, either fallibly ('?=>') or infallibly ('=>').

     For instance:

         bitfield! {
             struct Rgb(u16) {
                 15:11 blue;
                 10:5 green;
                 4:0 red;
             }
         }

         // Compile-time checks.
         let color = Rgb::zeroed().with_const_green::<0x1f>();

         assert_eq!(color.green(), 0x1f);
         assert_eq!(color.into_raw(), 0x1f << Rgb::GREEN_SHIFT);

     Add as well documentation and a test suite for it, as usual; and
     update the 'register!' macro to use it.

     It will be maintained by Alexandre Courbot (with Yury Norov as
     reviewer) under a new 'MAINTAINERS' entry: 'RUST [BITFIELD]'.

   - 'ptr' module: rework index projection syntax into keyworded syntax
     and introduce panicking variant.

     The keyword syntax ('build:', 'try:', 'panic:') is more explicit
     and paves the way of perhaps adding more flavors in the future,
     e.g. an 'unsafe' index projection.

     For instance, projections now look like this:

         fn f(p: *const [u8; 32]) -> Result {
             // Ok, within bounds, checked at build time.
             project!(p, [build: 1]);

             // Build error.
             project!(p, [build: 128]);

             // `OutOfBound` runtime error (convertible to `ERANGE`).
             project!(p, [try: 128]);

             // Runtime panic.
             project!(p, [panic: 128]);

             Ok(())
         }

     Update as well the users, which now look like e.g.

         // Pointer to the first entry of the GSP message queue.
         let data = project!(self.0.as_ptr(), .gspq.msgq.data[build: 0]);

   - 'build_assert' module: make the module the home of its macros
     instead of rendering them twice.

   - 'sync' module: add 'UniqueArc::as_ptr()' associated function.

   - 'alloc' module:

       - Fix the 'Vec::reserve()' doctest to properly account for the
         existing vector length in the capacity assertion.

       - Fix an incorrect operator in the 'Vec::extend_with()' 'SAFETY'
         comment; add a doc test demonstrating basic usage and the
         zero-length case.

   - Clean imports across several modules to follow the "kernel
     vertical" import style in order to minimize conflicts.

  'pin-init' crate:

   - User visible changes:

       - Do not generate 'non_snake_case' warnings for identifiers that
         are syntactically just users of a field name. This would allow
         all '#[allow(non_snake_case)]' in nova-core to be removed,
         which Gary will send to the nova tree next cycle.

       - Filter non-cfg attributes out properly in derived structs. This
         improves pin-init compatibility with other derive macros.

       - Insert projection types' where clause properly.

   - Other changes:

       - Bump MSRV to 1.82, plus associated cleanups.

       - Overhaul how init slots are projected. The new approach is
         easier to justify with safety comments.

       - Mark more functions as inline, which should help mitigate the
         super-long symbol name issue due to lack of inlining.

  rust-analyzer:

   - Support '--envs' for passing env vars for crates like 'zerocopy'.

  'MAINTAINERS':

   - Add the following reviewers to the 'RUST' entry:
       - Daniel Almeida
       - Tamir Duberstein
       - Alexandre Courbot
       - Onur Özkan

     They have been involved in the Rust for Linux project for about 7
     collective years and bring expertise across several domains, which
     will be very useful to have around in the future.

     Thanks everyone for stepping up!

  And some other fixes, cleanups and improvements"

Link: https://github.com/google/zerocopy
Link: https://docs.rs/zerocopy
Link: https://github.com/Rust-for-Linux/linux/issues/1239
* tag 'rust-7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/ojeda/linux: (86 commits)
  MAINTAINERS: add Onur Özkan as Rust reviewer
  MAINTAINERS: add Alexandre Courbot as Rust reviewer
  MAINTAINERS: add Tamir Duberstein as Rust reviewer
  MAINTAINERS: add Daniel Almeida as Rust reviewer
  kbuild: rust: clean `zerocopy-derive` in `mrproper`
  rust: make `build_assert` module the home of related macros
  rust: str: clean unused import for Rust >= 1.98
  rust: str: use the "kernel vertical" imports style
  rust: aref: use the "kernel vertical" imports style
  rust: page: use the "kernel vertical" imports style
  gpu: nova-core: firmware: parse `FalconUCodeDescV2` via `zerocopy`
  rust: prelude: add `zerocopy{,_derive}::FromBytes`
  rust: zerocopy-derive: enable support in kbuild
  rust: zerocopy-derive: add `README.md`
  rust: zerocopy-derive: avoid generating non-ASCII identifiers
  rust: zerocopy-derive: add SPDX License Identifiers
  rust: zerocopy-derive: import crate
  rust: zerocopy: enable support in kbuild
  rust: zerocopy: add `README.md`
  rust: zerocopy: remove float `Display` support
  ...

Merge tag 'rcu.release.v7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Uladzislau Rezki:
"Torture test updates:

   - Improve kvm-series.sh script by adding examples in its header
     comment

   - Lazy RCU is more fully tested now by replacing call_rcu_hurry()
     with call_rcu() and doing rcu_barrier() to motivate lazy callbacks
     during a stutter pause

   - Add more synonyms for the "--do-normal" group of torture.sh
     command-line arguments

  Misc changes:

   - Reduce stack usage of nocb_gp_wait() to address frame size warning
     when built with CONFIG_UBSAN_ALIGNMENT

   - The synchronize_rcu() call can detect the flood and latches a
     normal/default path temporary switching to wait_rcu_gp() path

   - Document using rcu_access_pointer() to fetch the old pointer for
     lockless cmpxchg() updates

   - Simplify some RCU code using clamp_val()

   - Fix a kerneldoc header comment typo in srcu_down_read_fast()"

* tag 'rcu.release.v7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux:
  rcu/nocb: reduce stack usage in nocb_gp_wait()
  rcu-tasks: Fix possible boot-time tests failed for the call_rcu_tasks()
  rcu: Latch normal synchronize_rcu() path on flood
  rcu: Document rcu_access_pointer() feeding into cmpxchg()
  rcu: Simplify param_set_next_fqs_jiffies() by applying clamp_val()
  rcu: Simplify rcu_do_batch() by applying clamp()
  checkpatch: Undeprecate rcu_read_lock_trace() and rcu_read_unlock_trace()
  srcu: Fix kerneldoc header comment typo in srcu_down_read_fast()
  torture: Allow "norm" abbreviation for "normal"
  torture: Improve kvm-series.sh header comment
  torture: Add torture_sched_set_normal() for user-specified nice values
  rcutorture: Fully test lazy RCU

Merge tag 'kcsan-20260612-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/melver/linux

Pull KCSAN update from Marco Elver:

- Silence -Wmaybe-uninitialized when calling __kcsan_check_access()

* tag 'kcsan-20260612-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/melver/linux:
kcsan: Silence -Wmaybe-uninitialized when calling __kcsan_check_access()

selftests/bpf: Add test for sleepable lsm_cgroup rejection

Confirm the verifier rejects loading a sleepable BPF_LSM_CGROUP program,
as introduced in commit 5b038319be44 ("bpf: Reject sleepable
BPF_LSM_CGROUP programs at load time").

Signed-off-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260611143549.703914-1-dwindsor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

udf: fix nls leak on udf_fill_super() failure

On all failure exits that go to error_out there we have already moved the
nls reference from uopt->nls_map to sbi->s_nls_map, leaving NULL behind.

Fixes: c4e89cc674ac ("udf: convert to new mount API")
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Merge remote-tracking branches 'ras/edac-drivers' and 'ras/edac-misc' into edac-updates

* ras/edac-drivers: (21 commits)
  EDAC: Consistently define pci_device_ids using named initializers
  EDAC/igen6: Add Intel Nova Lake-H SoC support
  EDAC/igen6: Make registers for detecting IBECC configurable
  EDAC/imh: Add RRL support for Intel Diamond Rapids server
  EDAC/{skx_common,i10nm}: Prepare RRL for sub-channel granularity
  EDAC/skx_common: Add SubChannel support to ADXL decode
  EDAC/{skx_common,i10nm}: Move RRL handling to common code
  EDAC/{skx_common,i10nm}: Introduce rrl_ctrl_mode
  EDAC/{skx_common,i10nm}: Rename rrl_mode to rrl_source_type
  EDAC/{skx_common,skx,i10nm}: Split skx_set_decode()
  EDAC/{skx_common,i10nm,imh}: Move MC register access helpers to skx_common
  EDAC/{skx_common,skx}: Fix UBSAN shift-out-of-bounds in skx_get_dimm_info
  EDAC/igen6: Add one Intel Panther Lake-H SoC support
  EDAC/igen6: Fix memory topology parsing for Panther Lake-H SoCs
  EDAC/igen6: Fix call trace due to missing release()
  EDAC/sb_edac: fix grammar in sb_decode_ddr3 warning
  EDAC/i5400: disable error reporting at teardown and refactor helper
  EDAC/i5100: disable error reporting at teardown and create helper
  EDAC/i5000: disable error reporting at teardown and refactor helper
  EDAC/i7300: disable error reporting if init fails and refactor helper
  ...

* ras/edac-misc:
  RAS/AMD/ATL: Drop malformed default N from Kconfig

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>

Merge branch 'bpf-fix-bpf_get-setsockopt-to-tos-for-ipv4-mapped-ipv6-socket'

Leon Hwang says:

====================
bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket

When TCP over IPv4 via INET6 API, sk->sk_family is AF_INET6, but it is a
v4 pkt. inet_csk(sk)->icsk_af_ops is ipv6_mapped and use ip_queue_xmit.
The tos sockopt does not work for bpf [get,set]sockopt() helpers.

Changelog:
v3 -> v4:
* Add 'sk->sk_type != SOCK_RAW && !ipv6_only_sock(sk)' check.
* Re-implement test with LLM assistance.
* v3: https://lore.kernel.org/all/20240914103226.71109-1-zhoufeng.zf@bytedance.com/

v2->v3:
* Use sk_is_inet() helper. (Eric Dumazet)
* https://lore.kernel.org/bpf/CANn89i+9GmBLCdgsfH=WWe-tyFYpiO27wONyxaxiU6aOBC6G8g@mail.gmail.com/T/

v1->v2:
* Fix compilation error. (kernel test robot)
* https://lore.kernel.org/bpf/202408152058.YXAnhLgZ-lkp@intel.com/T/
====================

Link: https://patch.msgid.link/20260613162443.60515-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper

Verify the fix by:

1. Attach cgroup sockops prog.
2. Build a tcp connection using ipv4 addr in ipv6 socket.
3. Verify the return value of bpf_setsockopt() helper.

Assisted-by: Codex:gpt-5.5-xhigh
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260613162443.60515-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket

When TCP over IPv4 via INET6 API, bpf_get/setsockopt with ipv4 will
fail, because sk->sk_family is AF_INET6. With ipv6 will success, not
take effect, because inet_csk(sk)->icsk_af_ops is ipv6_mapped and
use ip_queue_xmit, inet_sk(sk)->tos.

To relax this restriction, allow getting/setting tos for those possible
ipv4-mapped ipv6 sockets.

Fixes: ee7f1e1302f5 ("bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()")
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260613162443.60515-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'tools-build-bpf-append-extra_cflags-and-host_extracflags'

Leo Yan says:

====================
tools build: bpf: Append EXTRA_CFLAGS and HOST_EXTRACFLAGS

Append EXTRA_CFLAGS and HOST_EXTRACFLAGS to the BPF build.

This mitigates an issue introduced in GCC 15, where a {0} initializer
does not guarantee zeroing the entire union [1].

The common changes under tools to support EXTRA_CFLAGS and
HOST_EXTRACFLAGS are sent separately [2]. As suggested, BPF patches
would be picked up via the bpf tree, so this series only includes BPF
related changes.

Verification on bpf-ci (with tools changes [2]:

https://github.com/kernel-patches/bpf/actions/runs/26815163486

[1] https://gcc.gnu.org/gcc-15/changes.html
[2] https://lore.kernel.org/all/20260602-tools_build_fix_zero_init-v7-0-631baf679fe7@arm.com/

Signed-off-by: Leo Yan <leo.yan@arm.com>
---
Changes in v2:
- Used strscpy() instead in patch 06 (Ihor).
- Added prefix "bpf-next" in subject (Alexei).
- Added patch 01 to pass host cflags to bootstrap libbpf.
- Added patch 08 to avoid static LLVM linking for cross build.
- Link to v1: https://lore.kernel.org/r/20260323-tools_build_fix_zero_init_bpf_only-v1-0-d1cfad2f4cd1@arm.com
====================

Link: https://patch.msgid.link/20260602-tools_build_fix_zero_init_bpf_only-v2-0-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Avoid static LLVM linking for cross builds

The BPF selftests prefer static LLVM linking, which works for native
builds but can break cross builds. Its --link-static output may include
host-only libraries that are unavailable for the cross compilation,
causing link failures.

Avoid static LLVM linking for cross builds and use shared LLVM libraries
instead. Native builds keep the existing behavior.

Signed-off-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-8-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Use common CFLAGS for urandom_read

The urandom_read helper and its shared library are built with $(CLANG)
directly rather than through the normal selftest $(CC) rules.

The CFLAGS variable can contain specific flags only for $(CC) but might
be imcompatible for $(CLANG) and those flags are not necessarily valid
for the clang-only urandom_read build.

Split the BPF selftest local flags into COMMON_CFLAGS and append them to
CFLAGS for the normal build path. Use COMMON_CFLAGS directly for
urandom_read and liburandom_read.so, while still filtering out -static as
before.

Signed-off-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-7-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Initialize operation name before use

ASAN reports stack-buffer-overflow due to the uninitialized op_name.

Initialize it to fix the issue.

Fixes: 054b6c7866c7 ("selftests/bpf: Add verifier log tests for BPF_BTF_LOAD command")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-6-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools/bpf: build: Append extra cflags

Append EXTRA_CFLAGS to CFLAGS so that additional flags can be applied to
the compiler.

Signed-off-by: Leo Yan <leo.yan@arm.com>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-5-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Initialize CFLAGS before including Makefile.include

tools/scripts/Makefile.include may expand EXTRA_CFLAGS in a future
change. This could alter the initialization of CFLAGS, as the default
options "-g -O2" would never be set once EXTRA_CFLAGS is expanded.

Prepare for this by moving the CFLAGS initialization before including
tools/scripts/Makefile.include, so it is not affected by the extended
EXTRA_CFLAGS.

Append EXTRA_CFLAGS to CFLAGS only after including Makefile.include and
place it last so that the extra flags propagate properly and can
override the default options.

tools/scripts/Makefile.include already appends $(CLANG_CROSS_FLAGS) to
CFLAGS, the Makefile appends $(CLANG_CROSS_FLAGS) again, remove the
redundant append.

Signed-off-by: Leo Yan <leo.yan@arm.com>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-4-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Append extra host flags

Append HOST_EXTRACFLAGS to HOST_CFLAGS so that additional flags can be
applied to the host compiler.

Acked-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-3-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Avoid adding EXTRA_CFLAGS to HOST_CFLAGS

Prepare for future changes where EXTRA_CFLAGS may include flags not
applicable to the host compiler.

Move the HOST_CFLAGS assignment before appending EXTRA_CFLAGS to
CFLAGS so that HOST_CFLAGS does not inherit flags from EXTRA_CFLAGS.

Acked-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-2-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpftool: Pass host flags to bootstrap libbpf

bpftool builds a bootstrap libbpf with HOSTCC, but the libbpf submake can
still inherit target build flags through CFLAGS. This can break cross
builds when host objects are compiled with target-only options.

Since HOST_CFLAGS contains warning options that are not suitable for
building libbpf, use LIBBPF_BOOTSTRAP_CFLAGS with the warning options
removed to build the bootstrap libbpf. Clear EXTRA_CFLAGS so target
extra flags are not mixed into the host bootstrap libbpf build.

Signed-off-by: Leo Yan <leo.yan@arm.com>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20260602-tools_build_fix_zero_init_bpf_only-v2-1-c76e5250ea1c@arm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: correct CONFIG_PPC64 macro name in comment

A comment in tools/testing/selftests/bpf/progs/test_fill_link_info.c
incorrectly refers to CONFIG_PPC6 instead of CONFIG_PPC64. Correct it.

Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://lore.kernel.org/r/20260610044023.225820-1-enelsonmoore@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpf-allow-uprobe_multi-binary-specified-by-file-descriptor'

Jiri Olsa says:

====================
bpf: Allow uprobe_multi binary specified by file descriptor

Add ability to open uprobe_multi link on top of binary identified
by file descriptor. This allows us to avoid the race where the binary is
replaced between path resolution and attachment, ensuring we monitor the
intended binary.

v1: https://lore.kernel.org/bpf/20260609104244.588321-1-jolsa@kernel.org/T/#m0275d5f39805c57dc8fd3308c640237dc7aec4db
v2: https://lore.kernel.org/bpf/20260610143627.804790-1-jolsa@kernel.org/T/#m153e18fa426140fdcc773cc97b10e006531656c0

v3 changes:
- guard t_user acesss with access_ok [sashiko]

v2 changes:
- move path retrieval in separate function so CLASS(..) is not used in function
with goto-based cleanup [sashiko]
- force zero path_fd in case BPF_F_UPROBE_MULTI_PATH_FD is not set [sashiko]
- add space around | in bpf_uprobe_multi_link_attach [Alexei]
====================

Link: https://patch.msgid.link/20260611114230.950379-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Fix typo in verify_umulti_link_info

We verify info.uprobe_multi.flags against wrong kprobe-multi flag
(BPF_F_KPROBE_MULTI_RETURN). It's the same value as the correct
flag (BPF_F_UPROBE_MULTI_RETURN), so there's not functional change.

Fixes: 147c69307bcf ("selftests/bpf: Add link_info test for uprobe_multi link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260611114230.950379-8-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add uprobe_multi path_fd fail tests

Adding tests to attach_api_fails suite to make sure we fail
wrong setup for path_fd usage.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260611114230.950379-7-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add uprobe_multi path_fd test

Add a uprobe_multi link API selftest that opens /proc/self/exe and passes
the resulting descriptor through opts.uprobe_multi.path_fd with
BPF_F_UPROBE_MULTI_PATH_FD set.

Assisted-by: Codex:GPT-5.4
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260611114230.950379-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: Add path_fd to struct bpf_link_create_opts

Adding the path_fd field to struct bpf_link_create_opts and passing it
through kernel attr interface.

Assisted-by: Codex:GPT-5.4
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260611114230.950379-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add support to specify uprobe_multi target via file descriptor

Allow uprobe_multi link to identify the target binary by an already
opened file descriptor.

Adding new BPF_F_UPROBE_MULTI_PATH_FD flag and the path_fd field for
the attr.link_create.uprobe_multi struct.

When the flag is set, we resolve the target from path_fd, without the
flag, we keep the existing string path behavior.

I don't see a use case for supporting O_PATH file descriptors, because
we need to read the binary first to get probes offsets, so I'm using
the CLASS(fd, f), which fails for O_PATH fds.

Assisted-by: Codex:GPT-5.4
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260611114230.950379-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Use user_path_at for path resolution in uprobe_multi

Resolve the uprobe_multi user path with user_path_at() instead of copying
the string with strndup_user() and passing it to kern_path(). This removes
the temporary allocation and keeps the lookup logic in one helper.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260611114230.950379-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Guard __get_user acesss with access_ok for uprobe_multi data

As reported by sashiko [1] we need to use access_ok to check the user
space data bounds before we use __get-user to get it.

[1] https://lore.kernel.org/bpf/20260610145235.CB1441F00893@smtp.kernel.org/
Fixes: 0b779b61f651 ("bpf: Add cookies support for uprobe_multi link")
Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260611114230.950379-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'for-linus-7.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

- Several small cleanups of various Xen related drivers
   (xen/platform-pci, xen-balloon, xenbus, xen/mcelog)

- Cleanup for Xen PV-mode related code (includes dropping the Xen
   debugfs code)

- Drop the additional lazy mmu mode tracking done by Xen specific code

* tag 'for-linus-7.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/xenbus: Replace strcpy() with memcpy()
  x86/xen: Replace generic lazy tracking with cpu specific one
  x86/xen: Get rid of last XEN_LAZY_MMU uses
  mm: Refactor lazy_mmu_mode_pause() and lazy_mmu_mode_resume()
  x86/xen: Change interface of xen_mc_issue()
  x86/xen: Drop lazy mode from trace entries
  x86/xen: Remove Xen debugfs support
  x86/xen: Cleanup Xen related trace points
  x86/xen: Guard PV-only stuff in xen-ops.h with CONFIG_XEN_PV
  xen: balloon: Replace sprintf() with sysfs_emit()
  xen/mcelog: mark g_physinfo, ncpus and xen_mce_chrdev_device as __ro_after_init
  xen: constify xsd_errors array
  xen/platform-pci: Simplify initialization of pci_device_id array

Merge tag 'kbuild-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux

Pull Kbuild / Kconfig updates from Nathan Chancellor:
"Kbuild:

   - Remove broken module linking exclusion for BTF

   - Add documentation around how offset header files work

   - Include unstripped vDSO libraries in pacman packages

   - Bump minimum version of LLVM for building the kernel to 17.0.1 and
     clean up unnecessary workarounds

   - Use a context manager in run-clang-tools

   - Add dist macro value if present to release tag for RPM packages

   - Detect and report truncated buf_printf() output in modpost

   - Add __llvm_covfun and __llvm_covmap to section whitelist in modpost

   - Support Clang's distributed ThinLTO mode

   - Remove architecture specific configurations for AutoFDO and
     Propeller to ease individual architecture maintenance

  Kconfig:

   - Add kconfig-sym-check target to look for dangling Kconfig symbol
     references and invalid tristate literal values

   - Harden against potential NULL pointer dereference

   - Fix typo in Kconfig test comment"

* tag 'kbuild-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (31 commits)
  kconfig: tests: fix typo in comment
  kconfig: Remove the architecture specific config for Propeller
  kconfig: Remove the architecture specific config for AutoFDO
  modpost: Add __llvm_covfun and __llvm_covmap to section_white_list
  kconfig: add kconfig-sym-check static checker
  kbuild: Remove unnecessary 'T' modifier in cmd_ar_builtin_fixup
  kbuild: distributed build support for Clang ThinLTO
  kbuild: move vmlinux.a build rule to scripts/Makefile.vmlinux_a
  scripts: modpost: detect and report truncated buf_printf() output
  kbuild: rpm-pkg: append %{?dist} macro to Release tag
  run-clang-tools: run multiprocessing.Pool as context manager
  compiler-clang.h: Drop explicit version number from "all" diagnostic macro
  compiler-clang.h: Remove __cleanup -Wunused-variable workaround
  kbuild: Remove check for broken scoping with clang < 17 in CC_HAS_ASM_GOTO_OUTPUT
  x86/entry/vdso32: Remove conditional omission of '.cfi_offset eflags'
  x86/module: Revert "Deal with GOT based stack cookie load on Clang < 17"
  x86/build: Drop unnecessary '-ffreestanding' addition to KBUILD_CFLAGS
  scripts/Makefile.warn: Drop -Wformat handling for clang < 16
  riscv: Drop tautological condition from TOOLCHAIN_NEEDS_OLD_ISA_SPEC
  riscv: Remove tautological condition from selection of ARCH_SUPPORTS_CFI
  ...

arm64: mm: Remove misleading pte_none() comment from ptep_try_set()

This comment was thoughtlessly copied from the x86 version and doesn't
apply to arm64. Remove it.

Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20260614210209.2371030-1-tj@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'pull-configfs-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull configfs updates from Al Viro:
"A couple of fixes (UAF in configfs_lookup() and really old races
  introduced when lseek() on configfs directories stopped locking those
  directories; impact up to and including UAF).

  Fixes aside, the main result is that configfs is finally switched to
  tree-in-dcache machinery. It's *not* making use of recursive removal
  helpers yet, and it still does the bloody awful "build subtree in full
  sight of userland, with possibility of failure halfway through and
  need to unroll" that forces the locking model from hell; dealing with
  that is a separate patch series, once this one is out of the way.
  However, it is using DCACHE_PERSISTENT properly now. And apparmorfs is
  the sole remaining user of __simple_{unlink,rmdir}() at that point"

* tag 'pull-configfs-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  create_default_group(): pass parent's dentry instead of config_group
  configfs_attach_group(): drop the unused parent_item argument
  configs_attach_item(): drop unused parent_item argument
  configfs_create(): lift parent timestamp updates into callers
  kill configfs_drop_dentry()
  configfs: mark pinned dentries persistent
  configfs: dentry refcount needs to be pinned only once
  switch configfs_detach_{group,item}() to passing dentry
  configfs_remove_dir(), detach_attrs(): switch to passing dentry
  populate_attrs(): move cleanup to the sole caller
  populate_group(): move cleanup on failure to the sole caller
  configfs_detach_rollback(): pass configfs_dirent instead of dentry
  configfs_do_depend_item(): pass configfs_dirent instead of dentry
  configfs_depend_prep(): pass configfs_dirent instead of dentry
  configfs_detach_prep(): pass configfs_dirent instead of dentry
  configfs_mkdir(): use take_dentry_name_snapshot()
  configfs: fix lockless traversals of ->s_children
  configfs_lookup(): don't leave ->s_dentry dangling on failure

Merge tag 'pull-d_add' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dentry d_add() cleanups from Al Viro:
"This converts a bunch of unidiomatic uses of d_add() in ->lookup()
  instances to equivalent uses of d_splice_alias(), which is the normal
  mechanism for ->lookup()"

* tag 'pull-d_add' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gfs2: use d_splice_alias() for ->lookup() return value
  ntfs: use d_splice_alias() for ->lookup() return value
  simple_lookup(): use d_splice_alias() for ->lookup() return value
  ecryptfs: use d_splice_alias() for ->lookup() return value
  configfs_lookup(): switch to d_splice_alias()
  tracefs: use d_splice_alias() in ->lookup() instances

Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dcache updates from Al Viro:

- d_alloc_parallel() API change (Neil's with my changes)

- NORCU fixes

- Reorganization and simplification of dentry eviction logic

- Simplifying rcu_read_lock() scopes in fs/dcache.c

- Secondary roots work - getting rid of NFS fake root dentries and
   dealing with remaining shrink_dcache_for_umount() and
   shrink_dentry_list() races

- making cursors NORCU (surprisingly easy)

* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits)
  make cursors NORCU
  nfs: get rid of fake root dentries
  wind ->s_roots via ->d_sib instead of ->d_hash
  shrink_dentry_tree(): unify the calls of shrink_dentry_list()
  shrinking rcu_read_lock() scope in d_alloc_parallel()
  d_walk(): shrink rcu_read_lock() scope
  document dentry_kill()
  adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()
  Document rcu_read_lock() use in select_collect2()
  Shift rcu_read_{,un}lock() inside fast_dput()
  simplify safety for lock_for_kill() slowpath
  fold lock_for_kill() and __dentry_kill() into common helper
  fold lock_for_kill() into shrink_kill()
  shrink_dentry_list(): start with removing from shrink list
  d_prune_aliases(): make sure to skip NORCU aliases
  kill d_dispose_if_unused()
  make to_shrink_list() return whether it has moved dentry to list
  select_collect(): ignore dentries on shrink lists if they have positive refcounts
  find_acceptable_alias(): skip NORCU aliases with zero refcount
  fix a race between d_find_any_alias() and final dput() of NORCU dentries
  ...

Merge tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull procfs updates from Christian Brauner:

- Revamp fs/filesystems.c

   The file was a mess with a hand-rolled linked list in desperate need
   of a cleanup. The filesystems list is now RCU-ified, /proc files can
   be marked permanent from outside fs/proc/, and the string emitted
   when reading /proc/filesystems is pre-generated and cached instead of
   pointer-chasing and printfing entry by entry on every read.

   The file is read frequently because libselinux reads it and is linked
   into numerous frequently used programs (even ones you would not
   suspect, like sed!). Scalability also improves since reference
   maintenance on open/close is bypassed.

    open+read+close cycle single-threaded (ops/s):
      before: 442732
      after:  1063462 (+140%)

    open+read+close cycle with 20 processes (ops/s):
      before: 606177
      after:  3300576 (+444%)

   A follow-up patch adds missing unlocks in some corner cases and
   tidies things up.

- Relax the mount visibility check for subset=pid mounts

   When procfs is mounted with subset=pid, all static files become
   unavailable and only the dynamic pid information is accessible. In
   that case there is no point in imposing the full mount visibility
   restrictions on the mounter - everything that can be hidden in procfs
   is already inaccessible. These restrictions prevented procfs from
   being mounted inside rootless containers since almost all container
   implementations overmount parts of procfs to hide certain
   directories.

   As part of this /proc/self/net is only shown in subset=pid mounts for
   CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the
   SB_I_USERNS_VISIBLE superblock flag is replaced with an
   FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are
   recorded in a list, and the mount restrictions are finally
   documented.

- Protect ptrace_may_access() with exec_update_lock in procfs

   Most uses of ptrace_may_access() in procfs should hold
   exec_update_lock to avoid TOCTOU issues with concurrent privileged
   execve() (like setuid binary execution).

   This fixes the easy cases - the owner and visibility checks and the
   FD link permission checks - with the gnarlier ones to follow later.

* tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix ups and tidy ups to /proc/filesystems caching
  proc: protect ptrace_may_access() with exec_update_lock (FD links)
  proc: protect ptrace_may_access() with exec_update_lock (part 1)
  docs: proc: add documentation about mount restrictions
  proc: handle subset=pid separately in userns visibility checks
  proc: prevent reconfiguring subset=pid
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  fs: cache the string generated by reading /proc/filesystems
  sysfs: remove trivial sysfs_get_tree() wrapper
  fs: RCU-ify filesystems list
  fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
  proc: allow to mark /proc files permanent outside of fs/proc/
  namespace: record fully visible mounts in list

Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
"Features:

   - Reduce pipe->mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe->mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size > PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 << n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to ->poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...

Merge tag 'vfs-7.2-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull simple_xattr updates from Christian Brauner:
"This reworks the simple xattr api to make it more efficient and easier
  to use for all consumers.

  The simple_xattr hash table moves from the inode into a per-superblock
  cache, removing the per-inode overhead for the common case of few or
  no xattrs. The interface now passes struct simple_xattrs ** so lazy
  allocation is handled internally instead of by every caller, kernfs
  xattr operations on kernfs nodes shared between multiple superblocks
  are properly serialized, and tmpfs constructs "security.foo" xattr
  names with kasprintf() instead of kmalloc() plus two memcpy()s.

  A follow-up fix links kernfs nodes to their parent before the LSM init
  hook runs: with the per-sb cache kernfs_xattr_set() computes the cache
  via kernfs_root(kn), which faulted on a freshly allocated node when
  selinux_kernfs_init_security() called into it - reproducible as a NULL
  pointer dereference on the first cgroup mkdir on SELinux-enabled
  systems.

  On top of this bpffs gains support for trusted.* and security.* xattrs
  so that user space and BPF LSM programs can attach metadata - for
  example a content hash or a security label - to pinned objects and
  directories and inspect it uniformly like on other filesystems. The
  store is in-memory and non-persistent, living only for the lifetime of
  the mount like everything else in bpffs"

* tag 'vfs-7.2-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  bpf: Add simple xattr support to bpffs
  kernfs: link kn to its parent before the LSM init hook
  simpe_xattr: use per-sb cache
  simple_xattr: change interface to pass struct simple_xattrs **
  tmpfs: simplify constructing "security.foo" xattr names
  kernfs: fix xattr race condition with multiple superblocks

Merge tag 'vfs-7.2-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner:

- Add the vfs infrastructure required to implement fs-verity support
   for XFS with a post-EOF merkle tree: fsverity generates and stores a
   zero-block hash, and iomap learns to verify data on buffered reads,
   to handle fsverity during writeback via the new IOMAP_F_FSVERITY
   flag, and to write fsverity metadata through iomap_fsverity_write().

- Skip the memset of the iomap in iomap_iter() once the iteration is
   done. In high-IOPS scenarios (4k randread NVMe polling via io_uring)
   the pointless memset wasted memory write bandwidth; this improves
   IOPS by about 5% on ext4 and xfs.

- Add balance_dirty_pages_ratelimited() to iomap_zero_iter(), aligning
   it with iomap_write_iter(). This prepares for the exFAT iomap
   conversion where zeroing beyond valid_size can trigger large-scale
   zeroing operations that caused memory pressure without throttling.

- Remove the over-strict inline data boundary check. If a filesystem
   provides a valid inline_data pointer and length there is no reason to
   require that inline data must not cross a page boundary.

- Don't make REQ_POLLED imply REQ_NOWAIT, matching the earlier
   equivalent block layer fix: there are valid cases to poll for I/O
   completion without REQ_NOWAIT, and REQ_NOWAIT for file system writes
   is currently not supported as writes aren't idempotent.

- Introduce IOMAP_F_ZERO_TAIL for filesystems that maintain a separate
   valid data length (exFAT, NTFS). For a write starting at or beyond
   valid_size, __iomap_write_begin() now zeroes only the tail portion of
   the block while preserving valid data before it, instead of leaving
   stale data in the page cache. The flag is also added to the iomap
   trace event strings.

* tag 'vfs-7.2-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iomap: Add IOMAP_F_ZERO_TAIL flag to trace event strings
  iomap: introduce iomap_fsverity_write() for writing fsverity metadata
  iomap: teach iomap to read files with fsverity
  iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
  fsverity: generate and store zero-block hash
  iomap: introduce IOMAP_F_ZERO_TAIL flag
  iomap: don't make REQ_POLLED imply REQ_NOWAIT
  iomap: remove over-strict inline data boundary check
  iomap: add dirty page control to iomap_zero_iter
  iomap: avoid memset iomap when iter is done

Merge tag 'vfs-7.2-rc1.eventpoll' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull eventpoll updates from Christian Brauner:

- eventpoll clarity refactor

   The recent eventpoll UAF fixes (a6dc643c6931 and follow-ups) depended
   on invariants in fs/eventpoll.c that were nowhere documented and had
   to be reverse-engineered from the code: the lifetime relationships
   between struct eventpoll, struct epitem, and struct file, the three
   removal paths coordinating via epi_fget() pins and ep->mtx, the
   ovflist sentinel-encoded scan state machine, the POLLFREE
   release/acquire handshake, and the loop / path check globals
   serialized by epnested_mutex. The fixes were correct but the next
   person to touch this code would hit the same learning curve.

   This series codifies those invariants in source and tightens the
   surrounding structure. No functional changes intended:

     - Documentation: a top-of-file overview with field-protection
       tables for struct eventpoll and struct epitem, a section
       gathering the loop-check / path-check globals next to their
       declarations, labelled comments on the two sides of the POLLFREE
       handshake, refreshed comments on epi_fget() and ep_remove_file(),
       and a docblock on ep_clear_and_put() that names its two-pass
       structure as load-bearing.

     - Mechanical renames: ep_refcount_dec_and_test() -> ep_put() to
       pair with ep_get(), attach_epitem() -> ep_attach_file() for
       ep_remove_file() symmetry, the unused depth argument dropped from
       epoll_mutex_lock(), and the CONFIG_KCMP block relocated next to
       CONFIG_COMPAT so the hot-path code is contiguous.

     - Helper extraction: ep_insert() splits into ep_alloc_epitem() and
       ep_register_epitem(), ep_clear_and_put()'s two passes become
       ep_drain_pollwaits() and ep_drain_tree() so the ordering
       invariant is enforced by the call sequence rather than
       convention, the per-event delivery loop body becomes
       ep_deliver_event(), and the ep->mtx + epnested_mutex acquisition
       dance lifts out of do_epoll_ctl() into ep_ctl_lock() /
       ep_ctl_unlock().

     - Sentinel and predicate cleanup: the EP_UNACTIVE_PTR overload is
       hidden behind named helpers (ep_is_scanning, epi_on_ovflist,
       ...), epi->next is renamed to epi->ovflist_next, and the boolean
       predicates return bool.

     - The per-CTL_ADD scratch state (tfile_check_list, path_count[],
       inserting_into) moves from file-scope globals into a
       stack-allocated struct ep_ctl_ctx plumbed through the loop / path
       check chain.

   Two follow-up fixes are included: missing kernel-doc for the new @ctx
   parameters, and restoring the EP_UNACTIVE_PTR sentinel for
   ctx->tfile_check_list - replacing it with NULL termination broke
   ep_remove_file()'s "never listed" check for the list tail, causing a
   syzbot-reported use-after-free.

- io_uring related epoll cleanups

   One of the nastier things about epoll is how it allows nesting
   contexts inside each other, leading to the necessity of loop
   detection and the issues that have come with that. There is no reason
   to support nesting on the io_uring side, so contain the damage and
   disallow nested contexts from there: eventpoll gains a file based
   control interface and struct epoll_filefd is renamed to epoll_key.
   The io_uring side proper goes on top of this through the block tree.

- Fix epoll_wait() reporting false negatives

   ep_events_available() checks ep->rdllist and ep_is_scanning() without
   a lock and can race with a concurrent scan such that neither check
   sees the events, causing epoll_wait() with a zero timeout to wrongly
   report no events even though events are available. A sequence lock
   closes the race and a reproducer is added to the eventpoll selftests.

* tag 'vfs-7.2-rc1.eventpoll' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  eventpoll: restore EP_UNACTIVE_PTR sentinel for ctx->tfile_check_list
  eventpoll: Fix epoll_wait() report false negative
  selftests/eventpoll: Add test for multiple waiters
  eventpoll: add missing kernel-doc for @ctx function parameters
  eventpoll: rename struct epoll_filefd to epoll_key
  eventpoll: add file based control interface
  eventpoll: export is_file_epoll()
  eventpoll: pass struct epoll_filefd through ep_find() and ep_insert()
  eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx
  eventpoll: use bool for predicate helpers
  eventpoll: rename epi->next and txlist for clarity
  eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers
  eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock()
  eventpoll: extract ep_deliver_event() from ep_send_events()
  eventpoll: split ep_clear_and_put() into drain helpers
  eventpoll: split ep_insert() into alloc + register stages
  eventpoll: relocate KCMP helpers near compat syscalls
  eventpoll: rename attach_epitem() to ep_attach_file()
  eventpoll: drop unused depth argument from epoll_mutex_lock()
  eventpoll: rename ep_refcount_dec_and_test() to ep_put()
  ...

Merge tag 'vfs-7.2-rc1.bh' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull buffer_head updates from Christian Brauner:
"This removes b_end_io from struct buffer_head.

  Instead of setting bio->bi_end_io to end_bio_bh_io_sync() which then
  calls bh->b_end_io(), the new bh_submit() and __bh_submit() interfaces
  set bio->bi_end_io to the appropriate completion handler directly,
  replacing two indirect function calls in the completion path with one.
  It is also one fewer function pointer in the middle of a writable data
  structure that can be corrupted, it shrinks struct buffer_head from
  104 to 96 bytes allowing roughly 7% more buffer_heads to be cached in
  the same amount of memory, and it removes some atomic operations as
  the buffer refcount is no longer incremented before calling the end_io
  handler.

  All in-tree users (fs/buffer.c itself, ext4, jbd2, ocfs2, gfs2,
  nilfs2, and md-bitmap) are converted, and submit_bh(),
  mark_buffer_async_write(), and end_buffer_write_sync() are removed"

* tag 'vfs-7.2-rc1.bh' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (34 commits)
  buffer: Remove end_buffer_write_sync()
  buffer: Change calling convention for end_buffer_read_sync()
  buffer: Remove b_end_io
  buffer: Remove submit_bh()
  md-bitmap: Convert read_file_page and write_file_page to bh_submit()
  nilfs2: Convert nilfs_mdt_submit_block to bh_submit()
  nilfs2: Convert nilfs_gccache_submit_read_data to bh_submit()
  nilfs2: Convert nilfs_btnode_submit_block to bh_submit()
  buffer: Remove mark_buffer_async_write()
  gfs2: Convert gfs2_aspace_write_folio to bh_submit()
  gfs2: Remove use of b_end_io in gfs2_meta_read_endio()
  gfs2: Convert gfs2_dir_readahead to bh_submit()
  gfs2: Convert gfs2_metapath_ra to bh_submit()
  ocfs2: Convert ocfs2_write_super_or_backup to bh_submit()
  ocfs2: Convert ocfs2_read_blocks to bh_submit()
  ocfs2: Convert ocfs2_read_block to bh_submit()
  ocfs2: Convert ocfs2_write_block to bh_submit()
  jbd2: Convert jbd2_write_superblock() to bh_submit()
  jbd2: Convert journal commit to bh_submit()
  ext4: Convert ext4_commit_super() to bh_submit()
  ...

drm/bridge: anx7625: Add WQ_PERCPU add to alloc_workqueue

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.

In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>

wifi: ath6kl: fix invalid workqueue flags in ath6kl_usb_create()

ath6kl_usb_create() currently creates ath6kl_wq with flags set to 0:

  alloc_workqueue("ath6kl_wq", 0, 0)

This triggers a runtime warning in __alloc_workqueue() because the queue is
created with neither WQ_PERCPU nor WQ_UNBOUND set:

  workqueue: ath6kl_wq is using neither WQ_PERCPU or WQ_UNBOUND.
  Setting WQ_PERCPU.

Set WQ_PERCPU explicitly to match the actual execution model and remove the
warning during device probe. No functional change intended.

Fixes: 21c05ca88a54 ("workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present")
Reported-by: syzbot+f80c62f371ba6a1e7d79@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/6a289c01.39669fcc.33b062.00aa.GAE@google.com/T/
Signed-off-by: wuyankun <wuyankun@uniontech.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

btrfs: Drop WQ_PERCPU from ordered_flags in btrfs_init_workqueues()

After commit 21c05ca88a54 ("workqueue: Add warnings and ensure one among
WQ_PERCPU or WQ_UNBOUND is present"), there is a warning from the
btrfs-qgroup-rescan workqueue at run time:

workqueue: btrfs-qgroup-rescan uses both WQ_PERCPU and WQ_UNBOUND. Dropped WQ_PERCPU, keeping WQ_UNBOUND.

WQ_PERCPU is included in ordered_flags after commit 69635d7f4b34 ("fs:
WQ_PERCPU added to alloc_workqueue users") and WQ_UNBOUND is set in
alloc_ordered_workqueue(), which btrfs_alloc_ordered_workqueue() calls.

Drop WQ_PERCPU from ordered_flags, as alloc_ordered_workqueue() notes
that only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful.

Fixes: 69635d7f4b34 ("fs: WQ_PERCPU added to alloc_workqueue users")
Fixes: 21c05ca88a54 ("workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Acked-by: Marco Crivellari <marco.crivellari@suse.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'vfs-7.2-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs writeback updates from Christian Brauner:

- Fix a race between cgroup_writeback_umount() and inode_switch_wbs()

   When a container exits, a race between cgroup_writeback_umount() and
   inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy
   inodes after unmount" followed by a use-after-free on percpu
   counters.

   There is a window between inode_prepare_wbs_switch() returning true
   (having passed the SB_ACTIVE check and grabbed the inode) and the
   subsequent wb_queue_isw() call: if cgroup_writeback_umount() observes
   the global isw_nr_in_flight counter as non-zero but flush_workqueue()
   finds nothing queued yet, it returns early - leaving a held inode
   reference that blocks evict_inodes() and a later iput() that hits
   freed percpu counters.

   The race is closed by covering the window from
   inode_prepare_wbs_switch() through wb_queue_isw() with an RCU
   read-side critical section and synchronizing in the umount path.

   On top of that the now-dead rcu_barrier() left over from the
   queue_rcu_work() era is removed, and the global
   synchronize_rcu()/flush_workqueue() pair is replaced with a per-sb
   in-flight counter plus pin/unpin/drain helpers so umount no longer
   serializes against switch activity on unrelated superblocks.

   Under cgroup writeback churn on a 16 vCPU guest this takes umount
   latency from ~92-138ms p50 down to ~5-8ms p50 and the cumulative cost
   of cgroup_writeback_umount() from ~62ms to ~4us per call.

   The initial race fix is kept separate and minimal so it backports
   cleanly to stable trees that still queue switches via
   queue_rcu_work().

- Improve write performance with RWF_DONTCACHE

   Dirty DONTCACHE pages are now tracked per bdi_writeback so that the
   writeback flusher can be kicked in a targeted fashion for
   IOCB_DONTCACHE writes instead of relying on global writeback, and the
   PG_dropbehind flag is preserved when a folio is split.

* tag 'vfs-7.2-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  mm: track DONTCACHE dirty pages per bdi_writeback
  mm: preserve PG_dropbehind flag during folio split
  writeback: use a per-sb counter to drain inode wb switches at umount
  writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()
  writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

Merge tag 'vfs-7.2-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs superblock updates from Christian Brauner:
"This retires sget().

  CIFS plus the two ext4 KUnit tests (extents-test, mballoc-test) were
  the last in-tree callers, and all three convert cleanly to sget_fc().

  That lets sget() and its prototype come out, taking ~60 lines that
  only existed to be kept in lockstep with sget_fc() on every
  publish-path change"

* tag 'vfs-7.2-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: retire sget()
  smb: client: convert cifs_smb3_do_mount() to sget_fc()
  ext4: convert mballoc KUnit test to sget_fc()
  ext4: convert extents KUnit test to sget_fc()

Merge tag 'vfs-7.2-rc1.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull openat2 updates from Christian Brauner:
"Features:

   - Add O_EMPTYPATH to openat(2)/openat2(2). To get an operable file
     descriptor from an O_PATH file descriptor it is possible to use
     openat(fd, ".", O_DIRECTORY) for directories, but other file types
     require going through open("/proc/<pid>/fd/<nr>") and thus depend
     on a functioning procfs.

     With O_EMPTYPATH an empty path string is accepted and LOOKUP_EMPTY
     is set at path resolution time, allowing to reopen the file behind
     the file descriptor directly. Selftests are included.

   - Add an OPENAT2_REGULAR flag for openat2(2) which refuses to open
     anything but regular files with the new EFTYPE error code.

     This implements the "ability to only open regular files" feature
     requested by userspace via uapi-group.org and protects services
     from being redirected to fifos, device nodes, and friends.

     All atomic_open implementations were audited for OPENAT2_REGULAR
     handling. Explicit checks were added to ceph, gfs2, nfs (v4), and
     cifs/smb - these are the filesystems whose atomic_open can
     encounter an existing non-regular file and would otherwise call
     finish_open() on it or return a misleading error code.

     The remaining implementations (9p, fuse, vboxsf, nfs v2/v3) only
     call finish_open() on freshly created files and use
     finish_no_open() for lookup hits, letting the VFS catch non-regular
     files via the do_open() safety net.

  Cleanups:

   - Migrate the openat2 selftests to the kselftest harness and move
     them under selftests/filesystems/. The tests were written in the
     early days of selftests' TAP support and the modern kselftest
     harness is much easier to follow and maintain. The contents of the
     tests are unchanged and the new emptypath tests are ported on top.

   - Make the LAST_XXX last-type constants private to fs/namei.c. The
     only user outside of fs/namei.c was ksmbd which only needs to know
     whether the last component is a regular one, so
     vfs_path_parent_lookup() now performs the LAST_NORM check
     internally. The ints are replaced with a dedicated enum last_type"

* tag 'vfs-7.2-rc1.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: replace ints with enum last_type for LAST_XXX
  vfs: make LAST_XXX private to fs/namei.c
  selftests: openat2: port emptypath_test to kselftest harness
  kselftest/openat2: test for OPENAT2_REGULAR flag
  openat2: new OPENAT2_REGULAR flag support
  openat2: introduce EFTYPE error code
  selftest: add tests for O_EMPTYPATH
  vfs: add O_EMPTYPATH to openat(2)/openat2(2)
  selftests: openat2: migrate to kselftest harness
  selftests: openat2: switch from custom ARRAY_LEN to ARRAY_SIZE
  selftests: openat2: move helpers to header
  selftests: move openat2 tests to selftests/filesystems/

Merge tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc kernel updates from Christian Brauner:
"Fixes

   - rhashtable: give each instance its own lockdep class

     syzbot reported a circular locking dependency between ht->mutex and
     fs_reclaim via the simple_xattrs rhashtable being torn down during
     inode eviction.

     The predicted deadlock cannot occur: rhashtable_free_and_destroy()
     cancels the deferred worker before taking ht->mutex and
     acquisitions on distinct rhashtables are on distinct mutexes.

     Lockdep flags a cycle anyway because every ht->mutex in the kernel
     shared the single static lockdep class from
     rhashtable_init_noprof().

     The lockdep key is lifted to a per-call-site static key so every
     rhashtable instance gets its own class.

   - selftests/clone3: fix misuse of the libcap library interface in the
     cap_checkpoint_restore test and remove unused variables

   - selftests/pid_namespace: compute the pid_max test limits
     dynamically instead of hardcoding values below the kernel-enforced
     minimum of PIDS_PER_CPU_MIN * num_possible_cpus() which made the
     tests fail on machines with many possible CPUs

   - selftests: fix the Makefile TARGETS entry for nsfs which wasn't
     adjusted when the tests moved under filesystems/

  Cleanups

   - ipc/sem.c: use unsigned int for nsops to match the declaration in
     syscalls.h"

* tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests/clone3: remove unused variables
  selftests/clone3: fix libcap interface usage
  ipc/sem.c: use unsigned int for nsops
  selftests: Fix Makefile target for nsfs
  rhashtable: give each instance its own lockdep class
  selftests/pid_namespace: compute pid_max test limits dynamically

Merge tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull task_exec_state updates from Christian Brauner:
"This introduces a new per-task task_exec_state structure and relocates
  the dumpable mode and the user namespace captured at execve() from
  mm_struct onto it. It stays attached to the task for its full
  lifetime.

  __ptrace_may_access() and several /proc owner and visibility checks
  need to consult two pieces of state for any observable task, including
  zombies that have already gone through exit_mm(): the dumpable mode
  and the user namespace captured at execve(). Both live on mm_struct
  today, which exit_mm() clears from the task long before the task is
  reaped. A reader that races with do_exit() observes task->mm == NULL
  and either fails the check or falls back to init_user_ns - which
  denies legitimate access to non-dumpable zombies that were running in
  a nested user namespace.

  mm_struct loses ->user_ns and the dumpability bits in ->flags.
  MMF_DUMPABLE_BITS is reserved so the MMF_DUMP_FILTER_* layout exposed
  via /proc/<pid>/coredump_filter stays stable. task->user_dumpable and
  its exit_mm() snapshot are removed.

  task_exec_state is the privilege domain established by an execve().
  Within a thread group it is shared via refcount; across thread groups
  each task has its own:

   - CLONE_VM siblings (thread-group members, io_uring workers)
     refcount-share the parent's exec_state.

   - Non-CLONE_VM clones (fork(), vfork() without CLONE_VM) allocate a
     fresh exec_state inheriting the parent's dumpable mode and user_ns.

   - execve() in the child allocates a fresh instance and installs it
     under task_lock + exec_update_lock via task_exec_state_replace().

   - Credential changes (setresuid, capset, ...) and
     prctl(PR_SET_DUMPABLE) update dumpability on the current task's
     exec_state, i.e., on the thread group's shared instance.

  On top of this exec_mmap() no longer tears down the old mm while
  holding exec_update_lock for writing and cred_guard_mutex. Neither
  lock is needed for that: exec_update_lock only exists to make the mm
  swap atomic with the later commit_creds() and all its readers operate
  on the new mm; none looks at the detached old mm.

  The cost was real: __mmput() runs exit_mmap() over the entire old
  address space and can block in exit_aio() waiting for in-flight AIO,
  so execve() of a large process blocked ptrace_attach() and every
  exec_update_lock reader for the duration of the teardown.

  The old mm is now stashed in bprm->old_mm and released from
  setup_new_exec() after both locks are dropped, with a backstop in
  free_bprm() for the error paths"

* tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  exec: free the old mm outside the exec locks
  exec_state: relocate dumpable information
  ptrace: add ptracer_access_allowed()
  exec: introduce struct task_exec_state
  sched/coredump: introduce enum task_dumpable

Merge tag 'vfs-7.2-rc1.casefold' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs casefolding updates from Christian Brauner:
"This exposes the case folding behavior of local filesystems so that
  file servers - nfsd, ksmbd, and user space file servers - can report
  the actual behavior to clients instead of guessing.

  Filesystems report case-insensitive and case-nonpreserving behavior
  via new file_kattr flags in their fileattr_get implementations. fat,
  exfat, ntfs3, hfs, hfsplus, xfs, cifs, nfs, vboxsf, and isofs are
  wired up. Local filesystems that are not explicitly handled default to
  the usual POSIX behavior of case-sensitive and case-preserving.

  nfsd uses this to report case folding via NFSv3 PATHCONF and to
  implement the NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING
  attributes - both have been part of the NFS protocols for decades to
  support clients on non-POSIX systems - and ksmbd reports it via
  FS_ATTRIBUTE_INFORMATION. Exposing the information through the
  fileattr uapi covers user space file servers.

  The immediate motivation is interoperability: Windows NFS clients
  hard-require servers to report case-insensitivity for Win32
  applications to work correctly, and a client that knows the server is
  case-insensitive can avoid issuing multiple LOOKUP/READDIR requests
  searching for case variants.

  The Linux NFS client already grew support for case-insensitive shares
  years ago in support of the Hammerspace NFS server - negative dentry
  caching must be disabled (a lookup for "FILE.TXT" failing must not
  cache a negative entry when "file.txt" exists) and directory change
  invalidation must drop cached case-folded name variants. Such servers
  often operate in multi-protocol environments where a single file
  service instance caters to both NFS and SMB clients, and nfsd needs to
  report case folding properly to participate as a first-class citizen
  there.

  A follow-up series brings fixes for the initial work: the nfsd
  case-info probe now uses kernel credentials, maps -ESTALE to
  NFS3ERR_STALE, and has its cost capped across READDIR entries; the nfs
  client avoids transiently zeroed case capability bits during the probe
  and skips the pathconf probe when neither field is consumed; the
  FS_CASEFOLD_FL semantics are clarified in the UAPI header; and the
  tools UAPI headers are synced"

* tag 'vfs-7.2-rc1.casefold' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  nfsd: Cap case-folding probe cost across READDIR entries
  nfsd: Map -ESTALE from case probe to NFS3ERR_STALE
  nfsd: Use kernel credentials for case-info probe
  fs: Clarify FS_CASEFOLD_FL semantics in UAPI header
  nfs: Skip pathconf probe when neither field is consumed
  nfs: Avoid transient zeroed case capability bits during probe
  tools headers UAPI: Sync case-sensitivity flags from linux/fs.h
  ksmbd: Report filesystem case sensitivity via FS_ATTRIBUTE_INFORMATION
  nfsd: Implement NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING
  nfsd: Report export case-folding via NFSv3 PATHCONF
  isofs: Implement fileattr_get for case sensitivity
  vboxsf: Implement fileattr_get for case sensitivity
  nfs: Implement fileattr_get for case sensitivity
  cifs: Implement fileattr_get for case sensitivity
  xfs: Report case sensitivity in fileattr_get
  hfsplus: Report case sensitivity in fileattr_get
  hfs: Implement fileattr_get for case sensitivity
  ntfs3: Implement fileattr_get for case sensitivity
  exfat: Implement fileattr_get for case sensitivity
  fat: Implement fileattr_get for case sensitivity
  ...

Merge tag 'vfs-7.2-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs directory delegations from Christian Brauner:
"This contains the VFS prerequisites for supporting directory
  delegations in nfsd via CB_NOTIFY callbacks.

  The filelock core gains support for ignoring delegation breaks for
  directory change events together with an inode_lease_ignore_mask()
  helper, and fsnotify gains fsnotify_modify_mark_mask() and a
  FSNOTIFY_EVENT_RENAME data type.

  With this in place nfsd can request delegations on directories and set
  up inotify watches to trigger sending CB_NOTIFY events to clients
  instead of having every directory change break the delegation.

  New tracepoints are added to fsnotify() and to the start of
  break_lease(), and trace_break_lease_block() is passed the currently
  blocking lease instead of the new one.

  A follow-up fix moves the LEASE_BREAK_* flags out of
  #ifdef CONFIG_FILE_LOCKING to fix the build for CONFIG_FILE_LOCKING=n
  configurations"

* tag 'vfs-7.2-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  filelock: move LEASE_BREAK_* flags out of #ifdef CONFIG_FILE_LOCKING
  fsnotify: add FSNOTIFY_EVENT_RENAME data type
  fsnotify: add fsnotify_modify_mark_mask()
  fsnotify: new tracepoint in fsnotify()
  filelock: add an inode_lease_ignore_mask helper
  filelock: add a tracepoint to start of break_lease()
  filelock: add support for ignoring deleg breaks for dir change events
  filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"

Merge tag 'vfs-7.2-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode updates from Christian Brauner:
"This extends the lockless ->i_count handling.

  iput() could already decrement any value greater than one locklessly
  but acquiring a reference always required taking inode->i_lock. Now
  acquiring a reference is lockless as long as the count was already at
  least 1, i.e., only the 0->1 and 1->0 transitions take the lock.

  This avoids the lock for the common cases of nfs calling into the
  inode hash and btrfs using igrab(). Cleanup-wise icount_read_once() is
  added to line up with inode_state_read_once() and the open-coded
  ->i_count loads across the tree are converted, and ihold() is
  relocated and tidied up.

  On top of that some stale lock ordering annotations are retired from
  the inode hash code: iunique() no longer takes the hash lock since the
  inode hash became RCU-searchable and s_inode_list_lock is no longer
  taken under the hash lock either"

* tag 'vfs-7.2-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: retire stale lock ordering annotations from inode hash
  fs: allow lockless ->i_count bumps as long as it does not transition 0->1
  fs: relocate and tidy up ihold()
  fs: add icount_read_once() and stop open-coding ->i_count loads

Merge tag 'vfs-7.2-rc1.exportfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull exportfs updates from Christian Brauner:
"This cleans up the exportfs support for block-style layouts that
  provide direct block device access: the operations for layout-based
  block device access are split out of struct export_operations into a
  separate header, ->commit_blocks() no longer takes a struct iattr
  argument, and the way support for layout-based block device access is
  detected is reworked.

  nfsd's blocklayout code also stops honoring loca_time_modify. This is
  preparation for supporting export of more than a single device per
  file system"

* tag 'vfs-7.2-rc1.exportfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  exportfs,nfsd: rework checking for layout-based block device access support
  exportfs: don't pass struct iattr to ->commit_blocks
  exportfs: split out the ops for layout-based block device access
  nfsd/blocklayout: always ignore loca_time_modify

Merge tag 'vfs-7.2-rc1.kfunc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull bpf filesystem kfunc fix from Christian Brauner:
"The bpf_set_dentry_xattr() and bpf_remove_dentry_xattr() kfuncs locked
  the inode of the supplied dentry without checking whether the dentry
  is negative.

  Passing a negative dentry (e.g., from security_inode_create) caused a
  NULL pointer dereference. Negative dentries now fail with EINVAL. The
  WARN_ON(!inode) in the bpf xattr permission helpers is dropped as well
  since it could be triggered the same way, amounting to a denial of
  service on systems with panic_on_warn enabled"

* tag 'vfs-7.2-rc1.kfunc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  bpf: fix crash in bpf_[set|remove]_dentry_xattr for negative dentries

bpf: Raise maximum call chain depth to 16 frames

Bump MAX_CALL_FRAMES from 8 to 16 to allow deeper call chains
that Rust-BPF requires and update selftests.

Link: https://lore.kernel.org/r/20260613180755.29671-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

smb: client: Use more common code in SMB2_tcon()

Use an additional label so that a bit of common code can be better reused
at the end of this function implementation.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: Use more common error handling code in smb3_reconfigure()

Use an additional label so that a bit of exception handling can be better
reused at the end of this function implementation.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb/client: Fix error code in smb2_aead_req_alloc()

The "*num_sgs" variable is a u32 so "ERR_PTR(*num_sgs)" doesn't work.
We would have to do something similar to the previous line where it's
cast to int and then long. However, it's simpler to store the return in
an int ret variable.

This bug would eventually result in a crash when dereference the invalid
error pointer.

Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Cc: stable@kernel.org
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb/client: clean up a type issue in cifs_xattr_get()

The cifs_xattr_get() function returns type int, not ssize_t so
declare "rc" as int as well. This has no effect on runtime.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb/client: allow FS_IOC_SETFLAGS to clear compression

The CIFS FS_IOC_SETFLAGS path can set FS_COMPR_FL now, but it cannot
clear it again. This can be reproduced on a share backed by a filesystem
that supports compression, for example btrfs exported by Samba:

[compress_share]
vfs objects = btrfs

$ touch test.bin
$ chattr +c test.bin
$ lsattr test.bin
$ chattr -c test.bin

The final chattr -c fails with EOPNOTSUPP, and leaves the remote object
with the compressed attribute still set, because the client always sends
FSCTL_SET_COMPRESSION with COMPRESSION_FORMAT_DEFAULT. That is correct
for setting FS_COMPR_FL, but clearing FS_COMPR_FL requires sending
COMPRESSION_FORMAT_NONE.

Fix this by passing the requested compression state through the
set_compression operation.  The SMB1 and SMB2 helpers no longer hard-code
COMPRESSION_FORMAT_DEFAULT.

When FS_COMPR_FL is set, send COMPRESSION_FORMAT_DEFAULT.  When it is
cleared, send COMPRESSION_FORMAT_NONE.  If the server accepts the request,
update the cached FILE_ATTRIBUTE_COMPRESSED bit under i_lock so
FS_IOC_GETFLAGS reports the new state.

Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb/client: use writable handle for FS_IOC_SETFLAGS compression

Setting the compressed flag on a CIFS mount can fail with -EACCES:

[compress_share]
vfs objects = btrfs

        $ touch test.bin
        $ chattr +c test.bin
        chattr: Permission denied while setting flags on test.bin

This can be reproduced against a Samba share backed by a filesystem that
supports compression, such as btrfs.

FS_IOC_SETFLAGS is issued on the file handle opened by userspace.  chattr
opens the target read-only before setting FS_COMPR_FL, so the SMB client
currently sends FSCTL_SET_COMPRESSION on a handle that may not have
FILE_WRITE_DATA access.  Samba requires FILE_WRITE_DATA for
FSCTL_SET_COMPRESSION and rejects the request.

Use the current handle only if it already has FILE_WRITE_DATA.  Otherwise
try an existing writable handle for the inode.  If none is available, open
a temporary FILE_WRITE_DATA handle for the compression request.

After FSCTL_SET_COMPRESSION succeeds, update the cached compressed
attribute immediately, matching how smb2_set_sparse() updates
FILE_ATTRIBUTE_SPARSE_FILE after a successful FSCTL_SET_SPARSE.

Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb/client: always return a value for FS_IOC_GETFLAGS

Currently, repeated lsattr calls on a regular CIFS file without the
compressed attribute may show random flags:

$ touch test.bin
$ lsattr test.bin
s-S-ia-A-EjI---------m test.bin
$ lsattr test.bin
------d-cEjI---------m test.bin

The lsattr reproducer depends on the previous contents of its userspace
buffer, so it may not reproduce on every setup. A deterministic
reproducer is to initialize the ioctl argument before FS_IOC_GETFLAGS
on a file without the compressed attribute:

int flags = 0x7fffffff;
ioctl(fd, FS_IOC_GETFLAGS, &flags);

On an affected kernel, flags remains 0x7fffffff. With the fix, it is
set to 0.

This happens because when the cached inode does not have the compressed
bit set, the CIFS fallback path in FS_IOC_GETFLAGS returns success
without calling put_user() to write the zero flags value into the user
buffer. As a result, the caller observes stale contents from its own
buffer.

Fix this by always writing the visible flags value back to the user
buffer before returning success, even when the value is zero.

Fixes: 64a5cfa6db94 ("Allow setting per-file compression via SMB2/3")
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb/client: update i_blocks after contiguous writes

When a lease allows CIFS to use cached inode attributes, getattr may
return the locally cached attributes instead of revalidating them from
the server. After local writes extend a file, the write path updates the
file size, but i_blocks can remain based on the old allocation size.

For example, while the file is still open after two contiguous writes,
the local block count can remain smaller than the written range:

        after first write:   st_size = 4096,  st_blocks = 7
        after second write:  st_size = 12288, st_blocks = 21
        after close:         st_size = 12288, st_blocks = 24

This can make a fully written file look sparse:

        i_blocks * 512 < i_size

and can cause swap activation to reject a valid write-created swapfile
as having holes. This results in xfstests skipping swap-related tests
on CIFS mounts:

generic/472         [not run] swapfiles are not supported
generic/494         [not run] swapfiles are not supported
generic/497         [not run] swapfiles are not supported
generic/569         [not run] swapfiles are not supported
generic/636         [not run] swapfiles are not supported
generic/643         [not run] swapfiles are not supported

Update the local i_blocks estimate after successful writes, but only
when the write starts at or before the currently known allocated range.
This lets sequential writes grow i_blocks while avoiding treating
write-past-EOF holes as allocated.

Skip the local estimate for files that are already marked sparse, since
their allocation needs to come from the server rather than from a
contiguous-write estimate.

Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: fix races in cifsd thread creation

The cifsd demultiplex thread can run and access tcp_ses before the parent
thread has finished populating tcp_ses, which the worker thread accesses
locklessly.

Also, the kthread_run macro may start the thread before returning the
thread pointer. Because the pointer is part of the structure that the
thread can access, if the kernel is preempted after the thread is spawned,
but before the thread pointer is populated and the thread attempts to exit,
it will sleep, waiting for a SIGKILL signal.

Fix this by moving creation of the thread to after all of tcp_ses'es
fields are populated, and spawning the thread last, using a split
kthread_create/wake_up_process logic.

Signed-off-by: Fredric Cover <fredric.cover.lkernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

cifs: validate full SID length in security descriptors

parse_sid() only verified that the fixed SID header fit in the
returned security descriptor, but did not verify that the full SID
body described by num_subauth was present.

A malicious server can return a truncated owner or group SID whose
header lies within the descriptor buffer while sub_auth[] extends
past the end of the allocation, leading to an out-of-bounds read
when the client later parses or copies that SID.

Validate the full SID body in parse_sid(), centralize owner/group SID
lookup and bounds checking in sid_from_sd(), and use that validation
in parse_sec_desc(), build_sec_desc(), and copy_sec_desc() before
sub_auth[] is accessed.

Signed-off-by: Qihang <q.h.hack.winter@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: resolve SWN tcon from live registrations

cifs_swn_notify() looks up a witness registration by id under
cifs_swnreg_idr_mutex, drops the mutex, and then uses the registration's
cached tcon pointer.  That pointer is not a lifetime reference, and it is
not a stable representative once cifs_get_swn_reg() lets multiple tcons
for the same net/share name share one registration id.

A same-share second mount can keep the cifs_swn_reg alive after the first
tcon unregisters and is freed.  The registration then still points at the
freed first tcon, so taking tc_lock or incrementing tc_count through
swnreg->tcon only moves the use-after-free earlier.  Taking tc_lock while
holding cifs_swnreg_idr_mutex also violates the documented CIFS lock
order.

Fix this by making the registration store only the stable witness
identity: id, net name, share name, and notify flags.  When a notify
arrives, copy that identity under cifs_swnreg_idr_mutex, drop the mutex,
then find and pin a live witness tcon that currently matches the net/share
pair under the normal cifs_tcp_ses_lock -> tc_lock order.  The notification
path uses that pinned tcon directly and drops the reference when done.

Registration and unregister messages now use the live tcon passed by the
caller instead of a cached tcon in the registration.  The final unregister
send is folded into cifs_swn_unregister() while the registration is still
protected by cifs_swnreg_idr_mutex.  This removes the previous
find/drop/reacquire raw-pointer window.  The release path only removes the
idr entry and frees the stable identity strings.

This preserves the intended one-registration/many-tcon behavior: a
registration id represents a net/share pair, and notify handling acts on a
live representative selected at use time.  It also preserves CLIENT_MOVE
ordering for the representative tcon because the old-IP unregister is sent
before cifs_swn_register() sends the new-IP register.

Fixes: fed979a7e082 ("cifs: Set witness notification handler for messages from userspace daemon")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Steve French <stfrench@microsoft.com>

cifs: remove all cifs files before kill super

Cifs files may be put into fileinfo_put_wq during umounting cifs.
After umount done, cifsFileInfo_put_final is called, which cause
following BUG:

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
[  134.222152]  list_lru_add+0x64/0x1a0
[  134.222399]  ? cifs_put_tcon+0x171/0x340 [cifs]
[  134.222772]  d_lru_add+0x44/0x60
[  134.222997]  dput+0x1fc/0x210
[  134.223213]  cifsFileInfo_put_final+0x11a/0x140 [cifs]
[  134.223576]  process_one_work+0x17c/0x320
[  134.223843]  worker_thread+0x188/0x280
[  134.224084]  ? __pfx_worker_thread+0x10/0x10
[  134.224366]  kthread+0xcc/0x100
[  134.224576]  ? __pfx_kthread+0x10/0x10
[  134.224827]  ret_from_fork+0x30/0x50
[  134.225063]  ? __pfx_kthread+0x10/0x10
[  134.225328]  ret_from_fork_asm+0x1b/0x30

This can be reproduce by following:
unshare -n bash -c "
mkdir -p ${CIFS_MNT}
ip netns attach root 1
ip link add eth0 type veth peer veth0 netns root
ip link set eth0 up
ip -n root link set veth0 up
ip addr add 192.168.0.2/24 dev eth0
ip -n root addr add 192.168.0.1/24 dev veth0
ip route add default via 192.168.0.1 dev eth0
ip netns exec root sysctl net.ipv4.ip_forward=1
ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o
${DEV} -j MASQUERADE
mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o
vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1
touch ${CIFS_MNT}/a.txt
ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o
${DEV} -j MASQUERADE
"
umount ${CIFS_MNT}

Fixes: 340cea84f691 ("cifs: open files should not hold ref on superblock")
Signed-off-by: Jian Zhang <zhangjian496@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: fix conflicting option validation for new mount API

Apply conflicting option validation consistently across all the new
mount API paths, for both mount and remount.

Some checks were only applied during initial mount validation, while
others were handled during option parsing, causing mount and
remount/reconfigure to behave differently.

Move the conflicting option checks into smb3_handle_conflicting_options()
and call it from the common validation paths, including for
multichannel/max_channels handling.

Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api")
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

cifs: invalidate cfid on unlink/rename/rmdir

Today we do not invalidate the cached_dirent or the entire
parent cfid when a dentry in a dir has been removed/moved.

This change invalidates the parent cfid so that we don't serve
directory contents from the cache.

Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

ALSA: timer: Fix racy timeri->timer changes with rwlock

Although we've covered the races around the timer object assignment
and release for timer instances, there are still races at starting or
stopping the timer instance.  They refer to timeri->timer without
lock, hence they can still trigger UAFs.

For addressing it, this patch changes the existing slave_active_lock
spinlock to timeri_lock rwlock.  It's a global rwlock applied as
read-lock when snd_timer_start() & co are called as well as
snd_timeri_timer_get() is called.  In turn, the places where
timeri->timer is assigned or released are covered by the write-lock.

The patch replaces spinlock_irqsave with spinlock in a couple of
spaces because they are now already protected by timeri_lock, too.

Reported-by: Kyle Zeng <kylebot@openai.com>
Link: https://patch.msgid.link/20260614090714.773216-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: core: Fix unintuitive behavior of snd_power_ref_and_wait()

snd_power_ref_and_wait() takes the power refcount and doesn't leave it
no matter whether it returns an error or not. However, the majority
of callers don't expect but just returns without unreferencing in the
caller side upon errors.

For addressing the potential refcount unbalance, rather correct the
behavior of snd_power_ref_wait() to unreference upon returning an
error.

Note that the problem above is likely negligible; the function returns
an error only when the sound card is being shutdown, hence it doesn't
matter about the power refcount any longer at such a state.

Fixes: e94fdbd7b25d ("ALSA: control: Track in-flight control read/write/tlv accesses")
Reported-by: WenTao Liang <vulab@iscas.ac.cn>
Closes: https://lore.kernel.org/20260612022121.14329-1-vulab@iscas.ac.cn
Link: https://patch.msgid.link/20260614090507.772540-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Linux 7.1

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux

Pull ARM fixes from Russell King:

- Avoid KASAN instrumentation of half-word IO

- Use a byte load for KASAN shadow stack

- Fix kexec and hibernation with PAN

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
  ARM: 9476/1: mm: fix kexec and hibernation with CONFIG_CPU_TTBR0_PAN
  ARM: 9475/1: entry: use byte load for KASAN VMAP stack shadow
  ARM: 9474/1: io: avoid KASAN instrumentation of raw halfword I/O

geneve: Fix off-by-one comparing with GRO_LEGACY_MAX_SIZE

GRO_LEGACY_MAX_SIZE = 65536; total_len being 65536 is too big to fit
into a u16. As can be seen in skb_gro_receive, packets bigger or equal
to gro_max_size (or GRO_LEGACY_MAX_SIZE) are dropped with -E2BIG. Apply
the same boundary to geneve_post_decap_hint to avoid writing 65536 to a
16-bit iph->tot_len field with an overflow.

Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path")
Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260611192955.604661-3-alice.kernel@fastmail.im
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: act_csum: don't mangle UDP tunnel GSO packets

Similar to commit add641e7dee3 ("sched: act_csum: don't mangle TCP and
UDP GSO packets"), UDP tunnel GSO packets going through act_csum
shouldn't have their checksum calculated at this point, because it will
be done after segmentation. Setting the checksum in act_csum modifies
skb->ip_summed and prevents inner IP csum offload from kicking in,
resulting in a packet with a bad checksum.

Add UDP tunnel GSO packets to the exceptions, and also add UDP GSO
(SKB_GSO_UDP_L4), as the same logic as in the commit mentioned above
applies to UDP GSO too.

Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260611192955.604661-2-alice.kernel@fastmail.im
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'for-next/sysregs' into for-next/core

* for-next/sysregs:
arm64/sysreg: Add HDBSS related register information

Merge branch 'for-next/selftests' into for-next/core

* for-next/selftests:
  kselftest/arm64: Add 2025 dpISA coverage to hwcaps
  kselftest/arm64: Add tests for POR_EL0 save/reset/restore
  kselftest/arm64: Move/add POE helpers to test_signals_utils.h
  kselftest/arm64: Add POE as a feature in the signal tests
  selftests/mm: Fix resv_sz when parsing arm64 signal frame

Merge branch 'for-next/perf' into for-next/core

* for-next/perf:
  perf/arm-cmn: Fix DVM node events
  perf: qcom: Unify user-visible "Qualcomm" name
  MAINTAINERS: Update HiSilicon PMU driver maintainer to Yushan Wang

Merge branch 'for-next/mpam' into for-next/core

* for-next/mpam:
arm_mpam: Update architecture version check for MPAM MSC
arm64: cpufeature: Add support for the MPAM v0.1 architecture version

Merge branch 'for-next/mm' into for-next/core

* for-next/mm: (24 commits)
  Revert "arm64: mm: Unmap kernel data/bss entirely from the linear map"
  Revert "arm64: mm: Defer remap of linear alias of data/bss"
  arm64/mm: Rename ptdesc_t
  arm64: mm: Defer remap of linear alias of data/bss
  KVM: arm64: Omit tag sync on stage-2 mappings of the zero page
  arm64: Avoid double evaluation of __ptep_get()
  kasan: Move generic KASAN page tables out of BSS too
  arm64: Rename page table BSS section to .bss..pgtbl
  arm64: mm: Unmap kernel data/bss entirely from the linear map
  arm64: mm: Map the kernel data/bss read-only in the linear map
  mm: Make empty_zero_page[] const
  sh: Drop cache flush of the zero page at boot
  powerpc/code-patching: Avoid r/w mapping of the zero page
  arm64: mm: Don't abuse memblock NOMAP to check for overlaps
  arm64: Move fixmap and kasan page tables to end of kernel image
  arm64: mm: Permit contiguous attribute for preliminary mappings
  arm64: kfence: Avoid NOMAP tricks when mapping the early pool
  arm64: mm: Permit contiguous descriptors to be manipulated
  arm64: mm: Preserve non-contiguous descriptors when mapping DRAM
  arm64: mm: Preserve existing table mappings when mapping DRAM
  ...

Merge branch 'for-next/misc' into for-next/core

* for-next/misc:
  arm64: arch_timer: reuse arch_timer_read_cnt{p,v}ct_el0() helpers
  arm64: patching: replace min_t with min in __text_poke
  ARM64: remove unnecessary architecture-specific <asm/device.h>
  arm64: Implement _THIS_IP_ using inline asm
  arm64: panic from init_IRQ if IRQ handler stacks cannot be allocated
  arm64: smp: Do not mark secondary CPUs possible under nosmp
  arm64/daifflags: Make local_daif_*() helpers __always_inline

Merge branch 'for-next/fpsimd-cleanups' into for-next/core

* for-next/fpsimd-cleanups:
  arm64: fpsimd: Remove <asm/fpsimdmacros.h>
  arm64: fpsimd: Move SME save/restore inline
  arm64: fpsimd: Move sve_flush_live() inline
  arm64: fpsimd: Move SVE save/restore inline
  arm64: fpsimd: Use opaque type for SME state
  arm64: fpsimd: Use opaque type for SVE state
  arm64: fpsimd: Move fpsimd save/restore inline
  arm64: fpsimd: Split FPSR/FPCR from SVE save/restore
  arm64: sysreg: Add FPCR and FPSR
  arm64: fpsimd: Move sve_get_vl() and sme_get_vl() inline
  arm64: fpsimd: Use assembler for baseline SME instructions
  arm64: fpsimd: Use assembler for SVE instructions
  arm64: fpsimd: Remove sve_set_vq() and sme_set_vq()
  arm64: fpsimd: Fold sve_init_regs() into do_sve_acc()
  KVM: arm64: pkvm: Remove struct cpu_sve_state
  KVM: arm64: pkvm: Save host FPMR in host cpu context
  KVM: arm64: Don't override FFR save/restore argument
  KVM: arm64: Don't include <asm/fpsimdmacros.h>
  arm64: fpsimd: Fix type mismatch in sme_{save,load}_state()
  arm64: fpsimd: Fix type mismatch in sve_{save,load}_state()

Merge branch 'for-next/errata' into for-next/core

* for-next/errata:
  arm64: errata: Mitigate TLBI errata on Microsoft Azure Cobalt 100 CPU
  arm64: errata: Mitigate TLBI errata on NVIDIA Olympus CPU
  arm64: errata: Mitigate TLBI errata on various Arm CPUs
  arm64: cputype: Add C1-Premium definitions
  arm64: cputype: Add C1-Ultra definitions
  arm64: kernel: Disable CNP on HiSilicon HIP09
  arm64: cpufeature: Add WORKAROUND_DISABLE_CNP capability
  arm64: proton-pack: use sysfs_emit in sysfs show functions
  arm64: errata: Reformat table for IDs

Merge branch 'for-next/cpufeature' into for-next/core

* for-next/cpufeature:
arm64: Document SVE constraints on new hwcaps
arm64/cpufeature: Define hwcaps for 2025 dpISA features

netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them

Update nft_dup and nft_fwd to use the nf_dev_xmit_recursion() helpers.
This patch also disables BH when transmitting the skb to address a
possible migration to different CPU leading to imbalanced decrementation
of the recursion counters.

This is modeled after Florian Westphal's dev_xmit_recursion*() API
available since commit 97cdcf37b57e ("net: place xmit recursion in
softnet data") according to its current state in the tree.

Fixes: 1d47b55b36d2 ("netfilter: nft_fwd_netdev: use recursion counter in neigh egress path")
Fixes: f37ad9127039 ("netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ipvs: fix doc syntax for conn_max sysctl

Fix the docutils error reported by kernel test robot
for the new conn_max sysctl:

Documentation/networking/ipvs-sysctl.rst:76: WARNING: Block quote ends
without a blank line; unexpected unindent. [docutils]
Documentation/networking/ipvs-sysctl.rst:76: ERROR: Unexpected section
title or transition.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606071851.Dc1H7hOO-lkp@intel.com/
Fixes: 4a15044a2b06 ("ipvs: add conn_max sysctl to limit connections")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: flowtable: bail out if forward path cannot be discovered

If forward path discovery fails for any reason or netdevice is not
registered for this flowtable, then bail out to classic forwarding path
rather than providing incomplete forwarding path.

Update the existing forward path parser functions to report an error
so the flow_offload expressions gives up on setting up the flowtable
entry.

Link: https://sashiko.dev/#/patchset/20260607094954.48892-15-pablo%40netfilter.org?part=14
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: check NULL when retrieving ct extension

nf_ct_ext_find() might return NULL if ct extension is not found.

Add also the null checks to:

- nfct_help()
- nfct_help_data()
- nfct_seqadj()
- nfct_nat()

This is defensive, for safety reasons.

nf_ct_ext_find() used to return NULL if the extension is stale for
unconfirmed conntracks if the genid validation fails.

Skip NULL check in nf_nat_inet_fn() given this is valid to be NULL
for non-initialized ct nat extensions.

While at it, fetch ct helper area in nf_ct_expect_related_report() only
once and pass it on to other ancilliary functions. Replace WARN_ON()
by WARN_ON_ONCE() in nf_ct_unlink_expect_report().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conncount: gc and rcu fixes

Another drive-by AI review:

1) tree_gc_worker fails to wrap around after it can't find more pending
   work.  Update data->gc_tree unconditionally.  If its 0, start from
   the first pending tree (which can be 0).

2) tree_gc_worker() iterates the rbtree without lock. This is never
   safe.  Move iteration under the spinlock.  If this takes too long
   (resched needed), save key of next node, drop lock, resched, re-lock,
   then search for the key (node).  In very rare cases this node might
   no longer exist, in that case we can just wait for next gc.

3) use disable_work_sync(), we don't want any restarts.

4) module exit function needs rcu_barrier before we zap the kmem cache.

Fixes: 5c789e131cbb ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Closes: https://sashiko.dev/#/patchset/20260525182924.28456-1-fw%40strlen.de
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conncount: add sequence counter to detect tree modifications

There a two issues with traversal:
1. Key lookup (tree search) cannot detect concurrent modifications and may
not find a result in case of parallel modification.

2. Worker does a lockless iteration. This is never safe.

Add a sequence counter and re-do the lookup under lock in case the
tree was modified / seqcount changed.

gc_worker bugs are addressed in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conncount: split count_tree_node rbtree walk into helper

Add find_tree_node() helper that fetches a matching rbtree node.

This is used by followup patch to optionally search the tree again while
preventing concurrent updates via tree lock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conncount: use per nf_conncount_data spinlocks

This change replaces the rb_root with a new container structure.
Instead of an array of locks shared by all nf_conncount_data objects,
each tree gains its own dedicated lock.

Downside: nf_conncount_data increases in size.  Before this change:
struct nf_conncount_data {
        [..]
        /* --- cacheline 33 boundary (2112 bytes) was 16 bytes ago --- */
        unsigned int               gc_tree;              /*  2128     4 */
        /* size: 2136, cachelines: 34, members: 7 */
        /* padding: 4 */

After:
        /* size: 4184, cachelines: 66, members: 7 */
        /* padding: 4 */

On LOCKDEP enabled kernels, this is even worse:
/* size: 18560, cachelines: 290, members: 7 */

(due to lockdep map in each spinlock).

For this reason also switch to kvzalloc.  The zeroing variant is needed
to not start with random (heap memory content) in the ->pending_trees
bitmap.

Followup patch will add and use a sequence counter.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conncount: callers must hold rcu read lock

rcu_derefence_raw() should not have been used here, it concealed this bug.
Its used because struct rb_node lacks __rcu annotated pointers, so plain
rcu_derefence causes sparse warnings.

The major tradeoff is that rcu_derefence_raw() doesn't warn when the caller
isn't in a rcu read section.

Extend the rcu read lock scope accordingly and cause sparse warnings,
those warnings are the lesser evil.

Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
Closes: https://sashiko.dev/#/patchset/20260603230610.7900-1-fw%40strlen.de
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths

Replace raw warning macros with DEBUG_NET_WARN_ON_ONCE across the
nf_tables API, core engine, and expression evaluations. This prevents
unnecessary system panics when panic_on_warn=1 is enabled in production
systems.

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ipvs: Replace use of system_unbound_wq with system_dfl_long_wq

This patch continues the effort to refactor workqueue APIs, which has
begun with the changes introducing new workqueues and a new
alloc_workqueue flag:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The point of the refactoring is to eventually alter the default behavior
of workqueues to become unbound by default so that their workload
placement is optimized by the scheduler.

Before that to happen, workqueue users must be converted to the better
named new workqueues with no intended behaviour changes:

   system_wq -> system_percpu_wq
   system_unbound_wq -> system_dfl_wq

This way the old obsolete workqueues (system_wq, system_unbound_wq) can
be removed in the future.

This specific work is considered long, so enqueue it using
system_dfl_long_wq instead of system_dfl_wq.

Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ALSA: seq: avoid stale FIFO cells during resize

snd_seq_fifo_resize() still needs to publish the replacement pool
before it waits for FIFO users. A blocking snd_seq_read() holds
f->use_lock while it sleeps, so concurrent senders must be able to
queue to the new pool and wake that reader instead of failing against a
closing old pool.

However, snd_seq_fifo_event_in() duplicates an event before it takes
f->lock, and snd_seq_read() can dequeue a cell and later call
snd_seq_fifo_cell_putback() if copy_to_user() or
snd_seq_expand_var_event() fails. If resize swaps f->pool and detaches
oldhead in between, either path can relink an old-pool cell after the
snapshot. That stale cell sits outside the drained oldhead list, keeps
oldpool->counter elevated, and can leave snd_seq_pool_delete() waiting
for the retired pool to drain.

Keep the existing swap-before-wait ordering in snd_seq_fifo_resize(),
but reject stale cells before any FIFO relink. Revalidate event-in cells
under f->lock and retry them against the published replacement pool, and
free stale putback cells instead of linking them back into the FIFO.

The buggy scenario involves two paths, with each column showing the
order within that path:

resize path:                    relink path:
1. Allocate newpool.             1. Take f->use_lock.
2. Swap f->pool to newpool and   2. Duplicate or dequeue an old-pool
   detach oldhead.                  cell before oldpool closes.
3. Mark oldpool closing and      3. Reach a later relink point after
   wait for FIFO users.             resize published newpool.
4. Free oldhead and delete       4. Relink the old-pool cell after
   oldpool.                         resize detached oldhead.
                                 5. Drop f->use_lock.

The reproducer reports a resize ioctl blocked in the expected pool
teardown path:

signal: resize iteration=98 target_pool=4 exceeded 250ms
        (elapsed=251ms)
diagnostic: resize_tid=651 wchan=snd_seq_pool_done
diagnostic: resize_tid=651 stack=
  snd_seq_pool_done+0x5b/0x140
  snd_seq_pool_delete+0x7a/0x90
  snd_seq_fifo_resize+0x193/0x1e0
  snd_seq_ioctl_set_client_pool+0x214/0x260
  snd_seq_ioctl+0x119/0x540
  __x64_sys_ioctl+0xd1/0x120
  do_syscall_64+0xbb/0x2f0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

A second run with larger pools hit the same target path:

signal: resize iteration=32 target_pool=64 exceeded 250ms
        (elapsed=251ms)
diagnostic: resize_tid=663 wchan=snd_seq_pool_done
diagnostic: resize_tid=663 stack=
  snd_seq_pool_done+0x5b/0x140
  snd_seq_pool_delete+0x7a/0x90
  snd_seq_fifo_resize+0x193/0x1e0
  snd_seq_ioctl_set_client_pool+0x214/0x260
  snd_seq_ioctl+0x119/0x540
  __x64_sys_ioctl+0xd1/0x120
  do_syscall_64+0xbb/0x2f0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 2d7d54002e39 ("ALSA: seq: Fix race during FIFO resize")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Link: https://patch.msgid.link/20260614004801.3507773-2-zzzccc427@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: seq: oss: Serialize readq reset state with q->lock

snd_seq_oss_readq_clear() resets qlen, head, and tail without
q->lock even though the normal reader and producer paths serialize the
same ring state under that spinlock. A reset can therefore race
snd_seq_oss_readq_free() or snd_seq_oss_readq_put_event() and leave
stale records in the queue, drop freshly queued ones, or report the
wrong readiness after wakeup. KCSAN reports a data race between
snd_seq_oss_readq_clear() and snd_seq_oss_readq_free().

Take q->lock while clearing the ring and resetting input_time. Factor
the enqueue logic into a caller-locked helper so
snd_seq_oss_readq_put_timestamp() updates its suppression state under
the same lock instead of racing the reset path.

The buggy scenario involves two paths, with each column showing the
order within that path:

reset path:                      locked readq updater:
1. snd_seq_oss_reset() or        1. A reader or callback producer
   release reaches                  takes q->lock on the same queue.
   snd_seq_oss_readq_clear().
2. snd_seq_oss_readq_clear()     2. The updater tests or modifies
   resets qlen, head, tail,         qlen, head, and tail.
   and input_time.
3. snd_seq_oss_readq_clear()     3. The updater completes its
   wakes sleepers on                read-modify-write sequence.
   q->midi_sleep.
4. Without q->lock, the reset    4. The resulting ring state drives
   can overlap the locked           later reads and readiness.
   update.

KCSAN reports:

BUG: KCSAN: data-race in snd_seq_oss_readq_clear /
snd_seq_oss_readq_free

write to 0xffff8881069fe608 of 4 bytes by task 120516 on cpu 0:
  snd_seq_oss_readq_free+0x6c/0x80
  snd_seq_oss_read+0xcb/0x250
  odev_read+0x38/0x60
  vfs_read+0xff/0x600
  ksys_read+0xb4/0x140
  __x64_sys_read+0x46/0x60
  do_syscall_64+0xbb/0x2f0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff8881069fe608 of 4 bytes by task 120517 on cpu 1:
  snd_seq_oss_readq_clear+0x1f/0x90
  snd_seq_oss_reset+0xa7/0xf0
  snd_seq_oss_ioctl+0x6f6/0x7e0
  odev_ioctl+0x56/0xc0
  __x64_sys_ioctl+0xd1/0x120
  do_syscall_64+0xbb/0x2f0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00000001 -> 0x00000000

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Link: https://patch.msgid.link/20260614004801.3507773-1-zzzccc427@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>