Andrii Nakryiko [Thu, 26 Jun 2025 17:28:51 +0000 (10:28 -0700)]
Merge branch 'support-array-presets-in-veristat'
Mykyta Yatsenko says:
====================
Support array presets in veristat
From: Mykyta Yatsenko <yatsenko@meta.com>
This patch series implements support for array variable presets in
veristat. Currently users can set values to global variables before
loading BPF program, but not for arrays. With this change array
elements are supported as well, for example:
```
sudo ./veristat set_global_vars.bpf.o -G "arr[0] = 1"
```
v4 -> v5
* Simplify array elements offset calculation using btf__resolve_size
* Support white spaces in array indexes, e.g. arr [ 0 ]
* Rename btf_find_member into adjust_var_secinfo_member
v3 -> v4
* Rework implementation to support multi dimensional arrays
* Rework data structures for storing parsed presets
v2 -> v3
* Added more negative tests
* Fix mem leak
* Other small fixes
v1 -> v2
* Support enums as indexes
* Separating parsing logic from preset processing
* Add more tests
====================
Mykyta Yatsenko [Wed, 25 Jun 2025 16:59:04 +0000 (17:59 +0100)]
selftests/bpf: Test array presets in veristat
Modify existing veristat tests to verify that array presets are applied
as expected.
Introduce few negative tests as well to check that common error modes
are handled.
Mykyta Yatsenko [Wed, 25 Jun 2025 16:59:03 +0000 (17:59 +0100)]
selftests/bpf: Support array presets in veristat
Implement support for presetting values for array elements in veristat.
For example:
```
sudo ./veristat set_global_vars.bpf.o -G "arr[3] = 1"
```
Arrays of structures and structure of arrays work, but each individual
scalar value has to be set separately: `foo[1].bar[2] = value`.
Mykyta Yatsenko [Wed, 25 Jun 2025 16:59:02 +0000 (17:59 +0100)]
selftests/bpf: Separate var preset parsing in veristat
Refactor var preset parsing in veristat to simplify implementation.
Prepare parsed variable beforehand so that parsing logic is separated
from functionality of calculating offsets and searching fields.
Introduce rvalue struct, storing either int or enum (string value),
will be reused in the next patch, extract parsing rvalue into a
separate function.
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3
Cross-merge BPF, perf and other fixes after downstream PRs.
It restores BPF CI to green after critical fix
commit bc4394e5e79c ("perf: Fix the throttle error of some clock events")
====================
bpf: Add kfuncs for read-only string operations
String operations are commonly used in programming and BPF programs are
no exception. Since it is cumbersome to reimplement them over and over,
this series introduce kfuncs which provide the most common operations.
For now, we only limit ourselves to functions which do not copy memory
since these usually introduce undefined behaviour in case the
source/destination buffers overlap which would have to be prevented by
the verifier.
The kernel already contains implementations for all of these, however,
it is not possible to use them from BPF context. The main reason is that
the verifier is not able to check that it is safe to access the entire
string and that the string is null-terminated and the function won't
loop forever. Therefore, the operations are open-coded using
__get_kernel_nofault instead of plain dereference and bounded to at most
XATTR_SIZE_MAX characters to make them safe. That allows to skip all the
verfier checks for the passed-in strings as safety is ensured
dynamically.
All of the proposed functions return integers, even those that normally
(in the kernel or libc) return pointers into the strings. The reason is
that since the strings are generally treated as unsafe, the pointers
couldn't be dereferenced anyways. So, instead, we return an index to the
string and let user decide what to do with it. The integer APIs also
allow to return various error codes when unexpected situations happen
while processing the strings.
The series include both positive and negative tests using the kfuncs.
Changelog
---------
Changes in v8:
- Return -ENOENT (instead of -1) when "item not found" for relevant
functions (Alexei).
- Small adjustments of the string algorithms (Andrii).
- Adapt comments to kernel style (Alexei).
Changes in v7:
- Disable negative tests passing NULL and 0x1 to kfuncs on s390 as they
aren't relevant (see comment in string_kfuncs_failure1.c for details).
Changes in v6:
- Improve the third patch which allows to use macros in __retval in
selftests. The previous solution broke several tests.
Changes in v5:
- Make all kfuncs return integers (Andrii).
- Return -ERANGE when passing non-kernel pointers on arches with
non-overlapping address spaces (Alexei).
- Implement "unbounded" variants using the bounded ones (Andrii).
- Add more negative test cases.
Changes in v4 (all suggested by Andrii):
- Open-code all the kfuncs, not just the unbounded variants.
- Introduce `pagefault` lock guard to simplify the implementation
- Return appropriate error codes (-E2BIG and -EFAULT) on failures
- Const-ify all arguments and return values
- Add negative test-cases
Changes in v3:
- Open-code unbounded variants with __get_kernel_nofault instead of
dereference (suggested by Alexei).
- Use the __sz suffix for size parameters in bounded variants (suggested
by Eduard and Alexei).
- Make tests more compact (suggested by Eduard).
- Add benchmark.
====================
Viktor Malik [Thu, 26 Jun 2025 06:08:31 +0000 (08:08 +0200)]
selftests/bpf: Add tests for string kfuncs
Add both positive and negative tests cases using string kfuncs added in
the previous patches.
Positive tests check that the functions work as expected.
Negative tests pass various incorrect strings to the kfuncs and check
for the expected error codes:
-E2BIG when passing too long strings
-EFAULT when trying to read inaccessible kernel memory
-ERANGE when passing userspace pointers on arches with non-overlapping
address spaces
A majority of the tests use the RUN_TESTS helper which executes BPF
programs with BPF_PROG_TEST_RUN and check for the expected return value.
An exception to this are tests for long strings as we need to memset the
long string from userspace (at least I haven't found an ergonomic way to
memset it from a BPF program), which cannot be done using the RUN_TESTS
infrastructure.
Viktor Malik [Thu, 26 Jun 2025 06:08:30 +0000 (08:08 +0200)]
selftests/bpf: Allow macros in __retval
Allow macro expansion for values passed to the `__retval` and
`__retval_unpriv` attributes. This is especially useful for testing
programs which return various error codes.
With this change, the code for parsing special literals can be made
simpler, as the literals are defined via macros. The only exception is
INT_MIN which expands to (-INT_MAX -1), which is not single number and
cannot be parsed by strtol. So, we instead use a prefixed literal
_INT_MIN in __retval and handle it separately (assign the expected
return to INT_MIN). Also, strtol cannot handle the "ll" suffix so change
the value of POINTER_VALUE from 0xcafe4all to 0xbadcafe.
Viktor Malik [Thu, 26 Jun 2025 06:08:29 +0000 (08:08 +0200)]
bpf: Add kfuncs for read-only string operations
String operations are commonly used so this exposes the most common ones
to BPF programs. For now, we limit ourselves to operations which do not
copy memory around.
Unfortunately, most in-kernel implementations assume that strings are
%NUL-terminated, which is not necessarily true, and therefore we cannot
use them directly in the BPF context. Instead, we open-code them using
__get_kernel_nofault instead of plain dereference to make them safe and
limit the strings length to XATTR_SIZE_MAX to make sure the functions
terminate. When __get_kernel_nofault fails, functions return -EFAULT.
Similarly, when the size bound is reached, the functions return -E2BIG.
In addition, we return -ERANGE when the passed strings are outside of
the kernel address space.
Note that thanks to these dynamic safety checks, no other constraints
are put on the kfunc args (they are marked with the "__ign" suffix to
skip any verifier checks for them).
All of the functions return integers, including functions which normally
(in kernel or libc) return pointers to the strings. The reason is that
since the strings are generally treated as unsafe, the pointers couldn't
be dereferenced anyways. So, instead, we return an index to the string
and let user decide what to do with it. This also nicely fits with
returning various error codes when necessary (see above).
Jiawen Wu [Wed, 25 Jun 2025 02:39:24 +0000 (10:39 +0800)]
net: libwx: fix the creation of page_pool
'rx_ring->size' means the count of ring descriptors multiplied by the
size of one descriptor. When increasing the count of ring descriptors,
it may exceed the limit of pool size.
[ 864.209610] page_pool_create_percpu() gave up with errno -7
[ 864.209613] txgbe 0000:11:00.0: Page pool creation failed: -7
Fix to set the pool_size to the count of ring descriptors.
Fixes: 850b971110b2 ("net: libwx: Allocate Rx and Tx resources") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/434C72BFB40E350A+20250625023924.21821-1-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Tue, 24 Jun 2025 18:32:58 +0000 (11:32 -0700)]
net: selftests: fix TCP packet checksum
The length in the pseudo header should be the length of the L3 payload
AKA the L4 header+payload. The selftest code builds the packet from
the lower layers up, so all the headers are pushed already when it
constructs L4. We need to subtract the lower layer headers from skb->len.
Linus Torvalds [Thu, 26 Jun 2025 04:09:02 +0000 (21:09 -0700)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix use-after-free in libbpf when map is resized (Adin Scannell)
- Fix verifier assumptions about 2nd argument of bpf_sysctl_get_name
(Jerome Marchand)
- Fix verifier assumption of nullness of d_inode in dentry (Song Liu)
- Fix global starvation of LRU map (Willem de Bruijn)
- Fix potential NULL dereference in btf_dump__free (Yuan Chen)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: adapt one more case in test_lru_map to the new target_free
libbpf: Fix possible use-after-free for externs
selftests/bpf: Convert test_sysctl to prog_tests
bpf: Specify access type of bpf_sysctl_get_name args
libbpf: Fix null pointer dereference in btf_dump__free on allocation failure
bpf: Adjust free target to avoid global starvation of LRU map
bpf: Mark dentry->d_inode as trusted_or_null
Linus Torvalds [Thu, 26 Jun 2025 03:48:48 +0000 (20:48 -0700)]
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount fixes from Al Viro:
"Several mount-related fixes"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
userns and mnt_idmap leak in open_tree_attr(2)
attach_recursive_mnt(): do not lock the covering tree when sliding something under it
replace collect_mounts()/drop_collected_mounts() with a safer variant
atm: Release atm_dev_mutex after removing procfs in atm_dev_deregister().
syzbot reported a warning below during atm_dev_register(). [0]
Before creating a new device and procfs/sysfs for it, atm_dev_register()
looks up a duplicated device by __atm_dev_lookup(). These operations are
done under atm_dev_mutex.
However, when removing a device in atm_dev_deregister(), it releases the
mutex just after removing the device from the list that __atm_dev_lookup()
iterates over.
So, there will be a small race window where the device does not exist on
the device list but procfs/sysfs are still not removed, triggering the
splat.
Let's hold the mutex until procfs/sysfs are removed in
atm_dev_deregister().
====================
netlink: specs: enforce strict naming of properties
I got annoyed once again by the name properties in the ethtool spec
which use underscore instead of dash. I previously assumed that there
is a lot of such properties in the specs so fixing them now would
be near impossible. On a closer look, however, I only found 22
(rough grep suggests we have ~4.8k names in the specs, so bad ones
are just 0.46%).
Add a regex to the JSON schema to enforce the naming, fix the few
bad names. I was hoping we could start enforcing this from newer
families, but there's no correlation between the protocol and the
number of errors. If anything classic netlink has more recently
added specs so it has fewer errors.
The regex is just for name properties which will end up visible
to the user (in Python or YNL CLI). I left the c-name properties
alone, those don't matter as much. C codegen rewrites them, anyway.
I'm not updating the spec for genetlink-c. Looks like it has no
users, new families use genetlink, all old ones need genetlink-legacy.
If these patches are merged I will remove genetlink-c completely
in net-next.
====================
Jakub Kicinski [Tue, 24 Jun 2025 21:10:02 +0000 (14:10 -0700)]
netlink: specs: enforce strict naming of properties
Add a regexp to make sure all names which may end up being visible
to the user consist of lower case characters, numbers and dashes.
Underscores keep sneaking into the specs, which is not visible
in the C code but makes the Python and alike inconsistent.
Note that starting with a number is okay, as in C the full
name will include the family name.
For legacy families we can't enforce the naming in the family
name or the multicast group names, as these are part of the
binary uAPI of the kernel.
For classic netlink we need to allow capital letters in names
of struct members. TC has some structs with capitalized members.
Jakub Kicinski [Tue, 24 Jun 2025 21:10:01 +0000 (14:10 -0700)]
netlink: specs: tc: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Jakub Kicinski [Tue, 24 Jun 2025 21:10:00 +0000 (14:10 -0700)]
netlink: specs: rt-link: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Fixes: b2f63d904e72 ("doc/netlink: Add spec for rt link messages") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250624211002.3475021-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 24 Jun 2025 21:09:59 +0000 (14:09 -0700)]
netlink: specs: mptcp: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Fixes: bc8aeb2045e2 ("Documentation: netlink: add a YAML spec for mptcp") Reviewed-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250624211002.3475021-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 24 Jun 2025 21:09:58 +0000 (14:09 -0700)]
netlink: specs: ovs_flow: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Fixes: 93b230b549bc ("netlink: specs: add ynl spec for ovs_flow") Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Eelco Chaudron <echaudro@redhat.com> Link: https://patch.msgid.link/20250624211002.3475021-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 24 Jun 2025 21:09:57 +0000 (14:09 -0700)]
netlink: specs: devlink: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Fixes: 429ac6211494 ("devlink: define enum for attr types of dynamic attributes") Fixes: f2f9dd164db0 ("netlink: specs: devlink: add the remaining command to generate complete split_ops") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250624211002.3475021-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 24 Jun 2025 21:09:56 +0000 (14:09 -0700)]
netlink: specs: dpll: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Fixes: 3badff3a25d8 ("dpll: spec: Add Netlink spec in YAML") Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250624211002.3475021-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 24 Jun 2025 21:09:55 +0000 (14:09 -0700)]
netlink: specs: ethtool: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen replaces special chars in names)
but gives more uniform naming in Python.
Fixes: 13e59344fb9d ("net: ethtool: add support for symmetric-xor RSS hash") Fixes: 46fb3ba95b93 ("ethtool: Add an interface for flashing transceiver modules' firmware") Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250624211002.3475021-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 24 Jun 2025 21:09:54 +0000 (14:09 -0700)]
netlink: specs: fou: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Jakub Kicinski [Tue, 24 Jun 2025 21:09:53 +0000 (14:09 -0700)]
netlink: specs: nfsd: replace underscores with dashes in names
We're trying to add a strict regexp for the name format in the spec.
Underscores will not be allowed, dashes should be used instead.
This makes no difference to C (codegen, if used, replaces special
chars in names) but it gives more uniform naming in Python.
Fixes: 13727f85b49b ("NFSD: introduce netlink stubs") Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20250624211002.3475021-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Horman [Tue, 24 Jun 2025 16:35:12 +0000 (17:35 +0100)]
net: enetc: Correct endianness handling in _enetc_rd_reg64
enetc_hw.h provides two versions of _enetc_rd_reg64.
One which simply calls ioread64() when available.
And another that composes the 64-bit result from ioread32() calls.
In the second case the code appears to assume that each ioread32() call
returns a little-endian value. However both the shift and logical or
used to compose the return value would not work correctly on big endian
systems if this were the case. Moreover, this is inconsistent with the
first case where the return value of ioread64() is assumed to be in host
byte order.
It appears that the correct approach is for both versions to treat the
return value of ioread*() functions as being in host byte order. And
this patch corrects the ioread32()-based version to do so.
This is a bug but would only manifest on big endian systems
that make use of the ioread32-based implementation of _enetc_rd_reg64.
While all in-tree users of this driver are little endian and
make use of the ioread64-based implementation of _enetc_rd_reg64.
Thus, no in-tree user of this driver is affected by this bug.
Anton Protopopov [Wed, 25 Jun 2025 15:16:21 +0000 (15:16 +0000)]
bpf: add btf_type_is_i{32,64} helpers
There are places in BPF code which check if a BTF type is an integer
of particular size. This code can be made simpler by using helpers.
Add new btf_type_is_i{32,64} helpers, and simplify code in a few
files. (Suggested by Eduard for a patch which copy-pasted such a
check [1].)
v1 -> v2:
* export less generic helpers (Eduard)
* make subject less generic than in [v1] (Eduard)
====================
bpf: allow void* cast using bpf_rdonly_cast()
Currently, pointers returned by `bpf_rdonly_cast()` have a type of
"pointer to btf id", and only casts to structure types are allowed.
Access to memory pointed to by these pointers is done through
`BPF_PROBE_{MEM,MEMSX}` instructions and does not produce errors on
invalid memory access.
This patch set extends `bpf_rdonly_cast()` to allow casts to an
equivalent of 'void *', effectively replacing
`bpf_probe_read_kernel()` calls in situations where access to
individual bytes or integers is necessary.
The mechanism was suggested and explored by Andrii Nakryiko in [1].
To help with detecting support for this feature, an
`enum bpf_features` is added with intended usage as follows:
if (bpf_core_enum_value_exists(enum bpf_features,
BPF_FEAT_RDONLY_CAST_TO_VOID))
...
v2: https://lore.kernel.org/bpf/20250625000520.2700423-1-eddyz87@gmail.com/
v2 -> v3:
- dropped direct numbering for __MAX_BPF_FEAT.
v1: https://lore.kernel.org/bpf/20250624191009.902874-1-eddyz87@gmail.com/
v1 -> v2:
- renamed BPF_FEAT_TOTAL to __MAX_BPF_FEAT and moved patch introducing
bpf_features enum to the start of the series (Alexei);
- dropped patch #3 allowing optout from CAP_SYS_ADMIN drop in
prog_tests/verifier.c, use a separate runner in prog_tests/*
instead.
====================
Eduard Zingerman [Wed, 25 Jun 2025 18:24:14 +0000 (11:24 -0700)]
selftests/bpf: check operations on untrusted ro pointers to mem
The following cases are tested:
- it is ok to load memory at any offset from rdonly_untrusted_mem;
- rdonly_untrusted_mem offset/bounds are not tracked;
- writes into rdonly_untrusted_mem are forbidden;
- atomic operations on rdonly_untrusted_mem are forbidden;
- rdonly_untrusted_mem can't be passed as a memory argument of a
helper of kfunc;
- it is ok to use PTR_TO_MEM and PTR_TO_BTF_ID in a same load
instruction.
Eduard Zingerman [Wed, 25 Jun 2025 18:24:13 +0000 (11:24 -0700)]
bpf: allow void* cast using bpf_rdonly_cast()
Introduce support for `bpf_rdonly_cast(v, 0)`, which casts the value
`v` to an untyped, untrusted pointer, logically similar to a `void *`.
The memory pointed to by such a pointer is treated as read-only.
As with other untrusted pointers, memory access violations on loads
return zero instead of causing a fault.
Technically:
- The resulting pointer is represented as a register of type
`PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED` with size zero.
- Offsets within such pointers are not tracked.
- Same load instructions are allowed to have both
`PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED` and `PTR_TO_BTF_ID`
as the base pointer types.
In such cases, `bpf_insn_aux_data->ptr_type` is considered the
weaker of the two: `PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED`.
The following constraints apply to the new pointer type:
- can be used as a base for LDX instructions;
- can't be used as a base for ST/STX or atomic instructions;
- can't be used as parameter for kfuncs or helpers.
These constraints are enforced by existing handling of `MEM_RDONLY`
flag and `PTR_TO_MEM` of size zero.
Eduard Zingerman [Wed, 25 Jun 2025 18:24:12 +0000 (11:24 -0700)]
bpf: add bpf_features enum
This commit adds a kernel side enum for use in conjucntion with BTF
CO-RE bpf_core_enum_value_exists. The goal of the enum is to assist
with available BPF features detection. Intended usage looks as
follows:
if (bpf_core_enum_value_exists(enum bpf_features, BPF_FEAT_<f>))
... use feature f ...
Changes v1 => v2:
1. Split new selftests to a separate patch. (Eduard)
2. Reset reg id on BPF_NEG. (Eduard)
3. Use env->fake_reg instead of a bpf_reg_state on the stack. (Eduard)
4. Add __msg for passing selftests.
Song Liu [Wed, 25 Jun 2025 16:40:25 +0000 (09:40 -0700)]
selftests/bpf: Add tests for BPF_NEG range tracking logic
BPF_REG now has range tracking logic. Add selftests for BPF_NEG.
Specifically, return value of LSM hook lsm.s/socket_connect is used to
show that the verifer tracks BPF_NEG(1) falls in the [-4095, 0] range;
while BPF_NEG(100000) does not fall in that range.
Yonghong Song [Tue, 24 Jun 2025 21:18:02 +0000 (14:18 -0700)]
selftests/bpf: Fix usdt multispec failure with arm64/clang20 selftest build
When building the selftest with arm64/clang20, the following test failed:
...
ubtest_multispec_usdt:PASS:usdt_100_called 0 nsec
subtest_multispec_usdt:PASS:usdt_100_sum 0 nsec
subtest_multispec_usdt:FAIL:usdt_300_bad_attach unexpected pointer: 0xaaaad82a2a80
#471/2 usdt/multispec:FAIL
#471 usdt:FAIL
But arm64/gcc11 built kernel selftests succeeded. Further debug found arm64/clang
generated code has much less argument pattern after dedup, but gcc generated
code has a lot more.
Check usdt probes with usdt.test.o on arm64 platform:
There are total 300 locations for usdt_300. For gcc11 built binary, there are
300 spec's. But for clang20 built binary, there are 3 spec's. The default
BPF_USDT_MAX_SPEC_CNT is 256, so bpf_program__attach_usdt() will fail for gcc
but it will succeed with clang.
To fix the problem, do not do bpf_program__attach_usdt() for usdt_300
with arm64/clang setup.
Adin Scannell [Wed, 25 Jun 2025 05:02:15 +0000 (22:02 -0700)]
libbpf: Fix possible use-after-free for externs
The `name` field in `obj->externs` points into the BTF data at initial
open time. However, some functions may invalidate this after opening and
before loading (e.g. `bpf_map__set_value_size`), which results in
pointers into freed memory and undefined behavior.
The simplest solution is to simply `strdup` these strings, similar to
the `essent_name`, and free them at the same time.
In order to test this path, the `global_map_resize` BPF selftest is
modified slightly to ensure the presence of an extern, which causes this
test to fail prior to the fix. Given there isn't an obvious API or error
to test against, I opted to add this to the existing test as an aspect
of the resizing feature rather than duplicate the test.
Linus Torvalds [Wed, 25 Jun 2025 18:20:14 +0000 (11:20 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Fixes all in drivers.
ufs and megaraid_sas are small and obvious.
The large diffstat in fnic comes from two pieces: the addition of
quite a bit of logging (no change to function) and the reworking of
the timeout allocation path for the two conditions that can occur
simultaneously to prevent reusing the same abort frame and then both
trying to free it"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: fnic: Fix missing DMA mapping error in fnic_send_frame()
scsi: fnic: Set appropriate logging level for log message
scsi: fnic: Add and improve logs in FDMI and FDMI ABTS paths
scsi: fnic: Turn off FDMI ACTIVE flags on link down
scsi: fnic: Fix crash in fnic_wq_cmpl_handler when FDMI times out
scsi: ufs: core: Fix clk scaling to be conditional in reset and restore
scsi: megaraid_sas: Fix invalid node index
Linus Torvalds [Wed, 25 Jun 2025 18:13:31 +0000 (11:13 -0700)]
Merge tag 'uml-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull UML fixes from Johannes Berg:
- fix FP registers in seccomp mode
- prevent duplicate devices in VFIO support
- don't ignore errors in UBD thread start
- reduce stack use with clang 19
* tag 'uml-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
um: vector: Reduce stack usage in vector_eth_configure()
um: Use correct data source in fpregs_legacy_set()
um: vfio: Prevent duplicate device assignments
um: ubd: Add missing error check in start_io_thread()
Jakub Kicinski [Wed, 25 Jun 2025 17:26:16 +0000 (10:26 -0700)]
Merge tag 'wireless-2025-06-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Just a few fixes:
- iwlegacy: work around large stack with clang/kasan
- mac80211: fix integer overflow
- mac80211: fix link struct init vs. RCU publish
- iwlwifi: fix warning on IFF_UP
* tag 'wireless-2025-06-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: finish link init before RCU publish
wifi: iwlwifi: mvm: assume '1' as the default mac_config_cmd version
wifi: mac80211: fix beacon interval calculation overflow
wifi: iwlegacy: work around excessive stack usage on clang/kasan
====================
Tiwei Bie [Mon, 23 Jun 2025 11:08:29 +0000 (19:08 +0800)]
um: vector: Reduce stack usage in vector_eth_configure()
When compiling with clang (19.1.7), initializing *vp using a compound
literal may result in excessive stack usage. Fix it by initializing the
required fields of *vp individually.
====================
bpf, verifier: Improve precision of BPF_ADD and BPF_SUB
This patchset improves the precision of BPF_ADD and BPF_SUB range
tracking. It also adds selftests that exercise the cases where precision
improvement occurs, and selftests for the cases where precise bounds
cannot be computed and the output register state values are set to
unbounded.
Changelog:
v3:
* Improve readability in selftests and commit message by using
more readable constants (suggested by Eduard Zingerman).
* Add four new selftests for the cases where precise output register
state bounds cannot be computed in scalar(32)_min_max_add/sub, so the
output register state must be set to unbounded, i.e., [0, U64_MAX]
or [0, U32_MAX].
* Add suggested-by Eduard tag to commit message for changes to
verifier_bounds.c
v2:
* Add clearer example of precision improvement in the commit message for
verifier.c changes.
* Add selftests that exercise the precision improvement to
verifier_bounds.c (suggested by Eduard Zingerman).
selftests/bpf: Add testcases for BPF_ADD and BPF_SUB
The previous commit improves the precision in scalar(32)_min_max_add,
and scalar(32)_min_max_sub. The improvement in precision occurs in cases
when all outcomes overflow or underflow, respectively.
This commit adds selftests that exercise those cases.
This commit also adds selftests for cases where the output register
state bounds for u(32)_min/u(32)_max are conservatively set to unbounded
(when there is partial overflow or underflow).
bpf, verifier: Improve precision for BPF_ADD and BPF_SUB
This patch improves the precison of the scalar(32)_min_max_add and
scalar(32)_min_max_sub functions, which update the u(32)min/u(32)_max
ranges for the BPF_ADD and BPF_SUB instructions. We discovered this more
precise operator using a technique we are developing for automatically
synthesizing functions for updating tnums and ranges.
According to the BPF ISA [1], "Underflow and overflow are allowed during
arithmetic operations, meaning the 64-bit or 32-bit value will wrap".
Our patch leverages the wrap-around semantics of unsigned overflow and
underflow to improve precision.
Below is an example of our patch for scalar_min_max_add; the idea is
analogous for all four functions.
There are three cases to consider when adding two u64 ranges [dst_umin,
dst_umax] and [src_umin, src_umax]. Consider a value x in the range
[dst_umin, dst_umax] and another value y in the range [src_umin,
src_umax].
(a) No overflow: No addition x + y overflows. This occurs when even the
largest possible sum, i.e., dst_umax + src_umax <= U64_MAX.
(b) Partial overflow: Some additions x + y overflow. This occurs when
the largest possible sum overflows (dst_umax + src_umax > U64_MAX), but
the smallest possible sum does not overflow (dst_umin + src_umin <=
U64_MAX).
(c) Full overflow: All additions x + y overflow. This occurs when both
the smallest possible sum and the largest possible sum overflow, i.e.,
both (dst_umin + src_umin) and (dst_umax + src_umax) are > U64_MAX.
The current implementation conservatively sets the output bounds to
unbounded, i.e, [umin=0, umax=U64_MAX], whenever there is *any*
possibility of overflow, i.e, in cases (b) and (c). Otherwise it
computes tight bounds as [dst_umin + src_umin, dst_umax + src_umax]:
Our synthesis-based technique discovered a more precise operator.
Particularly, in case (c), all possible additions x + y overflow and
wrap around according to eBPF semantics, and the computation of the
output range as [dst_umin + src_umin, dst_umax + src_umax] continues to
work. Only in case (b), do we need to set the output bounds to
unbounded, i.e., [0, U64_MAX].
Case (b) can be checked by seeing if the minimum possible sum does *not*
overflow and the maximum possible sum *does* overflow, and when that
happens, we set the output to unbounded:
Below is an example eBPF program and the corresponding log from the
verifier.
The current implementation of scalar_min_max_add() sets r3's bounds to
[0, U64_MAX] at instruction 5: (0f) r3 += r3, due to conservative
overflow handling.
The logic for scalar32_min_max_add is analogous. For the
scalar(32)_min_max_sub functions, the reasoning is similar but applied
to detecting underflow instead of overflow.
We verified the correctness of the new implementations using Agni [3,4].
We since also discovered that a similar technique has been used to
calculate output ranges for unsigned interval addition and subtraction
in Hacker's Delight [2].
[1] https://docs.kernel.org/bpf/standardization/instruction-set.html
[2] Hacker's Delight Ch.4-2, Propagating Bounds through Add’s and Subtract’s
[3] https://github.com/bpfverif/agni
[4] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf
From the crash dump, we found that the cpu_map_flush_list inside
redirect info is partially corrupted: its list_head->next points to
itself, but list_head->prev points to a valid list of unflushed bq
entries.
This turned out to be a result of missed XDP flush on redirect lists. By
digging in the actual source code, we found that
commit 7f0a168b0441 ("bnxt_en: Add completion ring pointer in TX and RX
ring structures") incorrectly overwrites the event mask for XDP_REDIRECT
in bnxt_rx_xdp. We can stably reproduce this crash by returning XDP_TX
and XDP_REDIRECT randomly for incoming packets in a naive XDP program.
Properly propagate the XDP_REDIRECT events back fixes the crash.
Fixes: a7559bc8c17c ("bnxt: support transmit and free of aggregation buffers") Tested-by: Andrew Rzeznik <arzeznik@cloudflare.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Link: https://patch.msgid.link/aFl7jpCNzscumuN2@debian.debian Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 25 Jun 2025 00:20:43 +0000 (17:20 -0700)]
Merge tag 'selinux-pr-20250624' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux fix from Paul Moore:
"Another small SELinux patch to fix a problem seen by the dracut-ng
folks during early boot when SELinux is enabled, but the policy has
yet to be loaded"
* tag 'selinux-pr-20250624' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: change security_compute_sid to return the ssid or tsid on match
If a userspace application just include <linux/vm_sockets.h> will fail
to build with the following errors:
/usr/include/linux/vm_sockets.h:182:39: error: invalid application of ‘sizeof’ to incomplete type ‘struct sockaddr’
182 | unsigned char svm_zero[sizeof(struct sockaddr) -
| ^~~~~~
/usr/include/linux/vm_sockets.h:183:39: error: ‘sa_family_t’ undeclared here (not in a function)
183 | sizeof(sa_family_t) -
|
Include <sys/socket.h> for userspace (guarded by ifndef __KERNEL__)
where `struct sockaddr` and `sa_family_t` are defined.
We already do something similar in <linux/mptcp.h> and <linux/if.h>.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Reported-by: Daan De Meyer <daan.j.demeyer@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250623100053.40979-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Having PM put sync in remove function is causing PM underflow during
remove operation. This is caused by the function, runtime_pm_get_sync,
not being called anywhere during the op. Ensure that calls to
pm_runtime_enable()/pm_runtime_disable() and
pm_runtime_get_sync()/pm_runtime_put_sync() match.
Continuous bind/unbind will result in an "Unbalanced pm_runtime_enable" error.
Subsequent unbind attempts will return a "No such device" error, while bind
attempts will return a "Resource temporarily unavailable" error.
Also, change clk_disable_unprepare() to clk_disable() since continuous
bind and unbind operations will trigger a warning indicating that the clock is
already unprepared.
Al Viro [Tue, 24 Jun 2025 14:25:04 +0000 (10:25 -0400)]
userns and mnt_idmap leak in open_tree_attr(2)
Once want_mount_setattr() has returned a positive, it does require
finish_mount_kattr() to release ->mnt_userns. Failing do_mount_setattr()
does not change that.
As the result, we can end up leaking userns and possibly mnt_idmap as
well.
Fixes: c4a16820d901 ("fs: add open_tree_attr()") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Johannes Berg [Tue, 24 Jun 2025 11:07:49 +0000 (13:07 +0200)]
wifi: mac80211: finish link init before RCU publish
Since the link/conf pointers can be accessed without any
protection other than RCU, make sure the data is actually
set up before publishing the structures.
Miri Korenblit [Tue, 24 Jun 2025 07:14:27 +0000 (10:14 +0300)]
wifi: iwlwifi: mvm: assume '1' as the default mac_config_cmd version
Unfortunately, FWs of some devices don't have the version of the
iwl_mac_config_cmd defined in the TLVs. We send 0 as the 'def argument
to iwl_fw_lookup_cmd_ver, so for such FWs, the return value will be 0,
leading to a warning, and to not sending the command.
Fix this by assuming that the default version is 1.
Paolo Abeni [Tue, 24 Jun 2025 08:10:09 +0000 (10:10 +0200)]
Merge branch 'af_unix-fix-two-oob-issues'
Kuniyuki Iwashima says:
====================
af_unix: Fix two OOB issues.
From: Kuniyuki Iwashima <kuniyu@google.com>
Recently, two issues are reported regarding MSG_OOB.
Patch 1 fixes issues that happen when multiple consumed OOB
skbs are placed consecutively in the recv queue.
Patch 2 fixes an inconsistent behaviour that close()ing a socket
with a consumed OOB skb at the head of the recv queue triggers
-ECONNRESET on the peer's recv().
af_unix: Don't set -ECONNRESET for consumed OOB skb.
Christian Brauner reported that even after MSG_OOB data is consumed,
calling close() on the receiver socket causes the peer's recv() to
return -ECONNRESET:
Let's add a test case where consecutive concumed OOB skbs stay
at the head of the queue.
Without the previous patch, ioctl(SIOCATMARK) assertion fails.
Before:
# RUN msg_oob.no_peek.ex_oob_ex_oob_oob ...
# msg_oob.c:305:ex_oob_ex_oob_oob:Expected answ[0] (0) == oob_head (1)
# ex_oob_ex_oob_oob: Test terminated by assertion
# FAIL msg_oob.no_peek.ex_oob_ex_oob_oob
not ok 12 msg_oob.no_peek.ex_oob_ex_oob_oob
After:
# RUN msg_oob.no_peek.ex_oob_ex_oob_oob ...
# OK msg_oob.no_peek.ex_oob_ex_oob_oob
ok 12 msg_oob.no_peek.ex_oob_ex_oob_oob
Even though a user reads OOB data, the skb holding the data stays on
the recv queue to mark the OOB boundary and break the next recv().
After the last send() in the scenario above, the sk2's recv queue has
2 leading consumed OOB skbs and 1 real OOB skb.
Then, the following happens during the next recv() without MSG_OOB
1. unix_stream_read_generic() peeks the first consumed OOB skb
2. manage_oob() returns the next consumed OOB skb
3. unix_stream_read_generic() fetches the next not-yet-consumed OOB skb
4. unix_stream_read_generic() reads and frees the OOB skb
, and the last recv(MSG_OOB) triggers KASAN splat.
The 3. above occurs because of the SO_PEEK_OFF code, which does not
expect unix_skb_len(skb) to be 0, but this is true for such consumed
OOB skbs.
In addition to this use-after-free, there is another issue that
ioctl(SIOCATMARK) does not function properly with consecutive consumed
OOB skbs.
So, nothing good comes out of such a situation.
Instead of complicating manage_oob(), ioctl() handling, and the next
ECONNRESET fix by introducing a loop for consecutive consumed OOB skbs,
let's not leave such consecutive OOB unnecessarily.
Now, while receiving an OOB skb in unix_stream_recv_urg(), if its
previous skb is a consumed OOB skb, it is freed.
[0]:
BUG: KASAN: slab-use-after-free in unix_stream_read_actor (net/unix/af_unix.c:3027)
Read of size 4 at addr ffff888106ef2904 by task python3/315
The buggy address belongs to the object at ffff888106ef28c0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 68 bytes inside of
freed 224-byte region [ffff888106ef28c0, ffff888106ef29a0)
Memory state around the buggy address: ffff888106ef2800: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc ffff888106ef2880: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff888106ef2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff888106ef2980: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc ffff888106ef2a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Arnd Bergmann [Fri, 20 Jun 2025 11:39:42 +0000 (13:39 +0200)]
wifi: iwlegacy: work around excessive stack usage on clang/kasan
In some rare randconfig builds, I seem to trigger a bug in clang where
it unrolls a loop but then runs out of registers, which then get
spilled to the stack:
This seems to be the same one I saw in the omapdrm driver, and there is
an easy workaround by not inlining the il4965_rs_rate_scale_clear_win
function.
====================
bpf: Specify access type of bpf_sysctl_get_name args
The second argument of bpf_sysctl_get_name() helper is a pointer to a
buffer that is being written to. However that isn't specify in the
prototype. Until commit 37cce22dbd51a ("bpf: verifier: Refactor helper
access type tracking") that mistake was hidden by the way the verifier
treated helper accesses. Since then, the verifier, working on wrong
infromation from the prototype, can make faulty optimization that
would had been caught by the test_sysctl selftests if it was run by
the CI.
The first patch fixes bpf_sysctl_get_name prototype.
The second patch converts the test_sysctl to prog_tests so that it
will be run by the CI and catch similar issues in the future.
Changes in v3:
- Use ASSERT* macro instead of CHECK_FAIL.
- Remove useless code.
Changes in v2:
- Replace ARG_PTR_TO_UNINIT_MEM by ARG_PTR_TO_MEM | MEM_WRITE.
- Converts test_sysctl to prog_tests.
====================
Jerome Marchand [Thu, 19 Jun 2025 14:06:02 +0000 (16:06 +0200)]
bpf: Specify access type of bpf_sysctl_get_name args
The second argument of bpf_sysctl_get_name() helper is a pointer to a
buffer that is being written to. However that isn't specify in the
prototype.
Until commit 37cce22dbd51a ("bpf: verifier: Refactor helper access
type tracking"), all helper accesses were considered as a possible
write access by the verifier, so no big harm was done. However, since
then, the verifier might make wrong asssumption about the content of
that address which might lead it to make faulty optimizations (such as
removing code that was wrongly labeled dead). This is what happens in
test_sysctl selftest to the tests related to sysctl_get_name.
Add MEM_WRITE flag the second argument of bpf_sysctl_get_name().
Ido Schimmel [Thu, 19 Jun 2025 18:22:28 +0000 (21:22 +0300)]
bridge: mcast: Fix use-after-free during router port configuration
The bridge maintains a global list of ports behind which a multicast
router resides. The list is consulted during forwarding to ensure
multicast packets are forwarded to these ports even if the ports are not
member in the matching MDB entry.
When per-VLAN multicast snooping is enabled, the per-port multicast
context is disabled on each port and the port is removed from the global
router port list:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1
# ip link add name dummy1 up master br1 type dummy
# ip link set dev dummy1 type bridge_slave mcast_router 2
$ bridge -d mdb show | grep router
router ports on br1: dummy1
# ip link set dev br1 type bridge mcast_vlan_snooping 1
$ bridge -d mdb show | grep router
However, the port can be re-added to the global list even when per-VLAN
multicast snooping is enabled:
# ip link set dev dummy1 type bridge_slave mcast_router 0
# ip link set dev dummy1 type bridge_slave mcast_router 2
$ bridge -d mdb show | grep router
router ports on br1: dummy1
Since commit 4b30ae9adb04 ("net: bridge: mcast: re-implement
br_multicast_{enable, disable}_port functions"), when per-VLAN multicast
snooping is enabled, multicast disablement on a port will disable the
per-{port, VLAN} multicast contexts and not the per-port one. As a
result, a port will remain in the global router port list even after it
is deleted. This will lead to a use-after-free [1] when the list is
traversed (when adding a new port to the list, for example):
# ip link del dev dummy1
# ip link add name dummy2 up master br1 type dummy
# ip link set dev dummy2 type bridge_slave mcast_router 2
Similarly, stale entries can also be found in the per-VLAN router port
list. When per-VLAN multicast snooping is disabled, the per-{port, VLAN}
contexts are disabled on each port and the port is removed from the
per-VLAN router port list:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1
# ip link add name dummy1 up master br1 type dummy
# bridge vlan add vid 2 dev dummy1
# bridge vlan global set vid 2 dev br1 mcast_snooping 1
# bridge vlan set vid 2 dev dummy1 mcast_router 2
$ bridge vlan global show dev br1 vid 2 | grep router
router ports: dummy1
# ip link set dev br1 type bridge mcast_vlan_snooping 0
$ bridge vlan global show dev br1 vid 2 | grep router
However, the port can be re-added to the per-VLAN list even when
per-VLAN multicast snooping is disabled:
# bridge vlan set vid 2 dev dummy1 mcast_router 0
# bridge vlan set vid 2 dev dummy1 mcast_router 2
$ bridge vlan global show dev br1 vid 2 | grep router
router ports: dummy1
When the VLAN is deleted from the port, the per-{port, VLAN} multicast
context will not be disabled since multicast snooping is not enabled
on the VLAN. As a result, the port will remain in the per-VLAN router
port list even after it is no longer member in the VLAN. This will lead
to a use-after-free [2] when the list is traversed (when adding a new
port to the list, for example):
# ip link add name dummy2 up master br1 type dummy
# bridge vlan add vid 2 dev dummy2
# bridge vlan del vid 2 dev dummy1
# bridge vlan set vid 2 dev dummy2 mcast_router 2
Fix these issues by removing the port from the relevant (global or
per-VLAN) router port list in br_multicast_port_ctx_deinit(). The
function is invoked during port deletion with the per-port multicast
context and during VLAN deletion with the per-{port, VLAN} multicast
context.
Note that deleting the multicast router timer is not enough as it only
takes care of the temporary multicast router states (1 or 3) and not the
permanent one (2).
[1]
BUG: KASAN: slab-out-of-bounds in br_multicast_add_router.part.0+0x3f1/0x560
Write of size 8 at addr ffff888004a67328 by task ip/384
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xa0
print_address_description.constprop.0+0x6f/0x350
print_report+0x108/0x205
kasan_report+0xdf/0x110
br_multicast_add_router.part.0+0x3f1/0x560
br_multicast_set_port_router+0x74e/0xac0
br_setport+0xa55/0x1870
br_port_slave_changelink+0x95/0x120
__rtnl_newlink+0x5e8/0xa40
rtnl_newlink+0x627/0xb00
rtnetlink_rcv_msg+0x6fb/0xb70
netlink_rcv_skb+0x11f/0x350
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
__sys_sendmsg+0x124/0x1c0
do_syscall_64+0xbb/0x360
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[2]
BUG: KASAN: slab-use-after-free in br_multicast_add_router.part.0+0x378/0x560
Read of size 8 at addr ffff888009f00840 by task bridge/391
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xa0
print_address_description.constprop.0+0x6f/0x350
print_report+0x108/0x205
kasan_report+0xdf/0x110
br_multicast_add_router.part.0+0x378/0x560
br_multicast_set_port_router+0x6f9/0xac0
br_vlan_process_options+0x8b6/0x1430
br_vlan_rtm_process_one+0x605/0xa30
br_vlan_rtm_process+0x396/0x4c0
rtnetlink_rcv_msg+0x2f7/0xb70
netlink_rcv_skb+0x11f/0x350
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
__sys_sendmsg+0x124/0x1c0
do_syscall_64+0xbb/0x360
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Linus Torvalds [Mon, 23 Jun 2025 22:02:57 +0000 (15:02 -0700)]
Merge tag 'for-6.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mikulas Patocka:
- dm-crypt: fix a crash on 32-bit machines
- dm-raid: replace "rdev" with correct loop variable name "r"
* tag 'for-6.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-raid: fix variable in journal device check
dm-crypt: Extend state buffer size in crypt_iv_lmk_one
Linus Torvalds [Mon, 23 Jun 2025 21:55:40 +0000 (14:55 -0700)]
Merge tag 'f2fs-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
- fix double-unlock introduced by the recent folio conversion
- fix stale page content beyond EOF complained by xfstests/generic/363
* tag 'f2fs-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix to zero post-eof page
f2fs: Fix __write_node_folio() conversion
Breno Leitao [Fri, 20 Jun 2025 18:48:55 +0000 (11:48 -0700)]
net: netpoll: Initialize UDP checksum field before checksumming
commit f1fce08e63fe ("netpoll: Eliminate redundant assignment") removed
the initialization of the UDP checksum, which was wrong and broke
netpoll IPv6 transmission due to bad checksumming.
udph->check needs to be set before calling csum_ipv6_magic().
Linus Torvalds [Mon, 23 Jun 2025 18:16:38 +0000 (11:16 -0700)]
Merge tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes:
- fix invalid inode pointer dereferences during log replay
- fix a race between renames and directory logging
- fix shutting down delayed iput worker
- fix device byte accounting when dropping chunk
- in zoned mode, fix offset calculations for DUP profile when
conventional and sequential zones are used together
Regression fixes:
- fix possible double unlock of extent buffer tree (xarray
conversion)
- in zoned mode, fix extent buffer refcount when writing out extents
(xarray conversion)
Error handling fixes and updates:
- handle unexpected extent type when replaying log
- check and warn if there are remaining delayed inodes when putting a
root
- fix assertion when building free space tree
- handle csum tree error with mount option 'rescue=ibadroot'
Other:
- error message updates: add prefix to all scrub related messages,
include other information in messages"
* tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix alloc_offset calculation for partly conventional block groups
btrfs: handle csum tree error with rescue=ibadroots correctly
btrfs: fix race between async reclaim worker and close_ctree()
btrfs: fix assertion when building free space tree
btrfs: don't silently ignore unexpected extent type when replaying log
btrfs: fix invalid inode pointer dereferences during log replay
btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb
btrfs: update superblock's device bytes_used when dropping chunk
btrfs: fix a race between renames and directory logging
btrfs: scrub: add prefix for the error messages
btrfs: warn if leaking delayed_nodes in btrfs_put_root()
btrfs: fix delayed ref refcount leak in debug assertion
btrfs: include root in error message when unlinking inode
btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
Yuan Chen [Wed, 18 Jun 2025 01:19:33 +0000 (09:19 +0800)]
libbpf: Fix null pointer dereference in btf_dump__free on allocation failure
When btf_dump__new() fails to allocate memory for the internal hashmap
(btf_dump->type_names), it returns an error code. However, the cleanup
function btf_dump__free() does not check if btf_dump->type_names is NULL
before attempting to free it. This leads to a null pointer dereference
when btf_dump__free() is called on a btf_dump object.
Al Viro [Sun, 22 Jun 2025 22:03:29 +0000 (18:03 -0400)]
attach_recursive_mnt(): do not lock the covering tree when sliding something under it
If we are propagating across the userns boundary, we need to lock the
mounts added there. However, in case when something has already
been mounted there and we end up sliding a new tree under that,
the stuff that had been there before should not get locked.
IOW, lock_mnt_tree() should be called before we reparent the
preexisting tree on top of what we are adding.
Fixes: 3bd045cc9c4b ("separate copying and locking mount tree on cross-userns copies") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 17 Jun 2025 04:09:51 +0000 (00:09 -0400)]
replace collect_mounts()/drop_collected_mounts() with a safer variant
collect_mounts() has several problems - one can't iterate over the results
directly, so it has to be done with callback passed to iterate_mounts();
it has an oopsable race with d_invalidate(); it creates temporary clones
of mounts invisibly for sync umount (IOW, you can have non-lazy umount
succeed leaving filesystem not mounted anywhere and yet still busy).
A saner approach is to give caller an array of struct path that would pin
every mount in a subtree, without cloning any mounts.
* collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone
* collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or
a pointer to array of struct path, one for each chunk of tree visible under
'where' (i.e. the first element is a copy of where, followed by (mount,root)
for everything mounted under it - the same set collect_mounts() would give).
Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning
references to the roots of subtrees in the caller's namespace.
Array is terminated by {NULL, NULL} struct path. If it fits into
preallocated array (on-stack, normally), that's where it goes; otherwise
it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated'
is ignored (and expected to be NULL).
* drop_collected_paths(paths, preallocated) is given the array returned
by an earlier call of collect_paths() and the preallocated array passed to that
call. All mount/dentry references are dropped and array is kfree'd if it's not
equal to 'preallocated'.
* instead of iterate_mounts(), users should just iterate over array
of struct path - nothing exotic is needed for that. Existing users (all in
audit_tree.c) are converted.
[folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>]
Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Menglong Dong [Sat, 21 Jun 2025 04:55:01 +0000 (12:55 +0800)]
bpf: Make update_prog_stats() always_inline
The function update_prog_stats() will be called in the bpf trampoline.
In most cases, it will be optimized by the compiler by making it inline.
However, we can't rely on the compiler all the time, and just make it
__always_inline to reduce the possible overhead.
Linus Torvalds [Mon, 23 Jun 2025 16:20:39 +0000 (09:20 -0700)]
Merge tag 'mm-hotfixes-stable-2025-06-22-18-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 7 are cc:stable and the remainder address post-6.15
issues or aren't considered necessary for -stable kernels. Only 4 are
for MM.
- The series `Revert "bcache: update min_heap_callbacks to use
default builtin swap"' from Kuan-Wei Chiu backs out the author's
recent min_heap changes due to a performance regression.
A fix for this regression has been developed but we felt it best to
go back to the known-good version to give the new code more bake
time.
- A lot of MAINTAINERS maintenance.
I like to get these changes upstreamed promptly because they can't
break things and more accurate/complete MAINTAINERS info hopefully
improves the speed and accuracy of our responses to submitters and
reporters"
* tag 'mm-hotfixes-stable-2025-06-22-18-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add additional mmap-related files to mmap section
MAINTAINERS: add memfd, shmem quota files to shmem section
MAINTAINERS: add stray rmap file to mm rmap section
MAINTAINERS: add hugetlb_cgroup.c to hugetlb section
MAINTAINERS: add further init files to mm init block
MAINTAINERS: update maintainers for HugeTLB
maple_tree: fix MA_STATE_PREALLOC flag in mas_preallocate()
MAINTAINERS: add missing test files to mm gup section
MAINTAINERS: add missing mm/workingset.c file to mm reclaim section
selftests/mm: skip uprobe vma merge test if uprobes are not enabled
bcache: remove unnecessary select MIN_HEAP
Revert "bcache: remove heap-related macros and switch to generic min_heap"
Revert "bcache: update min_heap_callbacks to use default builtin swap"
selftests/mm: add configs to fix testcase failure
kho: initialize tail pages for higher order folios properly
MAINTAINERS: add linux-mm@ list to Kexec Handover
mm: userfaultfd: fix race of userfaultfd_move and swap cache
mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked"
selftests/mm: increase timeout from 180 to 900 seconds
mm/shmem, swap: fix softlockup with mTHP swapin
Bluetooth: hci_core: Fix use-after-free in vhci_flush()
syzbot reported use-after-free in vhci_flush() without repro. [0]
From the splat, a thread close()d a vhci file descriptor while
its device was being used by iotcl() on another thread.
Once the last fd refcnt is released, vhci_release() calls
hci_unregister_dev(), hci_free_dev(), and kfree() for struct
vhci_data, which is set to hci_dev->dev->driver_data.
The problem is that there is no synchronisation after unlinking
hdev from hci_dev_list in hci_unregister_dev(). There might be
another thread still accessing the hdev which was fetched before
the unlink operation.
We can use SRCU for such synchronisation.
Let's run hci_dev_reset() under SRCU and wait for its completion
in hci_unregister_dev().
Another option would be to restore hci_dev->destruct(), which was
removed in commit 587ae086f6e4 ("Bluetooth: Remove unused
hci-destruct cb"). However, this would not be a good solution, as
we should not run hci_unregister_dev() while there are in-flight
ioctl() requests, which could lead to another data-race KCSAN splat.
Note that other drivers seem to have the same problem, for exmaple,
virtbt_remove().
[0]:
BUG: KASAN: slab-use-after-free in skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline]
BUG: KASAN: slab-use-after-free in skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937
Read of size 8 at addr ffff88807cb8d858 by task syz.1.219/6718
The buggy address belongs to the object at ffff88807cb8d800
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 88 bytes inside of
freed 1024-byte region [ffff88807cb8d800, ffff88807cb8dc00)
Fixes: bf18c7118cf8 ("Bluetooth: vhci: Free driver_data on file release") Reported-by: syzbot+2faa4825e556199361f9@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f62d64848fc4c7c30cd6 Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Arnd Bergmann [Fri, 20 Jun 2025 13:09:53 +0000 (15:09 +0200)]
net: qed: reduce stack usage for TLV processing
clang gets a bit confused by the code in the qed_mfw_process_tlv_req and
ends up spilling registers to the stack hundreds of times. When sanitizers
are enabled, this can end up blowing the stack warning limit:
Apparently the problem is the complexity of qed_mfw_update_tlvs()
after inlining, and marking the four main branches of that function
as noinline_for_stack makes this problem completely go away, the stack
usage goes down to 100 bytes.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Mon, 23 Jun 2025 11:11:50 +0000 (19:11 +0800)]
dm-crypt: Extend state buffer size in crypt_iv_lmk_one
Add a macro CRYPTO_MD5_STATESIZE for the Crypto API export state
size of md5 and use that in dm-crypt instead of relying on the
size of struct md5_state (the latter is currently undergoing a
transition and may shrink).
Song Liu [Mon, 23 Jun 2025 06:38:54 +0000 (23:38 -0700)]
selftests/bpf: Add tests for bpf_cgroup_read_xattr
Add tests for different scenarios with bpf_cgroup_read_xattr:
1. Read cgroup xattr from bpf_cgroup_from_id;
2. Read cgroup xattr from bpf_cgroup_ancestor;
3. Read cgroup xattr from css_iter;
4. Use bpf_cgroup_read_xattr in LSM hook security_socket_connect.
5. Use bpf_cgroup_read_xattr in cgroup program.
Song Liu [Mon, 23 Jun 2025 06:38:52 +0000 (23:38 -0700)]
bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node
BPF programs, such as LSM and sched_ext, would benefit from tags on
cgroups. One common practice to apply such tags is to set xattrs on
cgroupfs folders.
Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's
xattr.
Note that, we already have bpf_get_[file|dentry]_xattr. However, these
two APIs are not ideal for reading cgroupfs xattrs, because:
1) These two APIs only works in sleepable contexts;
2) There is no kfunc that matches current cgroup to cgroupfs dentry.
bpf_cgroup_read_xattr is generic and can be useful for many program
types. It is also safe, because it requires trusted or rcu protected
argument (KF_RCU). Therefore, we make it available to all program types.
All allocations of struct kernfs_iattrs are serialized through a global
mutex. Simply do a racy allocation and let the first one win. I bet most
callers are under inode->i_rwsem anyway and it wouldn't be needed but
let's not require that.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-2-song@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
Eric Dumazet [Fri, 20 Jun 2025 14:28:44 +0000 (14:28 +0000)]
atm: clip: prevent NULL deref in clip_push()
Blamed commit missed that vcc_destroy_socket() calls
clip_push() with a NULL skb.
If clip_devs is NULL, clip_push() then crashes when reading
skb->truesize.
Fixes: 93a2014afbac ("atm: fix a UAF in lec_arp_clear_vccs()") Reported-by: syzbot+1316233c4c6803382a8b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68556f59.a00a0220.137b3.004e.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Gengming Liu <l.dmxcsnsbh@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 22 Jun 2025 17:50:36 +0000 (10:50 -0700)]
Merge tag 'i2c-for-6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
- subsystem: convert drivers to use recent callbacks of struct
i2c_algorithm A typical after-rc1 cleanup, which I couldn't send in
time for rc2
- tegra: fix YAML conversion of device tree bindings
- k1: re-add a check which got lost during upstreaming
* tag 'i2c-for-6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: k1: check for transfer error
i2c: use inclusive callbacks in struct i2c_algorithm
dt-bindings: i2c: nvidia,tegra20-i2c: Specify the required properties
Linus Torvalds [Sun, 22 Jun 2025 17:30:44 +0000 (10:30 -0700)]
Merge tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure the array tracking which kernel text positions need to be
alternatives-patched doesn't get mishandled by out-of-order
modifications, leading to it overflowing and causing page faults when
patching
- Avoid an infinite loop when early code does a ranged TLB invalidation
before the broadcast TLB invalidation count of how many pages it can
flush, has been read from CPUID
- Fix a CONFIG_MODULES typo
- Disable broadcast TLB invalidation when PTI is enabled to avoid an
overflow of the bitmap tracking dynamic ASIDs which need to be
flushed when the kernel switches between the user and kernel address
space
- Handle the case of a CPU going offline and thus reporting zeroes when
reading top-level events in the resctrl code
* tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Fix int3 handling failure from broken text_poke array
x86/mm: Fix early boot use of INVPLGB
x86/its: Fix an ifdef typo in its_alloc()
x86/mm: Disable INVLPGB when PTI is enabled
x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem
Linus Torvalds [Sun, 22 Jun 2025 17:17:51 +0000 (10:17 -0700)]
Merge tag 'irq_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Fix missing prototypes warnings
- Properly initialize work context when allocating it
- Remove a method tracking when managed interrupts are suspended during
hotplug, in favor of the code using a IRQ disable depth tracking now,
and have interrupts get properly enabled again on restore
- Make sure multiple CPUs getting hotplugged don't cause wrong tracking
of the managed IRQ disable depth
* tag 'irq_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/ath79-misc: Fix missing prototypes warnings
genirq/irq_sim: Initialize work context pointers properly
genirq/cpuhotplug: Restore affinity even for suspended IRQ
genirq/cpuhotplug: Rebalance managed interrupts across multi-CPU hotplug
Faisal Bukhari [Sat, 21 Jun 2025 10:32:04 +0000 (16:02 +0530)]
Fix typo in marvell octeontx2 documentation
Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
Fixes a spelling mistake: "funcionality" → "functionality".
Signed-off-by: Faisal Bukhari <faisalbukhari523@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 22 Jun 2025 17:11:45 +0000 (10:11 -0700)]
Merge tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Avoid a crash on a heterogeneous machine where not all cores support
the same hw events features
- Avoid a deadlock when throttling events
- Document the perf event states more
- Make sure a number of perf paths switching off or rescheduling events
call perf_cgroup_event_disable()
- Make sure perf does task sampling before its userspace mapping is
torn down, and not after
* tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix crash in icl_update_topdown_event()
perf: Fix the throttle error of some clock events
perf: Add comment to enum perf_event_state
perf/core: Fix WARN in perf_cgroup_switch()
perf: Fix dangling cgroup pointer in cpuctx
perf: Fix cgroup state vs ERROR
perf: Fix sample vs do_exit()
Linus Torvalds [Sun, 22 Jun 2025 17:09:23 +0000 (10:09 -0700)]
Merge tag 'locking_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Borislav Petkov:
- Make sure the switch to the global hash is requested always under a
lock so that two threads requesting that simultaneously cannot get to
inconsistent state
- Reject negative NUMA nodes earlier in the futex NUMA interface
handling code
- Selftests fixes
* tag 'locking_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Verify under the lock if hash can be replaced
futex: Handle invalid node numbers supplied by user
selftests/futex: Set the home_node in futex_numa_mpol
selftests/futex: getopt() requires int as return value.