Mykyta Yatsenko [Fri, 5 Jun 2026 11:41:18 +0000 (04:41 -0700)]
rhashtable: Add rhashtable_next_key() API
Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.
Caller holds RCU; passes NULL prev_key for the first element or
the previously returned key to advance. Walks tbl->future_tbl
chain so in-flight rehashes are observed.
Best-effort: in case of concurrent resize, provides no guarantees:
- may produce duplicate elements
- may skip any amount of elements
- termination of the loop is not guaranteed in case of
sustained rehash. Callers are advised to bound loop externally
or avoid inserting new elements during such loop.
Returns ERR_PTR(-ENOENT) if prev_key is not found.
Behavior on tables with duplicate keys is undefined.
rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP).
netfilter: conntrack: call nf_ct_gre_keymap_destroy() if master helper is pptp
For GRE flows, validate that the ct master helper (if any) is pptp
before calling nf_ct_gre_keymap_destroy(), so the helper data area
can be accessed safely. Note that only the pptp helper provides a
.destroy callback.
This infrastructure is not used anymore after moving ct timeout and
helper to use datapath refcount to track object use.
Revert commit c56716c69ce1 ("netfilter: extensions: introduce extension
genid count") this patch disables all ct extensions (leading to NULL)
for unconfirmed conntracks, when this is only targeted at ct helper and
ct timeout. There is also codebase that dereferences the ct extension
without checking for NULL which could lead to crash.
netfilter: nf_conntrack_helper: add refcounting from datapath
This patch adds a new ->ct_refcnt field to struct nf_conntrack_helper
which is bumped when the helper is used by the ct helper extension. Drop
this reference count when the conntrack entry is released. This is a
packet path refcount which ensures that struct nf_conntrack_helper
remains in place for tricky scenarios where a packet sits in nfqueue, or
elsewhere, with a conntrack that refers to this helper.
For simplicity, this leaves a single refcount for helper objects in
place, remove the existing refcount for control plane that ensures that
the helper does not go away if it is used by ruleset.
On helper removal, the help callback is set to NULL to disable it from
packet path and, after rcu grace period, existing expectations are
removed. Update ctnetlink to disable access to .to_nlattr and
.from_nlattr if the helper is going away.
Remove nf_queue_nf_hook_drop() since it has proven not to be effective
because packets with unconfirmed conntracks which are still flying to
sit in nfqueue.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_conntrack_pptp: move GRE specific cleanup to GRE tracker
Move the GRE specific cleanup to nf_conntrack_proto_gre.c to ensure that
the .destroy callback for the pptp helper is still reachable by existing
conntrack entries while pptp module is being removed.
This is a preparation patch, no functional changes are intended.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
wifi: mac80211: Add 802.3 multicast encapsulation offload support
mac80211 converts 802.3 multicast packets to 802.11 format
before driver TX, even when Ethernet encapsulation offload
is enabled. This prevents drivers that support multicast
Ethernet encapsulation offload from receiving frames in
native 802.3 format.
Introduce the IEEE80211_OFFLOAD_ENCAP_MCAST flag to bypass the
802.11 encapsulation step and pass the multicast packet to the
driver in 802.3 format. Drivers that support multicast Ethernet
encapsulation offload can advertise this flag.
Disable multicast encapsulation offload in MLO case for drivers not
advertising MLO_MCAST_MULTI_LINK_TX support for AP mode and for
3-address AP_VLAN multicast packets.
wifi: mac80211: Add multicast to unicast support for 802.3 path
mac80211 already supports multicast-to-unicast conversion for
native 802.11 TX paths, but this handling is missing for the
802.3 transmit path. Due to that the packet never converted to
unicast and directly pass it to 802.11 Tx path by checking the
destination address as multicast.
Extend ieee80211_subif_start_xmit_8023() to honor the
multicast_to_unicast setting by cloning and converting multicast
Ethernet frames into per-station unicast transmissions, following
the same behavior of the native 802.11 TX path and allow it
to take 802.3 path.
wifi: mac80211: Add sta pointer sanity check in ieee80211_8023_xmit()
Currently ieee80211_8023_xmit() accesses the sta pointer without any
sanity check, assuming that only unicast packets for an authorized
station are processed. But the sta pointer could become NULL when
a framework to support 802.3 offload for the multicast packets is
added in the follow-up patches. Add the valid sta pointer sanity
check to avoid the invalid pointer access.
This aligns with some of the subordinate functions called by
ieee80211_8023_xmit() that already NULL-check 'sta' such as
ieee80211_select_queue() and ieee80211_aggr_check().
Oliver Upton [Tue, 2 Jun 2026 16:59:01 +0000 (09:59 -0700)]
KVM: arm64: Correctly identify executable PTEs at stage-2
KVM invalidates the I-cache before installing an executable PTE on
implementations without DIC. Unfortunately, support for FEAT_XNX
broke this check as KVM_PTE_LEAF_ATTR_HI_S2_XN was expanded to a
bitfield.
Fix it by reusing kvm_pgtable_stage2_pte_prot() and testing the abstract
permission bits instead.
Fixes: 2608563b466b ("KVM: arm64: Add support for FEAT_XNX stage-2 permissions") Reported-by: Sashiko (gemini/gemini-3.1-pro-preview) Signed-off-by: Oliver Upton <oupton@kernel.org> Reviewed-by: Wei-Lin Chang <weilin.chang@arm.com> Link: https://patch.msgid.link/20260602165901.52800-3-oupton@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
Oliver Upton [Tue, 2 Jun 2026 16:59:00 +0000 (09:59 -0700)]
KVM: arm64: nv: Fix handling of XN[0] when !FEAT_XNX
XN has already been extracted from its bitfield position so using
FIELD_PREP() on the mask that clears XN[0] is completely broken, having
the effect of unconditionally granting execute permissions...
Fix the obvious mistake by manipulating the right bit.
Cc: stable@vger.kernel.org Fixes: d93febe2ed2e ("KVM: arm64: nv: Forward FEAT_XNX permissions to the shadow stage-2") Reviewed-by: Wei-Lin Chang <weilin.chang@arm.com> Signed-off-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/20260602165901.52800-2-oupton@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
Dmitry Ilvokhin [Fri, 5 Jun 2026 10:06:22 +0000 (03:06 -0700)]
cleanup: Specify nonnull argument index
The guard constructors were annotated with an empty __nonnull_args(),
relying on __nonnull__() marking every pointer parameter as non-NULL.
Sparse cannot parse the empty argument list.
Both constructors take the lock pointer as their first parameter, so
specify the index explicitly: __nonnull_args(1).
Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/all/aiJi0WcYE8FZt-jO@stanley.mountain/ Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/aiKpH3cLBEj3TF2Q@shell.ilvokhin.com
Andy Shevchenko [Thu, 4 Jun 2026 09:52:02 +0000 (11:52 +0200)]
fs/read_write: Do not export __kernel_write() to the entire world
Since we have EXPORT_SYMBOL_FOR_MODULES(), we may narrow
the __kernel_write() export to the only which really needs it.
With that being done, update the respective comment.
David Woodhouse [Thu, 4 Jun 2026 09:35:18 +0000 (10:35 +0100)]
ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing
When the system clocksource is kvmclock or Hyper-V (not the TSC directly),
vmclock_get_crosststamp() falls through to a separate get_cycles() call,
losing the atomic pairing between the system time snapshot and the TSC
reading.
Now that ktime_get_snapshot_id() populates hw_cycles with the underlying
TSC value for derived clocksources, use it when available. This gives a
perfect (system_time, tsc) pairing for the device time calculation.
The SUPPORT_KVMCLOCK wrapper is still needed to convert the TSC into
kvmclock nanoseconds for system_counter->cycles, because otherwise
get_device_system_crosststamp() can't interpret the result against the
system clock.
David Woodhouse [Thu, 4 Jun 2026 09:35:17 +0000 (10:35 +0100)]
x86/kvmclock: Implement read_snapshot() for kvmclock clocksource
Implement the read_snapshot() callback for the kvmclock clocksource. This
returns the kvmclock nanosecond value (for timekeeping) while also
providing the raw TSC value that was used to compute it.
The TSC is read inside the pvclock seqlock-protected region, ensuring the
raw TSC and derived kvmclock value are atomically paired.
This enables ktime_get_snapshot_id() to provide the raw TSC to consumers
like the vmclock PTP driver, which currently has to do a separate call to
get_cycles() to obtain a value at *approximately* the same time, to feed
through the vmclock calculation.
David Woodhouse [Thu, 4 Jun 2026 09:35:16 +0000 (10:35 +0100)]
clocksource/hyperv: Implement read_snapshot() for TSC page clocksource
Implement the read_snapshot() callback for the Hyper-V TSC page clock-
source. This returns the derived 10MHz reference time (for timekeeping)
while also providing the raw TSC value that was used to compute it.
When the TSC page is valid, hv_read_tsc_page_tsc() atomically captures both
values from a single RDTSC inside the sequence-counter protected read. When
the TSC page is invalid (sequence == 0), the hw_csid and hw_cycles are set
to zero indicating no value is available.
This enables ktime_get_snapshot_id() to provide the raw TSC to consumers
like KVM's master clock when running nested guests under Hyper-V.
pwm: th1520: Remove requirement for mul_u64_u64_div_u64_roundup
The cycle register is always u32, so cycles_to_ns() can take a u32
instead of a u64. With that narrowing, cycles * NSEC_PER_SEC is at most
u32::MAX * 1e9 (~4.3e18), which fits in u64 without overflow. The
saturating arithmetic is therefore no longer needed, and the ceiling
division can use Rust's u64::div_ceil() directly instead of the
open-coded numerator/denominator form.
This also drops the TODO referring to a future
mul_u64_u64_div_u64_roundup kernel helper, which is no longer required.
Shengming Hu [Thu, 4 Jun 2026 12:27:32 +0000 (20:27 +0800)]
mm/slub: preserve original size in _kmalloc_nolock_noprof retry path
_kmalloc_nolock_noprof() retries from the next kmalloc bucket when the
initial allocation fails. The retry currently reuses `size` as the
bucket selector and overwrites it with s->object_size + 1.
That value is later passed as the original allocation size to
__slab_alloc_node(), slab_post_alloc_hook() and kasan_kmalloc(). On a
successful retry this makes KASAN/slub-debug observe the retry bucket
selector rather than the caller requested size, potentially widening the
valid kmalloc range and hiding overflows.
Keep the caller requested size separately as orig_size and pass it to
the allocation/debug/KASAN paths. Continue using `size` as the retry cache
selector.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock()") Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/202606042027323804pk3MRY42Jy7y42OHAhQZ@zte.com.cn Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Namjae Jeon [Wed, 3 Jun 2026 14:40:31 +0000 (23:40 +0900)]
iomap: Add IOMAP_F_ZERO_TAIL flag to trace event strings
Add IOMAP_F_ZERO_TAIL to the flag string mapping in iomap trace
events. This allows the new flag to be properly displayed in
ftrace output when iomap operations use it.
chachapoly_create() still accepts the compatibility poly1305 parameter
in the template name, but it assumes the second template argument is
always present and immediately passes it to strcmp().
When the argument is missing, crypto_attr_alg_name() returns an error
pointer. Check for that before comparing the name so malformed template
instantiations fail with an error instead of dereferencing the error
pointer in strcmp().
This matches the surrounding Crypto API template pattern where
crypto_attr_alg_name() results are validated before string-specific use.
Junyuan Wang [Tue, 26 May 2026 09:28:39 +0000 (09:28 +0000)]
crypto: qat - add KPT support for GEN6 devices
Add support for Intel Key Protection Technology (KPT) on QAT GEN6
devices.
KPT protects private keys from exposure by keeping them wrapped
(encrypted) while in use, in-flight, and at rest. Keys remain in wrapped
form and are not exposed in plaintext in host memory. This feature
operates outside of the Linux crypto framework and kernel keyring.
Extend the firmware admin interface to enable and configure KPT. During
device initialisation, if KPT is enabled, the driver sends an admin
message to firmware to enable KPT mode and configure parameters such as
the maximum number of SWK (Symmetric Wrapping Key) slots and the SWK
time-to-live (TTL).
Expose KPT configuration via a new sysfs attribute group, "qat_kpt", and
add ABI documentation.
Co-developed-by: Nitesh Venkatesh <nitesh.venkatesh@intel.com> Signed-off-by: Nitesh Venkatesh <nitesh.venkatesh@intel.com> Signed-off-by: Junyuan Wang <junyuan.wang@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Reviewed-by: Ahsan Atta <ahsan.atta@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Ruijie Li [Mon, 25 May 2026 11:45:21 +0000 (19:45 +0800)]
crypto: pcrypt - restore callback for non-parallel fallback
pcrypt installs pcrypt_aead_done() on the child AEAD request before
trying to submit it through padata. If padata_do_parallel() returns
-EBUSY, pcrypt falls back to calling the child AEAD directly.
That fallback must not keep the padata completion callback. Otherwise
an asynchronous completion runs pcrypt_aead_done() even though the
request was never enrolled in padata.
Restore the original request callback and callback data before calling
the child AEAD directly. This keeps the fallback path aligned with a
direct AEAD request while leaving the parallel path unchanged.
Fixes: 662f2f13e66d ("crypto: pcrypt - Call crypto layer directly when padata_do_parallel() return -EBUSY") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:gpt-5.4 Signed-off-by: Ruijie Li <ruijieli51@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The Inline Crypto Engine found in Hawi SoC is compatible with the common
baseline IP 'qcom,inline-crypto-engine'. Hence, document the compatible as
such.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Merge patch series "vfs infrastructure for fs-verity support for XFS with post EOF merkle tree"
Christian Brauner <brauner@kernel.org> says:
This brings in the vfs infrastructure required to implement fs-verity
support for XFS.
* patches from https://patch.msgid.link/20260520123722.405752-1-aalbersh@kernel.org:
iomap: introduce iomap_fsverity_write() for writing fsverity metadata
iomap: teach iomap to read files with fsverity
iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
fsverity: generate and store zero-block hash
iomap: introduce iomap_fsverity_write() for writing fsverity metadata
This is just a wrapper around iomap_file_buffered_write() to create
necessary iterator over metadata.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-10-aalbersh@kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Obtain fsverity info for folios with file data and fsverity metadata.
Filesystem can pass vi down to ioend and then to fsverity for
verification. This is different from other filesystems ext4, f2fs, btrfs
supporting fsverity, these filesystems don't need fsverity_info for
reading fsverity metadata. While reading merkle tree iomap requires
fsverity info to synthesize hashes for zeroed data block.
fsverity metadata has two kinds of holes - ones in merkle tree and one
after fsverity descriptor.
Merkle tree holes are blocks full of hashes of zeroed data blocks. These
are not stored on the disk but synthesized on the fly. This saves a bit
of space for sparse files. Due to this iomap also need to lookup
fsverity_info for folios with fsverity metadata. ->vi has a hash of the
zeroed data block which will be used to fill the merkle tree block.
The hole past descriptor is interpreted as end of metadata region. As we
don't have EOF here we use this hole as an indication that rest of the
folio is empty. This patch marks rest of the folio beyond fsverity
descriptor as uptodate.
For file data, fsverity needs to verify consistency of the whole file
against the root hash, hashes of holes are included in the merkle tree.
Verify them too.
Issue reading of fsverity merkle tree on the fsverity inodes. This way
metadata will be available at I/O completion time.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-9-aalbersh@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
This flag indicates that I/O is for fsverity metadata.
In the write path skip i_size check and i_size updates as metadata is
past EOF. In writeback don't update i_size and continue writeback if
even folio is beyond EOF. In read path don't zero fsverity folios, again
they are past EOF.
The iomap_block_needs_zeroing() is also called from write path. For
folios of larger order we don't want to zero out pages in the folio as
these could contain other merkle tree blocks. For fsverity, filesystem
will request to read PAGE_SIZE memory regions. For data folios, iomap
will zero the rest of the folio for anything which is beyond EOF. We
don't want this for fsverity folios.
Christian Brauner <brauner@kernel.org> says:
Changed IOMAP_F_FSVERITY from (1U << 10) to (1U << 11) to avoid colliding
with IOMAP_F_ZERO_TAIL, which already uses (1U << 10).
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-8-aalbersh@kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Compute the hash of one filesystem block's worth of zeros. A filesystem
implementation can decide to elide merkle tree blocks containing only
this hash and synthesize the contents at read time.
Let's pretend that there's a file containing 131 data block and whose
merkle tree looks roughly like this:
If data[0-128] are sparse holes, then leaf0 will contain a repeating
sequence of @zero_digest. Therefore, leaf0 need not be written to disk
because its contents can be synthesized.
A subsequent xfs patch will use this to reduce the size of the merkle
tree when dealing with sparse gold master disk images and the like.
Note that this works only on the first-level (data holes). fsverity
doesn't store/generate zero_digest for any higher levels.
Add a helper to pre-fill folio with hashes of empty blocks. This will be
used by iomap to synthesize blocks full of zero hashes on the fly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://patch.msgid.link/20260520123722.405752-5-aalbersh@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Adapt all existing helpers to use a modified version of
nf_ct_helper_init(), to dynamically allocate struct nf_conntrack_helper.
Allocate expect_policy[] built-in into the helper to ensure this area is
reachable after helper removal since a follow up patch adds refcount to
track use of the nf_conntrack_helper structure from packet path so it
remains around until last reference from ct helper extension is dropped.
Export __nf_conntrack_helper_register() which allows to register
nfnetlink_cthelper dynamically allocated helper. Adapt nfnetlink_cthelper
to use the built-in expect_policy[].
This is a preparation patch to add packet path refcounting to helpers.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Clément Léger [Thu, 4 Jun 2026 16:07:13 +0000 (09:07 -0700)]
io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
When a bundle recv retries inside io_recv_finish(), the merge logic OR
the saved cflags from the previous iteration with the cflags returned by
the new iteration:
cflags = req->cqe.flags | (cflags & CQE_F_MASK);
Bits listed in CQE_F_MASK are inherited from the new iteration, and all
other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come from the
saved cflags. Before this change CQE_F_MASK covered only
IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE.
When using provided buffer rings (IOU_PBUF_RING_INC) with incremental
mode, and bundle recv, io_kbuf_inc_commit() can leave the head ring
entry partially consumed, __io_put_kbufs() then sets
IORING_CQE_F_BUF_MORE on the returned cflags so userspace knows the
buffer ID will be reused for subsequent completions.
Because IORING_CQE_F_BUF_MORE was not in CQE_F_MASK, the merge above
silently dropped it whenever the final retry iteration partially
consumed the buffer, and the subsequent req->cqe.flags = cflags &
~CQE_F_MASK save would have left a stale IORING_CQE_F_BUF_MORE in the
carried-over cflags had one been present. Userspace would then
wrongfully advance it ring head past an entry the kernel still uses.
Add IORING_CQE_F_BUF_MORE to CQE_F_MASK so it is both inherited from the
new iteration into the user-visible CQE and stripped from the saved
cflags between iterations.
Wyatt Feng [Tue, 2 Jun 2026 16:46:27 +0000 (00:46 +0800)]
xfrm: espintcp: do not reuse an in-progress partial send
espintcp keeps a single in-flight transmit in ctx->partial.
Before building a new sk_msg, espintcp_sendmsg() first tries to flush
that state through espintcp_push_msgs().
For blocking callers, espintcp_push_msgs() may return success even when
the previous partial send is still pending. espintcp_sendmsg() would
then reinitialize emsg->skmsg and reuse ctx->partial while the old
transfer still owns that state.
Do not rebuild the send message when ctx->partial is still in progress.
If espintcp_push_msgs() returns with emsg->len still set, fail the new
send instead of overwriting the live partial state.
This is a memory-safety fix: reusing the live partial-send state can
leave a stale offset attached to a new sk_msg and lead to an out-of-
bounds read in the send path.
tcp_sendmsg_locked() already handles waiting for send buffer memory, so
the fix here is just to preserve espintcp's one-message-at-a-time
transmit state.
Jens Axboe [Fri, 5 Jun 2026 11:18:58 +0000 (05:18 -0600)]
Merge tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme into for-7.2/block
Pull NVMe updates from Keith:
"- Per-controller timeouts
- Multipath telemetry
- Namespace format validation
- Various other fixes"
* tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme: (34 commits)
nvme: export controller reconnect event count via sysfs
nvme: export controller reset event count via sysfs
nvme: export I/O failure count when no path is available via sysfs
nvme: export I/O requeue count when no path is usable via sysfs
nvme: export command error counters via sysfs
nvme: export multipath failover count via sysfs
nvme: export command retry count via sysfs
nvme: add diag attribute group under sysfs
nvme-tcp: lockdep: use dynamic lockdep keys per socket instance
nvme-tcp: move nvme_tcp_reclassify_socket()
nvme: validate FDP configuration descriptor sizes
nvmet-auth: validate reply message payload bounds against transfer length
nvme: refresh multipath head zoned limits from path limits
nvme: fix FDP fdpcidx bounds check
nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.
nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
nvme-multipath: require exact iopolicy names for module parameter
nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()
nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools
...
netfilter: cttimeout: detach dataplane timeout policy and repurpose refcount
Add a refcount for struct nf_ct_timeout which is used by ct extension to
set the custom ct timeout policy, this tells us that the ct timeout is
being used by a conntrack entry. When the last conntrack entry drops the
refcount on the ct timeout, the ct timeout is released.
Remove the refcount for control plane which controls if the ruleset
refers to the timeout policy. After this update, it is possible to
remove the ct timeout policy from nfnetlink_cttimeout immediately.
This is for simplicity not to handle two refcounts on a single object.
Remove nf_queue_nf_hook_drop(): a packet sitting in nfqueue will just
hold a reference to the nf_ct_timeout object until packet is reinjected,
since this is part of the ct extension, this will be released by the
time the conntrack is freed.
nf_ct_untimeout() is still called to clean up in a best effort basis:
the ct timeout on existing entries gets removed when the ct timeout goes
away, but as long as the iptables ruleset still refers to the ct timeout
through a template, new conntracks may keep attaching it and extend its
lifetime until the rule is removed.
nf_ct_untimeout() is not called anymore from module removal path, this
is unlikely to find timeouts give module refcount is bumped, and the new
refcount already tracks the ct timeout policy use so it is released when
unused.
Fixes: 50978462300f ("netfilter: add cttimeout infrastructure for fine timeout tuning") Fixes: 7e0b2b57f01d ("netfilter: nft_ct: add ct timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: synproxy: protect nf_ct_seqadj_init() with conntrack lock
nf_ct_seqadj_init() is called without holding the ct lock. This can race
with nf_ct_seq_adjust() when a connection is in CLOSE state due to an
RST or connection reopening. In addition for SYN_RECV state, concurrent
processing of packets can trigger nf_ct_seq_adjust() too. These
situations create a read/write data race.
As synproxy is the only user of nf_ct_seqadj_init() at the moment, fix
this by holding ct->lock inside nf_ct_seqadj_init() until all is done.
Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: synproxy: fix unaligned memory access in timestamp adjustment
Use get_unaligned_be32() and put_unaligned_be32() to safely read and
write the timestamp fields. This prevents performance degradation due to
unaligned memory access or even a crash on strict alignment
architectures.
This follows the implementation of timestamp parsing in the networking
stack at tcp_parse_options() and synproxy_parse_options().
Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
RFC 9293 does not mention anything about duplicated options and each
networking stack handles it in their own way. Currently, Linux kernel is
processing options sequentially and in case of duplicated timestamp
options, the value from the latest one overrides the others.
As SYNPROXY is modifying only the first timestamp option found, a packet
can reach the backend server and it might parse the wrong timestamp
value. Let's just continue parsing the following options and in case a
duplicated timestamp is found, adjust it too.
Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: synproxy: drop packets if timestamp adjustment fails
If a packet was malformed or if skb_ensure_writable() failed, the
synproxy_tstamp_adjust() function returned 0 indicating an error but it
was ignored on the callers.
Make the function return a boolean instead to clarify the result and
drop the packet if synproxy_tstamp_adjust() failed due to ENOMEM from
skb_ensure_writable(). In addition, if there are malformed options, skip
the tstamp update but do not drop the packet as that should be done by
the policy directly.
Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nfnetlink_cthelper: use {READ,WRITE}_ONCE for accessing helper flags
Conntrack helper flags are accessed from packet and netlink dump path.
Concurrent update of userspace helper flags is not possible, because the
nfnl_mutex in held on updates. These flags are only used by userspace
helpers. Use {READ,WRITE}_ONCE() to access this flags from lockless
paths.
netfilter: nfnetlink_osf: fix mss parsing on big-endian architectures
The MSS calculation in nf_osf_match_one() manually shifts bytes to
construct a 16-bit value before passing it to ntohs().
This works on little-endian hosts but it does not work on big-endian as
the bytes are being always shifted and set in the same way for all
architectures.
Use get_unaligned_be16() to fix this on big-endian systems. It also
simplifies the code.
Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Julian Anastasov [Mon, 25 May 2026 03:54:39 +0000 (06:54 +0300)]
ipvs: add conn_max sysctl to limit connections
Currently, we are using atomic_t to track the number of
connections. On 64-bit setups with large memory there is
a risk this counter to overflow. Also, setups with many
containers may need to tune the limit for connections.
Add sysctl control to limit the number of connections to
1,073,741,824 (64-bit) and 16,777,216 (32-bit).
Depending on the admin's privilege, the value is
used to change a soft or hard limit allowing
unprivileged admins to change the soft limit in
range determined by privileged admins.
Tristan Madani [Tue, 2 Jun 2026 17:16:41 +0000 (17:16 +0000)]
xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()
iptfs_destroy_state() calls hrtimer_cancel() while holding a spinlock
that the timer callback also acquires, leading to an ABBA deadlock on
SMP systems.
For the output timer (iptfs_timer):
- iptfs_destroy_state() holds x->lock, calls hrtimer_cancel()
- iptfs_delay_timer() callback takes x->lock
For the drop timer (drop_timer):
- iptfs_destroy_state() holds drop_lock, calls hrtimer_cancel()
- iptfs_drop_timer() callback takes drop_lock
Both timers use HRTIMER_MODE_REL_SOFT, so their callbacks run in softirq
context. When hrtimer_cancel() is called for a soft timer that is
currently executing on another CPU, hrtimer_cancel_wait_running() spins
on softirq_expiry_lock -- the same lock held by the softirq running the
callback. If the callback is blocked waiting for the spinlock held by
the caller of hrtimer_cancel(), a circular dependency forms:
CPU 0: holds lock_A -> waits for softirq_expiry_lock
CPU 1: holds softirq_expiry_lock -> waits for lock_A
Fix by calling hrtimer_cancel() before acquiring the respective locks.
hrtimer_cancel() is safe to call without holding any lock and will wait
for any in-progress callback to complete. For the output timer, the
lock is still acquired afterwards to drain the packet queue. For the
drop timer, the lock/unlock pair is removed entirely since it only
existed to serialize with the timer callback, which hrtimer_cancel()
already guarantees.
__arch_counter_get_cntpct() and __arch_counter_get_cntvct() open-code
the same ECV-aware ALTERNATIVE block that arch_timer_read_cntpct_el0()
and arch_timer_read_cntvct_el0() already provide in the same header.
The two pairs are byte-for-byte identical except for the trailing
arch_counter_enforce_ordering() the __arch_counter_get_* variants add.
Replace the duplicated inline assembly in __arch_counter_get_cntpct()
and __arch_counter_get_cntvct() with calls to the corresponding helpers.
This mirrors commit 00b39d150986 ("arm64: vdso: Use
__arch_counter_get_cntvct()"), which removed similar duplication from
the vDSO, and keeps the system-counter read sequence in a single place,
reducing assembly code in the kernell
No functional change: the resulting inline assembly, alternatives, and
clobbers are unchanged; only the source-level expression of the read
moves into the existing helper.
Verified by rebuilding the consumers of these helpers before and after
the change and comparing the resulting disassembly:
- arch/arm64/kernel/vdso/vdso.so (final linked vDSO):
bit-identical (same sha256 across rebuilds)
- arch/arm64/kernel/vdso/vgettimeofday.o: identical disassembly
- arch/arm64/lib/delay.o: identical disassembly
- drivers/clocksource/arm_arch_timer.o: same 50 functions with
byte-identical instruction streams; only difference is function
ordering inside .text and NOP padding, with no opcodes added or
removed.
Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Will Deacon <will@kernel.org>
kvm->arch.nested_mmus[] is walked under kvm->mmu_lock, including from the
MMU notifier path (kvm_unmap_gfn_range() -> kvm_nested_s2_unmap()), which
can run at any time. kvm_vcpu_init_nested() reallocates the array and frees
the old buffer while holding only kvm->arch.config_lock, so such a walker
can reference the freed array.
Allocate the new array outside of mmu_lock, as the allocation can sleep.
Under the lock, copy the existing entries, fix up the back pointers and
reassign the array. Free the old buffer after dropping the lock, as
kvfree() can sleep as well.
Fixes: 4f128f8e1aaac ("KVM: arm64: nv: Support multiple nested Stage-2 mmu structures") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/aiKIVVeIr1aAB1yp@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger,kernel.org
Joey Gouly [Thu, 4 Jun 2026 10:54:34 +0000 (11:54 +0100)]
KVM: arm64: Restore POR_EL0 access to host EL0
CPTR_EL2.E0POE was being cleared in __deactivate_cptr_traps_vhe(), which meant
that any accesses to POR_EL0 from host EL0 would trap and be reported to
userspace as an Illegal instruction. This would happen after running any VM,
regardless if it used POE or not.
ptdesc_t sounds very similar to the core MM struct ptdesc which is actually
the memory descriptor for page table allocations. Hence rename this typedef
element as ptval_t instead for better clarity and separation.
Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: linux-efi@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Thu, 4 Jun 2026 15:11:57 +0000 (17:11 +0200)]
arm64: mm: Defer remap of linear alias of data/bss
Marking the linear alias of data/bss invalid involves calling
set_memory_valid(), which calls split_kernel_leaf_mapping() under the
hood.
On BBML2_NOABORT capable systems, this may result in the need to
allocate page tables at a time when the generic memory allocation APIs
are not yet available, resulting in a splat like
Ard Biesheuvel [Thu, 4 Jun 2026 15:11:56 +0000 (17:11 +0200)]
KVM: arm64: Omit tag sync on stage-2 mappings of the zero page
Commit
f620d66af316 ("arm64: mte: Do not flag the zero page as PG_mte_tagged")
removed the PG_mte_tagged flag from the zero page, but missed a KVM code
path that may set this flag on the zero page when it is used in a
stage-2 CoW mapping of anonymous memory.
So disregard the zero page explicitly in sanitise_mte_tags().
Fixes: f620d66af316 ("arm64: mte: Do not flag the zero page as PG_mte_tagged") Cc: stable@vger.kernel.org # 5.10.x Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Thu, 4 Jun 2026 15:11:55 +0000 (17:11 +0200)]
arm64: Avoid double evaluation of __ptep_get()
Sashiko warns that the new pte_valid_noncont() macro is used in a manner
where the argument (which performs a READ_ONCE() of the descriptor) is
evaluated twice.
Drop the macro that we just added, and move the check into the newly
added users.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Ard Biesheuvel [Thu, 4 Jun 2026 15:11:53 +0000 (17:11 +0200)]
arm64: Rename page table BSS section to .bss..pgtbl
Rename the .pgdir.bss section to .bss..pgtbl so that the compiler will
notice the leading ".bss" and mark it as NOBITS by default (rather than
PROGBITS, which would take up space in Image binary, forcing all of the
preceding BSS to be emitted into the image as well). This supersedes the
NOLOAD linker directive, which achieves the same thing, and can be
therefore be dropped.
Also, rename .pgdir to .pgtbl to be more generic, as page tables of
various levels will reside here.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Removing the try_vesa_interface gate caused a backlight regression on
panels whose VBT correctly reports INTEL_BACKLIGHT_DISPLAY_DDI and whose
PWM path is the actual backlight control, but whose DPCD optimistically
advertises DP_EDP_BACKLIGHT_AUX_ENABLE_CAP / _BRIGHTNESS_AUX_SET_CAP.
After the commit such panels silently bind to the VESA AUX backlight
funcs; AUX writes complete but the panel ignores them, leaving
brightness stuck (no-op backlight). Observed on at least KBL and TGL
eDP setups.
HyeongJun An [Thu, 4 Jun 2026 15:00:52 +0000 (00:00 +0900)]
x86/process: Convert rdmsr() to rdmsrq() in arch_post_acpi_subsys_init() to address W=1 warning
arch_post_acpi_subsys_init() reads MSR_K8_INT_PENDING_MSG with rdmsr()
into a lo/hi pair but only uses the low 32 bits: K8_INTP_C1E_ACTIVE_MASK
(0x18000000) lies entirely within them. The 'hi' half is never consumed,
which triggers a -Wunused-but-set-variable warning under W=1:
arch/x86/kernel/process.c: In function 'arch_post_acpi_subsys_init':
arch/x86/kernel/process.c:972:17: warning: variable 'hi' set but not used
Read the full MSR into a single u64 with rdmsrq() and test the mask
against it, dropping the now-unnecessary lo/hi variables.
Hyunwoo Kim [Wed, 3 Jun 2026 12:09:33 +0000 (21:09 +0900)]
KVM: arm64: Take the SRCU lock for page table walks in fault injection and AT emulation
walk_s1() and kvm_walk_nested_s2() expect to be called while holding
kvm->srcu to guard against memslot changes. While this is generally
the case, __kvm_at_s12() and __kvm_find_s1_desc_level() call into the
respective walkers without taking kvm->srcu.
Fix by acquiring kvm->srcu prior to the table walk in both instances.
Cc: stable@vger.kernel.org Fixes: 50f77dc87f13 ("KVM: arm64: Populate level on S1PTW SEA injection") Fixes: be04cebf3e78 ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}") Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/aiAZfdeyanIvP8SD@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org>
Hyunwoo Kim [Mon, 1 Jun 2026 14:53:26 +0000 (23:53 +0900)]
KVM: arm64: vgic-its: Drop the translation cache reference only for the erased entry
vgic_its_invalidate_cache() walks the per-ITS translation cache with
xa_for_each() and drops the cache's reference on each entry with
vgic_put_irq(). It puts the iterated pointer, though, rather than the
value returned by xa_erase().
The function is called from contexts that do not exclude one another: the
ITS command handlers hold its_lock, the GITS_CTLR write path holds
cmd_lock, and the path that clears EnableLPIs in a redistributor's
GICR_CTLR holds neither. Two or more of them can drain the same cache
concurrently, and if each one observes the same entry, erases it and then
puts it, the single reference the cache holds on that entry is dropped
more than once. The entry can then be freed while an ITE still maps it.
xa_erase() is atomic and returns the previous entry, so put only the entry
that this context actually removed. The cache reference is then dropped
exactly once per entry even when the invalidations run concurrently, and
the behavior is unchanged when only one context runs.
Fixes: 8201d1028caa ("KVM: arm64: vgic-its: Maintain a translation cache per ITS") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/ah2c5lu4JbUg7dj-@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
The Realtek interrupt driver currently supports only single core
systems. So the higher end devices like RTL839x and RTL930x with
dual VPEs must be driven with NR_CPU=1. Enhance the driver to
support multicore (dual VPE) systems. For this:
- Extend the register map for multiple cores
- Search for multiple CPU cores in the devicetree
- Improve the register helpers to support multiple cores
- Add an affinity setter
- Enhance the IRQ handler for multiple cores
The usage of these registers is very inconsistent. GIMR is addressed
directly while IRR has a helper that needs a macro as an input. Harmonize
this by providing consistent helpers that improve code readability.
The callers of these helpers use classic lock/unlock functions and
sometimes use the wrong locking helper. E.g. irqsave variants are used in
mask/unmask although not needed. Adapt and fix the surrounding call
locations.
Tony Luck [Fri, 5 Jun 2026 04:46:49 +0000 (21:46 -0700)]
x86/resctrl: Only check Intel systems for SNC
topology_num_nodes_per_package() reports values greater than one on certain
AMD systems resulting in resctrl's Intel model specific SNC detection
printing the confusing message:
"CoD enabled system? Resctrl not supported"
Add a check for Intel systems before looking at the topology.
Namjae Jeon [Mon, 18 May 2026 11:46:55 +0000 (20:46 +0900)]
iomap: introduce IOMAP_F_ZERO_TAIL flag
In filesystems that maintain a separate Valid Data Length, such as exFAT
and NTFS, a partial write may start at or beyond the current valid_size and
extend it. In this case, the region after the previous valid_size but
within the same filesystem block is considered unwritten.
This patch introduces IOMAP_F_ZERO_TAIL. When this flag is set in iomap,
__iomap_write_begin() will zero only the tail portion while preserving any
valid data before it in the same block.
Without this tail zeroing, stale data in the unwritten portion of the block
can remain in the page cache. Subsequent reads can then return incorrect
contents from that region.
Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Link: https://patch.msgid.link/20260518114705.9601-2-linkinjeon@kernel.org Acked-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Gary Guo [Tue, 2 Jun 2026 14:17:55 +0000 (15:17 +0100)]
rust: dma: update to keyworded index projection syntax
Demonstrate the preferred syntax of index projection in DMA documentation
and examples. A few `[i]?` cases are converted to demonstrate the new
variant.
Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Gary Guo <gary@garyguo.net> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-4-6989470f5440@garyguo.net Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Gary Guo [Tue, 2 Jun 2026 14:17:54 +0000 (15:17 +0100)]
rust: ptr: add panicking index projection variant
There have been a few cases where the programmer knows that the indices are
in bounds but the compiler cannot deduce that. This is also
compiler-version-dependent, so using build indexing here can be
problematic. On the other hand, it is also not ideal to use the fallible
variant, as it adds an error handling path that is never hit.
Add a new panicking index projection for this scenario. Like all panicking
operations, this should be used carefully only in cases where the user
knows the index is going to be in bounds, and panicking would indicate
something is catastrophically wrong.
To signify this, require users to explicitly denote the type of index being
used. The existing two types of index projections also gain the keyworded
version, which will be the recommended way going forward.
The keyworded syntax also paves the way of perhaps adding more flavors in
the future, e.g. `unsafe` index projection. However, unless the code is
extremely performance sensitive and bounds checking cannot be tolerated,
the panicking variant is safer and should be preferred, so it will be left
to the future when demand arises.
Signed-off-by: Gary Guo <gary@garyguo.net> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Danilo Krummrich <dakr@kernel.org> Link: https://patch.msgid.link/20260602-projection-syntax-rework-v2-3-6989470f5440@garyguo.net
[ Fixed broken intra-doc link. Added a few extra intra-doc links. Reworded
some docs slightly. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Gary Guo [Tue, 2 Jun 2026 14:17:52 +0000 (15:17 +0100)]
rust: ptr: rename `ProjectIndex::index` to `build_index`
The corresponding `SliceIndex` trait in Rust uses `index` to mean the
panicking variant, which is also being added to `ProjectIndex`. Hence
rename our custom `build_error!` index variant to `build_index`.
Kyle Zeng [Fri, 5 Jun 2026 08:02:04 +0000 (01:02 -0700)]
ALSA: seq: dummy: fix UMP event stack overread
The dummy sequencer port forwards events by copying an incoming
struct snd_seq_event into a stack temporary, rewriting source and
destination, and dispatching the temporary to subscribers. That legacy
event storage is smaller than struct snd_seq_ump_event.
When a UMP event reaches the dummy client, the copy leaves the UMP flag
set but only provides legacy-sized stack storage. The subscriber
delivery path then uses snd_seq_event_packet_size() and copies a
UMP-sized packet from that stack object, reading past the end of the
temporary.
Use the existing union __snd_seq_event storage and copy the packet size
reported for the incoming event before rewriting the common routing
fields. This preserves the full UMP packet for UMP events while keeping
legacy event handling unchanged.
Merge patch series "proc: protect ptrace_may_access() with exec_update_lock"
Jann Horn <jannh@google.com> says:
My understanding is that procfs is effectively maintained by the VFS
maintainers (though scripts/get_maintainer.pl claims that there are
no maintainers for procfs because the VFS entry only claims files
directly in fs/, and the procfs entry has no maintainers listed on
it).
In procfs, most uses of ptrace_may_access() should use
exec_update_lock to avoid TOCTOU issues with concurrent privileged
execve() (like setuid binary execution).
This series doesn't fix all the remaining issues in procfs, but it fixes
the easy cases for now; I will probably follow up with fixes for the
gnarlier cases later unless someone else wants to do that.
I have checked that procfs files still work with these changes and that
CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
(checkpatch complains about missing argument names in
proc_op::proc_get_link, but that was already the case before my patch.)
* patches from https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-0-5c3d20e0ac33@google.com:
proc: protect ptrace_may_access() with exec_update_lock (FD links)
proc: protect ptrace_may_access() with exec_update_lock (part 1)
Jann Horn [Mon, 18 May 2026 16:35:16 +0000 (18:35 +0200)]
proc: protect ptrace_may_access() with exec_update_lock (FD links)
proc_pid_get_link() and proc_pid_readlink() currently look up the task from
the pid once, then do the ptrace access check on that task, then look up
the task from the pid a second time to do the actual access.
That's racy in several ways.
To fix it, pass the task to the ->proc_get_link() handler, and instead of
proc_fd_access_allowed(), introduce a new helper call_proc_get_link() that
looks up and locks the task, does the access check, and calls
->proc_get_link().
Jann Horn [Mon, 18 May 2026 16:35:15 +0000 (18:35 +0200)]
proc: protect ptrace_may_access() with exec_update_lock (part 1)
Fix the easy cases where procfs currently calls ptrace_may_access() without
exec_update_lock protection, where the fix is to simply add the extra lock
or use mm_access():
- do_task_stat(): grab exec_update_lock
- proc_pid_wchan(): grab exec_update_lock
- proc_map_files_lookup(): use mm_access() instead of get_task_mm()
- proc_map_files_readdir(): use mm_access() instead of get_task_mm()
- proc_ns_get_link(): grab exec_update_lock
- proc_ns_readlink(): grab exec_update_lock
wifi: nl80211: Increase ie_len size to prevent truncated IEs in new peer notifications
Currently, ie_len in cfg80211_notify_new_peer_candidate is defined as
1-byte field, capping the maximum IE list size at 255 bytes. When a
large beacon is received, the IE list is truncated, passing incomplete
data to wpa_supplicant. This causes supplicant to fail parsing the IEs.
Increasing the size of ie_len to allow the full length of the IE list to
be forwarded properly.
Rosen Penev [Fri, 5 Jun 2026 00:56:27 +0000 (17:56 -0700)]
wifi: mac80211: fold tid_ampdu_rx allocations into a flexible array
Convert the separately-allocated reorder_buf pointer to a C99 flexible
array member at the end of struct tid_ampdu_rx, with both the
sk_buff_head and the jiffies timestamp in each array element. This
collapses three allocations into one and removes the corresponding
kfree() pairs from the error and free paths.
wifi: mac80211: Fix -Wc23-extensions in hwmp_route_info_get()
When building with a version of clang that supports
'-fms-anonymous-structs' (which will be used by the kernel instead of
the wider '-fms-extensions'), there are a couple warnings after some
recent mesg_hwmp.c changes:
net/mac80211/mesh_hwmp.c:373:3: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
373 | struct ieee80211_mesh_hwmp_preq_top *preq_elem_top =
| ^
net/mac80211/mesh_hwmp.c:390:3: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
390 | struct ieee80211_mesh_hwmp_prep_top *prep_elem_top =
| ^
2 errors generated.
Enclose the switch case blocks in braces to clear up the warning.
Sven Eckelmann [Thu, 4 Jun 2026 19:35:52 +0000 (21:35 +0200)]
batman-adv: fix kernel-doc typos and grammar errors
Various minor errors were gathered over the time in batman-adv's kernel-doc
comments. Get rid of many of them before they are copied (again) to new
functions.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
All receive handlers in batman-adv are consuming the skbuff independent of
the result of the handler. The "(without freeing the skb) on failure" is
therefore not corrrect anymore for the current implementation.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Thu, 4 Jun 2026 19:58:31 +0000 (21:58 +0200)]
batman-adv: uapi: keep kernel-doc in struct member order
The order of the members of struct batadv_coded_packet and struct
batadv_unicast_tvlv_packet didn't match the kernel doc. This is the case
for all other structures and should also be done the same way for these
two.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Thu, 4 Jun 2026 19:35:04 +0000 (21:35 +0200)]
batman-adv: bla: update stale kernel-doc
The bridge-loop-avoidance code was changed recently to avoid inconsistent
state and race condition problems. The kernel-doc addded in these commits
(and related code) has various minor deficits which are now resolved.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Thu, 4 Jun 2026 09:09:04 +0000 (11:09 +0200)]
batman-adv: tp_meter: update stale kernel-doc after refactoring
The tp_meter codebase was recently refactored:
* throughput meter sender and receiver variables were split into
two different structures
* the congestion control variables were extracted in a separate structure
But the kernel-doc was not updated everywhere to reflect these changes.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Wed, 3 Jun 2026 08:47:53 +0000 (10:47 +0200)]
batman-adv: document cleanup of batadv_wifi_net_devices entries
It doesn't seem to be obvious how the entries from the
batadv_wifi_net_devices rhashtable are getting removed before the actual
rhashtable is destroyed. Document the idea behind the process and which
steps are involved.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Tue, 2 Jun 2026 15:46:10 +0000 (17:46 +0200)]
batman-adv: use GFP_KERNEL allocations for the wifi detection cache
The batadv_wifi_net_device_insert() is called with ASSERT_RTNL() held, but
not inside a spinlock or another context which prevents "might_sleep"
functions. To relax the requirements for the allocator, use GFP_KERNEL.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Tue, 2 Jun 2026 15:41:15 +0000 (17:41 +0200)]
batman-adv: drop duplicated wifi_flags assignments
During the initialization of the batadv_wifi_net_device_state, it is enough
to write the wifi_flags once before the batadv_wifi_net_device_state is
added to the batadv_wifi_net_devices rhashtable.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Tue, 26 May 2026 07:09:32 +0000 (09:09 +0200)]
batman-adv: convert cancellation of work items to disable helper
With commit 86898fa6b8cd ("workqueue: Implement disable/enable for
(delayed) work items"), work queues gained the ability to permanently
disallow re-queuing of work items. This is particularly important during
object teardown, where a work item must not be re-armed after shutdown
begins.
Convert all cancel_work_sync() and cancel_delayed_work_sync() call sites to
their disable_* equivalents to clarify the intent to prevent re-arming
after teardown.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Thu, 4 Jun 2026 08:58:51 +0000 (10:58 +0200)]
batman-adv: tp_meter: initialize last_recv_time during init
The last_recv_time is the most important indicator for a receiver session
to figure out whether a session timed out or not. But this information was
only initialized after the session was added to the tp_receiver_list and
after the timer was started.
In the worst case, the timer (function) could have tried to access this
information before the actual initialization was reached. Like rest of the
variables of the tp_meter receiver session, this field has to be filled out
before any other (parallel running) context has the chance to access it.
Cc: stable@kernel.org Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Al Viro [Tue, 5 May 2026 04:20:19 +0000 (00:20 -0400)]
make cursors NORCU
All it requires is making sure that d_walk() will skip *all*
CURSOR dentries, even if somebody passes it one as an argument.
Cursors are negative and unhashed all along, never get added to
LRU or to shrink lists and no RCU references via ->d_sib are
possible for those - dentry_unlist() makes sure that no killed
dentry has ->d_sib.next left pointing to a cursor.
Seeing that a cursor is allocated every time we open a directory
on autofs, debugfs, devpts, etc., avoiding an RCU delay when such
opened files get closed is attractive...
Al Viro [Wed, 15 Apr 2026 23:29:53 +0000 (19:29 -0400)]
nfs: get rid of fake root dentries
... just grab the reference to the (real) root we are about to return
for the first mount of this superblock and be done with that.
Once upon a time dentry tree eviction at fs shutdown used to break
if ->s_root had been spliced on top of something; that hadn't been
the case for years now, and these fake root dentries violate a bunch
of invariants. Let's get rid of them...
Al Viro [Sat, 18 Apr 2026 22:39:03 +0000 (18:39 -0400)]
wind ->s_roots via ->d_sib instead of ->d_hash
shrink_dcache_for_umount() is supposed to handle the possibility of
some of the dentries to be evicted being in other threads shrink
lists; it either kills them, leaving an empty husk to be freed by
the owner of shrink list whenever it gets around to that, or it
waits for the eviction in progress to get completed.
That relies upon dentry remaining attached to the tree until the
eviction reaches dentry_unlist() and its ->d_sib gets removed
from the list. Unfortunately, the secondary roots are linked
via ->d_hash, rather than ->d_sib and they become removed from
that list before their inode references are dropped.
If shrink_dentry_list() from another thread ends up evicting
one of the secondary roots and gets to that point in dentry_kill()
when shrink_dcache_for_umount() is looking for secondary roots,
the latter will *not* notice anything, possibly leading to
warnings about busy inodes at umount time and all kinds of breakage
after that.
Moreover, shrink_dcache_for_umount() walks the list of secondary
roots with no protection whatsoever, so it might end up calling
dget() on a dentry that already passed through
lockref_mark_dead(&dentry->d_lockref);
ending up with corrupted refcount and possible UAF.
AFAICS, the most straightforward way to deal with that would be
to have secondary roots linked via ->d_sib rather than ->d_hash;
then they would remain on the list until killed, and we could
use d_add_waiter() machinery to wait for eviction in progress.
Changes:
* secondary roots look the same as ->s_root from d_unhashed()
and d_unlinked() POV now.
* secondary roots are represented as "no parent, but on ->d_sib"
instead of "no parent, but on ->d_hash".
* since ->d_sib is a plain hlist, we protect it with per-superblock
spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for
non-root dentries it would be protected by ->d_lock of parent).
* __d_obtain_alias() uses ->d_sib for linkage when allocating
a secondary root.
* d_splice_alias_ops() detects splicing of a secondary root and
removes it from the list before calling __d_move().
* dentry_unlist() detects eviction of a secondary root and
removes it from the list; no need to play the games for d_walk() sake,
since the latter is not going to look for the next sibling of those
anyway.
* ___d_drop() doesn't care about ->s_roots anymore.
* shrink_dcache_for_umount() uses proper locking for access to
the list of secondary roots and if it runs into one that is in the middle
of eviction waits for that to finish.
Al Viro [Thu, 23 Apr 2026 18:29:18 +0000 (19:29 +0100)]
shrinking rcu_read_lock() scope in d_alloc_parallel()
The current use of rcu_read_lock() uses in d_alloc_parallel()
is fairly opaque - the single large scope serves two purposes.
We start with lookup in normal hash, and there rcu_read_lock()
scope puts __d_lookup_rcu() and subsequent lockref_get_not_dead() into
the same RCU read-side critical area.
If no match is found, we proceed to lock the hash chain of
in-lookup hash and scan that for a match. If we find a match, we want
to grab it and wait for lookup in progress to finish. Since the bitlock
we use for these hash chains has to nest inside ->d_lock, we need to
unlock the chain first and use lockref_get_not_dead() on the match.
That has to be done without breaking the RCU read-side critical area,
and we use the same rcu_read_lock() scope to bridge over.
The thing is, after having grabbed the reference (and it is
very unlikely to fail) we proceed to grab ->d_lock - d_wait_lookup()
and __d_lookup_unhash()/__d_wake_in_lookup_waiters() are using that for
serialization. That makes lockref_get_not_dead() pointless - trying
to avoid grabbing ->d_lock for refcount increment, only to grab it
anyway immediately after that. If we grab ->d_lock first and replace
lockref_get_not_dead() with direct check for sign and increment if
non-negative we can move rcu_read_unlock() to immediately after grabbing
->d_lock. Moreover, we don't need the RCU read-side critical area to
be contiguous since before earlier __d_lookup_rcu() - we can just as
well terminate the earlier one ASAP and call rcu_read_lock() again only
after having found a match (if any) in the in-lookup hash chain.
That makes the entire thing easier to follow and the purpose
of those rcu_read_lock() calls easier to describe - the first scope is
for __d_lookup_rcu() + lockref_get_not_dead(), the second one bridges
over from the bitlock scope to the ->d_lock scope on the match found in
in-lookup hash.
Al Viro [Tue, 21 Apr 2026 19:52:13 +0000 (15:52 -0400)]
d_walk(): shrink rcu_read_lock() scope
we only need it to bridge over from ->d_lock scope of child to ->d_lock
scope of parent; dropping ->d_lock at rename_retry doesn't need to be
in rcu_read_lock() scope.
Al Viro [Sat, 11 Apr 2026 08:17:02 +0000 (04:17 -0400)]
adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()
Pull dropping ->d_lock on lock_for_kill() failure into lock_for_kill() itself.
That reduces dentry_kill() to
if (!lock_for_kill(dentry))
return NULL;
return __dentry_kill(dentry);
at which point it's easier to move that if (...) into the beginning of __dentry_kill()
itself and rename it into dentry_kill().
Document the new calling conventions of lock_for_kill().
Al Viro [Sat, 11 Apr 2026 08:01:28 +0000 (04:01 -0400)]
Document rcu_read_lock() use in select_collect2()
If select_collect2() finds something that is neither busy nor can
be moved to shrink list, it needs to return that to caller's caller
(shrink_dcache_tree()) ASAP and do so without grabbing references (among
other things, it might be already dying, in which case refcount can't be
incremented). We are called inside a ->d_lock scope, but that scope is
going to be terminated as soon as we return to caller (d_walk()); ->d_lock
will be retaken by shrink_dcache_tree(), but we need to bridge between
these scopes, turning them into contiguous RCU read-side critical area.
We do that with rcu_read_lock() scope - it spans from unbalanced
rcu_read_lock() in select_collect2() to unbalanced rcu_read_unlock()
in shrink_dcache_tree(). That works, but it really needs to be documented;
it's rather unidiomatic and it had caused quite a bit of confusion - some
of it in form of patches "fixing" the damn thing.