Ricardo Ribalda [Thu, 7 May 2026 20:58:06 +0000 (20:58 +0000)]
media: v4l2-dev: Add range check for vdev->minor
If the fixed minor ranges are not properly set we could end up in a
situation where the calculated minor is invalid. Add a check for this in
the code to make it more robust.
This check also fixes the following false positive smatch warning:
Hungyu Lin [Thu, 7 May 2026 02:22:13 +0000 (02:22 +0000)]
media: tegra-video: vi: fix invalid u32 return value in format lookup
tegra_get_format_fourcc_by_idx() returns a u32 but uses -EINVAL to
signal an out-of-bounds index. This results in a large unsigned
value being returned, which may be interpreted as a valid fourcc.
Returning 0 is not a valid fourcc either. This condition should
never happen, so use WARN_ON_ONCE() to catch unexpected out-of-bounds
access and return a valid fallback format instead.
Suggested-by: Hans Verkuil <hverkuil+cisco@kernel.org> Fixes: 3d8a97eabef0 ("media: tegra-video: Add Tegra210 Video input driver") Cc: stable@vger.kernel.org Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Signed-off-by: Hungyu Lin <dennylin0707@gmail.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
Ben Hoff [Sun, 10 May 2026 23:50:36 +0000 (19:50 -0400)]
media: pci: add AVMatrix HWS capture driver
Add an in-tree AVMatrix HWS PCIe capture driver. The driver supports
up to four HDMI inputs and exposes the video capture path through
V4L2 with vb2-dma-contig streaming, DV timings, and per-input
controls. Audio support is intentionally omitted from this
submission.
This patch also adds the MAINTAINERS entry for the new driver.
This driver is derived from a GPL out-of-tree driver.
Changes since v6:
- v6 accidently contained legacy history, resubmitting with latest
- remove an unused mode-change label reported by W=1
Changes since v5:
- keep queue_setup() and alloc_sizeimage on the logical sizeimage value
- drop the dead queue_setup() fallback that rebuilt pix.sizeimage
on the fly
- clarify that hws_calc_sizeimage() models the packed-YUYV-only path
- add Assisted-by attribution for Codex
Changes since v4:
- replace plain 64-bit elapsed-time divisions in debug logging with
div_u64() so i386 module builds do not emit __udivdi3 references
Changes since v3:
- fold the MAINTAINERS update into this patch so per-patch CI sees the
new file pattern
- wrap the validation text for checkpatch
Changes since v2:
- keep scratch DMA allocation on a single probe-owned path
- avoid double-freeing V4L2 control handlers on register unwind
- drop the extra per-node resolution sysfs ABI
- turn live geometry changes into explicit SOURCE_CHANGE renegotiation
- report live DV timings and reject attempts to retime a live source
- stop advertising RESOLUTION source changes for fps-only updates
- keep live fps state across harmless S_FMT restarts
- stop exposing an unvalidated DV RX power-present signal
- clean the imported sources for checkpatch and W=1 builds
Validation:
- build-tested with W=1 against a local kernel build tree
- compiled the driver with ARCH=i386 allmodconfig and verified the
resulting hws_pci.o, hws_video.o, and hws.o do not reference
__udivdi3
- v4l2-compliance 1.33.0-5459 from v4l-utils commit 4a0d2c3b4f523406cb9a6f4c541ef14f72f19f3d on /dev/video2:
48 tests succeeded, 0 failed, 1 warning
DV_RX_POWER_PRESENT is intentionally left unsupported in this revision
because current hardware evidence does not expose a validated
receiver-side power-detect signal distinct from active video presence.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604020522.z22eZuW8-lkp@intel.com/ Assisted-by: Codex:gpt-5.5 Signed-off-by: Ben Hoff <hoff.benjamin.k@gmail.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
The two halves of the splat refer to two different events on
&ht->mutex.
The kswapd0 path is unambiguous: shmem_evict_inode at mm/shmem.c:1429
calls simple_xattrs_free(), which calls rhashtable_free_and_destroy()
on the per-inode simple_xattrs rhashtable being torn down with the
inode.
The previously-recorded ht->mutex -> fs_reclaim edge comes from
rht_deferred_worker -> rhashtable_rehash_alloc ->
bucket_table_alloc(GFP_KERNEL) -> __kvmalloc_node ->
might_alloc -> fs_reclaim. That stack stops at generic library code:
there is no subsystem-specific frame above rht_deferred_worker, so
the splat does not identify which rhashtable's worker recorded the
edge -- only that some rhashtable in the system did.
Whether or not that recording happened on the same simple_xattrs ht
that is now being destroyed, the predicted deadlock cannot occur:
rhashtable_free_and_destroy() does cancel_work_sync(&ht->run_work)
before taking ht->mutex, so the deferred worker cannot be running on
the instance being torn down. If the recording was on a different
rhashtable instance, the two ht->mutex acquisitions are on distinct
mutex objects and cannot deadlock either.
Lockdep flags a cycle regardless because mutex_init(&ht->mutex) lives
on a single source line in rhashtable_init_noprof(), so every
ht->mutex in the kernel shares one static lockdep class. Lockdep
matches by class, not by instance, and collapses all of these into
one node.
Lift the lockdep key out of rhashtable_init_noprof() and into the
caller. The user-visible rhashtable_init_noprof() /
rhltable_init_noprof() identifiers become macros that declare a
per-call-site static lock_class_key.
Link: https://patch.msgid.link/20260427-work-rhashtable-lockdep-v1-1-f69e8bd91cb2@kernel.org Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc") Acked-by: Michal Hocko <mhocko@suse.com> Reported-by: syzbot+5af806780f38a5fe691f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/69e798fe.050a0220.24bfd3.0032.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
drm/i915/dp: Fix VSC dynamic range signaling for RGB formats
For RGB, set dynamic_range to CTA or VESA based on
crtc_state->limited_color_range so sinks apply correct
quantization. YCbCr remains limited (CTA) range.
(DP v1.4, Table 5-1)
drm/i915: skip __i915_request_skip() for already signaled requests
After a GPU reset the HWSP is zeroed, so previously completed
requests appear incomplete. If such a request is picked up during
reset_rewind() and marked guilty, i915_request_set_error_once()
returns early (fence already signaled), leaving fence.error without
a fatal error code. The subsequent __i915_request_skip() then hits:
```
GEM_BUG_ON(!fatal_error(rq->fence.error))
```
Fixes a kernel BUG observed on Sandy Bridge (Gen6) during
heartbeat-triggered engine resets.
```
kernel BUG at drivers/gpu/drm/i915/i915_request.c:556!
RIP: __i915_request_skip+0x15e/0x1d0 [i915]
...
__i915_request_reset+0x212/0xa70 [i915]
reset_rewind+0xe4/0x280 [i915]
intel_gt_reset+0x30d/0x5b0 [i915]
heartbeat+0x516/0x530 [i915]
```
Guard __i915_request_skip() with i915_request_signaled(), if the
fence is already signaled, the ring content is committed and there
is nothing left to skip.
Fixes: 36e191f0644b ("drm/i915: Apply i915_request_skip() on submission") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/13729 Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Cc: stable@vger.kernel.org # v5.7+ Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/fe76921d35b6ae85aa651822726d0d9815aa5362.1776339012.git.sebastian.brzezinka@intel.com
(cherry picked from commit 5ba54393dcd7adf75a9f39f5a933b1538349cad5) Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Sophie D [Sat, 9 May 2026 02:54:05 +0000 (22:54 -0400)]
drm/gud: Add RCade Display Adapter VID/PID pair
The RCade Display Adapter is a hardware device that allows driving an
Arcade CRT display via the GUD protocol. Currently it spoofs an
existing GUD VID/PID pair. However, now that it has its own pair
assigned, it makes sense to add this to the list of pairs that GUD
supports natively.
More information can be found in the project repositories:
https://gitlab.scd31.com/stephen/stm32-usb-vga-adapter-hardware
https://gitlab.scd31.com/stephen/stm32-usb-vga-rcade-adapter
Sven Eckelmann [Sat, 2 May 2026 19:25:19 +0000 (21:25 +0200)]
batman-adv: tt: prevent TVLV entry number overflow
The helpers to prepare the buffers for the local and global TT based
replies are trying to sum up all TT entries which can be found for each
VLAN. In theory, this sum can be too big for an u16 and therefore overflow.
A too small buffer would then be allocated for the TVLV.
The too small buffer will be handled gracefully by
batadv_tt_tvlv_generate() and is not causing a buffer overflow - just a
truncated reply. But this overflow shouldn't have happened in the first and
the too small buffer should never have been allocated when an overflow was
detected.
Cc: stable@kernel.org Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 2 May 2026 18:47:34 +0000 (20:47 +0200)]
batman-adv: tt: avoid empty VLAN responses
The commit 16116dac2339 ("batman-adv: prevent TT request storms by not
sending inconsistent TT TLVLs") added checks to the local (direct) TT
response code. But the response can also be done indirectly by another node
using the global TT state. To avoid such inconsistency states reported in
the original fix, also avoid sending empty VLANs for replies from the
global TT state.
Cc: stable@kernel.org Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 2 May 2026 17:47:11 +0000 (19:47 +0200)]
batman-adv: tt: fix TOCTOU race for reported vlans
The local TT based TVLV is generated by first checking the number of VLANs
which have at least one TT entry. A new buffer with the correct size for
the VLANs is then allocated. Only then, the list of VLANs s used to fill
the VLAN entries in the buffer. During this time, the meshif_vlan_list_lock
is held. But the actual number of TT entries of each VLAN can still
increase during this time - just not the number of VLANs in the list.
But the prefilter used in the buffer size calculation might still cause an
increase of the number of VLANs which need to be stored. Simply because a
VLAN might now suddenly have at least one entry when it had none in the
pre-alloc check - and then needs to occupy space which was not allocated.
It is better to overestimate the buffer size at the beginning and then fill
the buffer only with the VLANs which are not empty.
Cc: stable@kernel.org Fixes: 16116dac2339 ("batman-adv: prevent TT request storms by not sending inconsistent TT TLVLs") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 2 May 2026 17:53:21 +0000 (19:53 +0200)]
batman-adv: tt: fix negative last_changeset_len
batadv_piv_tt::last_changeset_len len was declared as s16, but the field is
never intended to hold a negative value. When a value greater than 32767 is
assigned, it wraps to a negative signed integer.
In batadv_send_my_tt_response(), last_changeset_len is temporarily widened
to s32. The incorrectly negative s16 value propagates into the s32, causing
batadv_tt_prepare_tvlv_local_data() to allocate a full sized buffer but
populates only a small portion of it with the collected changeset. All
remaining bits are kept uninitialized.
Using an u16 avoids this type confusion and ensures that no (negative) sign
extension is performed in batadv_send_my_tt_response().
Sven Eckelmann [Sat, 2 May 2026 17:53:21 +0000 (19:53 +0200)]
batman-adv: tt: fix negative tt_buff_len
batadv_orig_node::tt_buff_len was declared as s16, but the field is never
intended to hold a negative value. When a value greater than 32767 is
assigned, it wraps to a negative signed integer.
In batadv_send_other_tt_response(), tt_buff_len is temporarily widened to
s32. The incorrectly negative s16 value propagates into the s32, causing
batadv_tt_prepare_tvlv_global_data() to allocate a full sized buffer but
populates only a small portion of it with the collected changeset. All
remaining bits are kept uninitialized.
Using an u16 avoids this type confusion and ensures that no (negative) sign
extension is performed in batadv_send_other_tt_response().
Sven Eckelmann [Sat, 2 May 2026 17:08:37 +0000 (19:08 +0200)]
batman-adv: tt: reject oversized local TVLV buffers
The commit 3a359bf5c61d ("batman-adv: reject oversized global TT response
buffers") added a check to ensure that a global return buffer size can be
stored in an u16. The same buffer handling also exists for the local data
buffer but was not touched.
A similar check should be also be in place for the local TVLV buffer. It
doesn't have the similar attack surface because it is only generated from
locally discovered MAC addresses but the dynamic nature could still cause
temporarily to large buffers.
Cc: stable@kernel.org Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org>
powerpc/hv-gpci: fix preempt count leak in sysfs show paths
Four sysfs show() callbacks in hv-gpci take get_cpu_var(hv_gpci_reqb)
(which calls preempt_disable()) but only call the matching put_cpu_var()
on the error path under the 'out:' label. Every successful read leaks
one preempt_disable():
(affinity_domain_via_partition_show() was already correct.)
On a CONFIG_PREEMPT=y kernel, repeated reads raise preempt_count and
eventually return to userspace with preemption still disabled. The
next user-mode page fault then hits faulthandler_disabled() == 1,
gets forced to SIGSEGV, and the resulting coredump trips
'BUG: scheduling while atomic' in call_usermodehelper_exec ->
wait_for_completion_state -> schedule:
powerpc: fix dead default for GUEST_STATE_BUFFER_TEST
The GUEST_STATE_BUFFER_TEST config option should default
to KUNIT_ALL_TESTS so that if all tests are enabled then
it is included, but currently the 'default KUNIT_ALL_TESTS'
statement is shadowed by 'def_tristate n',
meaning that this second default statement is currently dead code.
It looks to me like the commit 6ccbbc33f06a ("KVM: PPC: Add helper library for Guest State Buffers")
intended to set the default to KUNIT_ALL_TESTS, but mistakenly
missed the def_tristate.
This dead code was found by kconfirm, a static analysis tool for Kconfig.
Fixes: 6ccbbc33f06a ("KVM: PPC: Add helper library for Guest State Buffers") Signed-off-by: Julian Braha <julianbraha@gmail.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Reviewed-by: Amit Machhiwal <amachhiw@linux.ibm.com> Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260405161545.161006-1-julianbraha@gmail.com
Commit a28d3af2a26c ("[PATCH] 2/5 powerpc: Rework PowerMac i2c part 2")
removed the last calls to the pmac_low_i2c_{lock,unlock}() functions.
Hence, remove these two functions.
Ma Ke [Sun, 16 Nov 2025 02:44:11 +0000 (10:44 +0800)]
powerpc/warp: Fix error handling in pika_dtm_thread
pika_dtm_thread() acquires client through of_find_i2c_device_by_node()
but fails to release it in error handling path. This could result in a
reference count leak, preventing proper cleanup and potentially
leading to resource exhaustion. Add put_device() to release the
reference in the error handling path.
Found by code review.
Cc: stable@vger.kernel.org Fixes: 3984114f0562 ("powerpc/warp: Platform fix for i2c change") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251116024411.21968-1-make24@iscas.ac.cn
Ally Heev [Sun, 16 Nov 2025 14:25:44 +0000 (19:55 +0530)]
powerpc: 82xx: fix uninitialized pointers with free attribute
Uninitialized pointers with `__free` attribute can cause undefined
behavior as the memory allocated to the pointer is freed automatically
when the pointer goes out of scope.
powerpc/km82xx doesn't have any bugs related to this as of now, but,
it is better to initialize and assign pointers with `__free` attribute
in one statement to ensure proper scope-based cleanup
Linus Walleij [Tue, 5 May 2026 18:47:56 +0000 (20:47 +0200)]
powerpc/g5: Enable all windfarms by default
The G5 defconfig is clearly intended for the G5 Powermac
series, and that should enable all the available
windfarm drivers, or the machine will overheat a short
while after booting and shut itself down, which is
annoying.
Sibi Sankar [Fri, 13 Mar 2026 12:08:14 +0000 (17:38 +0530)]
arm64: dts: qcom: glymur-crd: Enable ADSP and CDSP
Enable ADSP and CDSP on Glymur CRD board.
Signed-off-by: Sibi Sankar <sibi.sankar@oss.qualcomm.com> Reviewed-by: Abel Vesa <abel.vesa@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260313120814.1312410-6-sibi.sankar@oss.qualcomm.com
[bjorn: Moved snippet to common glymur-crd.dtsi] Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Guangshuo Li [Thu, 7 May 2026 10:06:03 +0000 (18:06 +0800)]
drm/bridge: imx8qxp-pxl2dpi: avoid ERR_PTR with device_node cleanup
imx8qxp_pxl2dpi_get_available_ep_from_port() returns ERR_PTR()
on errors. imx8qxp_pxl2dpi_find_next_bridge() stores its return
value in a __free(device_node) variable before checking IS_ERR().
When the function returns on the error path, the cleanup action calls
of_node_put() on the ERR_PTR() value.
Do not let a device_node cleanup variable hold error pointers. Change
imx8qxp_pxl2dpi_get_available_ep_from_port() to return an int and pass
the endpoint node through an output argument. Initialize the output
argument to NULL so callers hold either NULL on error paths or a valid
device_node pointer on successful path.
Fixes: ceea3f7806a10 ("drm/bridge: imx8qxp-pxl2dpi: simplify put of device_node pointers") Cc: stable@vger.kernel.org Reviewed-by: Liu Ying <victor.liu@nxp.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Link: https://patch.msgid.link/20260507100604.667731-1-lgs201920130244@gmail.com Signed-off-by: Liu Ying <victor.liu@nxp.com>
Tao Cui [Fri, 8 May 2026 12:54:12 +0000 (20:54 +0800)]
net: ethtool: fix missing closing paren in rings_reply_size()
sizeof(u32) on the _RINGS_CQE_SIZE line is missing its closing
parenthesis, causing nla_total_size() to absorb the subsequent
_TX_PUSH and _RX_PUSH entries.
The resulting size estimate happens to be numerically identical
due to NLA alignment, so not treating this as a real fix.
But the nesting is wrong and misleading.
Signed-off-by: Tao Cui <cuitao@kylinos.cn> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260508125412.189804-1-cuitao@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ASoC: SOF: amd: Fix error code handling in psp_send_cmd()
The smn_read_register() helper returns negative error codes on failure
or the register value on success. When used with read_poll_timeout(),
the return value is stored in the 'data' variable.
Currently 'data' is declared as u32, which causes negative error codes
to be cast to large positive values. This makes the condition 'data > 0'
incorrectly treat errors as success.
Fix by changing 'data' from u32 to int, matching the pattern used in
psp_mbox_ready() which correctly handles the same helper function.
Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/linux-sound/agGES8vWrLOrBu28@stanley.mountain/ Fixes: f120cf33d232 ("ASoC: SOF: amd: Use AMD_NODE") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20260511153638.724810-1-mario.limonciello@amd.com Signed-off-by: Mark Brown <broonie@kernel.org>
qed: fix division by zero in qed_init_wfq_param when all vports are configured
In qed_init_wfq_param(), variable non_requested_count can become zero
when the number of vports with the configured flag set (including the
current vport being configured) equals total num_vports. This happens
when configuring the last unconfigured vport or when re-configuring
an already configured vport.
The function then calculates left_rate_per_vp = total_left_rate /
non_requested_count, which causes division by zero.
Fix this by skipping the division when non_requested_count is zero.
In that case, there is no remaining bandwidth to distribute, so just
record the configuration for the current vport and return success.
====================
net: phy: motorcomm: add ACPI _DSD property support
This series makes the Motorcomm PHY driver parse firmware properties via
device_property_*() so the same property set can be provided by either
Devicetree or ACPI _DSD.
Patch 1 switches drivers/net/phy/motorcomm.c from of_property_*() to
device_property_*() on &phydev->mdio.dev.
Patch 2 documents Motorcomm yt8xxx PHY ACPI _DSD properties under
Documentation/firmware-guide/acpi/dsd and links the new document from
the ACPI index.
====================
chunzhi.lin [Thu, 7 May 2026 04:02:20 +0000 (12:02 +0800)]
net: phy: motorcomm: use device properties for firmware tuning
The Motorcomm PHY driver reads optional firmware properties via
of_property_read_*() from phydev->mdio.dev.of_node. This works for
Device Tree based systems, but causes ACPI platforms to ignore the same
properties when they are supplied through _DSD.
As a result, ACPI-described Motorcomm PHY devices fall back to default
settings instead of applying firmware-provided tuning such as
rx/tx internal delay, drive strength, clock output frequency, and
optional boolean controls like auto-sleep-disabled,
keep-pll-enabled, and tx clock inversion.
Switch these lookups to device_property_read_*() so the driver uses the
generic firmware node interface and can consume the same property names
from either Device Tree or ACPI.
This keeps the existing DT behavior unchanged while allowing ACPI
platforms to honor PHY configuration from firmware.
We have completed testing on Sophgo RISC-V architecture server SD3-10.
This server has a 64-core Thead C920 CPU whose DWMAC is connected to
Motorcomm's PHY YT8531. This server supports UEFI boot and it would like
to use the ACPI table.
Davide Caratti [Fri, 8 May 2026 17:05:10 +0000 (19:05 +0200)]
net/sched: dualpi2: initialize timer earlier in dualpi2_init()
'pi2_timer' needs to be initialized in all error paths of dualpi2_init():
otherwise, a failure in qdisc_create_dflt() causes the following crash in
dualpi2_destroy():
Allow switching from 8 to 64 for the maximum number of subflows and
accepted ADD_ADDR, and from 8 to 255 for the number of MPTCP endpoints.
The previous limit of 8 subflows makes sense in most cases. Using more
subflows will very likely *not* improve the situation, and could even
decrease the performances. But there are no technical limitations nor
performance impact to raise this limit, so let's do it: this will allow
people with very specific use-cases, and researchers to easily create
more subflows, and measure the performance impact by themselves.
- Patches 1-2: increase subflows and accepted ADD_ADDR limits.
- Patches 3-4: increase endpoints limit.
- Patches 5-7: validate the new limits: 64 subflows, 255 endpoints.
- Patch 8: selftests: use send()/recv() instead of sendto()/recvfrom().
====================
These limits have been recently updated, from 8 to:
- 64 for the subflows and accepted add_addr
- 255 for the MPTCP endpoints
These modifications validate the new limits, but are also compatible
with the previous ones, to be able to continue to validate stable kernel
using the last version of the selftests. That's why new variables are
now used instead of hard-coded values.
The limits have been recently increased, it is required to validate that
having 64 subflows is allowed.
Here, both the client and the server have 8 network interfaces. The
server has 8 endpoints marked as 'signal' to announce all its v4
addresses. The client also has 8 endpoints, but marked as 'subflow' and
'fullmesh' in order to create 8 subflows to each address announced by
the server. This means 63 additional subflows will be created after the
initial one.
If it is not possible to increase the limits to 64, it means an older
kernel version is being used, and the test is skipped.
selftests: mptcp: join: allow changing ifaces nr per test
By default, 4 network interfaces are created per subtest in a dedicated
net namespace. Each netns has a dedicated pair of v4 and v6 addresses.
Future tests will need more.
Simply always creating more network interfaces per test will increase
the execution time for all other tests, for no other benefits. So now it
is possible to change this number only when needed, by setting ifaces_nr
when calling 'reset' and 'init_shapers', e.g.
Note that it might also be interesting to decrease the default value to
2 to reduce the setup time, especially when a debug kernel config is
being used.
The endpoints are managed in a list which was limited to 8 entries.
This limit can be too small in some cases: by having the same limit as
the number of subflows, it might not allow creating all expected
subflows when having a mix of v4 and v6 addresses that can all use MPTCP
on v4/v6 only networks.
While increasing the limit above the new subflows one, why not using the
technical limit: 255. Indeed, the endpoint will each have an ID that
will be used on the wire, limited to u8, and the ID 0 is reserved to the
initial subflow.
mptcp: pm: kernel: allow flushing more than 8 endpoints
The mptcp_rm_list structure contains an array of IDs of 8 entries: to be
able to send a RM_ADDR with 8 IDs. This limitation was OK so far because
there could maximum 8 endpoints.
But this is going to change in the next commit. To cope with that, if
one of the arrays is full, the iteration stops, the lists are processed,
then the iteration continues where it previously stopped.
Note that if there are many endpoints to remove, and multiple RM_ADDR to
send, it might be more likely that some of these RM_ADDRs are dropped or
lost. This is a known limitation: RM_ADDR are not retransmitted in
MPTCPv1.
This means switching the maximum from 8 to 64 for the number of subflows
and accepted ADD_ADDR.
The previous limit of 8 subflows makes sense in most cases. Using more
subflows will very likely *not* improve the situation, and could even
decrease the performances. But there are no technical limitations nor
performance impact to raise this limit, so let's do it: this will allow
people with very specific use-cases, and researchers to easily create
more subflows, and measure the performance impact by themselves.
The theoretical limit is 255 -- the ID is written in a u8 on the wire --
but 64 is more than enough. With so many subflows, it will be costly to
iterate over all of them when operations are done in bottom half.
Note that the in-kernel PM will continue to create subflows in reply to
ADD_ADDR with a single batch of maximum 8 subflows. Same when adding new
"subflow" endpoints with the fullmesh flag. Increasing those batch
limits would have a memory impact, and it looks fine not to cover these
cases with larger batches for the moment. If more is needed later, the
position of the last subflow from the list could be remembered, and the
list iteration could continue later.
mptcp: pm: in-kernel: explicitly limit batches to array size
The in-kernel PM can create subflows in reply to ADD_ADDR by batch of
maximum 8 subflows for the moment. Same when adding new "subflow"
endpoints with the fullmesh flag. This limit is linked to the arrays
used during these steps.
There was no explicit limit to the arrays size (8), because the limit of
extra subflows is the same (8). It seems safer to use an explicit limit,
but also these two sizes are going to be different in the next commit.
Abel Vesa [Tue, 14 Apr 2026 17:05:51 +0000 (20:05 +0300)]
arm64: dts: qcom: glymur: Drop RPMh CXO clocks from QMP PHYs
On Glymur, all QMP PHYs except the one used by USB SS0 take their
reference clock from the TCSR clock controller. Since these TCSR clocks
already derive from RPMH_CXO_CLK as their sole parent, there is no need
to provide an extra `clkref` clock to the PHY nodes.
Drop the extra RPMh CXO clock inputs and use the TCSR clocks as the PHY
reference clocks instead.
This also fixes the devicetree schema validation, as the bindings do not
allow a separate `clkref` clock.
Fixes: 4eee57dd4df9 ("arm64: dts: qcom: glymur: Add USB related nodes") Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Reported-by: Rob Herring <robh@kernel.org> Closes: https://lore.kernel.org/r/20260410145205.GA554754-robh@kernel.org/ Signed-off-by: Abel Vesa <abel.vesa@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260414-dts-glymur-drop-rpmh-cxo-clk-from-qmpphys-v1-1-ab12d77c4aec@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Breno Leitao [Fri, 8 May 2026 13:55:04 +0000 (06:55 -0700)]
tools/bootconfig: render kernel.* subtree as cmdline string with -C
Add a -C option that finds the "kernel" subtree of a bootconfig file
and prints it as a flat, space-separated cmdline string by calling the
shared xbc_snprint_cmdline() renderer. An empty or absent kernel.*
subtree produces empty output and exits successfully.
This lets the kernel build embed a bootconfig file as a plain cmdline
string at build time, so embedded bootconfig values can reach
parse_early_param() during architecture setup without parsing the
bootconfig at runtime.
The renderer is intentionally limited to the kernel.* subtree: that is
the only thing the kernel build needs to embed; init.* and other
subtrees keep going through the runtime parser.
Example of this new mode:
# cat /tmp/test.bconf
kernel {
foo = bar
baz = "hello world"
arr = 1, 2
}
init.foo = nope
Breno Leitao [Fri, 8 May 2026 13:55:03 +0000 (06:55 -0700)]
bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.c
Move xbc_snprint_cmdline() from init/main.c to lib/bootconfig.c so the
function (and its xbc_namebuf scratch buffer) becomes part of the shared
parser library. tools/bootconfig already compiles lib/bootconfig.c
directly, which lets a follow-up patch reuse the same renderer in the
userspace tool to convert a bootconfig file into a flat cmdline string
at build time.
David Yang [Thu, 7 May 2026 21:40:51 +0000 (05:40 +0800)]
net: mention the convention for .ndo_setup_tc()
qdisc_offload_dump_helper(), originated from commit 602f3baf2218
("net_sch: red: Add offload ability to RED qdisc"), is designed to that
Whether RED is being offloaded is being determined every time dump
action is being called because parent change of this qdisc could
change its offload state but doesn't require any RED function to be
called.
and returning -EOPNOTSUPP (for dump queries) does not mean "I don't have
any statistics", but "I don't offload this qdisc anymore". At least two
existing drivers did it wrong, so it is worth mentioning.
net: ena: PHC: Check return code before setting timestamp output
ena_phc_gettimex64() is setting the output parameter regardless
of whether ena_com_phc_get_timestamp() succeeded or failed.
When ena_com_phc_get_timestamp() returns an error, the timestamp
parameter may contain uninitialized stack memory (e.g., when PHC is
disabled or in blocked state) or invalid hardware values. Passing
these to userspace via the PTP ioctl is both a security issue
(information leak) and a correctness bug.
Fix by checking the return code after releasing the lock and only
setting the output timestamp on success.
Fixes: e0ea34158ee8 ("net: ena: Add PHC support in the ENA driver") Cc: stable@vger.kernel.org Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260507003518.22554-1-akiyano@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/rds: reset op_nents when zerocopy page pin fails
When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(),
the pinned pages are released with put_page(), and
rm->data.op_mmp_znotifier is cleared. But we fail to properly
clear rm->data.op_nents.
Later when rds_message_purge() is called from rds_sendmsg() the
cleanup loop iterates over the incorrectly non zero number of
op_nents and frees them again.
Fix this by properly resetting op_nents when it should be in
rds_message_zcopy_from_user().
Ihor Solodrai [Sat, 9 May 2026 00:57:30 +0000 (17:57 -0700)]
selftests/bpf: Use both hrtimer enqueue helpers in vmlinux test
The vmlinux selftest triggers nanosleep and checks that both kprobe
and fentry programs observe the hrtimer enqueue path.
After the hrtimer_start_expires_user() conversion [1], nanosleep
reaches hrtimer_start_range_ns_user() instead of
hrtimer_start_range_ns(). Hard-coding either symbol makes the test
fail either on bpf tree or on linux-next [2].
Update the test to resolve the target symbol at runtime via
libbpf_find_vmlinux_btf_id(). This is a nice example of how to modify
a BPF program to work on both older and newer kernel revision.
QA output created by 637
entries 7 and 8 have duplicate d_off 8
Found unlinked files in open dir (see xfstests-dev/results//generic/637.full for details)
Debugging of the hfsplus_readdir() logic showed this:
It means that hfsplus_readdir() stopped the processing of
folder's items on ctx->pos 8, then, item with ino 28 has
been deleted and hfsplus_readdir() re-started the logic
from ctx->pos 7. As a result, previous and new sets of
folder's items have overlapping values for the case of
d_off 8.
Currently, HFS+ has very complicated and fragile logic
of rd->file->f_pos correction in hfsplus_delete_cat().
This patch removes this logic and it stores the current
pos into hfsplus_readdir_data. Finally, if rd->pos == ctx->pos
then hfsplus_readdir() tries to find the position in
b-tree's node by means of hfsplus_cat_key. This position is
used to re-start the folder's content traversal.
zonefs: handle integer overflow in zonefs_fname_to_fno
In zonefs the file name in one of the two directories corresponds to the
zone number.
Here Alexey reported a possible integer overflow in zonefs_fname_to_fno(),
where the parsing of the zone number from the file name can overflow the
'long' data type.
Add a check for integer overflows and if the fno 'long' did overflow
return -ENOENT.
Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Fixes: d207794ababe ("zonefs: Dynamically create file inodes when needed") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tejun Heo [Mon, 11 May 2026 22:43:39 +0000 (12:43 -1000)]
Merge branch 'for-7.1-fixes' into for-7.2
Pull to receive:
9a415cc53711 ("sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path")
Conflicts with for-7.2's scx_task_iter_relock() rework. The fix moves
put_task_struct(p) past scx_error(); for-7.2 still has it at the old
position. Resolved by dropping the old one.
Linus Torvalds [Mon, 11 May 2026 22:38:49 +0000 (15:38 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah Khan:
"Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix
KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which it
will not be useful"
* tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: config: KUNIT_DEBUGFS should depend on DEBUG_FS
kunit: config: Enable KUNIT_DEBUGFS by default
Changelog:
RFC: https://lore.kernel.org/all/20260420111726.2118636-1-puranjay@kernel.org/
Changes in v1:
- Replace bpf_get_cpu_time_counter() with bpf_ktime_get_ns()
- Replace bpf_repeat() with plain for loop and may_goto
- Refactor collect_measurements() to reuse bench_force_done()
- Remove histogram, verbose calibration output, and per-scenario status prints
- Trim run script table to p50/stddev/p99
- Set env.quiet when --machine-readable is passed
- Add || true to run script benchmark invocation for set -e safety
- Add bpf-nop benchmark as timing overhead baseline (patch 3)
- Use named struct for LRU inner map to fix build on older toolchains
This series adds an XDP load-balancer benchmark (based on Katran) to the BPF
selftest bench framework.
Motivation
----------
Existing BPF bench tests measure individual operations (map lookups,
kprobes, ring buffers) in isolation. Production BPF programs combine
parsing, map lookups, branching, and packet rewriting in a single call
chain. The performance characteristics of such programs depend on the
interaction of these operations -- register pressure, spills, inlining
decisions, branch layout -- which isolated micro-benchmarks do not
capture.
This benchmark implements a simplified L4 load-balancer modeled after
katran [1]. The BPF program reproduces katran's core datapath:
L3/L4 parsing -> VIP hash lookup -> per-CPU LRU connection table
with consistent-hash fallback -> real server selection -> per-VIP
and per-real stats -> IPIP/IP6IP6 encapsulation
The BPF code exercises hash maps, array-of-maps (per-CPU LRU),
percpu arrays, jhash, bpf_xdp_adjust_head(), bpf_ktime_get_ns(),
and bpf_get_smp_processor_id() in a single pipeline.
This is intended as the first in a series of BPF workload benchmarks
covering other use cases (sched_ext, etc.).
Design
------
A userspace loop calling bpf_prog_test_run_opts(repeat=1) would
measure syscall overhead, not BPF program cost -- the ~4 ns early-exit
paths would be buried under kernel entry/exit. Using repeat=N is
also unsuitable: the kernel re-runs the same packet without resetting
state between iterations, so the second iteration of an encap scenario
would process an already-encapsulated packet.
Instead, timing is measured inside the BPF program using
bpf_ktime_get_ns(). BENCH_BPF_LOOP() brackets N iterations with
timestamp reads using a plain for loop with may_goto, runs a
caller-supplied reset block between iterations to undo side effects
(e.g. strip encapsulation), and records the elapsed time per batch.
One extra untimed iteration runs afterward for output validation.
Auto-calibration picks a batch size targeting ~10 ms per invocation.
A proportionality sanity check verifies that 2N iterations take ~2x
as long as N.
24 scenarios cover the code-path matrix:
- Protocol: TCP, UDP
- Address family: IPv4, IPv6, cross-AF (IPv4-in-IPv6)
- LRU state: hit, miss (16M flow space), diverse (4K flows), cold
- Consistent-hash: direct (LRU bypass)
- TCP flags: SYN (skip LRU, force CH), RST (skip LRU insert)
- Early exits: unknown VIP, non-IP, ICMP, fragments, IP options
Each scenario validates correctness before benchmarking by comparing
the output packet byte-for-byte against a pre-built expected packet
and checking BPF map counters.
Wire up the userspace side of the XDP load-balancer benchmark.
24 scenarios cover the full code-path matrix: TCP/UDP, IPv4/IPv6,
cross-AF encap, LRU hit/miss/diverse/cold, consistent-hash bypass,
SYN/RST flag handling, and early exits (unknown VIP, non-IP, ICMP,
fragments, IP options).
Before benchmarking each scenario validates correctness: the output
packet is compared byte-for-byte against a pre-built expected packet
and BPF map counters are checked against the expected values.
Add the BPF datapath for the XDP load-balancer benchmark, a
simplified L4 load-balancer inspired by katran.
The pipeline: L3/L4 parse -> VIP lookup -> per-CPU LRU connection
table or consistent-hash fallback -> real server lookup -> per-VIP
and per-real stats -> IPIP/IP6IP6 encapsulation. TCP SYN forces
the consistent-hash path (skipping LRU); TCP RST skips LRU insert
to avoid polluting the table.
process_packet() is marked __noinline so that the BENCH_BPF_LOOP
reset block (which strips encapsulation) operates on valid packet
pointers after bpf_xdp_adjust_head().
selftests/bpf: Add XDP load-balancer common definitions
Add the shared header for the XDP load-balancer benchmark. This
defines the data structures used by both the BPF program and
userspace: flow_key, vip_definition, real_definition, and the
stats/control structures.
Also provides the encapsulation source-address helpers shared
between the BPF datapath (for encap) and userspace (for building
expected output packets used in validation).
selftests/bpf: Add bpf-nop benchmark for timing overhead baseline
Add a minimal benchmark that measures the overhead of the batch-timing
infrastructure itself. The BPF program runs an empty BENCH_BPF_LOOP body
(~1.5-2 ns/op), establishing the floor cost that all timing-library
benchmarks include.
[root@virtme-ng tools/testing/selftests/bpf]# sudo ./bench -a -p8 bpf-nop
Setting up benchmark 'bpf-nop'...
Benchmark 'bpf-nop' started.
bpf-nop: median 1.82 ns/op, stddev 0.01, p99 1.86 (1754 samples)
Add a reusable timing library for BPF benchmarks that need to measure
BPF program execution time.
The BPF side (progs/bench_bpf_timing.bpf.h) provides per-CPU sample
arrays and BENCH_BPF_LOOP(), a macro that brackets batch_iters
iterations with bpf_ktime_get_ns() reads and records the elapsed time.
One extra untimed iteration runs afterward for output validation.
The userspace side (benchs/bench_bpf_timing.c) collects samples from
the skeleton BSS, computes percentile statistics, and auto-calibrates
batch_iters to target ~10 ms per batch.
selftests/bpf: Add bench_force_done() for early benchmark completion
The bench framework waits for duration_sec to elapse before collecting
results. Benchmarks that know exactly how many samples they need can
call bench_force_done() to signal completion early, avoiding wasted
wall-clock time.
Also refactor collect_measurements() to reuse bench_force_done()
instead of open-coding the same mutex/cond_signal sequence.
Tejun Heo [Mon, 11 May 2026 22:05:48 +0000 (12:05 -1000)]
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p->comm and p->pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.
drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12
gfx_v12_0_init_microcode() always loads RS64 CP ucode but never set
adev->gfx.rs64_enable, so it stayed false and code that branches on it
(e.g. MEC pipe reset) used the legacy CP_MEC_CNTL path incorrectly.
Match GFX11: derive RS64 mode from the PFP firmware header (v2.0) via
amdgpu_ucode_hdr_version(). Log at debug when RS64 is enabled.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b03d53598b0d2048e8fa7303b8d0784768ec4fa6)
Xiang Liu [Thu, 7 May 2026 12:56:15 +0000 (20:56 +0800)]
drm/amd/ras: Fix CPER ring debugfs read overflow
The legacy CPER debugfs reader can reach the payload path without a
valid pointer snapshot. The remaining user byte count is also treated as
the ring occupancy in dwords, so reads past the header can copy more than
requested.
Take the CPER lock before sampling pointers. Resample rptr/wptr for
payload reads, bound the payload copy by available dwords and the
remaining user size, and advance the file position for each dword copied.
Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1e40ef87ffdc291e05ccdade8b9170cc9c1c4249)
drm/amd/display: Wrap DCN32 phantom-plane allocation in DC_RUN_WITH_PREEMPTION_ENABLED
[Why]
dcn32_validate_bandwidth() wraps dcn32_internal_validate_bw() with
DC_FP_START()/DC_FP_END(). In x86 non-RT, DC_FP_START takes fpregs_lock(),
which disables local softirqs.
The DML1 path through dcn32_enable_phantom_plane() calls kvzalloc() to
allocate ~335 KiB for dc_plane_state. This triggers the vmalloc path,
which calls BUG_ON(in_interrupt()) because it's invoked within the
FPU-enabled (softirq disabled) region, leading to a kernel crash.
[How]
Wrap the dc_state_create_phantom_plane() call with the
DC_RUN_WITH_PREEMPTION_ENABLED() macro to allow preemption during
this memory allocation.
Fixes: 235c67634230 ("drm/amd/display: add DCN32/321 specific files for Display Core") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4470 Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: James Lin <pinglei.lin@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 885ccbef7b94a8b38f69c4211c679021aa27ad11) Cc: stable@vger.kernel.org
Christian König [Mon, 20 Apr 2026 14:08:35 +0000 (16:08 +0200)]
drm/amdgpu: fix userq hang detection and reset
Fix lock inversions pointed out by Prike and Sunil. The hang detection
timeout *CAN'T* grab locks under which we wait for fences, especially
not the userq_mutex lock.
Then instead of this completely broken handling with the
hang_detect_fence just cancel the work when fences are processed and
re-start if necessary.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1b62077f045ac6ffde7c97005c6659569ac5c1ec)
Christian König [Mon, 20 Apr 2026 13:13:57 +0000 (15:13 +0200)]
drm/amdgpu: remove almost all calls to amdgpu_userq_detect_and_reset_queues
Well the reset handling seems broken on multiple levels.
As first step of fixing this remove most calls to the hang detection.
That function should only be called after we run into a timeout! And *NOT*
as random check spread over the code in multiple places.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 71bea36b54ccfb14cbc90f94267af6369af4e702)
Christian König [Thu, 16 Apr 2026 13:32:11 +0000 (15:32 +0200)]
drm/amdgpu: rework amdgpu_userq_signal_ioctl v3
This one was fortunately not looking so bad as the wait ioctl path, but
there were still a few things which could be fixed/improved:
1. Allocating with GFP_ATOMIC was quite unnecessary, we can do that
before taking the userq_lock.
2. Use a new mutex as protection for the fence_drv_xa so that we can do
memory allocations while holding it.
3. Starting the reset timer is unnecessary when the fence is already
signaled when we create it.
4. Cleanup error handling, avoid trying to free the queue when we don't
even got one.
v2: fix incorrect usage of xa_find, destroy the new mutex on error
v3: cleanup ref ordering
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1609eb0f81a609d350169839128cecf298c84e7a)
Christian König [Mon, 20 Apr 2026 18:18:43 +0000 (20:18 +0200)]
drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset
The purpose of a GPU reset is to make sure that fence can be signaled
again and the signal and resume workers can make progress again.
So waiting for the resume worker or any fence in the GPU reset path is
just utterly nonsense.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fcd5f065eab46993af43442fd77ee8d9eb9c5bdf)
Merge patch series "proc: subset=pid: Relax check of mount visibility"
Alexey Gladkov <legion@kernel.org> says:
When mounting procfs with the subset=pids option, all static files become
unavailable and only the dynamic part with information about pids is accessible.
In this case, there is no point in imposing additional restrictions on the
visibility of the entire filesystem for the mounter. Everything that can be
hidden in procfs is already inaccessible.
Currently, these restrictions prevent procfs from being mounted inside rootless
containers, as almost all container implementations override part of procfs to
hide certain directories. Relaxing these restrictions will allow pidfs to be
used in nested containerization.
* patches from https://patch.msgid.link/cover.1777278334.git.legion@kernel.org:
docs: proc: add documentation about mount restrictions
proc: handle subset=pid separately in userns visibility checks
proc: prevent reconfiguring subset=pid
proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
sysfs: remove trivial sysfs_get_tree() wrapper
fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
namespace: record fully visible mounts in list
proc: handle subset=pid separately in userns visibility checks
When procfs is mounted with subset=pid, only the dynamic process-related
part of the filesystem remains visible. That part cannot be hidden by
overmounts, so checking whether an existing procfs mount is fully
visible does not make sense for this mode.
At the same time, a subset=pid procfs mount must not be used as evidence
that a later procfs mount would not reveal additional information. It
provides a restricted view of procfs, not the full filesystem view.
Mark subset=pid procfs instances as restricted variants. Ignore
restricted variants when looking for an already-visible mount, and allow
new restricted variants without consulting mnt_already_visible().
Changing subset=pid on an existing procfs instance is not safe. If a
full procfs mount has entries hidden by overmounts, switching it to
subset=pid would hide the top-level procfs entries from lookup and
readdir while leaving the existing overmounts reachable.
Reject attempts to change the subset=pid state during reconfigure before
applying any other procfs mount options, so a failed reconfigure cannot
partially update the instance.
The file is a mess with a hand-rolled linked list in a desperate need of
a clean up.
The code to emit /proc/filesystems is used frequently because libselinux
reads the file, which in turn is linked into numerous frequently used
programs (even ones you would not suspect, like sed!). In order to
combat that pre-gen the string instead of pointer-chasing and printfing
one by-one.
The main bottleneck afterwards is the spurious lockref trip on open.
* patches from https://patch.msgid.link/20260425220844.1763933-1-mjguzik@gmail.com:
fs: cache the string generated by reading /proc/filesystems
fs: RCU-ify filesystems list
proc: allow to mark /proc files permanent outside of fs/proc/
proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.
Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN. To avoid inadvertently
allowing access to /proc/<pid>/net, updating mounter credentials is not
supported.
Now that FS_USERNS_MOUNT_RESTRICTED is a file_system_type flag,
sysfs_get_tree() is a trivial wrapper around kernfs_get_tree() with no
additional logic. Point sysfs_fs_context_ops.get_tree directly at
kernfs_get_tree() and remove the wrapper.
The drivers list was protected by an rwlock; every mount, every open
of /proc/filesystems and the legacy sysfs(2) syscall walked a
hand-rolled singly-linked list under it. /proc/filesystems is
especially hot because libselinux causes programs as mundane as
mkdir, ls and sed to open and read it on every invocation.
Convert the list to an RCU-protected hlist and switch the writer side
to a plain spinlock. Writers keep their existing non-sleeping
section while readers walk under rcu_read_lock() with no lock traffic:
- register_filesystem()/unregister_filesystem() take
file_systems_lock, publish via hlist_{add_tail,del_init}_rcu()
and invalidate the cached /proc/filesystems string.
unregister_filesystem() keeps its synchronize_rcu() after
dropping the lock so in-flight readers are drained before the
module (and its embedded file_system_type) can go away.
- __get_fs_type(), list_bdev_fs_names() and the
fs_index()/fs_name()/fs_maxindex() helpers walk the list under
rcu_read_lock(). fs_name() continues to drop the read-side
lock after try_module_get() and accesses ->name outside the RCU
section; the module reference pins the embedded file_system_type
across the boundary.
struct file_system_type::next becomes struct hlist_node list; no
in-tree caller references the old ->next field outside
fs/filesystems.c.
fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
Whether a filesystem's mounts need to undergo a visibility check in user
namespaces is a static property of the filesystem type, not a runtime
property of each superblock instance. Both proc and sysfs always set
SB_I_USERNS_VISIBLE on their superblocks unconditionally (sysfs does so
on first creation, and subsequent mounts reuse the same superblock).
Move this flag from sb->s_iflags (SB_I_USERNS_VISIBLE) to
file_system_type->fs_flags (FS_USERNS_MOUNT_RESTRICTED) so the intent
is expressed at the filesystem type level where it belongs.
All check sites are updated to test sb->s_type->fs_flags instead of
sb->s_iflags. The SB_I_NOEXEC and SB_I_NODEV flags remain on the
superblock as they are runtime properties set during fill_super.