Commit 8cf5b3f83614 ("Revert "kbuild, rust: use -fremap-path-prefix
to make paths relative"") removed `--remap-path-prefix` from the build
system, so the workarounds are not needed anymore.
Thus remove them.
Note that the flag has landed again in parallel in this cycle in
commit dda135077ecc ("rust: build: remap path to avoid absolute path"),
together with `--remap-path-scope=macro` [1]. However, they are gated on
`rustc-option-yn, --remap-path-scope=macro`, which means they are both
only passed starting with Rust 1.95.0 [2]:
`--remap-path-scope` is only stable in Rust 1.95, so use `rustc-option`
to detect its presence. This feature has been available as
`-Zremap-path-scope` for all versions that we support; however due to
bugs in the Rust compiler, it does not work reliably until 1.94. I opted
to not enable it for 1.94 as it's just a single version that we missed.
In turn, that means the workarounds removed here should not be needed
again (even with the flag added again above), since:
- `rustdoc` now recognizes the `--remap-path-prefix` flag since Rust
1.81.0 [3] (even if it is still an unstable feature [4]).
- The Internal Compiler Error [5] that the comment mentions was fixed in
Rust 1.87.0 [6]. We tested that was the case in a previous version
of this series by making the workaround conditional [7][8].
...which are both older versions than Rust 1.95.0.
We will still need to skip `--remap-path-scope` for `rustdoc` though,
since `rustdoc` does not support that one yet [4].
Jouni Högander [Fri, 27 Mar 2026 11:45:53 +0000 (13:45 +0200)]
drm/i915/psr: Do not use pipe_src as borders for SU area
This far using crtc_state->pipe_src as borders for Selective Update area
haven't caused visible problems as drm_rect_width(crtc_state->pipe_src) ==
crtc_state->hw.adjusted_mode.crtc_hdisplay and
drm_rect_height(crtc_state->pipe_src) ==
crtc_state->hw.adjusted_mode.crtc_vdisplay when pipe scaling is not
used. On the other hand using pipe scaling is forcing full frame updates and all the
Selective Update area calculations are skipped. Now this improper usage of
crtc_state->pipe_src is causing following warnings:
"drm/i915/dsc: Add helper for writing DSC Selective Update ET parameters"
These warnings are seen when DSC and pipe scaling are enabled
simultaneously. This is because on full frame update SU area is improperly
set as pipe_src which is not aligned with DSC slice height.
Fix these by creating local rectangle using
crtc_state->hw.adjusted_mode.crtc_hdisplay and
crtc_state->hw.adjusted_mode.crtc_vdisplay. Use this local rectangle as
borders for SU area.
Arthur Husband [Mon, 6 Apr 2026 22:23:35 +0000 (15:23 -0700)]
ata: ahci: force 32-bit DMA for JMicron JMB582/JMB585
The JMicron JMB585 (and JMB582) SATA controllers advertise 64-bit DMA
support via the S64A bit in the AHCI CAP register, but their 64-bit DMA
implementation is defective. Under sustained I/O, DMA transfers targeting
addresses above 4GB silently corrupt data -- writes land at incorrect
memory addresses with no errors logged.
The failure pattern is similar to the ASMedia ASM1061
(commit 20730e9b2778 ("ahci: add 43-bit DMA address quirk for ASMedia
ASM1061 controllers")), which also falsely advertised full 64-bit DMA
support. However, the JMB585 requires a stricter 32-bit DMA mask rather
than 43-bit, as corruption occurs with any address above 4GB.
On the Minisforum N5 Pro specifically, the combination of the JMB585's
broken 64-bit DMA with the AMD Family 1Ah (Strix Point) IOMMU causes
silent data corruption that is only detectable via checksumming
filesystems (BTRFS/ZFS scrub). The corruption occurs when 32-bit IOVA
space is exhausted and the kernel transparently switches to 64-bit DMA
addresses.
Add device-specific PCI ID entries for the JMB582 (0x0582) and JMB585
(0x0585) before the generic JMicron class match, using a new board type
that combines AHCI_HFLAG_IGN_IRQ_IF_ERR (preserving existing behavior)
with AHCI_HFLAG_32BIT_ONLY to force 32-bit DMA masks.
Signed-off-by: Arthur Husband <artmoty@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org>
selftests/nolibc: don't skip tests for unimplemented syscalls anymore
The automatic skipping of tests on ENOSYS returns was introduced in
commit 349afc8a52f8 ("selftests/nolibc: skip tests for unimplemented
syscalls"). It handled the fact that nolibc would return ENOSYS for many
syscall wrappers on riscv32.
Nowadays nolibc handles all these correctly, so this logic is not used
anymore. To make missing nolibc functionality more obvious fail the
tests again if something is not implemented.
selftests/nolibc: explicitly handle ENOSYS from ptrace()
The automatic ENOSYS handling in EXPECT_SYSER() is about to be removed.
ptrace() will return legitimately return ENOSYS on qemu-user, so handle
it explicitly.
The standard syscall() function or macro uses the libc return value
convention. Errors returned from the kernel as negative values are
stored in errno and -1 is returned. Users who want to avoid using
errno don't have a way to call raw syscalls and check the returned
error.
Add a new macro _syscall() which works like the standard syscall()
but passes through the return value from the kernel unchanged.
The naming scheme and return values match the named _sys_foo()
system call wrappers already part of nolibc.
tools/nolibc: move the call to __sysret() into syscall()
__sysret() transforms the return value from the kernel into the libc
return value convention. There is no reason for it to be called in the
middle of the internals of the syscall() implementation macros.
Move the call up, directly into syscall(), to make the code simpler.
tools/nolibc: rename the internal macros used in syscall()
These macros are the internal implementation of syscall().
They can not be used by users. Align them with the standard naming
scheme for internal symbols.
The current name also prevents the addition of an application-usable
_syscall() symbol.
to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and
runtime values, but it returns unsigned long. tg_rt_schedulable() also
stores the current group limit and the accumulated child sum in unsigned
long.
On 32-bit builds, large bandwidth ratios can be truncated and the RT
group sum can wrap when enough siblings are present. That can let an
overcommitted RT hierarchy pass the schedulability check, and it also
narrows the helper result for other callers.
Return u64 from to_ratio() and use u64 for the RT group totals so
bandwidth ratios are preserved and compared at full width on both 32-bit
and 64-bit builds.
Fixes: b40b2e8eb521 ("sched: rt: multi level group constraints") Assisted-by: Codex:GPT-5 Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com
When retry_aligned_read() encounters an overlapped stripe, it releases
the stripe via raid5_release_stripe() which puts it on the lockless
released_stripes llist. In the next raid5d loop iteration,
release_stripe_list() drains the stripe onto handle_list (since
STRIPE_HANDLE is set by the original IO), but retry_aligned_read()
runs before handle_active_stripes() and removes the stripe from
handle_list via find_get_stripe() -> list_del_init(). This prevents
handle_stripe() from ever processing the stripe to resolve the
overlap, causing an infinite loop and soft lockup.
Fix this by using __release_stripe() with temp_inactive_list instead
of raid5_release_stripe() in the failure path, so the stripe does not
go through the released_stripes llist. This allows raid5d to break out
of its loop, and the overlap will be resolved when the stripe is
eventually processed by handle_stripe().
Zide Chen [Fri, 13 Mar 2026 17:40:50 +0000 (10:40 -0700)]
perf/x86/intel/uncore: Remove extra double quote mark
The third argument in INTEL_UNCORE_FR_EVENT_DESC() is subject to
__stringify(), and the extra double quote marks can result in the
expansion "3.814697266e-6" in the sysfs knobs, instead of
3.814697266e-6.
This is incorrect, though it may still work for perf, e.g.
perf stat -e uncore_iio_free_running_0/bw_in_port0/
Fixes: d8987048f665 ("perf/x86/intel/uncore: Support IIO free-running counters on DMR") Closes: https://lore.kernel.org/all/20251231224233.113839-1-zide.chen@intel.com/ Reported-by: Chun-Tse Shao <ctshao@google.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chun-Tse Shao <ctshao@google.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260313174050.171704-5-zide.chen@intel.com
Zide Chen [Fri, 13 Mar 2026 17:40:49 +0000 (10:40 -0700)]
perf/x86/intel/uncore: Fix die ID init and look up bugs
In snbep_pci2phy_map_init(), in the nr_node_ids > 8 path,
uncore_device_to_die() may return -1 when all CPUs associated
with the UBOX device are offline.
Remove the WARN_ON_ONCE(die_id == -1) check for two reasons:
- The current code breaks out of the loop. This is incorrect because
pci_get_device() does not guarantee iteration in domain or bus order,
so additional UBOX devices may be skipped during the scan.
- Returning -EINVAL is incorrect, since marking offline buses with
die_id == -1 is expected and should not be treated as an error.
Separately, when NUMA is disabled on a NUMA-capable platform,
pcibus_to_node() returns NUMA_NO_NODE, causing uncore_device_to_die()
to return -1 for all PCI devices. As a result,
spr_update_device_location(), used on Intel SPR and EMR, ignores the
corresponding PMON units and does not add them to the RB tree.
Fix this by using uncore_pcibus_to_dieid(), which retrieves topology
from the UBOX GIDNIDMAP register and works regardless of whether NUMA
is enabled in Linux. This requires snbep_pci2phy_map_init() to be
added in spr_uncore_pci_init().
Keep uncore_device_to_die() only for the nr_node_ids > 8 case, where
NUMA is expected to be enabled.
Fixes: 9a7832ce3d92 ("perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info") Fixes: 65248a9a9ee1 ("perf/x86/uncore: Add a quirk for UPI on SPR") Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Steve Wahl <steve.wahl@hpe.com> Link: https://patch.msgid.link/20260313174050.171704-4-zide.chen@intel.com
Zide Chen [Fri, 13 Mar 2026 17:40:48 +0000 (10:40 -0700)]
perf/x86/intel/uncore: Skip discovery table for offline dies
This warning can be triggered if NUMA is disabled and the system
boots with fewer CPUs than the number of CPUs in die 0.
WARNING: CPU: 9 PID: 7257 at uncore.c:1157 uncore_pci_pmu_register+0x136/0x160 [intel_uncore]
Currently, the discovery table continues to be parsed even if all CPUs
in the associated die are offline. This can lead to an array overflow
at "pmu->boxes[die] = box" in uncore_pci_pmu_register(), which may
trigger the warning above or cause other issues.
Fixes: edae1f06c2cd ("perf/x86/intel/uncore: Parse uncore discovery tables") Reported-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Steve Wahl <steve.wahl@hpe.com> Link: https://patch.msgid.link/20260313174050.171704-3-zide.chen@intel.com
Zide Chen [Fri, 13 Mar 2026 17:40:47 +0000 (10:40 -0700)]
perf/x86/intel/uncore: Fix iounmap() leak on global_init failure
Kernel test robot reported:
Unverified Error/Warning (likely false positive, kindly check if
interested):
arch/x86/events/intel/uncore_discovery.c:293:2-8:
ERROR: missing iounmap; ioremap on line 288 and execution via
conditional on line 292
If domain->global_init() fails in __parse_discovery_table(), the
ioremap'ed MMIO region is not released before returning, resulting
in an MMIO mapping leak.
Fixes: b575fc0e3357 ("perf/x86/intel/uncore: Add domain global init callback") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260313174050.171704-2-zide.chen@intel.com
Yu Kuai [Fri, 27 Mar 2026 14:07:29 +0000 (22:07 +0800)]
md: wake raid456 reshape waiters before suspend
During raid456 reshape, direct IO across the reshape position can sleep
in raid5_make_request() waiting for reshape progress while still
holding an active_io reference. If userspace then freezes reshape and
writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io
and waits for all in-flight IO to drain.
This can deadlock: the IO needs reshape progress to continue, but the
reshape thread is already frozen, so the active_io reference is never
dropped and suspend never completes.
raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do
the same for normal md suspend when reshape is already interrupted, so
waiting raid456 IO can abort, drop its reference, and let suspend
finish.
The mdadm test tests/25raid456-reshape-deadlock reproduces the hang.
Xiao Ni [Tue, 24 Mar 2026 07:24:54 +0000 (15:24 +0800)]
md/raid1: serialize overlap io for writemostly disk
Previously, using wait_event() would wake up all waiters simultaneously,
and they would compete for the tree lock. The bio which gets the lock
first will be handled, so the write sequence cannot be guaranteed.
For example:
bio1(100,200)
bio2(150,200)
bio3(150,300)
The write sequence of fast device is bio1,bio2,bio3. But the write sequence
of slow device could be bio1,bio3,bio2 due to lock competition. This causes
data corruption.
Replace waitqueue with a fifo list to guarantee the write sequence. And it
also needs to iterate the list when removing one entry. If not, it may miss
the opportunity to wake up the waiting io.
For example:
bio1(1,3), bio2(2,4)
bio3(5,7), bio4(6,8)
These four bios are in the same bucket. bio1 and bio3 are inserted into
the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the
first one. bio3 returns from slow disk and tries to wake up the waiting
bios. bio2 is removed from the list and will be handled. But bio1 hasn't
finished. So bio2 will be added into waiting list again. Then bio1 returns
from slow disk and wakes up waiting bios. bio4 is removed from the list
and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left
on the waiting list. So it needs to iterate the waiting list to wake up
the right bio.
Yu Kuai [Mon, 23 Mar 2026 05:46:44 +0000 (13:46 +0800)]
md/md-llbitmap: optimize initial sync with write_zeroes_unmap support
For RAID-456 arrays with llbitmap, if all underlying disks support
write_zeroes with unmap, issue write_zeroes to zero all disk data
regions and initialize the bitmap to BitCleanUnwritten instead of
BitUnwritten.
This optimization skips the initial XOR parity building because:
1. write_zeroes with unmap guarantees zeroed reads after the operation
2. For RAID-456, when all data is zero, parity is automatically
consistent (0 XOR 0 XOR ... = 0)
3. BitCleanUnwritten indicates parity is valid but no user data
has been written
The implementation adds two helper functions:
- llbitmap_all_disks_support_wzeroes_unmap(): Checks if all active
disks support write_zeroes with unmap
- llbitmap_zero_all_disks(): Issues blkdev_issue_zeroout() to each
rdev's data region to zero all disks
The zeroing and bitmap state setting happens in llbitmap_init_state()
during bitmap initialization. If any disk fails to zero, we fall back
to BitUnwritten and normal lazy recovery.
This significantly reduces array initialization time for RAID-456
arrays built on modern NVMe SSDs or other devices that support
write_zeroes with unmap.
Yu Kuai [Mon, 23 Mar 2026 05:46:43 +0000 (13:46 +0800)]
md/md-llbitmap: add CleanUnwritten state for RAID-5 proactive parity building
Add new states to the llbitmap state machine to support proactive XOR
parity building for RAID-5 arrays. This allows users to pre-build parity
data for unwritten regions before any user data is written.
New states added:
- BitNeedSyncUnwritten: Transitional state when proactive sync is triggered
via sysfs on Unwritten regions.
- BitSyncingUnwritten: Proactive sync in progress for unwritten region.
- BitCleanUnwritten: XOR parity has been pre-built, but no user data
written yet. When user writes to this region, it transitions to BitDirty.
New actions added:
- BitmapActionProactiveSync: Trigger for proactive XOR parity building.
- BitmapActionClearUnwritten: Convert CleanUnwritten/NeedSyncUnwritten/
SyncingUnwritten states back to Unwritten before recovery starts.
State flows:
- Current (lazy): Unwritten -> (write) -> NeedSync -> (sync) -> Dirty -> Clean
- New (proactive): Unwritten -> (sysfs) -> NeedSyncUnwritten -> (sync) -> CleanUnwritten
- On write to CleanUnwritten: CleanUnwritten -> (write) -> Dirty -> Clean
- On disk replacement: CleanUnwritten regions are converted to Unwritten
before recovery starts, so recovery only rebuilds regions with user data
A new sysfs interface is added at /sys/block/mdX/md/llbitmap/proactive_sync
(write-only) to trigger proactive sync. This only works for RAID-456 arrays.
Yu Kuai [Mon, 23 Mar 2026 05:46:42 +0000 (13:46 +0800)]
md: add fallback to correct bitmap_ops on version mismatch
If default bitmap version and on-disk version doesn't match, and mdadm
is not the latest version to set bitmap_type, set bitmap_ops based on
the disk version.
Junrui Luo [Sat, 4 Apr 2026 07:44:35 +0000 (15:44 +0800)]
md/raid5: validate payload size before accessing journal metadata
r5c_recovery_analyze_meta_block() and
r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a
journal metadata block using on-disk payload size fields without
validating them against the remaining space in the metadata block.
A corrupted journal contains payload sizes extending beyond the PAGE_SIZE
boundary can cause out-of-bounds reads when accessing payload fields or
computing offsets.
Add bounds validation for each payload type to ensure the full payload
fits within meta_size before processing.
Gregory Price [Sun, 8 Mar 2026 23:42:02 +0000 (19:42 -0400)]
md/raid0: use kvzalloc/kvfree for strip_zone and devlist allocations
syzbot reported a WARNING at mm/page_alloc.c:__alloc_frozen_pages_noprof()
triggered by create_strip_zones() in the RAID0 driver.
When raid_disks is large, the allocation size exceeds MAX_PAGE_ORDER (4MB
on x86), causing WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER).
Convert the strip_zone and devlist allocations from kzalloc/kzalloc_objs to
kvzalloc/kvzalloc_objs, which first attempts a contiguous allocation with
__GFP_NOWARN and then falls back to vmalloc for large sizes. Convert the
corresponding kfree calls to kvfree.
Both arrays are pure metadata lookup tables (arrays of pointers and zone
descriptors) accessed only via indexing, so they do not require physically
contiguous memory.
erofs: handle 48-bit blocks/uniaddr for extra devices
erofs_init_device() only reads blocks_lo and uniaddr_lo from the
on-disk device slot, ignoring blocks_hi and uniaddr_hi that were
introduced alongside the 48-bit block addressing feature.
For the primary device (dif0), erofs_read_superblock() already handles
this correctly by combining blocks_lo with blocks_hi when 48-bit
layout is enabled. But the same logic was not applied to extra
devices.
With a 48-bit EROFS image using extra devices whose uniaddr or blocks
exceed 32-bit range, the truncated values cause erofs_map_dev() to
compute wrong physical addresses, leading to silent data corruption.
Fix this by reading blocks_hi and uniaddr_hi in erofs_init_device()
when 48-bit layout is enabled, consistent with the primary device
handling. Also fix the erofs_deviceslot on-disk definition where
blocks_hi was incorrectly declared as __le32 instead of __le16.
DRBD used a custom mechanism to mark netlink attributes as "mandatory":
bit 14 of nla_type was repurposed as DRBD_GENLA_F_MANDATORY. Attributes
sent from userspace that had this bit present and that were unknown
to the kernel would lead to an error.
Since commit ef6243acb478 ("genetlink: optionally validate strictly/dumps"),
the generic netlink layer rejects unknown top-level attributes when
strict validation is enabled. DRBD never opted out of strict
validation, so unknown top-level attributes are already rejected by
the netlink core.
The mandatory flag mechanism was required for nested attributes, because
these are parsed liberally, silently dropping attributes unknown to the
kernel.
This prepares for the move to a new YNL-based family, which will use the
now-default strict parsing.
The current family is not expected to gain any new attributes, which
makes this change safe.
Old userspace that still sets bit 14 is unaffected: nla_type()
strips it before __nla_validate_parse() performs attribute validation,
so the bit never reaches DRBD.
Remove all references to the mandatory flag in DRBD.
selftests: mptcp: join: recreate signal endp with same ID
In this "delete re-add signal" MPTCP Join subtest, the endpoint linked
to the initial subflow is removed, but readded once with different ID.
It appears that there was an issue when reusing the same ID, recently
fixed by commit d191101dee25 ("mptcp: pm: in-kernel: always set ID as
avail when rm endp"). The test then now reuses the same ID the first
time, but continue to use another one (88) the second time.
Factor out a new helper tcp_recv_should_stop() from tcp_recvmsg_locked()
and tcp_splice_read() to check whether to stop receiving. And use this
helper in mptcp_recvmsg() and mptcp_splice_read() to reduce redundant code.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-3-b0b33bea3fed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gang Yan [Fri, 3 Apr 2026 11:29:28 +0000 (13:29 +0200)]
mptcp: preserve MSG_EOR semantics in sendmsg path
Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag,
which marks the end of a record for application-level message boundaries.
Data fragments tagged with MSG_EOR are explicitly marked in the
mptcp_data_frag structure and skb context to prevent unintended
coalescing with subsequent data chunks. This ensures the intent of
applications using MSG_EOR is preserved across MPTCP subflows,
maintaining consistent message segmentation behavior.
Gang Yan [Fri, 3 Apr 2026 11:29:27 +0000 (13:29 +0200)]
mptcp: reduce 'overhead' from u16 to u8
The 'overhead' in struct mptcp_data_frag can safely use u8, as it
represents 'alignment + sizeof(mptcp_data_frag)'. With a maximum
alignment of 7('ALIGN(1, sizeof(long)) - 1'), the overhead is at most
47, well below U8_MAX and validated with BUILD_BUG_ON().
This patch also adds a field named 'unused' for further extensions.
dpaa2: avoid linking objects into multiple modules
Each object file contains information about which module it gets linked
into, so linking the same file into multiple modules now causes a warning:
scripts/Makefile.build:254: drivers/net/ethernet/freescale/dpaa2/Makefile: dpaa2-mac.o is added to multiple modules: fsl-dpaa2-eth fsl-dpaa2-switch
scripts/Makefile.build:254: drivers/net/ethernet/freescale/dpaa2/Makefile: dpmac.o is added to multiple modules: fsl-dpaa2-eth fsl-dpaa2-switch
Change the way that dpaa2 is built by moving the two common files into a
separate module with exported symbols instead.
net: ethernet: ti-cpsw: fix linking built-in code to modules
There are six variants of the cpsw driver, sharing various parts of
the code: davinci-emac, cpsw, cpsw-switchdev, netcp, netcp_ethss and
am65-cpsw-nuss.
I noticed that this means some files can be linked into more than
one loadable module, or even part of vmlinux but also linked into
a loadable module, both of which mess up assumptions of the build
system, and causes warnings:
scripts/Makefile.build:279: cpsw_ale.o is added to multiple modules: ti-am65-cpsw-nuss ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: cpsw_priv.o is added to multiple modules: ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: cpsw_sl.o is added to multiple modules: ti-am65-cpsw-nuss ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: cpsw_ethtool.o is added to multiple modules: ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: davinci_cpdma.o is added to multiple modules: ti_cpsw ti_cpsw_new ti_davinci_emac
Change this back to having separate modules for each portion that
can be linked standalone, exporting symbols as needed:
- ti-cpsw-common.ko now contains both cpsw-common.o and
davinci_cpdma.o as they are always used together
- ti-cpsw-priv.ko contains cpsw_priv.o, cpsw_sl.o and cpsw_ethtool.o,
which are the core of the cpsw and cpsw-new drivers.
- ti-cpsw-sl.ko contains the cpsw-sl.o object and is used on
ti-am65-cpsw-nuss.ko in addition to the two other cpsw variants.
- ti-cpsw-ale.o is the one standalone module that is used by all
except davinci_emac.
Each of these will be built-in if any of its users are built-in, otherwise
it's a loadable module if there is at least one module using it. I did
not bring back the separate Kconfig symbols for this, but just handle
it using Makefile logic.
Note: ideally this is something that Kbuild complains about, but usually
we just notice when something using THIS_MODULE misbehaves in a way that
a user notices.
net: ethernet: ti-cpsw:: rename soft_reset() function
While looking at the glob symbols shared between the cpsw drivers,
I noticed that soft_reset() is the only one that is missing a proper
namespace prefix, and will pollute the kernel namespace, so rename
it to be consistent with the other symbols.
Jakub Kicinski [Fri, 3 Apr 2026 22:05:01 +0000 (15:05 -0700)]
eth: remove the driver for acenic / tigon1&2
The entire git history for this driver looks like tree-wide
and automated cleanups. There's even more coming now with
AI, so let's try to delete it instead.
net: skb: fix cross-cache free of KFENCE-allocated skb head
SKB_SMALL_HEAD_CACHE_SIZE is intentionally set to a non-power-of-2
value (e.g. 704 on x86_64) to avoid collisions with generic kmalloc
bucket sizes. This ensures that skb_kfree_head() can reliably use
skb_end_offset to distinguish skb heads allocated from
skb_small_head_cache vs. generic kmalloc caches.
However, when KFENCE is enabled, kfence_ksize() returns the exact
requested allocation size instead of the slab bucket size. If a caller
(e.g. bpf_test_init) allocates skb head data via kzalloc() and the
requested size happens to equal SKB_SMALL_HEAD_CACHE_SIZE, then
slab_build_skb() -> ksize() returns that exact value. After subtracting
skb_shared_info overhead, skb_end_offset ends up matching
SKB_SMALL_HEAD_HEADROOM, causing skb_kfree_head() to incorrectly free
the object to skb_small_head_cache instead of back to the original
kmalloc cache, resulting in a slab cross-cache free:
kmem_cache_free(skbuff_small_head): Wrong slab cache. Expected
skbuff_small_head but got kmalloc-1k
Fix this by always calling kfree(head) in skb_kfree_head(). This keeps
the free path generic and avoids allocator-specific misclassification
for KFENCE objects.
Fixes: bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head") Reported-by: Antonius <antonius@bluedragonsec.com> Closes: https://lore.kernel.org/netdev/CAK8a0jxC5L5N7hq-DT2_NhUyjBxrPocoiDazzsBk4TGgT1r4-A@mail.gmail.com/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260403014517.142550-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When send() or recv() returns -1 with errno == EINTR, the code skips
the break but still adds the return value to nwritten/nread, making it
decrease by 1. This leads to wrong buffer offsets and wrong bytes count.
Fix it by explicitly continuing the loop on EINTR, so the return value
is only added when it is positive.
====================
xsk: tailroom reservation and MTU validation
here we fix a long-standing issue regarding multi-buffer scenario in ZC
mode - we have not been providing space at the end of the buffer where
multi-buffer XDP works on skb_shared_info. This has been brought to our
attention via [0].
Unaligned mode does not get any specific treatment, it is user's
responsibility to properly handle XSK addresses in queues.
With adjustments included here in this set against xskxceiver I have
been able to pass the full test suite on ice.
selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom
Since we have changed how big user defined headroom in umem can be,
change the logic in testapp_stats_rx_dropped() so we pass updated
headroom validation in xdp_umem_reg() and still drop half of frames.
Test works on non-mbuf setup so __xsk_pool_get_rx_frame_size() that is
called on xsk_rcv_check() will not account skb_shared_info size. Taking
the tailroom size into account in test being fixed is needed as
xdp_umem_reg() defaults to respect it.
selftests: bpf: have a separate variable for drop test
Currently two different XDP programs share a static variable for
different purposes (picking where to redirect on shared umem test &
whether to drop a packet). This can be a problem when running full test
suite - idx can be written by shared umem test and this value can cause
a false behavior within XDP drop half test.
Introduce a dedicated variable for drop half test so that these two
don't step on each other toes. There is no real need for using
__sync_fetch_and_add here as XSK tests are executed on single CPU.
Skip tail adjust tests in xskxceiver for SKB mode as it is not very
friendly for it. multi-buffer case does not work as xdp_rxq_info that is
registered for generic XDP does not report ::frag_size. The non-mbuf
path copies packet via skb_pp_cow_data() which only accounts for
headroom, leaving us with no tailroom and causing underlying XDP prog to
drop packets therefore.
For multi-buffer test on other modes, change the amount of bytes we use
for growth, assume worst-case scenario and take care of headroom and
tailroom.
selftests: bpf: introduce a common routine for reading procfs
Parametrize current way of getting MAX_SKB_FRAGS value from {sys,proc}fs
so that it can be re-used to get cache line size of system's CPU. All
that just to mimic and compute size of kernel's struct skb_shared_info
which for xsk and test suite interpret as tailroom.
Introduce two variables to ifobject struct that will carry count of skb
frags and tailroom size. Do the reading and computing once, at the
beginning of test suite execution in xskxceiver, but for test_progs such
way is not possible as in this environment each test setups and torns
down ifobject structs.
xsk: validate MTU against usable frame size on bind
AF_XDP bind currently accepts zero-copy pool configurations without
verifying that the device MTU fits into the usable frame space provided
by the UMEM chunk.
This becomes a problem since we started to respect tailroom which is
subtracted from chunk_size (among with headroom). 2k chunk size might
not provide enough space for standard 1500 MTU, so let us catch such
settings at bind time. Furthermore, validate whether underlying HW will
be able to satisfy configured MTU wrt XSK's frame size multiplied by
supported Rx buffer chain length (that is exposed via
net_device::xdp_zc_max_segs).
Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
to flags so set it in order to preserve mtu check that is supposed to be
done only when no multi-buffer setup is in picture.
Also, this flag has the same value as XDP_UMEM_TX_SW_CSUM so we could
get unexpected SG setups for software Tx checksums. Since csum flag is
UAPI, modify value of XDP_UMEM_SG_FLAG.
Fixes: d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem") Reviewed-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260402154958.562179-4-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multi-buffer XDP stores information about frags in skb_shared_info that
sits at the tailroom of a packet. The storage space is reserved via
xdp_data_hard_end():
Currently we do not respect this tailroom space in multi-buffer AF_XDP
ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
configure length of HW Rx buffer.
Typically drivers on Rx Hw buffers side work on 128 byte alignment so
let us align the value returned by xsk_pool_get_rx_frame_size() in order
to avoid addressing this on driver's side. This addresses the fact that
idpf uses mentioned function *before* pool->dev being set so we were at
risk that after subtracting tailroom we would not provide 128-byte
aligned value to HW.
Since xsk_pool_get_rx_frame_size() is actively used in xsk_rcv_check()
and __xsk_rcv(), add a variant of this routine that will not include 128
byte alignment and therefore old behavior is preserved.
Reviewed-by: Björn Töpel <bjorn@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260402154958.562179-3-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xsk: tighten UMEM headroom validation to account for tailroom and min frame
The current headroom validation in xdp_umem_reg() could leave us with
insufficient space dedicated to even receive minimum-sized ethernet
frame. Furthermore if multi-buffer would come to play then
skb_shared_info stored at the end of XSK frame would be corrupted.
HW typically works with 128-aligned sizes so let us provide this value
as bare minimum.
Multi-buffer setting is known later in the configuration process so
besides accounting for 128 bytes, let us also take care of tailroom space
upfront.
Reviewed-by: Björn Töpel <bjorn@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 99e3a236dd43 ("xsk: Add missing check on user supplied headroom size") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260402154958.562179-2-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Properly load values from insn_arays with non-zero offsets
The PTR_TO_INSN is always loaded via BPF_LDX_MEM instruction.
However, the verifier doesn't properly verify such loads when the
offset is not zero. Fix this and extend selftests with more scenarios.
v2 -> v3:
* Add a C-level selftest which triggers a load with nonzero offset (Alexei)
* Rephrase commit messages a bit
bpf: Do not ignore offsets for loads from insn_arrays
When a pointer to PTR_TO_INSN is dereferenced, the offset field
of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier
to not ignore this field.
Reported-by: Jiyong Yang <ksur673@gmail.com> Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Apparently, struct bpf_empty_prog_array exists entirely to populate a
single element of "items" in a global variable. "null_prog" is only
used during the initializer.
None of this is needed; globals will be correctly sized with an array
initializer of a flexible-array member.
So, remove struct bpf_empty_prog_array and adjust the rest of the code,
accordingly.
With these changes, fix the following warnings:
./include/linux/bpf.h:2369:31: warning: structure containing a flexible
array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jason Xing [Thu, 2 Apr 2026 03:41:14 +0000 (11:41 +0800)]
net: advance skb_defer_disable_key check in napi_consume_skb
When net.core.skb_defer_max is adjusted to zero, napi_consume_skb()
shouldn't go into that deeper in skb_attempt_defer_free() because it adds
an additional pair of local_bh_enable/disable() which is evidently not
needed. Advancing the check of the static key saves more cycles and
benefits non defer case.
====================
net: dsa: mxl862xx: add support for bridge offloading
As a next step to complete the mxl862xx DSA driver, add support for
offloading forwarding between bridged ports to the switch hardware.
This works pretty much without any big surprises, apart from two
subtleties:
* per-port control over flooding behavior has to be implemented by
(ab)using a 0-rate QoS meters as stopper in lack of any better
option.
* STP state transition unconditionally enables learning on a port
even if it was previously explicitely disabled (a firmware bug)
Note that as the driver is still lacking all VLAN features (which
are going to be added next), at this point some of the
bridge_vlan_aware.sh tests are failing after applying this series.
This is expected and cannot be avoided without implementing
port_vlan_filtering + port_vlan_add/del. And adding both bridge and
VLAN offloading at the same time would be too much for anyone to
review, so VLAN support is going to be submitted in a follow-up
series immediately after this series has been accepted.
All other relevant selftests (including bridge_vlan_unaware.sh) are
still passing.
Inspired by the comments received from Paolo Abeni as reply to v5
the driver now no longer caches bridge port membership in the
driver, but instead imports an existing helper from yt921x.c to dsa.h
in order to allow the driver to easily iterate over bridge members.
The mapping between DSA bridge num and firmware bridge ID is done
using a simple fixed-size array in mxl862xx_priv.
====================
Daniel Golle [Wed, 1 Apr 2026 13:35:01 +0000 (14:35 +0100)]
net: dsa: mxl862xx: implement bridge offloading
Implement joining and leaving bridges as well as add, delete and dump
operations on isolated FDBs, port MDB membership management, and
setting a port's STP state.
The switch supports a maximum of 63 bridges, however, up to 12 may
be used as "single-port bridges" to isolate standalone ports.
Allowing up to 48 bridges to be offloaded seems more than enough on
that hardware, hence that is set as max_num_bridges.
A total of 128 bridge ports are supported in the bridge portmap, and
virtual bridge ports have to be used eg. for link-aggregation, hence
potentially exceeding the number of hardware ports.
The firmware-assigned bridge identifier (FID) for each offloaded bridge
is stored in an array used to map DSA bridge num to firmware bridge ID,
avoiding the need for a driver-private bridge tracking structure.
Bridge member portmaps are rebuilt on join/leave using
dsa_switch_for_each_bridge_member().
As there are now more users of the BRIDGEPORT_CONFIG_SET API and the
state of each port is cached locally, introduce a helper function
mxl862xx_set_bridge_port(struct dsa_switch *ds, int port) which
applies the cached per-port state to hardware. For standalone user
ports (dp->bridge == NULL), it additionally resets the port to
single-port bridge state: CPU-only portmap, learning and flooding
disabled. The CPU port path sets its state explicitly before calling
this helper and is therefore not affected by the reset.
Note that MASK_VLAN_BASED_MAC_LEARNING is intentionally absent from
the firmware write mask. After mxl862xx_reset(), the firmware
initialises all VLAN-based MAC learning fields to 0 (disabled), so
SVL is the active mode by default without having to set it explicitly.
Note that there is no convenient way to control flooding on per-port
level, so the driver is using a 0-rate QoS meter setup as a stopper in
lack of any better option. In order to be perfect the firmware-enforced
minimum bucket size is bypassed by directly writing 0s to the relevant
registers -- without that at least one 64-byte packet could still
pass before the meter would change from 'yellow' into 'red' state.
Daniel Golle [Wed, 1 Apr 2026 13:34:52 +0000 (14:34 +0100)]
dsa: tag_mxl862xx: set dsa_default_offload_fwd_mark()
The MxL862xx offloads bridge forwarding in hardware, so set
dsa_default_offload_fwd_mark() to avoid duplicate forwarding of
packets of (eg. flooded) frames arriving at the CPU port.
Link-local frames are directly trapped to the CPU port only, so don't
set dsa_default_offload_fwd_mark() on those.
Daniel Golle [Wed, 1 Apr 2026 13:34:42 +0000 (14:34 +0100)]
net: dsa: add bridge member iteration macro
Drivers that offload bridges need to iterate over the ports that are
members of a given bridge, for example to rebuild per-port forwarding
bitmaps when membership changes. Currently drivers typically open-code
this by combining dsa_switch_for_each_user_port() with a
dsa_port_offloads_bridge_dev() check, or cache bridge membership
within the driver.
Add dsa_switch_for_each_bridge_member() macro to express this pattern
directly, and use it for the existing dsa_bridge_ports() inline
helper.
vsock: avoid timeout for non-blocking accept() with empty backlog
A common pattern in epoll network servers is to eagerly accept all
pending connections from the non-blocking listening socket after
epoll_wait indicates the socket is ready by calling accept in a loop
until EAGAIN is returned indicating that the backlog is empty.
Scheduling a timeout for a non-blocking accept with an empty backlog
meant AF_VSOCK sockets used by epoll network servers incurred hundreds
of microseconds of additional latency per accept loop compared to
AF_INET or AF_UNIX sockets.
Signed-off-by: Laurence Rowe <laurencerowe@gmail.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260402204918.130395-1-laurencerowe@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Fri, 3 Apr 2026 10:26:31 +0000 (03:26 -0700)]
psp: add missing device stats to get-stats reply attributes
Commit f05d26198cf2 ("psp: add stats from psp spec to driver facing
api") added device statistics (rx-packets, rx-bytes, rx-auth-fail,
rx-error, rx-bad, tx-packets, tx-bytes, tx-error) to the stats
attribute-set but did not add them to the get-stats operation reply
attributes. The kernel reports these attributes in the reply, so
list them in the spec to match.
Jeremy Kerr [Fri, 3 Apr 2026 02:24:51 +0000 (10:24 +0800)]
net: mctp: defer creation of dst after source-address check
Sashiko reports:
> mctp_dst_from_route() increments the device reference count by calling
> mctp_dev_hold(). When a valid route is found and dst is NULL, the
> structure copy is bypassed and rc is set to 0.
Instead of optimistically creating a dst from the final route (then
releasing it if the saddr is invalid), perform the saddr check first.
This means we don't have an unuecessary hold/release on the dev, which
could leak if the dst pointer is NULL. No caller passes a NULL dst at
present though (so the leak is not possible), but this is an intended
use of mctp_dst_from_route().
Jakub Kicinski [Thu, 2 Apr 2026 21:54:44 +0000 (14:54 -0700)]
selftests: net: py: color the basics in the output
Sometimes it's hard to spot the ok / not ok lines in the output.
This is especially true for the GRO tests which retries a lot
so there's a wall of non-fatal output printed.
Try to color the crucial lines green / red / yellow when running
in a terminal.
====================
Allow variable offsets for syscall PTR_TO_CTX
Enable pointer modification with variable offsets accumulated in the
register for PTR_TO_CTX for syscall programs where it won't be
rewritten, and the context is user-supplied and checked against the max
offset. See patches for details. Fixed offset support landed in [0].
By combining this set with [0], examples like the one below should
succeed verification now.
SEC("syscall")
int prog(void *ctx) {
int *arr = ctx;
int i;
* Drop comment around describing choice of fixed or variable offsets. (Eduard)
* Simplify offset adjustment for different cases. (Eduard)
* Add PTR_TO_CTX case in __check_mem_access(). (Eduard)
* Drop aligned access constraint from syscall_prog_is_valid_access().
* Wrap naked checks for BPF_PROG_TYPE_SYSCALL in a utility function. (Eduard)
* Split tests into separate clean up and addition patches. (Eduard)
* Remove CAP_SYS_ADMIN changes. (Eduard)
* Enable unaligned access to syscall ctx, add tests.
* Add more tests for various corner cases.
* Add acks. (Puranjay, Mykyta)
* Harden check_func_arg_reg_off check with ARG_PTR_TO_CTX.
* Add tests for unmodified ctx into tail calls.
* Squash unmodified ctx change into base commit.
* Add Reviewed-by's from Emil.
====================
selftests/bpf: Test modified syscall ctx for ARG_PTR_TO_CTX
Ensure that global subprogs and tail calls can only accept an unmodified
PTR_TO_CTX for syscall programs. For all other program types, fixed or
variable offsets on PTR_TO_CTX is rejected when passed into an argument
of any call instruction type, through the unified logic of
check_func_arg_reg_off.
Finally, add a positive example of a case that should succeed with all
our previous changes.
Add various tests to exercise fixed and variable offsets on PTR_TO_CTX
for syscall programs, and cover disallowed cases for other program types
lacking convert_ctx_access callback. Load verifier_ctx with CAP_SYS_ADMIN
so that kfunc related logic can be tested. While at it, convert assembly
tests to C. Unfortunately, ctx_pointer_to_helper_2's unpriv case conflicts
with usage of kfuncs in the file and cannot be run.
Don't reject usage of fixed unaligned offsets for syscall ctx. Tests
will be added in later commits. Unaligned offsets already work for
variable offsets.
bpf: Support variable offsets for syscall PTR_TO_CTX
Allow accessing PTR_TO_CTX with variable offsets in syscall programs.
Fixed offsets are already enabled for all program types that do not
convert their ctx accesses, since the changes we made in the commit de6c7d99f898 ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note
that we also lift the restriction on passing syscall context into
helpers, which was not permitted before, and passing modified syscall
context into kfuncs.
The structure of check_mem_access can be mostly shared and preserved,
but we must use check_mem_region_access to correctly verify access with
variable offsets.
The check made in check_helper_mem_access is hardened to only allow
PTR_TO_CTX for syscall programs to be passed in as helper memory. This
was the original intention of the existing code anyway, and it makes
little sense for other program types' context to be utilized as a memory
buffer. In case a convincing example presents itself in the future, this
check can be relaxed further.
We also no longer use the last-byte access to simulate helper memory
access, but instead go through check_mem_region_access. Since this no
longer updates our max_ctx_offset, we must do so manually, to keep track
of the maximum offset at which the program ctx may be accessed.
Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not
relax any fixed or variable offset constraints around PTR_TO_CTX even in
syscall programs, and require them to be passed unmodified. There are
several reasons why this is necessary. First, if we pass a modified ctx,
then the global subprog's accesses will not update the max_ctx_offset to
its true maximum offset, and can lead to out of bounds accesses. Second,
tail called program (or extension program replacing global subprog) where
their max_ctx_offset exceeds the program they are being called from can
also cause issues. For the latter, unmodified PTR_TO_CTX is the first
requirement for the fix, the second is ensuring max_ctx_offset >= the
program they are being called from, which has to be a separate change
not made in this commit.
All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and
make our relaxation around offsets conditional on it.
Drop coverage of syscall tests from verifier_ctx.c temporarily for
negative cases until they are updated in subsequent commits.
Agalakov Daniil [Wed, 18 Mar 2026 12:05:05 +0000 (15:05 +0300)]
e1000: check return value of e1000_read_eeprom
[Why]
e1000_set_eeprom() performs a read-modify-write operation when the write
range is not word-aligned. This requires reading the first and last words
of the range from the EEPROM to preserve the unmodified bytes.
However, the code does not check the return value of e1000_read_eeprom().
If the read fails, the operation continues using uninitialized data from
eeprom_buff. This results in corrupted data being written back to the
EEPROM for the boundary words.
Add the missing error checks and abort the operation if reading fails.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Alex Dvoretsky [Thu, 12 Mar 2026 13:52:55 +0000 (14:52 +0100)]
igb: remove napi_synchronize() in igb_down()
When an AF_XDP zero-copy application terminates abruptly (e.g., kill -9),
the XSK buffer pool is destroyed but NAPI polling continues.
igb_clean_rx_irq_zc() repeatedly returns the full budget, preventing
napi_complete_done() from clearing NAPI_STATE_SCHED.
igb_down() calls napi_synchronize() before napi_disable() for each queue
vector. napi_synchronize() spins waiting for NAPI_STATE_SCHED to clear,
which never happens. igb_down() blocks indefinitely, the TX watchdog
fires, and the TX queue remains permanently stalled.
napi_disable() already handles this correctly: it sets NAPI_STATE_DISABLE.
After a full-budget poll, __napi_poll() checks napi_disable_pending(). If
set, it forces completion and clears NAPI_STATE_SCHED, breaking the loop
that napi_synchronize() cannot.
napi_synchronize() was added in commit 41f149a285da ("igb: Fix possible
panic caused by Rx traffic arrival while interface is down").
napi_disable() provides stronger guarantees: it prevents further
scheduling and waits for any active poll to exit.
Other Intel drivers (ixgbe, ice, i40e) use napi_disable() without a
preceding napi_synchronize() in their down paths.
Remove redundant napi_synchronize() call and reorder napi_disable()
before igb_set_queue_napi() so the queue-to-NAPI mapping is only
cleared after polling has fully stopped.
Fixes: 2c6196013f84 ("igb: Add AF_XDP zero-copy Rx support") Cc: stable@vger.kernel.org Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Alex Dvoretsky <advoretsky@gmail.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Patryk Holda <patryk.holda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Michal Schmidt [Fri, 13 Mar 2026 08:22:29 +0000 (09:22 +0100)]
ixgbevf: add missing negotiate_features op to Hyper-V ops table
Commit a7075f501bd3 ("ixgbevf: fix mailbox API compatibility by
negotiating supported features") added the .negotiate_features callback
to ixgbe_mac_operations and populated it in ixgbevf_mac_ops, but forgot
to add it to ixgbevf_hv_mac_ops. This leaves the function pointer NULL
on Hyper-V VMs.
During probe, ixgbevf_negotiate_api() calls ixgbevf_set_features(),
which unconditionally dereferences hw->mac.ops.negotiate_features().
On Hyper-V this results in a NULL pointer dereference:
Add ixgbevf_hv_negotiate_features_vf() that returns -EOPNOTSUPP and
wire it into ixgbevf_hv_mac_ops. The caller already handles -EOPNOTSUPP
gracefully.
Fixes: a7075f501bd3 ("ixgbevf: fix mailbox API compatibility by negotiating supported features") Reported-by: Xiaoqiang Xiong <xxiong@redhat.com> Closes: https://issues.redhat.com/browse/RHEL-155455 Assisted-by: Claude:claude-4.6-opus-high Cursor Tested-by: Xiaoqiang Xiong <xxiong@redhat.com> Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ixgbe: stop re-reading flash on every get_drvinfo for e610
ixgbe_get_drvinfo() calls ixgbe_refresh_fw_version() on every ethtool
query for e610 adapters. That ends up in ixgbe_discover_flash_size(),
which bisects the full 16 MB NVM space issuing one ACI command per
step (~20 ms each, ~24 steps total = ~500 ms).
Profiling on an idle E610-XAT2 system with telegraf scraping ethtool
stats every 10 seconds:
kretprobe:ixgbe_get_drvinfo took 527603 us
kretprobe:ixgbe_get_drvinfo took 523978 us
kretprobe:ixgbe_get_drvinfo took 552975 us
kretprobe:ice_get_drvinfo took 3 us
kretprobe:igb_get_drvinfo took 2 us
kretprobe:i40e_get_drvinfo took 5 us
The half-second stall happens under the RTNL lock, causing visible
latency on ip-link and friends.
The FW version can only change after an EMPR reset. All flash data is
already populated at probe time and the cached adapter->eeprom_id is
what get_drvinfo should be returning. The only place that needs to
trigger a re-read is ixgbe_devlink_reload_empr_finish(), right after
the EMPR completes and new firmware is running. Additionally, refresh
the FW version in ixgbe_reinit_locked() so that any PF that undergoes a
reinit after an EMPR (e.g. triggered by another PF's devlink reload)
also picks up the new version in adapter->eeprom_id.
ixgbe_devlink_info_get() keeps its refresh call for explicit
"devlink dev info" queries, which is fine given those are user-initiated.
Fixes: c9e563cae19e ("ixgbe: add support for devlink reload") Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Petr Oros [Fri, 27 Mar 2026 07:46:58 +0000 (08:46 +0100)]
ice: fix PTP timestamping broken by SyncE code on E825C
The E825C SyncE support added in commit ad1df4f2d591 ("ice: dpll:
Support E825-C SyncE and dynamic pin discovery") introduced a SyncE
reconfiguration block in ice_ptp_link_change() that prevents
ice_ptp_port_phy_restart() from being called in several error paths.
Without the PHY restart, PTP timestamps stop working after any link
change event.
There are three ways the PHY restart gets blocked:
1. When DPLL initialization fails (e.g. missing ACPI firmware node
properties), ICE_FLAG_DPLL is not set and the function returns early
before reaching the PHY restart.
2. When ice_tspll_bypass_mux_active_e825c() fails to read the CGU
register, WARN_ON_ONCE fires and the function returns early.
3. When ice_tspll_cfg_synce_ethdiv_e825c() fails to configure the
clock divider for an active pin, same early return.
SyncE and PTP are independent features. SyncE reconfiguration failures
must not prevent the PTP PHY restart that is essential for timestamp
recovery after link changes.
Fix by making the entire SyncE block conditional on ICE_FLAG_DPLL
without an early return, and replacing the WARN_ON_ONCE + return error
handling inside the loop with dev_err_once + break. The function always
proceeds to ice_ptp_port_phy_restart() regardless of SyncE errors.
Fixes: ad1df4f2d591 ("ice: dpll: Support E825-C SyncE and dynamic pin discovery") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kohei Enju [Sun, 1 Feb 2026 14:14:00 +0000 (14:14 +0000)]
ice: ptp: don't WARN when controlling PF is unavailable
In VFIO passthrough setups, it is possible to pass through only a PF
which doesn't own the source timer. In that case the PTP controlling PF
(adapter->ctrl_pf) is never initialized in the VM, so ice_get_ctrl_ptp()
returns NULL and triggers WARN_ON() in ice_ptp_setup_pf().
Since this is an expected behavior in that configuration, replace
WARN_ON() with an informational message and return -EOPNOTSUPP.
Fixes: e800654e85b5 ("ice: Use ice_adapter for PTP shared data instead of auxdev") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Thu, 19 Mar 2026 21:13:35 +0000 (14:13 -0700)]
idpf: set the payload size before calling the async handler
Set the payload size before forwarding the reply to the async handler.
Without this, xn->reply_sz will be 0 and idpf_mac_filter_async_handler()
will never get past the size check.
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Cc: stable@vger.kernel.org Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Li Li <boolli@google.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Thu, 19 Mar 2026 21:13:34 +0000 (14:13 -0700)]
idpf: improve locking around idpf_vc_xn_push_free()
Protect the set_bit() operation for the free_xn bitmask in
idpf_vc_xn_push_free(), to make the locking consistent with rest of the
code and avoid potential races in that logic.
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Cc: stable@vger.kernel.org Reported-by: Ray Zhang <sgzhang@google.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Thu, 19 Mar 2026 21:13:33 +0000 (14:13 -0700)]
idpf: fix PREEMPT_RT raw/bh spinlock nesting for async VC handling
Switch from using the completion's raw spinlock to a local lock in the
idpf_vc_xn struct. The conversion is safe because complete/_all() are
called outside the lock and there is no reason to share the completion
lock in the current logic. This avoids invalid wait context reported by
the kernel due to the async handler taking BH spinlock:
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Cc: stable@vger.kernel.org Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Ray Zhang <sgzhang@google.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
David Gow [Sat, 28 Feb 2026 10:07:22 +0000 (18:07 +0800)]
kunit: tool: Terminate kernel under test on SIGINT
kunit.py will attempt to catch SIGINT / ^C in order to ensure the TTY isn't
messed up, but never actually attempts to terminate the running kernel (be
it UML or QEMU). This can lead to a bit of frustration if the kernel has
crashed or hung.
Terminate the kernel process in the signal handler, if it's running. This
requires plumbing through the process handle in a few more places (and
having some checks to see if the kernel is still running in places where it
may have already been killed).
Reported-by: Andy Shevchenko <andriy.shevchenko@intel.com> Closes: https://lore.kernel.org/all/aaFmiAmg9S18EANA@smile.fi.intel.com/ Signed-off-by: David Gow <david@davidgow.net> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Shuvam Pandey [Fri, 27 Feb 2026 12:31:36 +0000 (18:16 +0545)]
kunit: tool: skip stty when stdin is not a tty
run_kernel() cleanup and signal_handler() invoke stty unconditionally.
When stdin is not a tty (for example in CI or unit tests), this writes
noise to stderr.
Call stty only when stdin is a tty.
Add regression tests for these paths:
- run_kernel() with non-tty stdin
- signal_handler() with non-tty stdin
- signal_handler() with tty stdin
David Gow [Fri, 27 Feb 2026 10:56:49 +0000 (18:56 +0800)]
kunit: tool: Recommend --raw_output=all if no KTAP found
If no KTAP header is found in the kernel output (e.g., because the kernel
crashed before the KUnit executor was run), it's very useful to re-run the
test with --raw_output=all, as that will show any error output (such as a
stacktrace, log message, BUG, etc). This is not particularly intuitive,
however, as --raw_output=all is not well known.
Add an extra log line to advertise --raw_output=all in this case, as it's
a terrible user experience to just get "Did any KUnit tests run?"
Signed-off-by: David Gow <david@davidgow.net> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Yuto Ohnuki [Mon, 16 Mar 2026 07:03:59 +0000 (07:03 +0000)]
blk-wbt: remove WARN_ON_ONCE from wbt_init_enable_default()
wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:
- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered
syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.
wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.
Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.
Joseph Qi [Fri, 3 Apr 2026 06:38:30 +0000 (14:38 +0800)]
ocfs2: fix out-of-bounds write in ocfs2_write_end_inline
KASAN reports a use-after-free write of 4086 bytes in
ocfs2_write_end_inline, called from ocfs2_write_end_nolock during a
copy_file_range splice fallback on a corrupted ocfs2 filesystem mounted on
a loop device. The actual bug is an out-of-bounds write past the inode
block buffer, not a true use-after-free. The write overflows into an
adjacent freed page, which KASAN reports as UAF.
The root cause is that ocfs2_try_to_write_inline_data trusts the on-disk
id_count field to determine whether a write fits in inline data. On a
corrupted filesystem, id_count can exceed the physical maximum inline data
capacity, causing writes to overflow the inode block buffer.
damon_stat_start() always allocates the module's damon_ctx object
(damon_stat_context). Meanwhile, if damon_call() in the function fails,
the damon_ctx object is not deallocated. Hence, if the damon_call() is
failed, and the user writes Y to “enabled” again, the previously
allocated damon_ctx object is leaked.
This cannot simply be fixed by deallocating the damon_ctx object when
damon_call() fails. That's because damon_call() failure doesn't guarantee
the kdamond main function, which accesses the damon_ctx object, is
completely finished. In other words, if damon_stat_start() deallocates
the damon_ctx object after damon_call() failure, the not-yet-terminated
kdamond could access the freed memory (use-after-free).
Fix the leak while avoiding the use-after-free by keeping returning
damon_stat_start() without deallocating the damon_ctx object after
damon_call() failure, but deallocating it when the function is invoked
again and the kdamond is completely terminated. If the kdamond is not yet
terminated, simply return -EAGAIN, as the kdamond will soon be terminated.
Sechang Lim [Tue, 31 Mar 2026 18:08:11 +0000 (18:08 +0000)]
mm/vma: fix memory leak in __mmap_region()
commit 605f6586ecf7 ("mm/vma: do not leak memory when .mmap_prepare
swaps the file") handled the success path by skipping get_file() via
file_doesnt_need_get, but missed the error path.
When /dev/zero is mmap'd with MAP_SHARED, mmap_zero_prepare() calls
shmem_zero_setup_desc() which allocates a new shmem file to back the
mapping. If __mmap_new_vma() subsequently fails, this replacement
file is never fput()'d - the original is released by
ksys_mmap_pgoff(), but nobody releases the new one.
Add fput() for the swapped file in the error path.
Reproducible with fault injection.
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 1
CPU: 2 UID: 0 PID: 366 Comm: syz.7.14 Not tainted 7.0.0-rc6 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x164/0x1f0
should_fail_ex+0x525/0x650
should_failslab+0xdf/0x140
kmem_cache_alloc_noprof+0x78/0x630
vm_area_alloc+0x24/0x160
__mmap_region+0xf6b/0x2660
mmap_region+0x2eb/0x3a0
do_mmap+0xc79/0x1240
vm_mmap_pgoff+0x252/0x4c0
ksys_mmap_pgoff+0xf8/0x120
__x64_sys_mmap+0x12a/0x190
do_syscall_64+0xa9/0x580
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
Hao Li [Mon, 30 Mar 2026 03:57:49 +0000 (11:57 +0800)]
mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug
N_NORMAL_MEMORY is initialized from zone population at boot, but memory
hotplug currently only updates N_MEMORY. As a result, a node that gains
normal memory via hotplug can remain invisible to users iterating over
N_NORMAL_MEMORY, while a node that loses its last normal memory can stay
incorrectly marked as such.
The most visible effect is that
/sys/devices/system/node/has_normal_memory does not report a node even
after that node has gained normal memory via hotplug.
Also, list_lru-based shrinkers can undercount objects on such a node
and may skip reclaim on that node entirely, which can lead to a higher
memory footprint than expected.
Restore N_NORMAL_MEMORY maintenance directly in online_pages() and
offline_pages(). Set the bit when a node that currently lacks normal
memory onlines pages into a zone <= ZONE_NORMAL, and clear it when
offlining removes the last present pages from zones <= ZONE_NORMAL.
This restores the intended semantics without bringing back the old
status_change_nid_normal notifier plumbing which was removed in 8d2882a8edb8.
Current users that benefit include list_lru, zswap, nfsd filecache,
hugetlb_cgroup, and has_normal_memory sysfs reporting.
Link: https://lkml.kernel.org/r/20260330035941.518186-1-hao.li@linux.dev Fixes: 8d2882a8edb8 ("mm,memory_hotplug: remove status_change_nid_normal and update documentation") Signed-off-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 27 Mar 2026 00:32:22 +0000 (17:32 -0700)]
mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails
damon_call() for repeat_call_control of DAMON_SYSFS could fail if somehow
the kdamond is stopped before the damon_call(). It could happen, for
example, when te damon context was made for monitroing of a virtual
address processes, and the process is terminated immediately, before the
damon_call() invocation. In the case, the dyanmically allocated
repeat_call_control is not deallocated and leaked.
Fix the leak by deallocating the repeat_call_control under the
damon_call() failure.
Joanne Koong [Thu, 26 Mar 2026 21:51:27 +0000 (14:51 -0700)]
mm: reinstate unconditional writeback start in balance_dirty_pages()
Commit 64dd89ae01f2 ("mm/block/fs: remove laptop_mode") removed this
unconditional writeback start from balance_dirty_pages():
if (unlikely(!writeback_in_progress(wb)))
wb_start_background_writeback(wb);
This logic needs to be reinstated to prevent performance regressions for
strictlimited BDIs and memcg setups. The problem occurs because:
a) For strictlimited BDIs, throttling is calculated using per-wb
thresholds. The per-wb threshold can be exceeded even when the global
dirty threshold was not exceeded (nr_dirty < gdtc->bg_thresh)
b) For memcg-based throttling, memcg uses its own dirty count /
thresholds and can trigger throttling even when the global threshold
isn't exceeded
Without the unconditional writeback start, IO is throttled as it waits for
dirty pages to be written back but there is no writeback running. This
leads to severe stalls. On fuse, buffered write performance dropped from
1400 MiB/s to 2000 KiB/s.
Reinstate the unconditional writeback start so that writeback is
guaranteed to be running whenever IO needs to be throttled.
Link: https://lkml.kernel.org/r/20260326215127.3857682-2-joannelkoong@gmail.com Fixes: 64dd89ae01f2 ("mm/block/fs: remove laptop_mode") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
luo_session_deserialize() ignored the return value from
luo_file_deserialize(). As a result, a session could be left partially
restored even though the /dev/liveupdate open path treats deserialization
failures as fatal.
Propagate the error so a failed file deserialization aborts session
deserialization instead of silently continuing.
After analyzing this page’s state, it is hard to understand why the
mapcount is not 0 while the refcount is 0, since this page is not where
the issue first occurred. By enabling the CONFIG_DEBUG_VM config, I can
reproduce the crash as well and captured the first warning where the issue
appears:
The code that triggers the warning is: "VM_WARN_ON_FOLIO(page_folio(page +
nr_pages - 1) != folio, folio)", which indicates that set_pte_range()
tried to map beyond the large folio’s size.
By adding more debug information, I found that 'nr_pages' had overflowed
in filemap_map_pages(), causing set_pte_range() to establish mappings for
a range exceeding the folio size, potentially corrupting fields of pages
that do not belong to this folio (e.g., page->_mapcount).
After above analysis, I think the possible race is as follows:
CPU 0 CPU 1
filemap_map_pages() ext4_setattr()
//get and lock folio with old inode->i_size
next_uptodate_folio()
.......
//shrink the inode->i_size
i_size_write(inode, attr->ia_size);
//calculate the end_pgoff with the new inode->i_size
file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
end_pgoff = min(end_pgoff, file_end);
......
//nr_pages can be overflowed, cause xas.xa_index > end_pgoff
end = folio_next_index(folio) - 1;
nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
......
//map large folio
filemap_map_folio_range()
......
//truncate folios
truncate_pagecache(inode, inode->i_size);
To fix this issue, move the 'end_pgoff' calculation before
next_uptodate_folio(), so the retrieved folio stays consistent with the
file end to avoid 'nr_pages' calculation overflow. After this patch, the
crash issue is gone.
Link: https://lkml.kernel.org/r/1cf1ac59018fc647a87b0dad605d4056a71c14e4.1773739704.git.baolin.wang@linux.alibaba.com Fixes: 743a2753a02e ("filemap: cap PTE range to be created to allowed zero fill in folio_map_range()") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Tested-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cheng-Yang Chou [Tue, 31 Mar 2026 09:18:33 +0000 (17:18 +0800)]
tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
scx_alloc_free_idx() zeroes the payload of a freed arena allocation
one word at a time. The loop bound was alloc->pool.elem_size / 8, but
elem_size includes sizeof(struct sdt_data) (the 8-byte union sdt_id
header). This caused the loop to write one extra u64 past the
allocation, corrupting the tid field of the adjacent pool element.
Fix the loop bound to (elem_size - sizeof(struct sdt_data)) / 8 so
only the payload portion is zeroed.
Test plan:
- Add a temporary sanity check in scx_task_free() before the free call:
if (mval->data->tid.idx != mval->tid.idx)
scx_bpf_error("tid corruption: arena=%d storage=%d",
mval->data->tid.idx, (int)mval->tid.idx);
Without this fix, running scx_sdt under fork-heavy load triggers the
corruption error. With the fix applied, the same workload completes
without error.
Fixes: 36929ebd17ae ("tools/sched_ext: add arena based scheduler") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
selftests/nolibc: only use libgcc when really necessary
nolibc should work without libgcc to be compatible with as many
toolchains as possible. Currently the functionality tested by
nolibc-test does not contain any dependencies, make sure it stays
this way by not linking libgcc anymore.
On the ppc target GCC always emits references to '_restgpr_' functions,
so keep linking libgcc there.