Maíra Canal [Sun, 31 May 2026 20:18:55 +0000 (17:18 -0300)]
drm/v3d: Fix global performance monitor reference counting
In the SET_GLOBAL ioctl, v3d_perfmon_find() bumps the reference count on
the perfmon it returns, but v3d_perfmon_set_global_ioctl() and
v3d_perfmon_delete() fail to release that reference on several paths:
1. v3d_perfmon_set_global_ioctl() leaks the reference on its error
paths.
2. CLEAR_GLOBAL leaks both the find reference and the reference
previously stashed in v3d->global_perfmon by the SET_GLOBAL ioctl
that configured it.
3. Destroying a perfmon that is the current global perfmon leaks the
reference stashed by the SET_GLOBAL ioctl.
drm/xe: Clear pending_disable before signaling suspend fence
In the schedule-disable done path for suspend, we
signal the suspend fence before clearing pending_disable.
That wakeup can let suspend_wait complete and resume be queued
immediately. The resume path may then reach enable_scheduling()
while pending_disable is still set and hit the
!exec_queue_pending_disable(q) assertion.
Fix this by clearing pending_disable before signaling
the suspend fence, so any resumed transition observes a
consistent state.
The idle-skip optimization bypasses GuC suspend, so the GPU may not
perform the context switch that flushes TLB entries for invalidated
userptr VMAs. In LR/preempt-fence VM mode, this can lead to missed TLB
invalidation and page faults during userptr invalidation tests.
Restore unconditional schedule toggling on suspend so the context-switch
TLB flush is always performed.
This optimization will be reintroduced with a fix that does not skip
suspend in LR/preempt-fence VM mode.
Fixes: 8533051ce920 ("drm/xe: Skip exec queue schedule toggle if queue is idle during suspend") Cc: stable@vger.kernel.org # v7.0+ Suggested-by: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Signed-off-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Reviewed-by: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20260603065217.3131066-2-tilak.tirumalesh.tangudu@intel.com
(cherry picked from commit 6a1e7934d9a6cf46aecae00a99c2603d1295e170) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Amit Matityahu [Wed, 3 Jun 2026 17:01:39 +0000 (17:01 +0000)]
timers/migration: Fix livelock in tmigr_handle_remote_up()
tmigr_handle_remote_cpu() skips timer_expire_remote() when cpu ==
smp_processor_id(), assuming the local softirq path already handled this
CPU's timers.
This assumption is wrong because jiffies can advance after the handling of
the CPU's global timers in run_timer_base(BASE_GLOBAL) and before
tmigr_handle_remote() evaluates the expiry times.
As a consequence a timer which expires after the CPU local timer wheel
advanced and becomes expired in the remote handling is ignored and the
callback is never invoked and removed from the timer wheel.
What's worse is that fetch_next_timer_interrupt_remote() keeps reporting it
as expired, and the event is re-queued with expires == now on each
iteration. The goto-again loop spins indefinitely.
Fix this by calling timer_expire_remote() unconditionally. That's minimal
overhead for the common case as __run_timer_base() returns immediately if
there is nothing to expire in the local wheel.
[ tglx: Amend change log and add a comment ]
Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Reported-by: Alon Kariv <alonka@amazon.com> Signed-off-by: Amit Matityahu <amitmat@amazon.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260603170139.33628-1-amitmat@amazon.com
Zhan Wei [Fri, 29 May 2026 14:25:33 +0000 (22:25 +0800)]
eventpoll: restore EP_UNACTIVE_PTR sentinel for ctx->tfile_check_list
Commit e09c77d94003 ("eventpoll: hoist CTL_ADD scratch state into
struct ep_ctl_ctx") moved tfile_check_list from a file-scope global into
the stack-allocated struct ep_ctl_ctx, and in doing so replaced the
EP_UNACTIVE_PTR sentinel with NULL on the grounds that "NULL is the
obvious 'empty' value and zero-init handles it for free", describing the
change as "No functional change". It is not.
epitems_head->next is overloaded with two roles:
1. the "next" pointer that threads a head onto ctx->tfile_check_list;
2. a membership flag: ep_remove_file() uses
!smp_load_acquire(&v->next) to mean "this head is not on any
pending ctx->tfile_check_list and is therefore safe to free".
Before that change the EP_UNACTIVE_PTR sentinel kept the two roles
disjoint: a head on the list always had a non-NULL ->next (another head,
or the sentinel at the tail), so ->next == NULL was equivalent to "never
listed". With the sentinel gone the list is NULL-terminated, so the tail
head's ->next is NULL as well. ep_remove_file()'s gate can no longer
distinguish "never listed" from "listed at the tail", and misfires on
the tail head.
The reader (reverse_path_check_proc) holds epnested_mutex +
rcu_read_lock; the freer (ep_remove_file) holds ep->mtx + file->f_lock.
The two sides share no mutex -- the sentinel was the invariant the gate
relied on to know it could skip the read side. With it gone,
ep_remove_file() frees the tail head while reverse_path_check_proc() is
still walking it, producing the slab-use-after-free read. The syzbot
reproducer hits this within seconds on a multi-CPU VM.
Restore the sentinel: initialize ctx.tfile_check_list to EP_UNACTIVE_PTR
in do_epoll_ctl_file(), and terminate the walk on "!= EP_UNACTIVE_PTR"
in reverse_path_check() and clear_tfile_check_list(). The tail head's
->next becomes the sentinel again rather than NULL, so
ep_remove_file()'s gate regains its exclusivity and stops misfiring on
the tail. ep_remove_file() itself is unchanged.
This restores the invariant the file-scope tfile_check_list relied on
before that change while preserving the ctx packaging it introduced.
Raf Dickson [Tue, 26 May 2026 10:43:56 +0000 (10:43 +0000)]
vsock/vmci: fix sk_ack_backlog leak on failed handshake
When vmci_transport_recv_connecting_server() returns an error,
vmci_transport_recv_listen() calls vsock_remove_pending() but never
calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented
permanently.
Repeated handshake failures (malformed packets, queue pair alloc
failure, event subscribe failure) cause sk_ack_backlog to climb
toward sk_max_ack_backlog. Once it reaches the limit the listener
permanently refuses all new connections with -ECONNREFUSED, a
silent denial of service requiring a process restart to recover.
The two existing sk_acceptq_removed() calls in af_vsock.c do not
cover this path: line 764 checks vsock_is_pending() which returns
false after vsock_remove_pending(), and line 1889 is only reached
on successful accept().
Fix by balancing sk_acceptq_added() with sk_acceptq_removed() on
the error path.
The ASUS Vivobook 18 M1807GA (AMD ACP7.x, PCI 1022:15e2, subsystem
1043:3531) exposes a single Realtek RT721 SDCA codec on SoundWire link 1.
The BIOS reports the ACP audio config flag as 0 (SoundWire mode), so
snd_pci_ps claims the device, brings up the SoundWire managers and
enumerates the RT721 peripheral (sdw:0:1:025d:0721:01); the rt721-sdca
codec driver binds successfully.
No sound card is created, however: acp63_sdw_machine_select() walks
snd_soc_acpi_amd_acp70_sdw_machines[] and finds no entry whose declared
SoundWire peripherals are all present on the bus. The only existing RT721
entry, acp70_rt721_l1u0_tas2783x2_l1u8b, additionally requires two
TAS2783 amplifiers and deliberately exposes the RT721 as jack + DMIC
only. This M1807GA variant has no external amplifiers - the RT721's
internal AIF2 amplifier path drives the speakers - so that entry never
matches and no machine device is registered.
Add a standalone RT721 machine entry for link 1 exposing all three RT721
endpoints (jack/AIF1, speaker amplifier/AIF2, DMIC/AIF3), mirroring the
standalone RT722 configuration. Place it after the TAS2783 combo entry so
platforms that do have the external amplifiers continue to match the more
specific entry first.
ACPI _ADR of the codec: 0x000130025D072101
(link_id=1 version=3 mfg_id=0x025d Realtek part_id=0x0721 class=0x01).
Verified on the hardware: with the entry present the amd_sdw machine
binds, an "amd-soundwire" card is registered exposing the rt721-sdca
AIF1 (SimpleJack) and AIF2 (SmartAmp) PCM devices, and audio plays out
of the built-in speakers.
The MSI Raider A18 HX A9WJG has an internal digital microphone
connected through AMD ACP6x, but this machine does not expose the
AcpDmicConnected ACPI property, so acp_yc_mach does not bind.
Add a DMI quirk for this model.
This was tested on an MSI Raider A18 HX A9WJG with board MS-182L,
BIOS E182LAMS.31A, AMD ACP6x rev 0x62, and Realtek ALC274. After
applying the quirk, the internal microphone appears as an acp6x DMIC
capture device and records correctly.
The literal "1" is a signed 32-bit int. Shifting it by 32 positions is
undefined behaviour which may set this register to 0xFFFFFFFF, masking
all 32 slots.
Use GENMASK_U32() macro instead. For 32 slots this produces a zero mask:
~GENMASK_U32(31, 0) = ~0xFFFFFFFF = 0x00000000
Behaviour for fewer than 32 slots is unchanged.
Fixes: 770f58d7d2c5 ("ASoC: fsl_sai: Support multiple data channel enable bits") Cc: stable@vger.kernel.org Signed-off-by: Chancel Liu <chancel.liu@nxp.com> Reviewed-by: Shengjiu Wang <shengjiu.wang@gmail.com> Link: https://patch.msgid.link/20260601083327.1535185-1-chancel.liu@oss.nxp.com Signed-off-by: Mark Brown <broonie@kernel.org>
Sanghyun Park [Tue, 2 Jun 2026 09:49:05 +0000 (18:49 +0900)]
xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx()
Fix the race by pruning the bin while still holding xfrm_policy_lock,
before dropping it. Use __xfrm_policy_inexact_prune_bin() directly since
the lock is already held. The wrapper xfrm_policy_inexact_prune_bin()
becomes unused and is removed.
Race:
CPU0 (XFRM_MSG_DELPOLICY) CPU1 (XFRM_MSG_NEWSPDINFO)
========================== ==========================
xfrm_policy_bysel_ctx():
spin_lock_bh(xfrm_policy_lock)
bin = xfrm_policy_inexact_lookup()
__xfrm_policy_unlink(pol)
spin_unlock_bh(xfrm_policy_lock)
xfrm_policy_kill(ret)
// wide window, lock not held
xfrm_hash_rebuild():
spin_lock_bh(xfrm_policy_lock)
__xfrm_policy_inexact_flush():
kfree_rcu(bin) // bin freed
spin_unlock_bh(xfrm_policy_lock)
xfrm_policy_inexact_prune_bin(bin)
// UAF: bin is freed
Fixes: 6be3b0db6db8 ("xfrm: policy: add inexact policy search tree infrastructure") Signed-off-by: Sanghyun Park <sanghyun.park.cnu@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
ZhaoJinming [Mon, 1 Jun 2026 08:56:49 +0000 (16:56 +0800)]
net: bonding: fix NULL pointer dereference in bond_do_ioctl()
In bond_do_ioctl(), slave_dev is obtained via __dev_get_by_name() which
can return NULL if the requested interface name does not exist. However,
the subsequent slave_dbg() call is placed before the NULL check:
The slave_dbg() macro expands to netdev_dbg(bond_dev, "(slave %s): " fmt,
(slave_dev)->name, ...) which unconditionally dereferences slave_dev->name
before the NULL check is performed. This results in a NULL pointer
dereference kernel oops when a user calls bonding ioctl (e.g.
SIOCBONDENSLAVE, SIOCBONDRELEASE, etc.) with a non-existent slave
interface name.
This is reachable from userspace via the bonding ioctl interface with
CAP_NET_ADMIN capability, making it a potential local denial-of-service
vector.
Fix by moving the slave_dbg() call after the NULL check.
Sandipan Das [Mon, 1 Jun 2026 12:13:05 +0000 (17:43 +0530)]
perf/x86/amd/uncore: Use Node ID to identify DF and UMC domains
For DF and UMC PMUs, a single context is shared across all CPUs that
are connected to the same Data Fabric (DF) instance. Currently, the
Package ID, which also happens to be the Socket ID, is used to identify
DF instances. This approach works for configurations having a single IO
Die (IOD) but fails in the following cases.
* Older Zen 1 processors, where each chiplet has its own DF instance.
* Any configurations with multiple DF instances or multiple IODs in
the same package.
The correct way to identify DF instances is through the Node ID (not to
be confused with NUMA Node ID). This is available in ECX[7:0] of CPUID
leaf 0x8000001e and returned via topology_amd_node_id(). Hence, replace
usage of topology_logical_package_id() with topology_amd_node_id().
Zide Chen [Tue, 2 Jun 2026 14:49:08 +0000 (07:49 -0700)]
perf/x86/intel/uncore: Implement global init callback for GNR uncore
On Sierra Forest and Clearwater Forest, the FRZ_ALL bit in the global
control register defaults to 0 at boot, but UBOX PMON units do not
work until the global control register is explicitly written with 0
to trigger hardware initialization properly.
Implement the generic uncore_msr_global_init() callback and add it to
gnr_uncore_init[], which is shared by GNR, GRR, SRF, and CWF.
Fixes: 632c4bf6d007 ("perf/x86/intel/uncore: Support Granite Rapids") Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260602144908.263680-8-zide.chen@intel.com
Zide Chen [Tue, 2 Jun 2026 14:49:07 +0000 (07:49 -0700)]
perf/x86/intel/uncore: Fix uncore_die_to_cpu() for offline dies
If the die is offline when uncore_die_to_cpu() is called, it silently
returns 0, which is misleading. Return -1 in this case to indicate
that all CPUs on the die are offline and the caller can take care of
it accordingly.
Opportunistically, replace -EPERM with -ENODEV, as -ENODEV is
the appropriate error when no CPUs are online across all dies.
Zide Chen [Tue, 2 Jun 2026 14:49:06 +0000 (07:49 -0700)]
perf/x86/intel/uncore: Move die_to_cpu() to uncore.c
Move die_to_cpu() into uncore.c so it can be reused by the MSR
initialization path, preparing for the introduction of an MSR global
initialization callback.
Move the cpus_read_{lock,unlock}() out of the API, in order to make
it possible to be called when the lock is being held.
Add the uncore_ prefix for consistency with other uncore APIs.
Zide Chen [Tue, 2 Jun 2026 14:49:04 +0000 (07:49 -0700)]
perf/x86/intel/uncore: Fix PCI device refcount leak in UPI discovery
pci_get_domain_bus_and_slot() increments the reference count of the
returned PCI device and therefore requires a matching pci_dev_put().
In skx_upi_topology_cb() and discover_upi_topology(), the lookup is
performed inside a loop, but pci_dev_put() is only called once after
the loop. As a result, references from all previous iterations are
leaked.
Move pci_dev_put(dev) into the if (dev) block immediately after
upi_fill_topology() returns.
Opportunistically, fix uninitialized variable in skx_upi_topology_cb().
Fixes: 4cfce57fa42d ("perf/x86/intel/uncore: Enable UPI topology discovery for Skylake Server") Fixes: f680b6e6062e ("perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server") Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260602144908.263680-4-zide.chen@intel.com
Zide Chen [Tue, 2 Jun 2026 14:49:03 +0000 (07:49 -0700)]
perf/x86/intel/uncore: Guard against invalid box control address
Theoretically, intel_uncore_find_discovery_unit() could return NULL,
e.g., when a CPU die is offline during uncore enumeration and its PMU
units are not added to the discovery RB-tree.
Guard against a NULL return value and the resulting invalid box control
address (0) before accessing hardware.
Zide Chen [Tue, 2 Jun 2026 14:49:02 +0000 (07:49 -0700)]
perf/x86/intel/uncore: Fix discovery unit lookup for multi-die systems
In uncore_find_add_unit(), PMON units with the same unit ID may be
added to the uncore discovery RB-tree for different dies. These units
are distinguished by node->die.
However, intel_generic_uncore_box_ctl() uses a fixed die ID of -1 when
looking up the discovery unit, which may retrieve the wrong node on
multi-die systems.
Use box->dieid instead so the correct discovery unit is selected.
No functional issue has been observed so far because currently supported
platforms happen to use the same unit control register for such units.
Remove WARN_ON_ONCE() because with the above change a NULL unit can be
expected, e.g. when a CPU die is offline during uncore enumeration and
the unit is not added to the RB-tree. In this case,
intel_uncore_find_discovery_unit() returns NULL once the die becomes
online, and it is expected that the PMU box is not functional for that
die.
Fixes: b1d9ea2e1ca4 ("perf/x86/uncore: Apply the unit control RB tree to MSR uncore units") Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260602144908.263680-2-zide.chen@intel.com
Sandipan Das [Mon, 1 Jun 2026 14:58:46 +0000 (20:28 +0530)]
perf/x86/amd/core: Always use the NMI latency mitigation
Commit df4d29732fda ("perf/x86/amd: Change/fix NMI latency mitigation
to use a timestamp") fixed handling of late-arriving NMIs but limited
the mitigation to processors having X86_FEATURE_PERFCTR_CORE. However,
it is unclear if processors without this feature are also affected.
When Mediated vPMU is enabled on affected hardware, it is also possible
to bypass the fix inside KVM guests if X86_FEATURE_PERFCTR_CORE is
removed from the guest CPUID (e.g. using "-cpu host,-perfctr-core" with
QEMU). Hence, use the mitigation at all times.
David Woodhouse [Fri, 29 May 2026 20:01:29 +0000 (22:01 +0200)]
timekeeping: Add clocksource read_snapshot() method and hw_cycles to snapshot
Add a read_snapshot() callback to struct clocksource which returns the
derived clocksource value while also providing the underlying hardware
counter reading and the related clocksource ID.
This allows ktime_get_snapshot_id() to populate new hw_cycles and hw_csid
fields in struct system_time_snapshot.
For clocksources that are derived from an underlying counter (e.g., Hyper-V
TSC page scales TSC to 10MHz, kvmclock scales TSC to 1GHz), this provides
atomic access to both the derived value needed for timekeeping
calculations, and the raw hardware counter needed by consumers like KVM's
master clock and the vmclock PTP driver.
Thomas Gleixner [Fri, 29 May 2026 20:01:25 +0000 (22:01 +0200)]
ptp: Switch to ktime_get_snapshot_id() for pre/post timestamps
To prepare for a new PTP IOCTL, which exposes the raw counter value along
with the requested system time snapshot, switch the pre/post time stamp
sampling over to use ktime_get_snapshot_id() and fix up all usage sites.
No functional change intended.
The ptp_vmclock conversion was simplified by David Woodhouse.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260529195558.149589566@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:01:21 +0000 (22:01 +0200)]
timekeeping: Add support for AUX clock cross timestamping
Now that all prerequisites are in place add the final support for AUX
clocks in get_device_system_crosststamp(), which enables the PTP layer to
support hardware cross timestamps with a new IOTCL.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195558.097464513@kernel.org
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195558.046694580@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:01:13 +0000 (22:01 +0200)]
ALSA: hda/common: Use system_device_crosststamp::sys_systime
sys_systime is an alias for sys_realtime. The latter will be removed so
switch the code over to the new naming scheme.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.995298795@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:01:09 +0000 (22:01 +0200)]
wifi: iwlwifi: Use system_device_crosststamp::sys_systime
sys_systime is an alias for sys_realtime. The latter will be removed so
switch the code over to the new naming scheme.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.946612509@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:01:05 +0000 (22:01 +0200)]
ptp: Use system_device_crosststamp::sys_systime
.. to prepare for cross timestamps with variable clock IDs.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.897808371@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:01:00 +0000 (22:01 +0200)]
timekeeping: Prepare for cross timestamps on arbitrary clock IDs
PTP device system crosstime stamps support only CLOCK_REALTIME, which is
meaningless for AUX clocks. The PTP core hands in the clock ID already, so
prepare the core code to honor it.
- Add a new sys_systime field to struct system_device_crosststamp which
aliases the sys_realtime field. Once all users are converted
sys_realtime can be removed.
- Prepare get_device_system_crosststamp() and the related code for it by
switching to sys_systime and providing the initial changes to utilize
different time keepers.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.846634842@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:56 +0000 (22:00 +0200)]
timekeeping: Remove ktime_get_snapshot()
All users have been converted to ktime_get_snapshot_id().
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.795510496@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:52 +0000 (22:00 +0200)]
virtio_rtc: Use provided clock ID for history snapshot
The PTP core indicates in system_device_crosststamp::clock_id the clock ID
for which the system time stamp should be taken. That allows to utilize
hardware timestamps with e.g. AUX clocks.
Use ktime_get_snapshot_id() and hand the provided clock ID in.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260529195557.744271454@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:48 +0000 (22:00 +0200)]
net/mlx5: Use provided clock ID for history snapshot
The PTP core indicates in system_device_crosststamp::clock_id the clock ID
for which the system time stamp should be taken. That allows to utilize
hardware timestamps with e.g. AUX clocks.
Use ktime_get_snapshot_id() and hand the provided clock ID in.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.689836531@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:44 +0000 (22:00 +0200)]
igc: Use provided clock ID for history snapshot
The PTP core indicates in system_device_crosststamp::clock_id the clock ID
for which the system time stamp should be taken. That allows to utilize
hardware timestamps with e.g. AUX clocks.
Save the provided clock ID and use it in igc_phc_get_syncdevicetime() for
taking the history snapshot.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.637381831@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:40 +0000 (22:00 +0200)]
ice/ptp: Use provided clock ID for history snapshot
The PTP core indicates in system_device_crosststamp::clock_id the clock ID
for which then system time stamp should be taken. That allows to utilize
hardware timestamps with e.g. AUX clocks.
Save the provided clock ID and use it in ice_capture_crosststamp() for
taking the history snapshot.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.587226681@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:36 +0000 (22:00 +0200)]
wifi: iwlwifi: Adopt PTP cross timestamps to core changes
iwlwifi only supports CLOCK_REALTIME timestamps and provides an incomplete
result without system counter values etc.
It also zeros struct system_device_crosststamp, which is already zeroed in
the core and initialized with the clock ID.
Remove the zeroing and reject any request for a clock ID other than REALTIME.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.535447186@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:32 +0000 (22:00 +0200)]
timekeeping: Add CLOCK ID to system_device_crosststamp
The normal capture for system/device cross timestamps is CLOCK_REALTIME,
but that's meaningless for AUX clocks.
Add a clock_id field to struct system_device_crosststamp and initialize it
with CLOCK_REALTIME at the two places which prepare for cross
timestamps.
After the related code has been cleaned up, the core code will honor the
clock_id field when calculating the system time from the system counter
snapshot.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.482153523@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:28 +0000 (22:00 +0200)]
timekeeping: Add system_counterval_t to struct system_device_crosststamp
An upcoming extension to the PTP IOCTL requires to return the system counter
value and the clocksource ID to user space. get_device_system_crosststamp() has
this information already.
Extend struct system_device_crosststamp with a system_counterval_t member
and fill in the data.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.429406675@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:24 +0000 (22:00 +0200)]
timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id()
Now that all users are converted it's possible to enable snapshotting of
CLOCK_AUX time. The underlying clocksource is the same as for all other
CLOCK variants.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.380601005@kernel.org
All users are converted over to ktime_get_snapshot_id() and
system_time_snapshot::systime and ::monoraw.
Remove the leftovers.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.330029635@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:16 +0000 (22:00 +0200)]
ptp: ptp_vmclock: Convert to ktime_get_snapshot_id()
ktime_get_snapshot() is replaced by ktime_get_snapshot_id() which allows to
request a particular CLOCK ID to be captured along with the clocksource
counter.
Convert vmclock over and use the new system_time_snapshot::systime field,
which holds the system timestamp selected by the CLOCK ID argument.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260529195557.281425262@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:12 +0000 (22:00 +0200)]
KVM: arm64: Use ktime_get_snapshot_id() to snapshot CLOCK_REALTIME
ktime_get_snapshot() is replaced by ktime_get_snapshot_id() which allows to
request a particular CLOCK ID to be captured along with the clocksource
counter.
Convert the usage in kvm_get_ptp_time() over and use the new
system_time_snapshot::systime field, which holds the system timestamp
selected by the CLOCK ID argument.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://patch.msgid.link/20260529195557.225399927@kernel.org
Thomas Gleixner [Fri, 29 May 2026 20:00:08 +0000 (22:00 +0200)]
KVM: arm64: Use ktime_get_snapshot_id() to retrieve CLOCK_BOOTTIME
ktime_get_snapshot() is replaced by ktime_get_snapshot_id() which allows to
request a particular CLOCK ID to be captured along with the clocksource
counter.
Convert the tracing mechanism over and use the new
system_time_snapshot::systime field, which holds the system timestamp
selected by the CLOCK ID argument.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260529195557.174373054@kernel.org
Eva Kurchatova [Wed, 3 Jun 2026 15:31:42 +0000 (18:31 +0300)]
tracing: Fix CFI violation in probestub being called by tprobes
The probestub is a function to allow tprobes to hook to a tracepoint to
gain access to its parameters. The function itself is only referenced by
the tracepoint structure which lives in the __tracepoint section. objtool
explicitly ignores that section and when processing functions in the
kernel, if it detects one that has no references it will seal it to have
its ENDBR stripped on boot up.
This means when a tprobe is attached to the sched_wakeup tracepoint, when it
is triggered it will call __probestub_sched_wakeup and due to the missing
ENDBR on a CFI-enabled machine it will take a #CP exception.
Fix this by adding CFI_NOSEAL annotation to probestub declaration.
Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260603153147.573589-1-eva.kurchatova@virtuozzo.com Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks") Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
[ Updated change log ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reinette Chatre [Thu, 4 Jun 2026 03:38:47 +0000 (20:38 -0700)]
cpu: Add lockdep_is_cpus_held()/lockdep_is_cpus_write_held() stubs for !CONFIG_HOTPLUG_CPU
lockdep_is_cpus_held() and lockdep_is_cpus_write_held() are undefined when
!CONFIG_HOTPLUG_CPU. This is ok because their few callers protect the calls
with a "if (IS_ENABLED(CONFIG_HOTPLUG_CPU) ..." check.
It is error prone to require callers to protect lockdep_is_cpus_held()
and lockdep_is_cpus_write_held() with an IS_ENABLED(CONFIG_HOTPLUG_CPU)
check while the custom for equivalent functions, for example the more
prevalent lockdep_is_held(), is to not require similar protection.
It is also inconsistent with CPU hotplug lockdep code self since related
call lockdep_assert_cpus_held() does not require protection.
Create stubs for lockdep_is_cpus_held() and lockdep_is_cpus_write_held()
that returns 1 (LOCK_STATE_UNKNOWN/LOCK_STATE_HELD) when !CONFIG_HOTPLUG_CPU.
This makes the CPU hotplug lockdep checks consistent while following
existing lockdep custom. Drop the "extern" from the function declaration
as part of the move to match kernel coding style.
Keep the IS_ENABLED(CONFIG_HOTPLUG_CPU) checks in existing users since
removing them would change the logic of these expressions.
Tomas Glozar [Tue, 2 Jun 2026 12:55:06 +0000 (14:55 +0200)]
rtla: Fix parsing of multi-character short options
A bug was reported where the parsing of multi-character short options,
be it a short option with an argument specified without space (e.g.
"-p100") or multiple short options in one argument (e.g. -un), ignores
options specific to individual tools.
Furthermore, if the rest of the option is supposed to be an argument, it
gets reinterpreted as a string of options. For example, -p100 gets
interpreted as -100, which is due to hackish implementation read as
--no-thread --no-irq --no-irq with timerlat hist, causing rtla to error
out:
$ rtla timerlat hist -p100
no-irq and no-thread set, there is nothing to do here
This behavior is caused by getopt_long() being called twice on each
argument, once in common_parse_options(), once in [tool]_parse_args():
- common_parse_options() calls getopt_long() with an array of options
common for all rtla tools, while suppressing errors (opterr = 0).
- If the option fails to parse, common_parse_options() returns 0.
- If 0 is returned from common_parse_options(), [tool]_parse_args()
calls getopt_long() again, with its own set of options.
* [tool] means one of {osnoise,timerlat}_{top,hist}
At least in glibc, getopt_long() increments its internal nextchar
variable even if the option is not recognized. That means that in the
case of "-p100", common_parse_options() sets nextchar pointing to '1',
and timerlat_hist_parse_args() sees '1', not 'p'; the same then repeats
for the first and second '0'.
As there is no way to restore the correct internal state of
getopt_long() reliably, fix the issue by merging the common options back
to the longopt array and option string of the [tool]_parse_args()
functions using a macro; only the switch part is left in the original
function, which is renamed to set_common_option().
Antoine Tenart [Fri, 29 May 2026 14:47:00 +0000 (16:47 +0200)]
geneve: fix length used in GRO hint UDP checksum adjustment
In geneve_post_decap_hint the length used for adjusting the UDP checksum
should be 'skb->len - gro_hint->nested_tp_offset' (UDP length) instead
of 'skb->len - gro_hint->nested_nh_offset' (IP length).
Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path") Cc: Paolo Abeni <pabeni@redhat.com> Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260521131436.748832-1-jhs%40mojatatu.com Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260529144713.780938-1-atenart@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Merge patch series "Remove b_end_io from struct buffer_head"
Matthew Wilcox (Oracle) <willy@infradead.org> says:
There are four benefits to this patchset. First, it removes an
indirect function call from the completion path. Instead of setting
bio->bi_end_io to end_bio_bh_io_sync() which then calls bh->b_end_io(),
we set bio->bi_end_io to the appropriate completion handler, replacing
two indirect function calls with one.
Second, there is a slight security advantage to this. It is one fewer
function pointer in the middle of a writable data structure that can
be corrupted. Third, it shrinks struct buffer_head from 104 bytes to 96
bytes, allowing for appropriximately 7% reduction in the amount of memory
used by buffer_heads (or, alternatively, allows 7% more buffer_heads to
be cached in the same amount of memory). Fourth, it removes some
atomic operations as the buffer refcount is no longer incremented before
calling the end_io handler.
I've run ext4 through its paces, and everything seems OK. I've only
compiled ocfs2/gfs2/nilfs/md-bitmap. Hopefully the maintainers can give
this series a try. I'm sending the entire series to linux-fsdevel
and cc'ing the fs-specific mailing lists for the fs-specific patches.
* patches from https://patch.msgid.link/20260528173150.1093780-1-willy@infradead.org: (34 commits)
buffer: Remove end_buffer_write_sync()
buffer: Change calling convention for end_buffer_read_sync()
buffer: Remove b_end_io
buffer: Remove submit_bh()
md-bitmap: Convert read_file_page and write_file_page to bh_submit()
nilfs2: Convert nilfs_mdt_submit_block to bh_submit()
nilfs2: Convert nilfs_gccache_submit_read_data to bh_submit()
nilfs2: Convert nilfs_btnode_submit_block to bh_submit()
buffer: Remove mark_buffer_async_write()
gfs2: Convert gfs2_aspace_write_folio to bh_submit()
gfs2: Remove use of b_end_io in gfs2_meta_read_endio()
gfs2: Convert gfs2_dir_readahead to bh_submit()
gfs2: Convert gfs2_metapath_ra to bh_submit()
ocfs2: Convert ocfs2_write_super_or_backup to bh_submit()
ocfs2: Convert ocfs2_read_blocks to bh_submit()
ocfs2: Convert ocfs2_read_block to bh_submit()
ocfs2: Convert ocfs2_write_block to bh_submit()
jbd2: Convert jbd2_write_superblock() to bh_submit()
jbd2: Convert journal commit to bh_submit()
ext4: Convert ext4_commit_super() to bh_submit()
...
buffer: Change calling convention for end_buffer_read_sync()
Unify end_buffer_read_sync() and __end_buffer_read_notouch()
by requiring the caller put the refcount on the buffer. The only caller
is in the gfs2_meta_read() path, and there we can put the refcount
after locking the buffer.
This shrinks buffer_head by 8 bytes, letting us pack more buffer heads
per slab. With a Debian config, it shrinks from 104 bytes to 96 bytes
which is 42 objects per 4KiB page rather than 39, a 7% reduction in the
amount of memory used.
There are no more callers of this function, so delete it.
end_buffer_async_write() then has only one caller left, so
inline it into bh_end_async_write().
Avoid an extra indirect function call by using bh_submit() instead of
submit_bh(). Also simplify the control flow now that the buffer
refcount is not put by bh_end_read().
Avoid an extra indirect function call by using bh_submit() instead
of submit_bh(). Also simplify the control flow now that the buffer
refcount is not put by bh_end_read().
ocfs2: Convert ocfs2_write_super_or_backup to bh_submit()
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-22-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-21-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-20-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer
refcount by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-19-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call by using bh_submit()
instead of submit_bh() in journal_submit_commit_record()
and jbd2_journal_commit_transaction(). These both use
journal_end_buffer_io_sync(), so it's more straightforward to do them
both at once.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-17-willy@infradead.org Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer refcount
by converting ext4_end_bitmap_read() from bh_end_io_t to bio_end_io_t
and calling bh_submit().
buffer: Convert __block_write_full_folio to __bh_submit()
Avoid an extra indirect function call by using __bh_submit() instead
of submit_bh_wbc(). Since there is only one caller of submit_bh_wbc()
left, inline it into submit_bh().
buffer: Convert block_read_full_folio to bh_submit()
Avoid an extra indirect function call by using bh_submit() instead of
submit_bh(). Since mark_buffer_async_read() would collapse to a single
function call, inline it into block_read_full_folio() along with its
extensive comment. Convert end_buffer_async_read_io() to
bh_end_async_read().
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-10-willy@infradead.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
buffer: Add bh_end_read(), bh_end_write() and bh_end_async_write()
These are the bio_end_io_t versions of end_buffer_read_sync(),
end_buffer_write_sync() and end_buffer_async_write(). They do not
contain a put_bh() call as it is no longer necessary.
All callers of mark_buffer_async_write_endio() pass
end_buffer_async_write, so we can inline mark_buffer_async_write_endio()
into mark_buffer_async_write() and just call that instead.
bh_submit() takes a bio_end_io allowing users to avoid the indirect
function call through bh->b_end_io, and eventually allowing us to remove
bh->b_end_io.
Nam Cao [Tue, 2 Jun 2026 17:51:46 +0000 (19:51 +0200)]
eventpoll: Fix epoll_wait() report false negative
ep_events_available() checks for available events by looking at
ep->rdllist and ep_is_scanning(). However, this is done without a lock
and can report false negative if ep_start_scan() or ep_done_scan() are
executed by another task concurrently. For example:
_________________________________________________________________________
|ep_start_scan()
| list_splice_init(&ep->rdllist, ...)
ep_events_available() |
!list_empty_careful(&ep->rdllist)|
|| ep_is_scanning(ep) |
| ep_enter_scan(ep)
___________________________________|_____________________________________
In the above examples, ep_events_available() sees no event despite
events being available. In case epoll_wait() is called with timeout=0,
epoll_wait() will wrongly return "no event" to user.
Introduce a sequence lock to resolve this issue.
Measuring the time consumption of 10 million loop iterations doing
epoll_wait(), the following performance drop is observed:
timeout #event before after diff
0ms 0 3727ms 3974ms +6.6%
0ms 1 8099ms 9134ms +13%
1ms 1 13525ms 13586ms +0.45%
Considering the use case of epoll_wait() (wait for events, do something
with the events, repeat), it should only contribute to a small portion of
user's CPU consumption. Therefore this performance drop is not alarming.
Nam Cao [Tue, 2 Jun 2026 17:51:45 +0000 (19:51 +0200)]
selftests/eventpoll: Add test for multiple waiters
Add a test whichs creates 64 threads who all epoll_wait() on the same
eventpoll. The source eventfd is written but never read, therefore all the
threads should always see an EPOLLIN event.
This test fails because of a kernel bug, which will be fixed by a follow-up
commit.
Merge patch series "mm: improve write performance with RWF_DONTCACHE"
Jeff Layton <jlayton@kernel.org> says:
This patch series is intended to improve write performance with
RWF_DONTCACHE. This version fixes additional stat accounting issues
found during review: integer promotion on 32-bit, cgroup writeback
domain migration, folio split flag preservation, and a UAF that could
occur in filemap_dontcache_kick_writeback().
* patches from https://patch.msgid.link/20260511-dontcache-v7-0-2848ddce8090@kernel.org:
mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
mm: track DONTCACHE dirty pages per bdi_writeback
mm: preserve PG_dropbehind flag during folio split
Jeff Layton [Mon, 11 May 2026 11:58:29 +0000 (07:58 -0400)]
mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context. Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).
Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background. This moves writeback submission
completely off the writer's hot path.
To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back. The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.
Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations. Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.
Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.
In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch. Pin the wb with
wb_get() inside the RCU critical section before calling wb_wakeup()
outside it, since cgroup bdi_writeback structures are RCU-freed and the
wb pointer could become invalid after unlocked_inode_to_wb_end() drops
the RCU read lock.
Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.
Dontcache now reaches 81% of buffered throughput (was 35%).
Competing writers (dontcache vs buffered, separate files):
Before After
buffered writer 868 433 MB/s
dontcache writer 415 433 MB/s
Aggregate 1,284 866 MB/s
Previously the buffered writer starved the dontcache writer 2:1.
With per-bdi_writeback tracking, both writers now receive equal
bandwidth. The aggregate matches the buffered-vs-buffered baseline
(863 MB/s), indicating fair sharing regardless of I/O mode.
The dontcache writer's p99.9 latency collapsed from 119 ms to
33 ms (-73%), eliminating the severe periodic stalls seen in the
baseline. Both writers now share identical latency profiles,
matching the buffered-vs-buffered pattern.
The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.
Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Jeff Layton [Mon, 11 May 2026 11:58:28 +0000 (07:58 -0400)]
mm: track DONTCACHE dirty pages per bdi_writeback
Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
writes).
Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
to prevent concurrent lookups from double-decrementing the counter, and
guarding the decrement with mapping_can_writeback() to match the increment
path.
Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
that the stat is properly migrated when an inode switches cgroup writeback
domains.
The counter will be used by the writeback flusher to determine how many
pages to write back when expediting writeback for IOCB_DONTCACHE writes,
without flushing the entire BDI's dirty pages.
Suggested-by: Jan Kara <jack@suse.cz> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-2-2848ddce8090@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Jeff Layton [Mon, 11 May 2026 11:58:27 +0000 (07:58 -0400)]
mm: preserve PG_dropbehind flag during folio split
__split_folio_to_order() copies page flags from the original folio to
newly created sub-folios using an explicit allowlist, but PG_dropbehind
is not included. When a large folio with PG_dropbehind set is split,
only the head sub-folio retains the flag; all tail sub-folios silently
lose it and will not be reclaimed eagerly after writeback completes.
Add PG_dropbehind to the flag copy mask so that the drop-behind hint
is preserved across folio splits.
Fixes: a323281cdfec ("mm: add PG_dropbehind folio flag") Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-1-2848ddce8090@kernel.org Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Merge patch series "libfs: set SB_I_NOEXEC and SB_I_NODEV in init_pseudo()"
John Hubbard <jhubbard@nvidia.com> says:
This began as a one-line dma-buf fix for a path_noexec() warning added
by commit 1e7ab6f67824 ("anon_inode: rework assertions"). Christoph
pointed out that the fix belongs higher up: a pseudo filesystem has no
reason not to set SB_I_NOEXEC by default. This series does that.
* Patch 1 sets both flags in init_pseudo(), so every pseudo
filesystem gets them. This is the only patch that changes a flag,
and the only one with Fixes:/Cc: stable.
* Patch 2 drops the assignments that are now redundant in the callers
that set them by hand.
Most callers already set one or both flags. I audited every
init_pseudo() caller. Here is what patch 1 actually changes for each.
The only visible effect is on dma-buf, where SB_I_NOEXEC silences the
warning. SB_I_NODEV is never consulted on these SB_NOUSER mounts, and
none of the callers that gain SB_I_NOEXEC are executed from.
caller had patch 1 adds
--------------------------- -------- --------------
fs/anon_inodes.c both nothing new
mm/secretmem.c both nothing new
virt/kvm/guest_memfd.c both nothing new
fs/nsfs.c both nothing new
fs/pidfs.c both nothing new
fs/aio.c NOEXEC NODEV
drivers/dma-buf/dma-buf.c neither NOEXEC + NODEV
net/socket.c neither NOEXEC + NODEV
fs/pipe.c neither NOEXEC + NODEV
kernel/resource.c neither NOEXEC + NODEV
fs/erofs/super.c neither NOEXEC + NODEV
fs/btrfs/tests/... neither NOEXEC + NODEV
drivers/vfio/vfio_main.c neither NOEXEC + NODEV
drivers/gpu/drm/drm_drv.c neither NOEXEC + NODEV
drivers/dax/super.c neither NOEXEC + NODEV
block/bdev.c neither NOEXEC + NODEV
* patches from https://patch.msgid.link/20260604025315.245910-1-jhubbard@nvidia.com:
libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
John Hubbard [Thu, 4 Jun 2026 02:53:14 +0000 (19:53 -0700)]
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
Since commit 1e7ab6f67824 ("anon_inode: rework assertions"),
path_noexec() warns when an anonymous-inode file is mmap'd from a
superblock that has not set SB_I_NOEXEC. dma-buf backs its files this
way and never set the flag, so mmap of any exported buffer trips the
warning on a CONFIG_DEBUG_VFS=y kernel:
init_pseudo() sets up internal SB_NOUSER mounts that are never
path-reachable. Set both flags here so every pseudo filesystem gets
them by default instead of each caller setting them.
SB_I_NODEV is inert for unreachable mounts. SB_I_NOEXEC has one
visible effect: an executable mapping of a pseudo-fs fd, such as a
dma-buf, now fails with -EPERM, which is the invariant the assertion
enforces. No in-tree caller maps these executable.
Reproduce on CONFIG_DEBUG_VFS=y:
make -C tools/testing/selftests/dmabuf-heaps
sudo ./tools/testing/selftests/dmabuf-heaps/dmabuf-heap -t system
Fixes: 1e7ab6f67824 ("anon_inode: rework assertions") Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: John Hubbard <jhubbard@nvidia.com> Link: https://patch.msgid.link/20260604025315.245910-2-jhubbard@nvidia.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Joanne Koong [Thu, 4 Jun 2026 01:18:58 +0000 (18:18 -0700)]
iomap: avoid potential null folio->mapping deref during error reporting
When a buffered read fails, iomap_finish_folio_read() reports the error
with fserror_report_io(folio->mapping->host, ...). This is called after
ifs->read_bytes_pending has been decremented by the bytes attempted to
be read.
For a folio split across multiple read completions, the folio is only
guaranteed to stay locked while read_bytes_pending > 0. Once
iomap_finish_folio_read() decrements read_bytes_pending, another
in-flight read can complete and end the read on the folio, which unlocks
it. This allows truncate logic to run and detach the folio (set
folio->mapping to NULL). The error reporting path then can dereference a
NULL folio->mapping. As reported by Sam Sun, this is the race that can
occur:
CPU0: failed completion CPU1: final completion CPU2: truncate
----------------------- ---------------------- --------------
read_bytes_pending -= len
finished = false
/* preempted before
fserror_report_io() */
read_bytes_pending -= len
finished = true
folio_end_read()
truncate clears
folio->mapping
fserror_report_io(
folio->mapping->host, ...)
^ NULL deref
Fix this by reporting the error first before decrementing
ifs->read_bytes_pending.
Fixes: a9d573ee88af ("iomap: report file I/O errors to the VFS") Cc: stable@vger.kernel.org Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/linux-fsdevel/CAEkJfYPhWdd59RKmuNLJg-bkypHz7xiOwaWyNVu3A8CUqQCnvg@mail.gmail.com/ Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260604011858.2297561-1-joannelkoong@gmail.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Jann Horn [Wed, 3 Jun 2026 19:31:57 +0000 (21:31 +0200)]
fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
may_decode_fh() accesses mount::mnt_ns without holding any locks; that
means the mount can concurrently be unmounted, and the mnt_namespace can
concurrently be freed after an RCU grace period.
This race can happens as follows, assuming that the mount point was
created by open_tree(..., OPEN_TREE_CLONE):
Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
in __prepend_path().
Additionally, document the semantics of mount::mnt_ns, and use WRITE_ONCE()
for writers that can race with lockless readers.
This bug is unreachable unless one of the following is set:
because it requires an RCU grace period to happen during a syscall without
an explicit preemption.
This doesn't seem to have interesting security impact; worst-case, it could
leak the result of an integer comparison to userspace (from the level
check in cap_capable()), cause an endless loop, or crash the kernel by
dereferencing an invalid address.
Miguel Ojeda [Tue, 2 Jun 2026 15:16:38 +0000 (17:16 +0200)]
kbuild: rust: rename flag to `-Zdebuginfo-for-profiling` for Rust >= 1.98
Starting with Rust 1.98.0 (expected 2026-08-20), the
`-Zdebug-info-for-profiling` flag has been renamed to
`-Zdebuginfo-for-profiling` (i.e. one less dash, to match `debuginfo`s
in other flags) [1].
Without this change, one gets in the latest nightlies:
Andrew Jones [Wed, 27 May 2026 14:27:03 +0000 (09:27 -0500)]
kconfig: add kconfig-sym-check static checker
Add 'make kconfig-sym-check', a static checker that finds Kconfig
symbols referenced in expressions (select, depends on, default, etc.)
but never defined via config/menuconfig anywhere in the tree. New
dangling symbols are reported as errors (exit 1) unless they are
listed in an exclusion file, e.g.
KCONFIG_SYM_CHECK_EXCLUDES=sym-check-excludes make kconfig-sym-check
The exclusion file lists one symbol per line; blank lines and lines
starting with '#' are ignored.
The checker also warns about uppercase N/Y/M used as tristate literal
values following the same logic as checkpatch.
This new static checker is the script used for [1] with a few
improvements to avoid some false positives.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216748 Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Andrew Jones <andrew.jones@linux.dev> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Julian Braha <julianbraha@gmail.com> Tested-by: Nicolas Schier <nsc@kernel.org> Acked-by: Nicolas Schier <nsc@kernel.org> Link: https://patch.msgid.link/20260527142703.107110-1-andrew.jones@linux.dev Signed-off-by: Nathan Chancellor <nathan@kernel.org>
====================
Fix use-after-free in metadata dst teardown in airoha_eth and mtk_eth_soc drivers
airoha_metadata_dst_free() and mtk_free_dev() call metadata_dst_free()
which frees the metadata_dst with kfree() immediately, bypassing the RCU
grace period.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path and runs call_rcu_hurry() if refcount goes to
zero.
====================
net: ethernet: mtk_eth_soc: Fix use-after-free in metadata dst teardown
mtk_free_dev() calls metadata_dst_free() which frees the metadata_dst
with kfree() immediately, bypassing the RCU grace period.
In the RX path, skb_dst_set_noref() sets a non-refcounted pointer from
the skb to the metadata_dst. This function requires RCU read-side
protection and the dst must remain valid until all RCU readers complete.
Since metadata_dst_free() calls kfree() directly, a use-after-free can
occur if any skb still holds a noref pointer to the dst when the driver
tears it down.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path: when the refcount drops to zero, it schedules
the actual free via call_rcu_hurry(), ensuring all RCU readers have
completed before the memory is freed.