git.ipfire.org Git - thirdparty/linux.git/log

Bluetooth: hci_sync: Set HCI_CMD_DRAIN_WORKQUEUE during device close

Since hci_dev_close_sync() can now be called during the reset path, we
should also set HCI_CMD_DRAIN_WORKQUEUE. This avoids queuing timeouts
while the hdev workqueue is being drained.

Fixes: 877afadad2dc ("Bluetooth: When HCI work queue is drained, only queue chained work")
Signed-off-by: Heitor Alves de Siqueira <halves@igalia.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: hci_core: Rework hci_dev_do_reset() to use hci_sync functions

The current HCI reset function in hci_core.c duplicates most of the work
done by hci_dev_close_sync(), and doesn't handle LE, advertising or
discovery.

Instead of porting these to hci_dev_do_reset(), directly call the
close/open functions from hci_sync to reset the hdev. MGMT now notifies
when a user performs a reset.

Suggested-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Heitor Alves de Siqueira <halves@igalia.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: ISO: serialize iso_sock_clear_timer with socket lock

iso_sock_close() calls iso_sock_clear_timer() before acquiring
lock_sock(sk).

iso_sock_clear_timer() reads iso_pi(sk)->conn twice without the
socket lock held:

    if (!iso_pi(sk)->conn)
        return;
    cancel_delayed_work(&iso_pi(sk)->conn->timeout_work);

Concurrently, iso_conn_del() executes under lock_sock(sk) and calls
iso_chan_del(), which sets iso_pi(sk)->conn to NULL and may result in
the final reference to the connection being dropped:

    CPU0                         CPU1
    ----                         ----
    iso_sock_clear_timer()
      if (conn != NULL) ...      lock_sock(sk)
                                   iso_chan_del()
                                   iso_pi(sk)->conn = NULL
      cancel_delayed_work(conn)  /* NULL deref or UAF */

iso_pi(sk)->conn is not stable across the unlock window, causing a
NULL pointer dereference or use-after-free.

Serialize iso_sock_clear_timer() with the socket lock by moving it
inside lock_sock()/release_sock(), matching the pattern used in
iso_conn_del() and all other call sites.

Fixes: ccf74f2390d60a2f9a75ef496d2564abb478f46a ("Bluetooth: Add BTPROTO_ISO socket type")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: ISO: fix UAF in iso_recv_frame

iso_recv_frame reads conn->sk under iso_conn_lock but releases the lock
before using sk, with no reference held. A concurrent iso_sock_kill()
can free sk in that window, causing use-after-free on sk->sk_state and
sock_queue_rcv_skb().

Fix by replacing the bare pointer read with iso_sock_hold(conn), which
calls sock_hold() while the spinlock is held, atomically elevating the
refcount before the lock drops. Add a drop_put label so sock_put() is
called on all exit paths where the hold succeeded.

Fixes: ccf74f2390d60a2f9a75ef496d2564abb478f46a ("Bluetooth: Add BTPROTO_ISO socket type")
Cc: stable@vger.kernel.org
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: L2CAP: Fix possible crash on l2cap_ecred_conn_rsp

If dcid is received for an already-assigned destination CID the spec
requires that both channels to be discarded, but calling l2cap_chan_del
may invalidate the tmp cursor created by list_for_each_entry_safe and
in fact it is the wrong procedure as the chan->dcid may be assigned
previously it really needs to be disconnected.

Calling l2cap_chan_clone directly may still lead to l2cap_chan_del so
instead schedule l2cap_chan_timeout with delay 0 to close the channel
asynchronously.

Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: l2cap: clear chan->ident on ECRED reconfiguration success

l2cap_ecred_reconf_rsp() returns early on success without clearing
chan->ident. Every other L2CAP response handler (l2cap_ecred_conn_rsp,
l2cap_le_connect_rsp, l2cap_config_rsp) clears chan->ident after a
successful transaction to prevent the channel from matching subsequent
responses with the recycled ident value.

A remote attacker that completed a reconfiguration as the peer can
replay a failure response with the stale ident, causing the kernel to
match and destroy the already-established channel via
l2cap_chan_del(chan, ECONNRESET).

Clear chan->ident for all matching channels on success, and harden the
failure path by using l2cap_chan_hold_unless_zero() consistent with
other L2CAP handlers (l2cap_le_command_rej, __l2cap_get_chan_by_ident).

Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Zhenghang Xiao <kipreyyy@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

spi: spi-mem: avoid mutating op template in spi_mem_supports_op()

spi_mem_supports_op() accepts a const struct spi_mem_op pointer but
casts away const internally to call spi_mem_adjust_op_freq(). This
mutates the caller's op template, which causes stale max_freq values
when callers reuse persistent templates - subsequent calls won't
re-apply the device frequency cap since spi_mem_adjust_op_freq()
skips non-zero values.

Fix by operating on a stack-local copy instead.

Fixes: a4f8e70d75dd ("spi: spi-mem: add spi_mem_adjust_op_freq() in spi_mem_supports_op()")
Cc: Tianyu Xu <xtydtc@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Santhosh Kumar K <s-k6@ti.com>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://patch.msgid.link/20260527173736.2243004-1-s-k6@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>

fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling

A SOFTIRQ-safe to SOFTIRQ-unsafe lock order deadlock can occur in
send_sigio() and send_sigurg() when a process group receives a signal.

When FASYNC is configured for a process group (PIDTYPE_PGID), both
functions use read_lock(&tasklist_lock) to traverse the task list.
However, they are frequently called from softirq context:
- send_sigio() via input_inject_event -> kill_fasync
- send_sigurg() via tcp_check_urg -> sk_send_sigurg (NET_RX_SOFTIRQ)

The deadlock is caused by the rwlock writer fairness mechanism:
1. CPU 0 (process context) holds read_lock(&tasklist_lock) in do_wait().
2. CPU 1 (process context) attempts write_lock(&tasklist_lock) in
   fork() or exit() and spins, which blocks all new readers.
3. CPU 0 is interrupted by a softirq (e.g., TCP URG packet reception).
4. The softirq calls send_sigurg() and attempts to acquire
   read_lock(&tasklist_lock), deadlocking because CPU 1 is waiting.

Since PID hashing and do_each_pid_task() traversals are already
RCU-protected, the read_lock on tasklist_lock is no longer strictly
required for safe traversal. Fix this by replacing tasklist_lock with
rcu_read_lock(), aligning the process group signaling path with the
single-PID path. This also mitigates a potential remote denial of
service vector via TCP URG packets.

Lockdep splat:
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[...]
Chain exists of:
  &dev->event_lock --> &f_owner->lock --> tasklist_lock

Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(tasklist_lock);
                           local_irq_disable();
                           lock(&dev->event_lock);
                           lock(&f_owner->lock);
  <Interrupt>
    lock(&dev->event_lock);

*** DEADLOCK ***

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
Link: https://patch.msgid.link/20260523135210.590928-1-w15303746062@163.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Merge patch series "fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock"

Breno Leitao <leitao@debian.org> says:

While profiling Meta's caching code[1], I found pipe->mutex contention
on the hot path. anon_pipe_write() currently calls alloc_page() once
per page while holding pipe->mutex. The allocation can sleep doing
direct reclaim and runs memcg charging, which extends the critical
section and stalls any concurrent reader on the same mutex.

This series pre-allocates pages outside pipe->mutex in
anon_pipe_write(): for writes that span more than one full page, up
to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page
alloc_page() loop before the mutex is taken. anon_pipe_get_page()
then drains the prealloc array first, falls back to the per-pipe
tmp_page[] cache, and only enters the allocator under the mutex for
the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page
writes that skip prealloc, or shortfalls when the prealloc loop
fails). Leftover prealloc pages are recycled into tmp_page[] before
unlock and any remainder is put_page()'d after unlock, keeping the
allocator out of the critical section on both sides.

alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator
refuses __GFP_ACCOUNT under memcg -- it returns at most one page
when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit
8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for
__GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the
task NUMA mempolicy honoured uniformly without open-coding the
charge.

I also vibe-coded a microbenchmark to validate the change. It sweeps
writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a
1 MB pipe and prints throughput + latency percentiles per config.

Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB
writes, 1 MB pipe). The numbers below were collected on v1
(alloc_pages_bulk()); v2's per-page loop preserves the dominant
"allocation outside the mutex" win and is expected to land in the same
range.

== No memory pressure (10s per config) ==

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1              readers=5               readers=10
          1   1119 -> 1354  (+21%)   1132 -> 1195   (+6%)   1060 -> 1240  (+17%)
          2   1162 -> 1487  (+28%)   1034 -> 1285  (+24%)   1069 -> 1213  (+14%)
          5   1152 -> 1357  (+18%)   1021 -> 1164  (+14%)    997 -> 1239  (+24%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1                 readers=5                readers=10
          1    55786 ->  46103 (-17%)   55164 ->  52260  (-5%)   58906 ->  50370 (-14%)
          2   107546 ->  84011 (-22%)  120837 ->  97206 (-20%)  116860 -> 103036 (-12%)
          5   271293 -> 230170 (-15%)  306089 -> 268429 (-12%)  313300 -> 252232 (-19%)

Throughput improves +6% to +28% and average write latency drops 5%
to 22% across every configuration.

== Under memory pressure (--memory-pressure, 6s per config) ==

stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the
sweep so the alloc_page() calls inside anon_pipe_write() routinely
hit direct reclaim -- exactly the regime the patch targets.

  Throughput in MB/s (baseline -> patched, delta):
    writers   readers=1            readers=5            readers=10
          1   1088 -> 1438  (+32%)   996  -> 1477  (+48%)   989  -> 1194  (+21%)
          2   1076 -> 1378  (+28%)   1007 -> 1269  (+26%)   1018 -> 1234  (+21%)
          5   1052 -> 1311  (+25%)   986  -> 1225  (+24%)   972  -> 1249  (+29%)

  Avg write latency in ns (baseline -> patched, delta):
    writers   readers=1              readers=5              readers=10
          1    57397 ->  43406 (-24%)   62690 ->  42272 (-33%)   63136 ->  52272 (-17%)
          2   116121 ->  90700 (-22%)  124098 ->  98481 (-21%)  122754 -> 101217 (-18%)
          5   297122 -> 238322 (-20%)  316836 -> 255095 (-19%)  321496 -> 250189 (-22%)

Throughput improves +21% to +48% and average write latency drops
17% to 33% -- a noticeably bigger win than the no-pressure run.

That tracks: when alloc_page() has to dip into reclaim, the cost
of holding pipe->mutex across it is highest, and pulling the
allocation out of the critical section pays the most.

* patches from https://patch.msgid.link/20260524-fix_pipe-v3-0-bb4a75d23a90@debian.org:
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write

Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf
Link: https://patch.msgid.link/20260524-fix_pipe-v3-0-bb4a75d23a90@debian.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

selftests/pipe: add pipe_bench microbenchmark

Add a small selftest that stresses pipe->mutex contention by spawning N
writer threads that hammer a single pipe with multi-page writes, plus M
reader threads that drain. Each writer records its own write() latency
samples into a log2-bucketed histogram; main aggregates and prints
total writes, throughput, average and percentile (p50/p99) latencies,
and the maximum observed latency.

Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80%
--vm-method all) for the duration of the run, so alloc_page() in
anon_pipe_write() routinely hits direct reclaim. The flag fails
fast if stress-ng is not on $PATH.

Program print something like the following, for different writes,
readers, msgsizes and memory pressure:

config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576
memory_pressure=[no|yes]
writes: total=54451 rate=18150/s
throughput_MBps: 1134.40
lat_avg_ns: 275355
lat_p50_ns_upper: 262143
lat_p99_ns_upper: 1048575
lat_max_ns: 2145633

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260524-fix_pipe-v3-2-bb4a75d23a90@debian.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write

anon_pipe_write() takes pipe->mutex (aka "mutex protecting the whole
thing") and then, from the per-iteration anon_pipe_get_page() helper,
used to call alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT) once per page
while still holding it.

That allocation can sleep doing direct reclaim and/or runs memcg
charging, which extends the critical section and stalls a concurrent
reader on the very same mutex.

Just pre-alloc the required pages before the lock in an array and just pop
them inside the lock.

This can improve the pipe throughput up to 48% and reduce the
latency in 33%, easily seen when there is memory pressure and direct
reclaim.

Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260524-fix_pipe-v3-1-bb4a75d23a90@debian.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

fs: retire stale comment in fget_task_next()

The routine originally showed up in e9a53aeb5e0a838f ("file: Implement
task_lookup_next_fd_rcu"), afterwards it got renamed and started
entering RCU on its own in 8fd3395ec9051a52 ("get rid of
...lookup...fdget_rcu() family").

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20260522142152.1515572-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

PCI: qcom: Disable ASPM L0s for SA8775P

Due to a hardware issue, L0s is not properly supported by the PCIe
controller on the SA8775p SoC. If enabled, the L0s to L0 transition
triggers below correctable AER errors and may also affect link stability:

  pcieport 0000:00:00.0: PME: Signaling with IRQ 332
  pcieport 0000:00:00.0: AER: enabled with IRQ 332
  pcieport 0000:00:00.0: AER: Correctable error message received from 0000:01:00.0
  pci 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
  pci 0000:01:00.0:   device [17cb:1103] error status/mask=00001000/0000e000
  pci 0000:01:00.0:    [12] Timeout
  pcieport 0000:00:00.0: AER: Multiple Correctable error message received from 0000:01:00.0
  pcieport 0000:00:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
  pcieport 0000:00:00.0:   device [17cb:0115] error status/mask=00001000/0000e000
  pcieport 0000:00:00.0:    [12] Timeout

Hence, disable L0s for the SA8775p SoC to allow it to properly function
by sacrificing a little bit of power saving.

Fixes: 58d0d3e032b3 ("PCI: qcom-ep: Add support for SA8775P SOC")
Assisted-by: Claude:claude-4-6-sonnet
Signed-off-by: Shawn Guo <shengchao.guo@oss.qualcomm.com>
[mani: commit log, corrected fixes tag]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20260419093934.1223027-1-shengchao.guo@oss.qualcomm.com

fs/qnx6: fix pointer arithmetic in directory iteration

The conversion to qnx6_get_folio() in commit b2aa61556fcf
("qnx6: Convert qnx6_get_page() to qnx6_get_folio()")
introduced a regression in directory iteration. The pointer 'de'
and the 'limit' address were calculated using byte offsets from
a char pointer without scaling by the size of a QNX6 directory
entry.

This causes the driver to read from incorrect memory offsets,
leading to "invalid direntry size" errors and premature
termination of directory scans.

Fix this by casting 'kaddr' to 'struct qnx6_dir_entry *' before
applying the offset and last_entry(...) increments. This allows the
compiler to correctly scale the pointer arithmetic by the 32-byte
stride of the directory entry structure.

Fixes: b2aa61556fcf ("qnx6: Convert qnx6_get_page() to qnx6_get_folio()")
Cc: stable@vger.kernel.org
Signed-off-by: Arpith Kalaginanavoor <arpithk@nvidia.com>
Link: https://patch.msgid.link/20260526123858.1683035-1-arpithk@nvidia.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

VFS: fix possible failure to unlock in nfsd4_create_file()

atomic_create() in fs/namei.c drops the reference to the dentry
when it returns an error.
This behaviour was imported into dentry_create() so that it
will drop the reference if an error is returned from atomic_create(),
though not if vfs_create() returns an error (in the case where
->atomic_create is not supported).

The caller - nfsd4_create_file() - is made aware of this by checking
path->dentry, which will either be a counted reference to a dentry, or
an error pointer.

However the change to use start_creating()/end_creating() (which landed
shortly before the dentry_create() change landed, though was likely
developed around the same time) means that nfsd4_create_file() *needs* a
valid dentry so that it can unlock the parent.

The net result is that if NFSD exports a filesystem which uses
->atomic_create, and if a call to ->atomic_create returns an error, then
nfsd4_create_file() will pass an error pointer to end_creating()
and the parent will not be unlocked.

Fix this by changing dentry_create() to make sure path->dentry is always
a valid dentry, never an error-pointer. The actual error is already
returned a different way.

Note that if ->atomic_create() returns a different dentry (which may not
be possible in practice) we are guaranteed (because it is only ever
provided by d_spliace_alias()) that it will have the same d_parent and
so it will have the same effect when passed to end_creating().

Fixes: 64a989dbd144 ("VFS/knfsd: Teach dentry_create() to use atomic_open()")
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/177969022571.3379282.16448744624428323496@noble.neil.brown.name
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Reviewed-by: Jori Koolstra <jkoolstra@xs4all.nl>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

fs: fix spelling mistakes in comment

Fix three spelling errors in the comment for an internal file structure
allocation function:
- happend  →  happened
- over     →  exceed (grammatical fix)
- int      →  in

Changes since v1:
- Fix comma after e.g.
- Fix incorrect use of "imbalance"

Signed-off-by: Qingshuang Fu <fuqingshuang@kylinos.cn>
Link: https://patch.msgid.link/20260527100025.960339-1-fffsqian@163.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Merge branch 'dpll-zl3073x-various-fixes'

Ivan Vecera says:

====================
dpll: zl3073x: various fixes

Three fixes for the zl3073x DPLL driver.

Patch 1 exports __dpll_device_change_ntf() for use by drivers that
need to send device change notifications from within callbacks
already running under dpll_lock.

Patch 2 replaces the change_work workqueue mechanism with direct
calls to __dpll_device_change_ntf(), eliminating a race condition
where the work handler could dereference a freed dpll_dev pointer
during device teardown.

Patch 3 moves the freq_monitor flag from per-DPLL to per-device
scope to match the hardware behavior where frequency measurement
registers are shared across all DPLL channels.
====================

Link: https://patch.msgid.link/20260526074525.1451008-1-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: zl3073x: make frequency monitor a per-device attribute

The frequency monitoring feature uses shared hardware registers
that measure input reference frequencies independently of
individual DPLL channels. However, the freq_monitor flag was
incorrectly placed in the per-DPLL structure, causing each
channel to track its own enable/disable state independently.

Since the DPLL core calls measured_freq_get() only for the first
pin registration, the measured_freq_check() in the periodic worker
was gated by the per-DPLL freq_monitor flag of whichever channel
happens to be checked. If the first DPLL channel had frequency
monitoring disabled while another had it enabled, measurements
were never reported.

Move freq_monitor from struct zl3073x_dpll to struct zl3073x_dev
so all DPLL channels share a single flag, matching the hardware
behavior. Update freq_monitor_set() to notify other DPLL devices
about the change (like phase_offset_avg_factor_set() already does)
and remove the mode-dependent guard in zl3073x_dpll_changes_check()
since all input pin monitoring (pin state, phase offset, FFO, and
measured frequency) works correctly in all DPLL modes.

Fixes: bfc923b642874 ("dpll: zl3073x: implement frequency monitoring")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260526074525.1451008-4-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: zl3073x: use __dpll_device_change_ntf() and remove change_work

The change_work was introduced to send device change notifications
from DPLL device callbacks without deadlocking on dpll_lock, since
the callbacks are already invoked under that lock. Now that
__dpll_device_change_ntf() is exported for callers that already
hold dpll_lock, use it directly and remove the change_work
infrastructure entirely.

This eliminates a race condition where change_work could be
re-scheduled after cancel_work_sync() during device teardown,
potentially causing the handler to dereference a freed or NULL
dpll_dev pointer.

Fixes: 9363b4837659 ("dpll: zl3073x: Allow to configure phase offset averaging factor")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260526074525.1451008-3-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: export __dpll_device_change_ntf() for use under dpll_lock

Export __dpll_device_change_ntf() so that drivers can send device
change notifications from within device callbacks, which are already
called under dpll_lock. Using dpll_device_change_ntf() in that
context would deadlock.

Add lockdep_assert_held() to catch misuse without the lock held.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260526074525.1451008-2-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge patch series "fs: replace __get_free_pages() call with kmalloc()"

Mike Rapoport (Microsoft) <rppt@kernel.org> says:

This is a (small) part of larger work of replacing page allocator calls
with kmalloc.

* patches from https://patch.msgid.link/20260523-b4-fs-v1-0-275e36a83f0e@kernel.org:
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  fuse: replace __get_free_page() with kmalloc()
  isofs: replace __get_free_page() with kmalloc()
  jbd2: replace __get_free_pages() with kmalloc()
  jfs: replace __get_free_page() with kmalloc()
  libfs: simple_transaction_get(): replace get_zeroed_page() with kzalloc()
  NFSD: replace __get_free_page() with kmalloc() in nfsd_buffered_readdir()
  NFS: remove unused page and page2 in nfs4_replace_transport()
  NFS: replace __get_free_page() with kmalloc() in nfs_show_devname()
  nilfs2: replace get_zeroed_page() with kzalloc()
  ocfs2/dlm: replace __get_free_page() with kmalloc()
  proc: replace __get_free_page() with kmalloc()
  quota: allocate dquot_hash with kmalloc()

Link: https://patch.msgid.link/20260523-b4-fs-v1-0-275e36a83f0e@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

bfs: replace get_zeroed_page() with kzalloc()

bfs_dump_imap() allocates temporary buffer with get_zeroed_page().

kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.

Replace use of get_zeroed_page() with kzalloc().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-17-275e36a83f0e@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

binfmt_misc: replace __get_free_page() with kmalloc()

bm_entry_read() allocates temporary buffer using __get_free_page().

kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.

Replace use of __get_free_page() with kmalloc().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-16-275e36a83f0e@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

configfs: replace __get_free_pages() with kzalloc()

configfs allocates staging buffers __get_free_pages().

kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.

Replace use of __get_free_pages() with kzalloc().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-15-275e36a83f0e@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

fs/namespace: use __getname() to allocate mntpath buffer

mnt_warn_timestamp_expiry() allocates memory for a path with
__get_free_page() although there is a dedicated helper for allocation of
file paths: __getname().

Replace __get_free_page() for allocation of a path buffer with __getname().

Christian Brauner <brauner@kernel.org> says:

Pass PATH_MAX (not PAGE_SIZE) to d_path() to match the size that
__getname() actually allocates, and drop the now-unnecessary NULL check
around __putname() since __putname() handles NULL. Both per Jan Kara's
review feedback, acked by the author.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-14-275e36a83f0e@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

MAINTAINERS: Add my employer to my entries

AMD pays for my IOMMU maintainer work, so mention that in the
MAINTAINERS file as well.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

MAINTAINERS: Add Vasant Hegde to reviewers of AMD IOMMU

Vasant has a long history of providing valuable feedback and testing
results for the AMD IOMMU code. Still, too often he gets not Cc'ed on
code changes, so make his reviewer status official.

Acked-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

Merge tag 'asoc-fix-v7.1-rc5' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v7.1

This round of fixes is mostly Sirini's Qualcomm cleanups that have been
in review for a while, we also have a couple of small fixes from Cássio.

Merge branch 'net-handshake-anchor-request-lifetime-to-a-pinned-file-reference'

Chuck Lever says:

====================
net/handshake: anchor request lifetime to a pinned file reference

handshake_nl_accept_doit() has accumulated four follow-on fixes
since 3b3009ea8abb ("net/handshake: Create a NETLINK service for
handling handshake requests"): 7ea9c1ec66bc, 7798b59409c3,
fe67b063f687, and dabac51b8102.  Each was a local refcount or
NULL-check correction; none moved where the file reference is
owned, and the same code keeps producing the same class of bug.
Reworking the ownership is what breaks the pattern.

For the duration of a request, sock->file has no single owner.
Submit publishes the request without taking a file reference;
accept_doit acquires one inside the handler, after the request
has already left the pending list.  The consumer can drop its
own reference at any time, including the moment between
handshake_req_next() popping the request and accept_doit
reaching get_file().  The submit-side sock_hold() pins only
struct sock; struct socket and sock->file remain under the
consumer's control via the file descriptor.

This series places the file reference under unambiguous
ownership.  handshake_req_submit() pins it on the request and
completion or cancel drops it (patches 4-5); the submit-side
sock_hold() then becomes redundant, and dropping it also closes
a publish-before-pin race the late sock_hold itself opened
(patch 6).  The handshake_complete() API and its consumers move
to a uniform negative-errno sign convention (patch 3), with the
matching sign correction in nvme-tcp (patch 2).  Patch 1
hardens hn_lock for BH context, the netns-exit drain fix
builds on the new file-pin infrastructure (patch 8), and new
KUnit file-count assertions verify the refcount contract
(patch 7).

Three things in this restructuring want a careful look.  In
handshake_complete(), the fput() of the request's file
reference has to come after hp_done() -- fput() can transitively
run handshake_sk_destruct() and free the request, so the patch
stashes hr_file in a local first.  handshake_sk_destruct()
itself is kept on purpose: it owns rhashtable removal and
kfree, and remains the backstop if a consumer path bypasses
handshake_complete() entirely.  Third, handshake_req_next() now
returns its request with an extra get_file() held under
hn_lock; accept_doit must consume that reference (FD_PREPARE on
success, explicit fput on the fdf.err path), and any future
caller has to honor the same contract.

v2: https://patch.msgid.link/20260521-handshake-file-pin-v2-0-b9dadc472840@oracle.com
v1: https://patch.msgid.link/20260518-handshake-file-pin-v1-0-4bbcb7e62fda@oracle.com
====================

Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-0-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

platform/x86: Move delayed work on system_dfl_wq

Currently the code enqueue work items using {queue|mod}_delayed_work(),
using system_wq, which will be deprecated soon and replaced by
system_percpu_wq.

commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")

The function(s) mentioned earlier, end up calling __queue_delayed_work(),
which set a global timer that could fire anywhere, enqueuing the work
where the timer fired.

Unbound works could benefit from scheduler task placement, to optimize
performance and power consumption.

Since the workqueue work doesn't rely on per-cpu variables, there is no
obvious reason that justify the use of a per-cpu workqueue. So change
system_wq with system_dfl_wq so that the work may benefit from
scheduler task placement.

Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20260515145851.318787-1-marco.crivellari@suse.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: x86-android-tablets: Use named initializers for struct i2c_device_id

While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.

This patch doesn't modify the compiled array, only its representation in
source form benefits. The former was confirmed with x86 and arm64
builds.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/20260519154040.1594878-2-u.kleine-koenig@baylibre.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform: arm64 Use named initializers for struct i2c_device_id

While being less compact, using named initializers allows to more easily
see which members of the structs are assigned which value without having
to lookup the declaration of the struct. And it's also more robust
against changes to the struct definition.

This patch doesn't modify the compiled arrays, only their representation
in source form benefits. The former was confirmed with x86 and arm64
builds.

While touching all these arrays, unify usage of whitespace in the list
terminator.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Link: https://patch.msgid.link/20260519144341.1589034-2-u.kleine-koenig@baylibre.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

net/handshake: Drain pending requests at net namespace exit

The arguments to list_splice_init() in handshake_net_exit() are
reversed. The call moves the local empty "requests" list onto
hn->hn_requests, leaving the local list empty, so the subsequent
drain loop runs zero iterations. Pending handshake requests that
had not yet been accepted are not torn down when the net namespace
is destroyed; each one keeps a reference on a socket file and on
the handshake_req allocation.

Pass the source and destination in the documented order
(list_splice_init(list, head) moves list onto head) so the pending
list is transferred to the local scratch list and drained through
handshake_complete().

Fixing the splice direction exposes a list-corruption race. After
the splice each req->hr_list still has non-empty link pointers,
threading the stack-local scratch list rather than hn_requests.
A concurrent handshake_req_cancel() -- for example, from sunrpc's
TLS timeout on a kernel socket whose netns reference was not
taken -- finds the request through the rhashtable, calls
remove_pending(), and sees !list_empty(&req->hr_list).
__remove_pending_locked() then list_del_init()s an entry off the
scratch list while the drain iterates, corrupting it. The same
call arriving after the drain loop has run list_del() on an
entry hits LIST_POISON instead.

Have remove_pending() check HANDSHAKE_F_NET_DRAINING under
hn_lock and report not-found when drain is in progress. The
drain has already taken ownership; handshake_complete()'s existing
test_and_set on HANDSHAKE_F_REQ_COMPLETED still arbitrates
between drain and cancel for who calls the consumer's hp_done. Use
list_del_init() rather than list_del() in the drain so req->hr_list
does not carry LIST_POISON after drain releases the entry.

The DRAINING guard in remove_pending() makes cancel return false,
but cancel still falls through to test_and_set_bit on
HANDSHAKE_F_REQ_COMPLETED and drops the request's hr_file reference.
Without another pin, if that is the last reference, sk_destruct frees
the request while it is still linked on the drain loop's local list.
Pin each request's hr_file under hn_lock before releasing the list,
and drop that drain pin after the loop finishes with the request.

Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-8-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/handshake: Verify file-reference balance in submit paths

The new file-reference contract on struct handshake_req is silently
breakable: a missing get_file() at submit or a missing fput() on an
error path leaves the file leaked but does not crash the test, so
the existing absence-of-crash checks pass either way.

Snapshot file_count(filp) before each handshake_req_submit() in
the submit-success, EAGAIN, EBUSY, and cancel tests, and assert
the expected balance after submit and again after cancel. The
already-completed cancel test also asserts the post-complete
balance, which pins down that handshake_complete() drops the
reference and that the subsequent cancel does not double-fput.
The destroy test gets the same treatment before __fput_sync(),
which double-checks that cancel's fput() ran and the only
remaining reference is the one sock_alloc_file() established.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-7-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/handshake: Close the submit-side sock_hold race

handshake_req_submit() publishes the request via
handshake_req_hash_add() and __add_pending_locked(), drops
hn_lock, and calls handshake_genl_notify() (which can sleep)
before taking sock_hold() on req->hr_sk. A fast tlshd ACCEPT
followed by DONE can drive handshake_complete()'s sock_put()
into the window between the spin_unlock and the late
sock_hold(); on a system where the consumer's fd held the
only sk reference, the late sock_hold() then operates on an
sk whose refcount has reached zero.

The preceding two patches install an explicit file reference
on struct handshake_req. That file pins sock->file, which
pins the embedded struct socket, which defers inet_release()'s
sock_put(). As long as hr_file is held, sk cannot reach refcount
zero from the consumer side, and the submit-side sock_hold()
with its matching sock_put() calls in handshake_complete() and
handshake_req_cancel() is now redundant.

Drop all three. The file reference already keeps each request's
socket alive, and the lifetime story is contained in a single
get_file()/fput() pair.

Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-6-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/handshake: hand off the pinned file reference to accept_doit

handshake_req_next() removes the request from the per-net
pending list and drops hn_lock before handshake_nl_accept_doit()
reads req->hr_sk->sk_socket and dereferences sock->file (once in
FD_PREPARE() and again in get_file()).  In that window a
consumer running tls_handshake_cancel() followed by sockfd_put()
(svc_sock_free) or __fput_sync() (xs_reset_transport) releases
sock->file.  sock_release() then runs sock_orphan(), zeroing
sk_socket, and frees the struct socket.  The accept-side code
either reads NULL through sk_socket or chases freed memory.

The submit-side sock_hold() does not prevent this.  sk_refcnt
protects struct sock, but struct socket and sock->file are
independently refcounted via the file descriptor the consumer
owns.  Pinning sk leaves sock and sock->file unprotected.

Retarget the accept-side dereferences at req->hr_file, which was
pinned at submit time, instead of req->hr_sk->sk_socket->file.
Pinning on its own is not sufficient: a consumer that cancels
between handshake_req_next() returning and accept_doit reaching
FD_PREPARE() takes the !remove_pending() branch in
handshake_req_cancel() and drops hr_file before the accept side
takes its own reference.  Hand off an additional file reference
inside handshake_req_next(), under hn_lock, so the accept side
operates on a reference that no concurrent handshake_req_cancel()
can revoke.  FD_PREPARE() consumes that handed-off reference,
either by transferring it to the new fd in fd_publish() or by
dropping it in the cleanup destructor on error; the explicit
get_file() that previously balanced FD_PREPARE() is therefore
redundant and goes away.

Update handshake_req_cancel_test2 and _test3 to simulate the
FD_PREPARE() consumption with an fput() so the kunit file-count
assertions stay balanced.

Reported-by: Chris Mason <clm@meta.com>
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-5-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/handshake: Take a long-lived file reference at submit

handshake_nl_accept_doit() needs the file pointer backing
req->hr_sk->sk_socket to survive the window between
handshake_req_next() and the subsequent FD_PREPARE() and get_file().
The submit-side sock_hold() does not provide that.  sk_refcnt keeps
struct sock alive, but struct socket is owned by sock->file: when
the consumer fputs the last file reference, sock_release() tears
the socket down regardless of any sock_hold.

Add an hr_file pointer to struct handshake_req and acquire an
explicit reference on sock->file during handshake_req_submit().
handshake_complete() and handshake_req_cancel() release the
reference on the completion-bit-winning path.

The submit error path must also release the file reference, but
after rhashtable insertion a concurrent handshake_req_cancel() can
discover the request and race the error path.  Gate the error-path
cleanup -- sk_destruct restoration, fput, and request destruction
-- with test_and_set_bit(HANDSHAKE_F_REQ_COMPLETED), the same
serialization handshake_complete() and handshake_req_cancel()
already use.  When cancel has already claimed ownership, the submit
error path returns without touching the request; socket teardown
handles final destruction.

The accept-side dereferences are not yet retargeted; that change
comes in the next patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-4-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/handshake: Pass negative errno through handshake_complete()

handshake_complete() declares status as unsigned int and
tls_handshake_done() negates that value (-status) before handing
it to the TLS consumer. Consumers match on negative errno
constants -- xs_tls_handshake_done() has

switch (status) {
case 0:
case -EACCES:
case -ETIMEDOUT:
lower_transport->xprt_err = status;
break;
default:
lower_transport->xprt_err = -EACCES;
}

so the API as designed expects callers to pass positive errno
values that the tlshd shim then negates.

Three internal callers in handshake_nl_accept_doit(), the
net-exit drain, and a kunit test follow kernel convention and
pass negative errnos -- -EIO, -ETIMEDOUT, -ETIMEDOUT. The
implicit conversion to unsigned int turns -ETIMEDOUT into
0xFFFFFF92; the subsequent -status in tls_handshake_done()
wraps back to 110, the consumer's switch falls through, and
the xprt reports -EACCES on what should be -ETIMEDOUT or -EIO.

Fix the API rather than the call sites. The natural kernel
convention is negative errno in, negative errno out. Change
handshake_complete() and hp_done to take int status, drop the
negation in tls_handshake_done(), and negate once in
handshake_nl_done_doit() where status arrives from the wire
as an unsigned netlink attribute. The three internal callers
were already correct under that convention and need no change.

At the same wire boundary, declare MAX_ERRNO as the netlink
policy upper bound for HANDSHAKE_A_DONE_STATUS. Attribute
validation rejects out-of-range values before
handshake_nl_done_doit() runs, and negating a bounded u32 there
stays within int range -- closing the UBSAN-visible signed-
integer overflow that an unconstrained u32 would invoke.

Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-3-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

nvme-tcp: store negative errno in queue->tls_err

nvme_tcp_tls_done() assigns queue->tls_err in three branches.  The
ENOKEY lookup failure and the EOPNOTSUPP initializer both store
negative errnos.  The third branch, reached when the handshake
layer reports a non-zero status, stores -status.

The handshake layer delivers status to the consumer callback as a
negative errno; the other in-tree consumers --
xs_tls_handshake_done() and the nvmet target callback -- treat
their status argument that way.  The extra negation in
nvme_tcp_tls_done() flips the sign, leaving tls_err as a positive
value (for instance, +EIO), which nvme_tcp_start_tls() then
returns to its caller.

Drop the extra negation so queue->tls_err uniformly carries a
negative errno on failure.

Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-2-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/handshake: Use spin_lock_bh for hn_lock

nvmet_tcp_state_change(), a socket callback that runs in BH context,
can reach handshake_req_cancel() via nvmet_tcp_schedule_release_queue()
and tls_handshake_cancel().  handshake_req_cancel() acquires
hn->hn_lock with plain spin_lock().  If a process-context thread on
the same CPU holds hn->hn_lock when a softirq invokes the cancel path,
the lock attempt deadlocks.  This is the only caller that invokes
tls_handshake_cancel() from BH context; every other consumer calls it
from process context.

Deferring the cancel to process context in the NVMe target is not
straightforward: nvmet_tcp_schedule_release_queue() must call
tls_handshake_cancel() atomically with its state transition to
DISCONNECTING.  If the cancel were deferred, the handshake completion
callback could fire in the window before the cancel runs, observe the
unexpected state, and return without dropping its kref on the queue.
Reworking that interlock is considerably more invasive than hardening
the handshake lock.  Convert all hn->hn_lock acquisitions from
spin_lock/spin_unlock to spin_lock_bh/spin_unlock_bh so the lock is
never taken with softirqs enabled.

Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-1-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: skbuff: fix missing zerocopy reference in pskb_carve helpers

pskb_carve_inside_header() and pskb_carve_inside_nonlinear() both copy
the old skb_shared_info header into a new buffer via memcpy(), which
includes the destructor_arg pointer (uarg) for MSG_ZEROCOPY skbs.
Neither function calls net_zcopy_get() for the new shinfo, creating an
unaccounted holder: every skb_shared_info with destructor_arg set will
call skb_zcopy_clear() once when freed, but the corresponding
net_zcopy_get() was never called for the new copy. Repeated calls
drive uarg->refcnt to zero prematurely, freeing ubuf_info_msgzc while
TX skbs still hold live destructor_arg pointers.

KASAN reports use-after-free on a freed ubuf_info_msgzc:

  BUG: KASAN: slab-use-after-free in skb_release_data+0x77b/0x810
  Read of size 8 at addr ffff88801574d3e8 by task poc/220

  Call Trace:
   skb_release_data+0x77b/0x810
   kfree_skb_list_reason+0x13e/0x610
   skb_release_data+0x4cd/0x810
   sk_skb_reason_drop+0xf3/0x340
   skb_queue_purge_reason+0x282/0x440
   rds_tcp_inc_free+0x1e/0x30
   rds_recvmsg+0x354/0x1780
   __sys_recvmsg+0xdf/0x180

  Allocated by task 219:
   msg_zerocopy_realloc+0x157/0x7b0
   tcp_sendmsg_locked+0x2892/0x3ba0

  Freed by task 219:
   ip_recv_error+0x74a/0xb10
   tcp_recvmsg+0x475/0x530

The skb consuming the late access still referenced the same uarg via
shinfo->destructor_arg copied by pskb_carve_inside_nonlinear() without
a refcount bump. This has been verified to be reliably exploitable: a
working proof-of-concept achieves full root privilege escalation from
an unprivileged local user on a default kernel configuration.

The fix follows the pattern of pskb_expand_head() which has the same
memcpy/cloned structure. For pskb_carve_inside_header(), net_zcopy_get()
is placed after skb_orphan_frags() succeeds, so the orphan error path
needs no cleanup. For pskb_carve_inside_nonlinear(), net_zcopy_get() is
placed after all failure points and just before skb_release_data(), so
no error path needs cleanup at all -- matching pskb_expand_head() more
closely and avoiding the need for a balancing net_zcopy_put().

Fixes: 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Minh Nguyen <minhnguyen.080505@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260526041240.329462-1-minhnguyen.080505@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ASoC: soc-card: add snd_soc_card_set_topology_name()

Some drivers want to use topology name, but currently each drivers are
setting it by own method.
This patch adds new snd_soc_card_set_topology_name() and do it by
same method.

Almost all driver doesn't set topology name, let's remove fixed name
array, and use devm_kasprintf() instead.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://patch.msgid.link/878q942wce.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>

drm/i915/dp: Account for AS_SDP guardband only when enabled

Currently the intel_dp_sdp_min_guardband() accounts for AS_SDP for all
platforms that support adaptive sync SDP even for configurations where
it cannot be enabled. Instead account for adaptive sync SDP guardband
only when it is enabled.

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-13-ankit.k.nautiyal@intel.com

drm/i915/dp: Enable AS SDP whenever VRR is possible or PR !async

Currently AS SDP is only configured when VRR is enabled.
With optimized guardband, we also need to account for wakeup time and other
relevant details that depend on the AS SDP position whenever AS SDP is
enabled. If a feature enabling AS SDP gets turned on later (after modeset),
the guardband might not be sufficient and may need to increase, triggering
a full modeset.

Additionally, for Panel Replay with Aux-less ALPM where the sink does
not support asynchronous video timing in PR active, the source must
keep transmitting Adaptive-Sync SDPs while PR is active.

So, always send AS SDP whenever there is a possibility to use it for VRR
OR for Panel Replay for synchronization.

v2: Check if AS SDP can be used for synchronization for VRR or PR. (Ville)
v3: Use intel_psr_needs_alpm_aux_less() instead of
intel_alpm_is_alpm_aux_less() to avoid including the LOBF case. (Ville)
Modify the commit message and subject.

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-12-ankit.k.nautiyal@intel.com

drm/i915/dp: Compute AS SDP after PSR compute config

A subsequent change makes intel_dp_needs_as_sdp() depend on
crtc_state->has_panel_replay, which is set by intel_psr_compute_config().

Move call for intel_dp_compute_as_sdp() after the
intel_psr_compute_config().

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-11-ankit.k.nautiyal@intel.com

drm/i915/dp: Compute and include coasting vtotal for AS SDP

DP v2.1 allows the source to temporarily suspend Adaptive-Sync SDP
transmission while Panel Replay is active when the sink supports
asynchronous video timing.

In such cases, the sink relies on the last transmitted AS SDP timing
information to maintain the refresh rate. To support this behavior,
compute and populate the coasting vtotal field in the AS SDP payload.

Include coasting vtotal in AS SDP packing, unpacking, and comparison,
and set it during late AS SDP configuration for PR with Aux-less ALPM
when asynchronous video timing is supported.

Note:
The coasting vtotal value is fully under driver control i.e. the HW does
not overwrite these payload bytes. HW only samples the PR_ALPM_CTL[AS SDP
Transmission in Active Disable] bit during PR active state and reflects it
in the AS SDP payload at the appropriate time.

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-10-ankit.k.nautiyal@intel.com

drm/i915/dp: Program AS SDP DB[1:0] for PR with Link off

For Panel Replay with AUX-less ALPM (link-off PR), the source must send
Adaptive-Sync SDP v2. Program DB[1:0] per DP spec v2.1:
- VRR AVT: 00b (variable VTotal)
- VRR FAVT: 10b/11b (TRR not reached/reached)
- Fixed timing with PR link-off (VRR off): 01b (AS disabled; VTotal fixed)

Also, drop the redundant target_rr assignment.

v2: Fix the else case. (Ville)

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-9-ankit.k.nautiyal@intel.com

drm/i915/dp: Set relevant Downspread Ctrl DPCD bits for PR + Auxless ALPM

If a Panel Replay capable sink, supports Async Video timing in
PR active state, then source does not necessarily need to send AS SDPs
during PR active.

However, if asynchronous video timing is not supported, then for PR with
Aux-less ALPM, the source must transmit Adaptive-Sync SDPs for video
timing synchronization while PR is active.

If the source needs to send AS SDP during PR active, this requires setting
DPCD 0x0107[6] (FIXED_VTOTAL_AS_SDP_EN_IN_PR_ACTIVE). This applies whether
VRR is enabled (AVT/FAVT) or fixed-timing mode is used.

This bit defines AS SDP timing behavior during PR Active, even if AS SDPs
are briefly suspended.

Program the relevant Downspread Ctrl DPCD bits accordingly.

v2: Instead of Panel Replay check simply use AS SDP enable check. (Ville)
v3: Since the bit is defined in context of Panel Replay and AS SDP, add
a check for both. (Ville)
v4: Extract pr_with_as_sdp logic into helper function. (Ville)

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-8-ankit.k.nautiyal@intel.com

drm/i915/psr: Program Panel Replay CONFIG3 using AS SDP transmission time

Panel Replay requires the AS SDP transmission time to be written into
PANEL_REPLAY_CONFIG3. This field was previously not programmed.

Use the AS SDP transmission-time helper to populate CONFIG3.

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-7-ankit.k.nautiyal@intel.com

drm/i915/display: Add helper for AS SDP transmission time selection

AS SDP may be transmitted at T1 or T2 depending on Panel Replay and
Adaptive Sync SDP configuration as per DP 2.1. Current we are using
T1 only, but future PR/AS SDP modes/features may require T2 or dynamic
selection.

Introduce a helper to return the appropriate AS SDP transmission time so
that a single value is consistently used for programming PR_ALPM.
For now this returns T1.

v2: Avoid adding new member to crtc_state; use a helper. (Ville)
v3: Clarify why AS SDP transmission time is fixed to T1. (Ville)
v4: Return u8 from intel_dp_as_sdp_transmission_time(). (Ville)

Bspec: 68920
Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-6-ankit.k.nautiyal@intel.com

drm/i915/psr: Write the PR config DPCDs in burst mode

Replace the consecutive single-byte writes to PANEL_REPLAY_CONFIG and
CONFIG2 with one drm_dp_dpcd_write() burst starting at PANEL_REPLAY_CONFIG,
reducing AUX transactions.

v2: Drop extra conditions, and optimize variables. (Ville)
v3: Drop the error check after write. (Ville)

Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-5-ankit.k.nautiyal@intel.com

drm/i915/dp: Allow AS SDP only if v2 is supported

We do not support AS SDP version 1, so allow AS SDP only if AS SDP v2 is
supported.

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-4-ankit.k.nautiyal@intel.com

drm/i915/dp: Add member to intel_dp to store AS SDP v2 support

eDP v1.5a advertises support for Adaptive Sync SDP and with that the
support for AS SDP v2 is mandatory.

DP v2.1 SCR advertises support for FAVT payload fields parsing in DPCD
0x2214 Bit 2. This indicates the support for Adaptive-Sync SDP version 2
(AS SDP v2), which allows the source to set the version in HB2[4:0] and the
payload length in HB3[5:0] of the AS SDP header.

DP v2.1 SCR also introduces ASYNC_VIDEO_TIMING_NOT_SUPPORTED_IN_PR in the
Panel Replay Capability DPCD 0x00b1 (Bit 3). When this bit is set, the sink
does not support asynchronous video timing while in a Panel Replay Active
state and the source is required to keep transmitting Adaptive-Sync
SDPs. The spec mandates that such sinks shall support AS SDP v2.

Infer AS SDP v2 support from these capabilities and store it in
struct intel_dp for use by subsequent feature enablement changes.

v2:
- Include parsing ASYNC_VIDEO_TIMING_NOT_SUPPORTED_IN_PR bit to
determine AS SDP v2 support. (Ville)
v3:
- Use helper to determine asynch video timing support.
v4:
- Add AS SDP v2 support for eDP as per v1.5a.
- Add a check for Panel Replay support before checking for Async video
timing support in PR
- Add a TODO for Display ID and PCON considerations. (Ville)

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-3-ankit.k.nautiyal@intel.com

drm/i915/psr: Add helper to get Async Video timing support in PR active

Introduce a helper to check if Panel Replay has Async Video Timing support
during PR Active state.

v2: Confirm that Panel Replay is supported before checking for
Async Video Timing Support during PR active. (Ville)

Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260527041050.601735-2-ankit.k.nautiyal@intel.com

platform/x86: lenovo-wmi-capdata: Add debugfs file for dumping capdata

The Lenovo GameZone/Other interfaces have some delicate divergences
among different devices. When making a bug report or adding support for
new devices/interfaces, capdata is the most important information to
cross-check with.

Add a debugfs file (lenovo_wmi/<device_name>/capdata), so that users can
dump capdata and include it in their reports.

Since `struct capdata01' is just an extension to `struct capdata00',
also convert the former to include the latter anonymously
(-fms-extensions, since v6.19). This is declared as a union in the
capdata01 struct, with both the anonymous declaration and as a named
member to avoid type casting when passing just the capdata00 struct
pointer.

Tested-by: Kurt Borja <kuurtb@gmail.com>
Signed-off-by: Rong Zhang <i@rong.moe>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
Link: https://patch.msgid.link/20260520060740.119554-8-derekjohn.clark@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: lenovo-wmi-helpers: Add helper for creating per-device debugfs dir

We are about to add debugfs support for lenovo-wmi-capdata. Let's setup
a debugfs directory called "lenovo_wmi" for tidiness, so that any
lenovo-wmi-* device can put its subdirectory under the directory.
Subdirectories will be named after the corresponding WMI devices.

Signed-off-by: Rong Zhang <i@rong.moe>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
Link: https://patch.msgid.link/20260520060740.119554-7-derekjohn.clark@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: lenovo-wmi-other: Add force_load_psy_ext module parameter

Some Lenovo BIOS have been shown to have incomplete and/or broken
capability data and WMI attribute IDs. In some cases the capability data
reports that a feature is not supported when the get/set methods are
fully implemented. It is also possible that the ACPI methods from the
ideapad_laptop driver we defer to could be bugged while the WMI method
is fully working. To aid end users in submitting more complete bug
reports in these situations, add an override to skip the ACPI and
compatibility checks to force load the power supply extension as if it
is fully supported and has no conflicts.

Reviewed-by: Rong Zhang <i@rong.moe>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
Link: https://patch.msgid.link/20260520060740.119554-6-derekjohn.clark@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: lenovo-wmi-other: Add WMI battery charge limiting

Add charge_behaviour and charge_types attributes through a power supply
extension for devices that support WMI based charge enable & disable.

Lenovo Legion devices that implement WMI function and capdata ID
0x03010001 in their BIOS are able to enable or disable charging at 80%
through the lenovo-wmi-other interface. Add a charge_types attribute for
BATX devices to expose this capability with Standard and Long_Life types
enabled.

Additionally, devices that support WMI function and capdata ID 0x03020000
are able to force discharge of the battery. Expose this capability with
a charge_behaviour attribute in the power supply extension, with the auto
and force-discharge behaviors enabled. The GET method for this attribute
is bugged. After analyzing the DSDT, and some testing, it appears the
method grabs bit(3) instead of bit(4) from the EC register that stores the
current status, and will only report if charging has been inhibited or
not. To work around this, store and report the last setting written to the
attribute.

As some devices only expose one attribute or the other, a bitmask is
added with a lookup table and some helper macros to select the correct
configuration for the hardware at runtime.

The ideapad_laptop driver provides the charge_types attribute to provide
similar functionality. When the WMI method is set this can corrupt the
ACPI method return and cause hardware and driver errors. To avoid
conflicts between the drivers, we get the acpi_handle and do the same
check that ideapad_laptop does when it enables the feature. If the
feature is supported in ideapad_laptop, abort adding the extension from
lenovo-wmi-other. The ACPI method is more reliable when both are
present from my testing, so we can prefer that implementation and do
not need to worry about de-conflicting from inside that driver.

Reviewed-by: Rong Zhang <i@rong.moe>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
Link: https://patch.msgid.link/20260520060740.119554-5-derekjohn.clark@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: lenovo-wmi-other: Rename LWMI_OM_FW_ATTR_BASE_PATH

In the next patch a power supply extension is added which requires
a name attribute. Instead of creating another const macro with the
same information, rename LWMI_OM_FW_ATTR_BASE_PATH to
LWMI_OM_SYSFS_NAME.

Reviewed-by: Rong Zhang <i@rong.moe>
Tested-by: Rong Zhang <i@rong.moe>
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
Link: https://patch.msgid.link/20260520060740.119554-4-derekjohn.clark@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: lenovo-wmi-other: Add GPU tunable attributes

Use an enum for all GPU attribute feature ID's and add GPU attributes.

Reviewed-by: Rong Zhang <i@rong.moe>
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
Link: https://patch.msgid.link/20260520060740.119554-3-derekjohn.clark@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: lenovo-wmi-other: Add missing CPU tunable attributes

Use an enum for all device ID's and CPU attribute feature ID's,
add missing CPU attributes.

Reviewed-by: Rong Zhang <i@rong.moe>
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com>
Link: https://patch.msgid.link/20260520060740.119554-2-derekjohn.clark@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

KVM: selftests: Enable pre_fault_memory_test for s390

Enable the pre_fault_memory_test to run on s390.

Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260527144358.186359-6-imbrenda@linux.ibm.com>

KVM: selftests: Fix pre_fault_memory_test to run on s390

Add a missing #include <ucall_common.h> which is needed and otherwise
not included on s390.

Remove the assertion vcpu->run->exit_reason == KVM_EXIT_IO since it
is x86-specific and redundant anyway.

Acked-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260527144358.186359-5-imbrenda@linux.ibm.com>

KVM: s390: Update KVM_PRE_FAULT_MEMORY API documentation

Update the API documentation for KVM_PRE_FAULT_MEMORY to account for
its s390 implementation.

Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260527144358.186359-4-imbrenda@linux.ibm.com>

KVM: s390: Implement KVM_PRE_FAULT_MEMORY

Implement and enable the KVM_PRE_FAULT_MEMORY ioctl for s390.

Faulted-in pages will be marked as accessed, unlike x86, otherwise they
will trigger a minor fault when accessed. Avoiding such faults is one of
the points of KVM_PRE_FAULT_MEMORY.

Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260527144358.186359-3-imbrenda@linux.ibm.com>

KVM: s390: Track page size in struct guest_fault

Until now, the members of struct guest_fault are always accessed while
holding the required locks, and thus the ptep and crstep pointers can
be dereferenced safely.

There will be some new cases where callers of kvm_s390_faultin_gfn()
need to know the size of the page used to solve the fault, at which
point no locks are held anymore, and dereferencing the crstep field
is not possible.

Introduce a new crste_region3 flag for struct guest_fault to indicate
whether the crstep used to solve the fault was a region 3 entry with FC=1
(large pud).

This allows to disambiguate all three possible scenarios:
* If ptep is not NULL, the fault was solved with a pte.
* If ptep is NULL and crste_region3 is 0, a segment entry with FC=1
(large pmd) was used.
* If ptep is NULL and crste_region3 is 1, a region 3 entry with FC=1
(large pud) was used.

Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260527144358.186359-2-imbrenda@linux.ibm.com>

Merge branch 'hibmcge-fix-rx-packet-corruption-issue'

Jijie Shao says:

====================
hibmcge: fix RX packet corruption issue

This series fixes an RX packet corruption issue observed when SMMU is
disabled on the hibmcge driver. The fixes include disabling PCI Relaxed
Ordering and correcting the order of DMA barrier operations in the RX
data sync path.
====================

Link: https://patch.msgid.link/20260525144525.94884-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Documentation/rtla: Add -A/--aligned option

Cover the newly added -A/--aligned option that aligns timerlat threads
using the corresponding feature of the timerlat tracer.

A note is added to clarify what alignment means, similar to the note in
the tracer implementation in commit 4245bf4dc58f ("tracing/osnoise: Add
option to align tlat threads").

Link: https://lore.kernel.org/r/20260527144928.2944472-3-tglozar@redhat.com
[ remove spurious newline ]
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/tests: Add unit tests for -A/--aligned option

Add both parse_args() and opt_* tests for the newly added -A/--aligned
option.

Assisted-by: Claude:claude-4.5-opus-high-thinking
Link: https://lore.kernel.org/r/20260527144928.2944472-2-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/timerlat: Add -A/--aligned CLI option

Add a new option, -A/--aligned, that enables timerlat thread alignment
implemented on the kernel-side in commit 4245bf4dc58f ("tracing/osnoise:
Add option to align tlat threads"). The option takes an argument,
representing alignment between timerlat threads in microseconds.

The feature is modeled after the option of the same name in the
cyclictest tool.

Link: https://lore.kernel.org/r/20260527144928.2944472-1-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/tests: Add unit tests for CLI option callbacks

In addition to testing all tool_parse_args() functions, test also all
callbacks used for parsing custom option formats.

The callbacks represent a middle layer between the parsing functions
and utility functions dedicated to checking specific argument formats,
for example, scheduling class and duration. Callback tests are run
before parsing functions to make sure any issue in the former is
reported before it is encountered through the latter.

Tests verify both successful parsing and proper rejection of invalid
inputs (via exit tests). To enable testing static callbacks, a pragma
once guard is added to timerlat.h for safe inclusion by cli_p.h.

Add dependency of UNIT_TESTS_IN on LIBSUBCMD_INCLUDES, as the new test
file tests/unit/cli_opt_callback.c includes cli_p.h which includes
subcmd/parse-options.h.

Link: https://lore.kernel.org/r/20260528103254.2990068-7-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/tests: Add unit tests for _parse_args() functions

Add a test suite for the _parse_args() function of each tool that checks
the params structures (struct common_params, struct osnoise_params,
struct timerlat_params) returned by them for correctness.

One test case is added per option, as well as a few special cases for
tricky combinations of options. Test cases are ordered the same as the
option arrays and help message to allow easy checking of whether all
options are covered.

This should help clarify what the proper command line behavior of RTLA
is in case there are holes in the documentation and verify that the
intended behavior is implemented correctly.

A few necessary changes to the unit tests were done as part of this
commit:

- Unit tests now also link to libsubcmd and its dependencies.
- A new global variable in_unit_test is added to RTLA's CLI interface,
  causing it to skip check for root if running in unit tests. This
  allows the CLI unit tests to run as non-root, like existing unit
  tests.

There is quite a lot of duplication, some of it is mitigated with macros,
but partially it is intentional so that future changes in behavior are
tracked across tools.

Link: https://lore.kernel.org/r/20260528103254.2990068-6-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla: Parse cmdline using libsubcmd

Instead of using getopt_long() directly to parse the command line
arguments given to an RTLA tool, use libsubcmd's parse_options().

Utilizing libsubcmd for parsing command line arguments has several
benefits:

- A help message is automatically generated by libsubcmd from the
  specification, removing the need of writing it by hand.
- Options are sorted into groups based on which part of tracing (CPU,
  thread, auto-analysis, tuning, histogram) they relate to.
- Common parsing patterns for numerical and boolean values now share
  code, with the target variable being stored in the option array.

To avoid duplication of the option parsing logic, RTLA-specific
macros defining struct option values are created:

- RTLA_OPT_* for options common to all tools
- OSNOISE_OPT_* and TIMERLAT_OPT_* for options specific to
  osnoise/timerlat tools
- HIST_OPT_* macros for options specific to histogram-based tools.

Individual *_parse_args() functions then construct an array out of
these macros that is then passed to libsubcmd's parse_options().

All code specific to command line options parsing is moved out of the
individual tool files into a new file, cli.c, which also contains the
contents of the rtla.c file. A private header, cli_p.h, is added
alongside the public header cli.h, so that unit tests are able to test
statically declared option callbacks.

Minor changes:

- The return value of tool-level help option changes to 129, as this is
  the value set by libsubcmd; this is reflected in affected test cases.
  The implementation of help for command-level and tracer-level help
  is set to 129 as well for consistency, and the change is reflected in
  exit value documentation.
- Related to the above, {rtla,osnoise,timerlat}_usage() are marked
  __noreturn and exit() is removed from after they are called for
  cleaner code.
- The error messages for invalid argument for options --dma-latency and
  -E/--entries were corrected, fixing off-by-one in the limits.

Note that unsetting options (using --no-<opt> syntax) is currently not
implemented for options that use custom callbacks. For --irq and
--thread, it will never be implemented, as they conflict with already
existing --no-irq and --no-thread with a different meaning.

Assisted-by: Composer:composer-1.5
Link: https://lore.kernel.org/r/20260528103254.2990068-5-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

tools subcmd: allow parsing distinct --opt and --no-opt

libsubcmd automatically generates for every option --opt an equivalent
negated option, --no-opt, to unset the option. Vice versa, for every
option declared as --no-opt, a shorthand --opt is declared for
convenience.

Add a flag, PARSE_OPT_NOAUTONEG, to disable this behavior. This new flag
behaves similarly to the already existing PARSE_OPT_NONEG, only it does
not reject the --no-opt variant, but leaves it undefined. That is useful
when there is a conflicting distinct --no-opt option in the syntax of
the tool.

PARSE_OPT_NOAUTONEG is enabled per-option, allowing to unset other
options that do not have this conflict.

Link: https://lore.kernel.org/r/20260528103254.2990068-4-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

tools subcmd: support optarg as separate argument

In addition to "-ovalue" and "--opt=value" syntax, allow also "-o value"
and "--opt value" for options with optional argument when the newly
added PARSE_OPT_OPTARG_ALLOW_NEXT flag is set.

This behavior is turned off by default since it does not make sense for
tools using non-option command line arguments. Consider the ambiguity
of "cmd -d x", where "-d x" can mean either "-d with argument of x" or
"-d without argument, followed by non-option argument x". This is not an
issue in the case that the tool takes no non-option arguments.

To implement this, a new local variable, force_defval, is created in
get_value(), along with a comment explaining the logic.

Link: https://lore.kernel.org/r/20260528103254.2990068-3-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla: Add libsubcmd dependency

In preparation for migrating RTLA to libsubcmd, build libsubcmd from the
appropriate directory next to the RTLA build proper, and link the
resulting object to RTLA.

libsubcmd uses str_error_r() and strlcpy() at several places. To support
these, also link the respective libraries from tools/lib.

For completeness, also add tools/include to include path. This will
allow other userspace functions and macros shipped with the kernel to be
used in RTLA; perf and bpftool, two other users of libsubcmd, already do
that.

To prevent a name conflict, rename RTLA's run_command() function to
run_tool_command(), and replace RTLA's own container_of implementation
with the one in tools/include/linux/container_of.h.

Assisted-by: Composer:composer-1
Link: https://lore.kernel.org/r/20260528103254.2990068-2-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/tests: Add runtime tests for restoring continue flag

In case an action preceding the continue action fails, not only
the continue flag should not be set, it should be unset if it was set
from a previous run of actions_perform().

Add a runtime test to both osnoise and timerlat tools that checks that
this works properly by creating a temporary file.

Link: https://lore.kernel.org/r/20260526102523.2662391-4-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/tests: Run runtime tests in temporary directory

Create a temporary directory before each test case to serve as working
directory during the duration of the test.

This prevents littering of the original working directory as well as
allows tests to use it to avoid path conflicts.

In order not to break already existing tests, also add a new "testdir"
variable containing the directory where the test file is located. This
is then used to locate artifacts used during testing like BPF programs
and scripts for checking the tracer threads.

Link: https://lore.kernel.org/r/20260526102523.2662391-3-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/tests: Add unit test for restoring continue flag

In case an action preceding the continue action fails, not only
the continue flag should not be set, it should be unset if it was set
from a previous run of actions_perform().

Add a unit test to check if this is implemented correctly.

Link: https://lore.kernel.org/r/20260526102523.2662391-2-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

rtla/actions: Restore continue flag in actions_perform()

Currently, actions_perform() only ever sets the continue flag (when
performing the continue action), but never resets it. That leads to
RTLA continuing tracing even if the continue action was not performed in
the current iteration.

For example, the following command:

$ rtla timerlat hist -T 100 --on-threshold shell,command='
    echo Spike!
    if [ -f /tmp/a ]
    then
      exit 1
    else
      touch /tmp/a
    fi' --on-threshold continue

should print Spike! at most once, because after hitting the threshold
for the first time, /tmp/a exists, the shell action will fail, and the
continue action is not performed. However, unless /tmp/a exists before
the measurement, it will print Spike! until stopped, as the continue
flag stays set.

Set the continue flag to false in the beginning of actions_perform() to
make RTLA continue only if the action was actually performed.

Fixes: 8d933d5c89e8 ("rtla/timerlat: Add continue action")
Link: https://lore.kernel.org/r/20260526102523.2662391-1-tglozar@redhat.com
[ correct Fixes tag to include 12 characters of hash ]
Signed-off-by: Tomas Glozar <tglozar@redhat.com>

net: hibmcge: move dma_rmb() after dma_sync_single_for_cpu() in RX path

The dma_rmb() barrier was placed before dma_sync_single_for_cpu(), which
is incorrect. DMA sync must complete first to make the buffer accessible
to the CPU, then the rmb barrier ensures subsequent descriptor reads
observe the latest data written by the hardware.

Reorder the operations so dma_sync_single_for_cpu() is called before
dma_rmb() to guarantee the driver reads consistent data from the DMA
buffer.

Fixes: f72e25594061 ("net: hibmcge: Implement rx_poll function to receive packets")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260525144525.94884-3-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: hibmcge: disable Relaxed Ordering to fix RX packet corruption

When SMMU is disabled, the hibmcge driver may receive corrupted packets.
The hardware writes packet data and descriptors to the same page, but
with Relaxed Ordering enabled, PCI write transactions may not be
strictly ordered. This can cause the driver to observe a valid
descriptor before the corresponding packet data is fully written.

Fix this by clearing PCI_EXP_DEVCTL_RELAX_EN in the PCI bridge control
register to ensure strict write ordering between packet data and
descriptors.

Fixes: f72e25594061 ("net: hibmcge: Implement rx_poll function to receive packets")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260525144525.94884-2-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

platform/x86: dell-dw5826e: Add reset driver for DW5826e

If the DW5826e is in a frozen state and unable to receive USB commands,
this driver provides a method for the user to reset the DW5826e via ACPI.

E.g: echo 1 > /sys/bus/platform/devices/PALC0001\:00/wwan_reset

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Armin Wolf <W_Armin@gmx.de>
Signed-off-by: Jack Wu <jackbb_wu@compal.com>
Link: https://patch.msgid.link/20260526-dell-reset-v8-v8-1-d3a29cb4cf2f@compal.com
[ij: removed default m]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

Merge branch 'net-sched-fix-packet-loops-in-mirred-and-netem'

Jamal Hadi Salim says:

====================
net/sched: Fix packet loops in mirred and netem

This patchset adds a 2-bit per-skb tc_depth counter that travels with
the packet. The existing per-CPU mirred nest tracking loses state
when a packet is deferred through the backlog or moves between CPUs
via XPS/RPS. A per-skb field covers both cases.

Patch 1 adds the tc_depth field in a padding hole in sk_buff.
Patches 2-3 revert the check_netem_in_tree() fix and its tests,
which broke legitimate multi-netem configurations.
Patch 4 uses tc_depth to stop netem duplicate recursion.
Patch 5 uses tc_depth to catch mirred ingress redirect loops.
Patch 6 fixes the infinite loop in the mirred egress blockcast case.
Patch 7 fixes drop stats in early return error scenarios in tcf_mirred_act
for redirect (caught by Sashiko [1]).
Patches 8-9 add mirred and netem test cases.

[1] https://sashiko.dev/#/patchset/20260413082027.2244884-1-hxzene%40gmail.com
====================

Link: https://patch.msgid.link/20260525122556.973584-1-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/tc-testing: Add netem test case exercising loops

Add a netem nested duplicate test case to validate that it won't
cause an infinite loop

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-10-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/tc-testing: Add mirred test cases exercising loops

Add mirred loop test cases to validate that those will be caught and other
test cases that were previously misinterpreted as loops by mirred.

This commit adds 12 test cases:

- Redirect multiport: dummy egress -> dev1 ingress -> dummy egress (Loop)
- Redirect singleport: dev1 ingress -> dev1 egress -> dev1 ingress (Loop)
- Redirect multiport: dev1 ingress -> dummy ingress -> dev1 egress (No Loop)
- Redirect multiport: dev1 ingress -> dummy ingress -> dev1 ingress (Loop)
- Redirect multiport: dev1 ingress -> dummy egress -> dev1 ingress (Loop)
- Redirect multiport: dummy egress -> dev1 ingress -> dummy egress, different prios (Loop)
- Redirect multiport: dev1 ingress -> dummy ingress -> dummy egress -> dev1 egress (No Loop)
- Redirect multiport: dev1 ingress -> dummy egress -> dev1 egress (No Loop)
- Redirect multiport: dev1 ingress -> dummy egress -> dummy ingress (No Loop)
- Redirect singleport: dev1 ingress -> dev1 ingress (Loop)
- Redirect singleport: dummy egress -> dummy ingress (No Loop)
- Redirect multiport: dev1 ingress -> dummy ingress -> dummy egress (No Loop)

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-9-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: act_mirred: Fix return code in early mirred redirect error paths

Since retval is set as TC_ACT_STOLEN in the mirred redirect case, returning
retval in cases where redirect failed will make the callers not register
the skb as being dropped.

Fix this by returning TC_ACT_SHOT instead in such scenarios.

Fixes: 16085e48cb48 ("net/sched: act_mirred: Create function tcf_mirred_to_dev and improve readability")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260413082027.2244884-1-hxzene%40gmail.com
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-8-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow

tcf_mirred_act() checks sched_mirred_nest against MIRRED_NEST_LIMIT (4)
to prevent deep recursion.  However, when the action uses blockcast
(tcfm_blockid != 0), the function returns at the tcf_blockcast() call
BEFORE reaching the counter increment.  As a result, the recursion
counter never advances and the limit check is entirely bypassed.

When two devices share a TC egress block with a mirred blockcast rule,
a packet egressing on device A is mirrored to device B via blockcast;
device B's egress TC re-enters tcf_mirred_act() via blockcast and
mirrors back to A, creating an unbounded recursion loop:

  tcf_mirred_act -> tcf_blockcast -> tcf_mirred_to_dev -> dev_queue_xmit
  -> sch_handle_egress -> tcf_classify -> tcf_mirred_act -> (repeat)

This recursion continues until the kernel stack overflows.

The bug is reachable from an unprivileged user via
unshare(CLONE_NEWUSER | CLONE_NEWNET): user namespaces grant
CAP_NET_ADMIN in the new network namespace, which is sufficient to
create dummy devices, attach clsact qdiscs with shared blocks, and
install mirred blockcast filters.

BUG: TASK stack guard page was hit at ffffc90000b7fff8
Oops: stack guard page: 0000 [#1] SMP KASAN NOPTI
CPU: 2 UID: 1000 PID: 169 Comm: poc Not tainted 7.0.0-rc7-next-20260410
RIP: 0010:xas_find+0x17/0x480
Call Trace:
  xa_find+0x17b/0x1d0
  tcf_mirred_act+0x640/0x1060
  tcf_action_exec+0x400/0x530
  basic_classify+0x128/0x1d0
  tcf_classify+0xd83/0x1150
  tc_run+0x328/0x620
  __dev_queue_xmit+0x797/0x3100
  tcf_mirred_to_dev+0x7b1/0xf70
  tcf_mirred_act+0x68a/0x1060
  [repeating ~30+ times until stack overflow]
Kernel panic - not syncing: Fatal exception in interrupt

Fix this by incrementing sched_mirred_nest before calling
tcf_blockcast() and decrementing it on return, mirroring the
non-blockcast path.  This ensures subsequent recursive entries see the
updated counter and are correctly limited by MIRRED_NEST_LIMIT.

Fixes: fe946a751d9b ("net/sched: act_mirred: add loop detection")
Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
Link: https://patch.msgid.link/20260525122556.973584-7-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: Fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop

When mirred redirects to ingress (from either ingress or egress) the loop
state from sched_mirred_dev array dev is lost because of 1) the packet
deferral into the backlog and 2) the fact the sched_mirred_dev array is
cleared. In such cases, if there was a loop we won't discover it.

Here's a simple test to reproduce:
ip a add dev port0 10.10.10.11/24

tc qdisc add dev port0 clsact
tc filter add dev port0 egress protocol ip \
prio 10 matchall action mirred ingress redirect dev port1

tc qdisc add dev port1 clsact
tc filter add dev port1 ingress protocol ip \
prio 10 matchall action mirred egress redirect dev port0

ping -c 1 -W0.01 10.10.10.10

Fixes: fe946a751d9b ("net/sched: act_mirred: add loop detection")
Tested-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-6-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: fix packet loop on netem when duplicate is on

When netem duplicates a packet it re-enqueues the copy at the root qdisc.
If another netem sits in the tree the copy can be duplicated
again, recursing until the stack or memory is exhausted.

The original duplication guard temporarily zeroed q->duplicate around
the re-enqueue, but that does not cover all cases because it is
per-qdisc state shared across all concurrent enqueue paths
and is not safe without additional locking.

Use the skb tc_depth field introduced in an earlier patch:
- increment it on the duplicate before re-enqueue
- skip duplication for any skb whose tc_depth is already non-zero.

This marks the packet itself rather than mutating qdisc state,
therefore it is safe regardless of tree topology or concurrency.

Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication")
Reported-by: William Liu <will@willsroot.io>
Reported-by: Savino Dicanosa <savy@syst3mfailure.io>
Closes: https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/
Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: William Liu <will@willsroot.io>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-5-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Revert "selftests/tc-testing: Add tests for restrictions on netem duplication"

This reverts commit ecdec65ec78d67d3ebd17edc88b88312054abe0d.

The tests added were related to check_netem_in_tree() which was
just reverted in the previous patch.

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-4-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: Revert "net/sched: Restrict conditions for adding duplicating netems to qdisc tree"

This reverts commit ec8e0e3d7adef940cdf9475e2352c0680189d14e.

The original patch rejects any tree containing two netems when
either has duplication set, even when they sit on unrelated classes
of the same classful parent. That broke configurations that have
worked since netem was introduced.

The re-entrancy problem the original commit was trying to solve is
handled by later patch using tc_depth flag.

Doing this revert will (re)expose the original bug with multiple
netem duplication. When this patch is backported make sure
and get the full series.

Fixes: ec8e0e3d7ade ("net/sched: Restrict conditions for adding duplicating netems to qdisc tree")
Reported-by: Ji-Soo Chung <jschung2@proton.me>
Reported-by: Gerlinde <lrGerlinde@mailfence.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220774
Reported-by: zyc zyc <zyc199902@zohomail.cn>
Closes: https://lore.kernel.org/all/19adda5a1e2.12410b78222774.9191120410578703463@zohomail.cn/
Reported-by: Manas Ghandat <ghandatmanas@gmail.com>
Closes: https://lore.kernel.org/netdev/f69b2c8f-8325-4c2e-a011-6dbc089f30e4@gmail.com/
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-3-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: Introduce skb tc depth field to track packet loops

Add a 2-bit per-skb tc depth field to track packet loops across the stack.

The previous per-CPU loop counters like MIRRED_NEST_LIMIT
assume a single call stack and lose state in two cases:
1) When a packet is queued and reprocessed later (e.g., egress->ingress
   via backlog), the per-cpu state is gone by the time it is dequeued.
2) With XPS/RPS a packet may arrive on one CPU and be processed on
   another.

A per-skb field solves both by travelling with the packet itself.

The field fits in existing padding, using 2 bits that were previously a
hole:

pahole before(-) and after (+) diff looks like:
   __u8       slow_gro:1;           /*   132: 3  1 */
   __u8       csum_not_inet:1;      /*   132: 4  1 */
   __u8       unreadable:1;         /*   132: 5  1 */
+ __u8       tc_depth:2;           /*   132: 6  1 */

- /* XXX 2 bits hole, try to pack */
   /* XXX 1 byte hole, try to pack */

   __u16      tc_index;             /*   134     2 */

There used to be a ttl field which was removed as part of tc_verd in commit
aec745e2c520 ("net-tc: remove unused tc_verd fields").  It was already
unused by that time, due to remove earlier in commit c19ae86a510c ("tc: remove
unused redirect ttl").

The first user of this field is netem, which increments tc_depth on
duplicated packets before re-enqueueing them at the root qdisc.  On
re-entry, netem skips duplication for any skb with tc_depth already set,
bounding recursion to a single level regardless of tree topology.

The other user is mirred which increments it on each pass
and limits to depth to MIRRED_DEFER_LIMIT (3).

The new field was called ttl in earlier versions of this patch
but renamed to tc_depth to avoid confusion with IP ttl.

Note (looking at you Sashiko! Dont ignore me and continue bringing this up):
1. Since both mirred and netem utilize the same 2-bit tc_depth field it is
   possible when netem and mirred are used together that netem qdisc to skip
   the duplication step. This is a known trade-off, as a 2-bit field cannot
   independently track both features' recursion depths and it is not considered
   sane to have a setup that addresses both features on at the same time.

2. skb_scrub_packet does not clear tc_depth. This means a packet's loop history
  is preserved even across namespaces. While this might be restrictive for
  some topologies, it is also design intent to provide robustness against loops
  across namespaces.

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260525122556.973584-2-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'locking/context' into locking/core

platform/x86: uniwill-laptop: Enable battery charge modes on supported devices

Enable battery charge modes on supported TUXEDO devices by adding the
feature bit to the respective device descriptors.

Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Link: https://patch.msgid.link/20260512232145.329260-9-W_Armin@gmx.de
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: uniwill-laptop: Add support for battery charge modes

Many Uniwill-based devices do not supports the already existing
charge limit functionality, but instead support an alternative
interface for controlling the battery charge algorithm.

Add support for this interface and update the documentation.

Reviewed-by: Werner Sembach <wse@tuxedocomputers.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Link: https://patch.msgid.link/20260512232145.329260-8-W_Armin@gmx.de
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: uniwill-laptop: Mark EC_ADDR_OEM_4 as volatile

It turned out that EC_ADDR_OEM_4 also contains bits with a volatile
nature. Mark the whole register as volatile to prepare for the usage
of said bits. This also means that we now have to save/restore the
touchpad toggle state ourself.

Reviewed-by: Werner Sembach <wse@tuxedocomputers.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Link: https://patch.msgid.link/20260512232145.329260-7-W_Armin@gmx.de
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: uniwill-laptop: Rework FN lock/super key suspend handling

Currently the suspend handling for the FN lock and super key enable
features saves the whole values of the affected registers instead of
the individual feature state. This duplicates the register access
logic from the associated sysfs attributes.

Rework the suspend handling to reuse said register access logic and
only store the individual feature state as a boolean value.

Reviewed-by: Werner Sembach <wse@tuxedocomputers.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Link: https://patch.msgid.link/20260512232145.329260-6-W_Armin@gmx.de
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

seqlock: Allow UBSAN_ALIGNMENT to fail optimizing

With gcc-15 and gcc-16 with UBSAN_ALIGNMENT enabled the compiler fails to
inline and optimize __scoped_seqlock_bug() away on s390:

s390x-16.1.0-ld: kernel/sched/build_policy.o: in function `__scoped_seqlock_next':
/.../seqlock.h:1286:(.text+0x22030): undefined reference to `__scoped_seqlock_bug'

Fix this by adding UBSAN_ALIGNMENT to the list of config options where a
not inlined empty __scoped_seqlock_bug() is allowed.

Closes: https://lore.kernel.org/r/20260515092057.810542-1-arnd@kernel.org/
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260519110315.1385307-1-hca@linux.ibm.com

Merge branch 'fixes' into for-next

Merge uniwill driver fixes to the for-next branch to be able to continue
feature work there.