git.ipfire.org Git - thirdparty/kernel/linux.git/log

xfrm: ah: use skb_to_full_sk in async output callbacks

When AH output is offloaded to an asynchronous crypto provider
(hardware accelerators such as AMD CCP, or a forced-async software
shim used for testing), the digest completion fires
ah_output_done() / ah6_output_done() on a workqueue.  The egress
skb at that point may have been originated by a TCP listener
sending a SYN-ACK, which sets skb->sk to a request_sock via
skb_set_owner_edemux(); it may also have been originated by an
inet_timewait_sock retransmit.  Neither is a full struct sock, and
passing the raw skb->sk to xfrm_output_resume() then forwards a
non-full socket through the rest of the xfrm output chain.

xfrm_output_resume() and its downstream consumers expect a full
sk where they dereference at all.  The natural egress path
through ah_output_done() does not crash today because the
consumers that read past sock_common are either gated by
sk_fullsock() or short-circuit on flags that are clear on a fresh
request_sock; an exhaustive walk of the 50 most plausible
consumers under sch_fq, dev_queue_xmit, netfilter, tc-egress and
cgroup-egress BPF found no current unguarded deref.  The bug is
still a real type confusion that future consumer changes could
turn into a memory-corruption primitive.

This is the same bug class fixed for ESP in commit 1620c88887b1
("xfrm: Fix the usage of skb->sk").  Apply the analogous fix to
AH: convert skb->sk to a full socket pointer (or NULL) via
skb_to_full_sk() before handing it to xfrm_output_resume().

The same async AH callbacks were touched recently for an
independent ESN-related ICV layout bug in commit ec54093e6a8f
("xfrm: ah: account for ESN high bits in async callbacks"); the
sk type-confusion addressed here is orthogonal.  This patch is
part of an ongoing audit of the AH callback paths; an ah_output
ihl-validation hardening series is also currently under review on
netdev.

Reproduced under UML + KASAN + lockdep with a forced-async
hmac(sha1) shim that registers at priority 9999 and wraps the
sync in-tree hmac-sha1-lib.  With the shim loaded, ah_output_done
runs on every SYN-ACK egress through a transport-mode AH SA and
skb->sk arrives as a request_sock (TCP_NEW_SYN_RECV); after this
patch, xfrm_output_resume() receives the listener (the result of
sk_to_full_sk()) and consumer derefs land on full-sock fields as
intended.

Fixes: 9ab1265d5231 ("xfrm: Use actual socket sk instead of skb socket for xfrm_output_resume")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

drm/xe/display: fix oops in suspend/shutdown without display

The xe driver keeps track of whether to probe display, and whether
display hardware is there, using xe->info.probe_display. It gets set to
false if there's no display after intel_display_device_probe(). However,
the display may also be disabled via fuses, detected at a later time in
intel_display_device_info_runtime_init().

In this case, the xe driver does for_each_intel_crtc() on uninitialized
mode config in xe_display_flush_cleanup_work(), leading to a NULL
pointer dereference, and generally calls display code with display info
cleared.

Check for intel_display_device_present() after
intel_display_device_info_runtime_init(), and reset
xe->info.probe_display as necessary. Also do unset_display_features()
for completeness, although display runtime init has already done
that. This will need to be unified across all cases later.

Move intel_display_device_info_runtime_init() call slightly earlier,
similar to i915, to avoid a bunch of unnecessary setup for no display
cases.

Note #1: The xe driver has no business doing low level display plumbing
like for_each_intel_crtc() to begin with. It all needs to happen in
display code.

Note #2: The actual bug is present already in commit 44e694958b95
("drm/xe/display: Implement display support"), but the oops was likely
introduced later at commit ddf6492e0e50 ("drm/xe/display: Make display
suspend/resume work on discrete").

Fixes: 44e694958b95 ("drm/xe/display: Implement display support")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7904
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/6150
Cc: stable@vger.kernel.org # v6.8+
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Link: https://patch.msgid.link/20260515160920.1082842-1-jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>

riscv: dts: spacemit: k1-musepi-pro: set default console baud rate

Allow serial output with the same uboot/opensbi settings so the
console works without providing a cmdline.

Signed-off-by: Andre Heider <a.heider@gmail.com>
Reviewed-by: Yixun Lan <dlan@kernel.org>
Link: https://patch.msgid.link/20260513071958.29574-7-a.heider@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

riscv: dts: spacemit: k1-musepi-pro: enable PCIe ports

Enable the two PCIe controllers along with their associated PHYs. They
are routed to the M.2 M-key connector and to the PCIe slot.

Signed-off-by: Andre Heider <a.heider@gmail.com>
Link: https://patch.msgid.link/20260513071958.29574-6-a.heider@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

riscv: dts: spacemit: k1-musepi-pro: enable USB 3 ports

Enable the DWC3 USB 3.0 controller, its associated combo_phy (USB 3 PHY)
and usbphy2 (USB 2 PHY) on the MusePi Pro board.

The board uses a VLI VL817 hub, providing four ports.

Signed-off-by: Andre Heider <a.heider@gmail.com>
Link: https://patch.msgid.link/20260513071958.29574-5-a.heider@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

riscv: dts: spacemit: k1-musepi-pro: enable QSPI and add SPI NOR

Add the QSPI controller node and describe the attached SPI NOR flash
(Winbond W25Q64FWSSAQ).

Add a corresponding vendor flash partition layout.

Signed-off-by: Andre Heider <a.heider@gmail.com>
Link: https://patch.msgid.link/20260513071958.29574-4-a.heider@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

riscv: dts: spacemit: k1-musepi-pro: add 24c02 eeprom

Enable i2c2 and add the connected GT24C02B EEPROM.

It contains an ONIE TLV table:
=> tlv_eeprom
TLV: 0
[  12.162] TlvInfo Header:
[  12.162]    Id String:    TlvInfo
[  12.165]    Version:      1
[  12.168]    Total Length: 58
[  12.171] TLV Name             Code Len Value
[  12.175] -------------------- ---- --- -----
[  12.179] Product Name         0x21  16 k1-x_MUSE-Pi-Pro
[  12.184] Serial Number        0x23  17 BPMIMXXXXXXXXXXXX
[  12.189] Unknown              0x41   1  0x02
[  12.194] Base MAC Address     0x24   6 FE:FE:FE:XX:XX:XX
[  12.199] MAC Addresses        0x2A   2 2
[  12.203] CRC-32               0xFE   4 0x395ECD34
[  12.207] Checksum is valid.

(With 0x41 as TLV_CODE_DDR_CSNUM)

Signed-off-by: Andre Heider <a.heider@gmail.com>
Link: https://patch.msgid.link/20260513071958.29574-3-a.heider@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

riscv: dts: spacemit: k1-musepi-pro: add PMIC and power infrastructure

Enable i2c8 and add the connected SpacemiT P1 PMIC with its related
regulators for the board's power infrastructure and voltage regulation
support.

Signed-off-by: Andre Heider <a.heider@gmail.com>
Link: https://patch.msgid.link/20260513071958.29574-2-a.heider@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

ASoC: sigmadsp: Use flexible array for control cache

Allocate SigmaDSP controls with kzalloc_flex() for the trailing
cache data instead of open-coding the size calculation.

Annotate the cache array with its existing byte count so the allocation
helper can initialize the counter.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260511230351.28868-1-rosenp@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: pcm6240: Use flexible array for config blocks

Store the per-config block pointer table in the config allocation
instead of allocating it separately.

This ties the table to the config object lifetime and removes the
extra allocation and free path.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260511231313.31929-1-rosenp@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: amd: acp: Add DMI quirk for ASUS Zenbook S16 UM5606GA

The ASUS Zenbook S16 (UM5606GA) with AMD Ryzen AI 9 465 (Strix Point,
ACP 7.0) has a BIOS that incorrectly sets the ACPI property
'acp-audio-config-flag' to 0x10 (FLAG_AMD_LEGACY_ONLY_DMIC) for the ACP
device. This prevents snd_pci_ps from probing the SoundWire bus, resulting
in no internal audio (dummy output only).

The hardware uses a Cirrus Logic CS42L43 (headphone/jack) and four CS35L56
smart amplifiers (speakers), all on SoundWire link 1. The corresponding
machine table entry (acp70_cs42l43_l1u0_cs35l56x4_l1u0123) already exists
in amd-acp70-acpi-match.c and correctly describes this topology.

Add a DMI quirk to override the flag to 0, consistent with the existing
entry for the HN7306EA.

Signed-off-by: Jasper Smet <josbeir@gmail.com>
Link: https://patch.msgid.link/20260513052137.56703-1-josbeir@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

spi: mtk-snfi: Fix resource leak in mtk_snand_read_page_cache()

When DMA read times out in mtk_snand_read_page_cache(), the original code
erroneously jumped to cleanup label which skips DMA unmapping and ECC
disable, causing a resource leak.

Fixes: 764f1b748164 ("spi: add driver for MTK SPI NAND Flash Interface")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260510-snfi-v1-1-bc375cf1af8e@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: amd: acp-sdw-legacy: check CPU DAI name before logging

devm_kasprintf() can fail and return NULL. The legacy AMD SoundWire
machine driver logs cpus->dai_name before checking the allocation result.

Move the debug print after the NULL check, matching the ordering used by
the SOF AMD SoundWire path after commit 5726b68473f7 ("ASoC: amd/sdw_utils:
avoid NULL deref when devm_kasprintf() fails").

Fixes: 2981d9b0789c ("ASoC: amd: acp: add soundwire machine driver for legacy stack")
Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Link: https://patch.msgid.link/20260511-asoc-amd-acp-sdw-legacy-dai-name-null-v1-1-dc6151b6da8a@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

spi: rspi: Simplify reset control handling

Use devm_reset_control_get_optional_exclusive_deasserted() to combine
get + deassert + cleanup in a single call, removing the redundant
rspi_reset_control_assert() helper.

Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260507-rspi-v1-1-8cfa47cd56aa@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: qcom: q6apm-dai: Allocate an extra page for PCM buffers

Some Old DSP firmware versions use 32-bit address arithmetic and size for
validating the PCM buffer address range. If a buffer is allocated near
the top of the 32-bit address space, arithmetic calculations involving
the end address can overflow and fail checks.

Work around this by increasing the preallocated PCM buffer size by one
page. The DSP is still passed the usable buffer size, excluding the extra
page, which prevents the firmware from seeing an end address that crosses
the 32-bit boundary.

This was not hit before because PCM buffer allocation and DSP-side
mapping happened at different points, and the size mapped on the DSP was
usually nperiods * period_size. Therefore the mapped size was unlikely to
match the full preallocated buffer size exactly, although the issue was
still possible. With early buffer mapping on the DSP, the full
preallocated buffer is mapped during PCM creation, making the failure
reproducible at boot.

Fixes: 8ea6e25c8536 ("ASoC: qcom: q6apm: Add support for early buffer mapping on DSP")
Cc: Stable@vger.kernel.org
Reported-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
Closes: https://lore.kernel.org/all/7f10abbd-fb78-4c3a-ab90-7ca78239891a@oldschoolsolutions.biz/
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@oss.qualcomm.com>
Tested-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
Link: https://patch.msgid.link/20260514090607.2435484-1-srinivas.kandagatla@oss.qualcomm.com
Signed-off-by: Mark Brown <broonie@kernel.org>

riscv: dts: spacemit: k1-bananapi-f3: add SD card support with UHS modes

Add complete SD card controller support with UHS high-speed modes.

- Enable sdhci0 controller with 4-bit bus width
- Configure card detect GPIO with GPIO_ACTIVE_LOW and internal pull-up
support
- Connect vmmc-supply to buck4 for 3.3V card power
- Connect vqmmc-supply to aldo1 for 1.8V/3.3V I/O switching
- Add dual pinctrl states for voltage-dependent pin configuration
- Support UHS-I SDR25, SDR50, and SDR104 modes
- Add stable MMC device aliases (mmc0 = eMMC, mmc1 = SD card)

This enables full SD card functionality including high-speed UHS modes
for improved performance.

Suggested-by: Anand Moon <linux.amoon@gmail.com>
Tested-by: Anand Moon <linux.amoon@gmail.com>
Tested-by: Margherita Milani <margherita.milani@amarulasolutions.com>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Iker Pedrosa <ikerpedrosam@gmail.com>
Link: https://patch.msgid.link/20260515-orangepi-sd-card-uhs-v10-3-094af27e310d@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

riscv: dts: spacemit: k1-orangepi-rv2: add SD card support with UHS modes

Add complete SD card controller support with UHS high-speed modes.

- Enable sdhci0 controller with 4-bit bus width
- Configure card detect GPIO with GPIO_ACTIVE_LOW logic
- Connect vmmc-supply to buck4 for 3.3V card power
- Connect vqmmc-supply to aldo1 for 1.8V/3.3V I/O switching
- Add dual pinctrl states for voltage-dependent pin configuration
- Support UHS-I SDR25, SDR50, and SDR104 modes
- Add stable MMC device aliases (mmc0 = eMMC, mmc1 = SD card)

This enables full SD card functionality including high-speed UHS modes
for improved performance.

Tested-by: Anand Moon <linux.amoon@gmail.com>
Tested-by: Trevor Gamblin <tgamblin@baylibre.com>
Tested-by: Michael Opdenacker <michael.opdenacker@rootcommit.com>
Tested-by: Vincent Legoll <legoll@online.fr>
Signed-off-by: Iker Pedrosa <ikerpedrosam@gmail.com>
Link: https://patch.msgid.link/20260515-orangepi-sd-card-uhs-v10-2-094af27e310d@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

riscv: dts: spacemit: k1: add SD card controller and pinctrl support

Add SD card controller infrastructure for SpacemiT K1 SoC with complete
pinctrl support for both standard and UHS modes.

- Add sdhci0 controller definition with clocks, resets and interrupts
- Add mmc1_cfg pinctrl for 3.3V standard SD operation
- Add mmc1_uhs_cfg pinctrl for 1.8V UHS high-speed operation
- Configure appropriate drive strength and power-source properties

This provides complete SD card infrastructure that K1-based boards can
enable.

Tested-by: Anand Moon <linux.amoon@gmail.com>
Tested-by: Trevor Gamblin <tgamblin@baylibre.com>
Tested-by: Vincent Legoll <legoll@online.fr>
Reviewed-by: Troy Mitchell <troy.mitchell@linux.dev>
Signed-off-by: Iker Pedrosa <ikerpedrosam@gmail.com>
Link: https://patch.msgid.link/20260515-orangepi-sd-card-uhs-v10-1-094af27e310d@gmail.com
Signed-off-by: Yixun Lan <dlan@kernel.org>

net: hsr: defer node table free until after RCU readers

HSR node-list and node-status generic-netlink operations run under
rcu_read_lock(). They walk hsr->node_db through hsr_get_next_node() and
hsr_get_node_data(), but RTM_DELLINK teardown removes the same node table
with plain list_del() and frees each node immediately.

That lets a generic-netlink reader hold a struct hsr_node pointer across
hsr_dellink(). In a KASAN build, widening the reader window after
hsr_get_next_node() obtains the node reproduces a slab-use-after-free
when the reader copies node->macaddress_A; the freeing stack is
hsr_del_nodes() from hsr_dellink().

Use list_del_rcu() and defer the free through the existing
hsr_free_node_rcu() callback. This matches the lifetime rule used by the
HSR prune paths, which already delete nodes with list_del_rcu() and
call_rcu().

Fixes: b9a1e627405d ("hsr: implement dellink to clean up resources")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260513233838.3064715-2-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: addrconf: bail out of dad_failure when state is no longer POSTDAD

addrconf_dad_failure() transitions ifp->state from DAD to POSTDAD
via addrconf_dad_end(), which drops ifp->lock on return.  The lock
is re-acquired after net_info_ratelimited().  A concurrent
ipv6_del_addr() can take the lock in that window, set ifp->state
to DEAD and run list_del_rcu(&ifp->if_list).

addrconf_dad_failure() then overwrites DEAD with ERRDAD at errdad:
and schedules a new dad_work.  The work calls ipv6_del_addr()
again, hitting the already-poisoned list entry:

  general protection fault: 0000 [#1] SMP NOPTI
  CPU: 4 PID: 217 Comm: kworker/4:1
  Workqueue: ipv6_addrconf addrconf_dad_work
  RIP: 0010:ipv6_del_addr+0xe9/0x280
  RAX: dead000000000122
  Call Trace:
   addrconf_dad_stop+0x113/0x140
   addrconf_dad_work+0x28c/0x430
   process_one_work+0x1eb/0x3b0
   worker_thread+0x4d/0x400
   kthread+0x104/0x140
   ret_from_fork+0x35/0x40

Fold the addrconf_dad_end() logic into addrconf_dad_failure() under
a single ifp->lock critical section.  The STABLE_PRIVACY branch
temporarily drops ifp->lock around address regeneration, so at
lock_errdad: verify the state is still POSTDAD before transitioning
to ERRDAD; bail out otherwise to avoid overwriting a state set by
another path while the lock was released.

Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing to workqueue")
Signed-off-by: Linmao Li <lilinmao@kylinos.cn>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260513025509.3776405-1-lilinmao@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sched-sch_htb-first-round-of-fixes'

Eric Dumazet says:

====================
net/sched: sch_htb: first round of fixes

First round of fixes in sch_htb.

I chose to send a small series, to reduce chances of multiple versions.
====================

Link: https://patch.msgid.link/20260514095935.3926276-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_htb: annotate data-races (III)

htb_dump_class_stats() will soon run locklessly.

(no RTNL, not qdisc spinlock).

Add READ_ONCE()/WRITE_ONCE() annotations on these fields:

- cl->overlimits
- cl->drops
- cl->tokens
- cl->ctokens

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260514095935.3926276-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_htb: annotate data-races (II)

htb_dump_class_stats() will soon run locklessly.
(no RTNL, not qdisc spinlock).

Remove cl->xstats and replace it with two fields:

- xstats_lends
- xstats_borrows

Then use READ_ONCE()/WRITE_ONCE() annotations on them, and change
htb_dump_class_stats to use a private struct tc_htb_xstats.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260514095935.3926276-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_htb: annotate data-races (I)

htb_dump() runs without holding qdisc spinlock.

Add missing READ_ONCE()/WRITE_ONCE() annotations around
q->overlimits and q->direct_pkts.

Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260514095935.3926276-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: sch_htb: do not change sch->flags in htb_dump()

htb_dump() runs without holding qdisc spinlock.

It is illegal to touch sch->flags with non locked RMW,
as concurrent readers might see intermediate wrong values.

Set TCQ_F_OFFLOADED in control path (htb_init()) instead.

Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260514095935.3926276-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

btrfs: swallow btrfs_record_squota_delta() ENOENT

I thought that it was likely I could harden squota deletion to the point
that it was impossible to end up with an extent accounted to a qgroup
outliving its qgroup. Several recent bugs have made me re-consider that
position.

Ultimately, this is a tradeoff between short term stability and long
term strictness, but I think given that there could be another layer of
bugs behind the 2-3 I just fixed, I would feel much more confident in
people using squotas if the risk was "your values can get a bit out of
whack which you can fix by deleting stuff or
disabling/re-enabling/repairing" vs "it will abort your filesystem".

As the final nail in the coffin, the Meta production kernel was lacking
earlier fixes from me and Qu regarding subvol qgroup lifetime, so this
is what we have been testing at scale, so I think at least for now
upstream should have the same extra layer of protection.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: clamp to avoid squota underflow

Simple quota accounting can undercount metadata tree block allocations
in certain scenarios. When an undercounted subvolume is deleted and its
tree blocks freed, the free deltas decrement rfer/excl past zero,
wrapping the u64 to a value near U64_MAX.

Once wrapped, can_delete_squota_qgroup() sees non-zero rfer and refuses
to delete the qgroup. The qgroup becomes permanently orphaned in the
quota tree, since there is no subvolume left to generate frees that
would bring the counter back to zero.

While we ultimately want to fix any mis-accounting at the source, it is
also helpful and worthwhile to mitigate the damage by clamping rfer and
excl to zero on underflow rather than allowing the u64 to wrap. This at
least allows us to clean up the messed up qgroups on subvol deletion.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix squota accounting during enable generation

The first transaction that enables squotas is special and a bit tricky.
We have to set BTRFS_FS_QUOTA_ENABLED after the transaction to avoid a
deadlock, so any delayed refs that run before we set the bit are not
squota accounted. For data this is fine, we don't get an owner_ref, so
there is no real harm, it's as if the extent predated squotas. However
for metadata, the tree block will have gen == enable_gen so when we free
it later, we will decrement the squota accounting, which can result in
an underflow. Before it is freed, btrfs check shows errors, as we have
mismatched usage between the node generations/owners and the squota
values.

There are two angles to this fix:

1. For extents that come in delayed_refs that run during the
   enable_gen transaction, we must actually set enable_gen to the *next*
   transaction. That is the first transaction that we can really
   properly account in any way.
2. For extents that come in between the end of our transaction handle
   and the time we set the BTRFS_FS_QUOTA_ENABLED bit, we need an
   additional bit, BTRFS_FS_SQUOTA_ENABLING which only affects recording
   squota deltas, so we do pick up those extents. Otherwise, we would
   miss them, even for enable_gen + 1.

Fixes: bd7c1ea3a302 ("btrfs: qgroup: check generation when recording simple quota delta")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: check for subvolume before deleting squota qgroup

The invariant that we want to maintain with subvolume qgroups is that
the qgroup can only be deleted if there is no root. With squotas, we
thought that it was sufficient to just check the usage, because we
assumed that deleting a subvolume will drive it's qgroups usage to 0,
and thus 0 usage implies no subvolume.

However, this is false, for two reasons:

- A subvol whose extents are all from before squotas was enabled.
- A subvol that was created in this transaction and for which we have
not yet run any delayed refs.

In both cases, deleting the qgroup breaks the desired invariant and we
are left with a subvolume with no qgroup but squotas are enabled.

Fix this by unifying the deletion check logic between full qgroups and
squotas. Squotas do all the same checks *and* the additional usage == 0
check, which is the one extra rule peculiar to squotas.

Link: https://lore.kernel.org/linux-btrfs/adnBhWfJQ1n3hZC8@merlins.org/
Fixes: a8df35619948 ("btrfs: forbid deleting live subvol qgroup")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: always drop root->inodes lock before cond_resched()

find_first_inode() and find_first_inode_to_shrink() lock root->inodes,
then loop over them, occasionally skipping some inodes. When they skip
an inode, they attempt to share the cpu/lock with cond_resched_lock().

However, that has a subtle problem associated with it.
cond_resched_lock() only drops the lock if it needs to actually call
schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as
detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So
regardless of HZ, we will run for at least 2.8ms in this loop without
dropping the lock, assuming it finds no suitable inodes. If HZ is
small enough, it might be even worse as the tick granularity becomes
bigger than the timeslice.

The knock-on effect of this is that callers to
btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab
or userspace threads calling evict() will spin on xa_lock(&root->inodes)
for 2.8ms, so the extent map shrinker dominates the lock even though
ostensibly it is intending to share it. This produces memory pressure as
there is only one kswapd and it runs sequentially so it can get stuck in
the inode slab shrinking.

To fix it, simply replace cond_resched_lock() with an open coded variant
which unconditionally does unlock/lock around cond_resched. Sharing the
lock is decoupled from sharing the CPU, and all the users of the lock
now share it fairly.

I was able to reproduce this on test systems by producing a lot of empty
files (to make a big root->inodes xarray), then producing memory
pressure by reading large files larger than ram, triggering kswapd and
the extent_map shrinker. The lock contention is visible with perf or
lockstat. This patch also relieved a user-apparent bottleneck on a
production system from the original report.

Tested-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: mark file extent range dirty after converting prealloc extents

When writing into a preallocated extent, ordered extent completion calls
btrfs_mark_extent_written() to convert the file extent item from the
BTRFS_FILE_EXTENT_PREALLOC type to the BTRFS_FILE_EXTENT_REG type.

If the preallocated extent was created beyond i_size with fallocate
keep-size, and the inode is evicted and loaded again before the write,
the inode's file_extent_tree is initialized only up to i_size.

The beyond i_size prealloc extent is therefore not tracked there.

After a write into that extent extends i_size, btrfs_mark_extent_written()
updates the file extent item, but the corresponding range is not marked
dirty in the inode's file_extent_tree.

This can leave disk_i_size stale when the filesystem does not use the
no-holes feature, so after remount the file size can go back to the old
value.

The following reproducer triggers the problem:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f -O ^no-holes $DEV
  mount $DEV $MNT

  touch $MNT/file
  fallocate -n -l 2M $MNT/file

  umount $MNT
  mount $DEV $MNT

  dd if=/dev/zero of=$MNT/file bs=1M count=1 conv=notrunc
  ls -lh $MNT/file

  umount $MNT
  mount $DEV $MNT

  ls -lh $MNT/file
  umount $MNT

Running the reproducer gives the following result:

  $ ./test.sh
  (...)
  1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000596024 s, 1.8 GB/s
  -rw-rw-r-- 1 root root 1.0M May  8 16:34 /mnt/sdi/file
  -rw-rw-r-- 1 root root 0 May  8 16:34 /mnt/sdi/file

Fix this by marking the written range dirty in the inode's
file_extent_tree after successfully converting the prealloc extent to a
regular extent.

Fixes: 9ddc959e802b ("btrfs: use the file extent tree infrastructure")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
[ Minor change log updates ]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

net/sched: sch_dualpi2: annotate data-races in dualpi2_dump_stats()

dualpi2_dump_stats() runs without qdisc lock held.

Add missing READ_ONCE()/WRITE_ONCE() annotations.

Fixes: d4de8bffbef4 ("sched: Dump configuration and statistics of dualpi2 qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vineet Agarwal <agarwal.vineet2006@gmail.com>
Link: https://patch.msgid.link/20260514114713.4134674-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

llc: avoid sparse cast-truncates warning in counter clamps

llc_conn_ac_inc_npta_value() and llc_conn_ac_inc_tx_win_size()
clamp their counters to the maximum valid 7-bit value via
(u8) ~LLC_2_SEQ_NBR_MODULO. LLC_2_SEQ_NBR_MODULO is defined as
((u8) 128) in include/net/llc_pdu.h, but the (u8) cast does not
prevent integer promotion of the operand of ~: ~128 is computed
as int (0xffffff7f), and the surrounding (u8) cast truncates
back to 0x7f. The result is correct (127), but the implicit
truncation is flagged by sparse:

  net/llc/llc_c_ac.c:1008:38: warning: cast truncates bits from
      constant value (ffffff7f becomes 7f)
  (and three more at lines 1009, 1099, 1100)

Replace the (u8) ~LLC_2_SEQ_NBR_MODULO expression with
LLC_2_SEQ_NBR_MODULO - 1, which evaluates to 127 directly and
silences sparse.

The same ~LLC_2_SEQ_NBR_MODULO pattern also appears in
include/net/llc_pdu.h:148 as part of PDU_GET_NEXT_Vr, but there
the result is immediately &-masked, so the int promotion is
harmless and sparse does not flag it; it is left alone.

This patch is the minimum diff to silence the warning. The
counter-clamp idiom itself could be modernized to
min_t(u8, ..., LLC_2_SEQ_NBR_MODULO - 1), but that is a
separate cleanup left for another patch.

No functional change.

Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com>
Link: https://patch.msgid.link/20260513092253.3035961-1-avinash.duduskar@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: always declare __sock_wfree() and tcp_wfree()

Even if guarded by IS_ENABLED(CONFIG_INET) compilers need to know
what __sock_wfree() and tcp_wfree() are:

   include/net/sock.h:1861:63: note: each undeclared identifier is reported only once for each function it appears in
   include/net/sock.h:1862:63: error: 'tcp_wfree' undeclared (first use in this function); did you mean 'sock_wfree'?
    1862 |                (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree);

Fixes: f0de88303d5e ("net: make is_skb_wmem() available to modules")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605141607.mDXnYFKY-lkp@intel.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260514095506.3919094-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock/virtio: fix zerocopy completion for multi-skb sends

When a large message is fragmented into multiple skbs, the zerocopy
uarg is only allocated and attached to the last skb in the loop.
Non-final skbs carry pinned user pages with no completion tracking,
so the kernel has no way to notify userspace when those pages are safe
to reuse. If the loop breaks early the uarg is never allocated at all,
leaking pinned pages with no completion notification.

Fix this by following the approach used by TCP: allocate the zerocopy
uarg (if not provided by the caller) before the send loop and attach
it to every skb via skb_zcopy_set(), which takes a reference per skb.
Each skb's completion properly decrements the refcount, and the
notification only fires after the last skb is freed.
On failure, if no data was sent, the uarg is cleanly aborted via
net_zcopy_put_abort().

This issue was initially discovered by sashiko while reviewing commit
1cb36e252211 ("vsock/virtio: fix MSG_ZEROCOPY pinned-pages accounting")
but was pre-existing.

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Closes: https://sashiko.dev/#/patchset/20260420132051.217589-1-sgarzare%40redhat.com
Reported-by: Maher Azzouzi <maherazz04@gmail.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Link: https://patch.msgid.link/20260514092948.268720-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hsr: reject unresolved interlink ifindex

In hsr_newlink(), a provided but invalid IFLA_HSR_INTERLINK attribute
was silently ignored if __dev_get_by_index() returned NULL. This leads
to incorrect RedBox topology creation without notifying the user.

Fix this by returning -EINVAL and an extack message when the
interlink attribute is present but cannot be resolved.

Reviewed-by: Felix Maurer <fmaurer@redhat.com>
Signed-off-by: Luka Gejak <luka.gejak@linux.dev>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260513182657.20346-3-luka.gejak@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Reserve RX headroom to avoid skb reallocation

Reserve NET_SKB_PAD + NET_IP_ALIGN bytes of headroom for received packets
to avoid skb head reallocation when pushing protocol headers into the skb.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260513-airoha-rx-headroom-v1-1-bd87798e422d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: CGX: add bounds check to cgx_speed_mbps index

cgx_speed_mbps has 13 elements but RESP_LINKSTAT_SPEED can yield values
0-15. If it returns a value >= 13, this causes an out-of-bounds array
access. Add a bounds check and default to speed 0 if the index is out of
range.

Fixes: 61071a871ea6 ("octeontx2-af: Forward CGX link notifications to PFs")
Cc: Sunil Goutham <sgoutham@marvell.com>
Cc: Linu Cherian <lcherian@marvell.com>
Cc: Geetha sowjanya <gakula@marvell.com>
Cc: hariprasad <hkelam@marvell.com>
Cc: Subbaraya Sundeep <sbhatta@marvell.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: stable <stable@kernel.org>
Signed-off-by: Sam Daly <sam@samdaly.ie>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2026051352-refined-demise-e88d@gregkh
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

IB/IPoIB: ndo_set_rx_mode_async conversion

The commit in the fixes tag added a warning for devices
that are netdev ops locked that they should be converted
to .ndo_set_rx_mode_async. IPoIB for mlx5 is such a
driver which was missed during the conversion because the
flow is more complex:
- mlx5 part of IPoIB device was converted to ops-lock in commit [1].
- ipoib_intf_init() then overrides netdev_ops with
  ipoib_netdev_ops_{pf,vf}, which still wired ndo_set_rx_mode to the
  legacy sync path -- tripping the new warning on every probe.

So now we have the following splat:
  netdevice: ib0 (uninitialized): ops-locked drivers should use ndo_set_rx_mode_async
  WARNING: net/core/dev.c:11366 at register_netdevice+0x83c/0x21d0
  ...
  register_netdev+0x1f/0x40
  ipoib_add_one+0x35c/0x880 [ib_ipoib]

This patch implements .ndo_set_rx_mode_async but it simply schedules the
multicast restart task like before. This is done to maintain the
assumption that this task and others [2] must run on the same order
workqueue to avoid racing with themselves. The race between
ipoib_mcast_join_task() and ipoib_mcast_restart_task() would be the most
obvious example.

[1] 8f7b00307bf1, "net/mlx5e: Convert mlx5 netdevs to instance locking")
[2] ipoib_mcast_join_task, ipoib_mcast_restart_task,
    ipoib_mcast_carrier_on_task, ipoib_reap_ah, ipoib_reap_neigh

Fixes: 3cbd22938877 ("net: warn ops-locked drivers still using ndo_set_rx_mode")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://patch.msgid.link/20260513124519.3357165-1-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'drm-fixes-2026-05-16' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly fixes pull, small and all over fixes, mostly xe and amdgpu,
  with some ttm and a core fix for the handle change pain.

  core:
   - fix for the fix for the handle change race

  ttm:
   - avoid infinite loop in swap out
   - avoid infinite loop in BO shrinking
   - convert -EAGAIN from dmem_cgroup_try_charge to -ENOSPC

  bridge:
   - imx8qxp-pxl2dpi: avoid ERR_PTR with device_node cleanup

  i915:
   - Skip __i915_request_skip() for already signaled requests
   - Fix VSC dynamic range signaling for RGB formats [dp]

  xe:
   - Madvise fix around purgeability tracking
   - Restore engine mask for specific blitter style
   - Couple UAF fixes
   - Drop unused ggtt_balloon field

  amdgpu:
   - Userq fixes
   - DCN 3.2 fix
   - RAS fix
   - GC 12 fix

  gma500:
   - oaktrail_lvds: fix i2c handling

  loongson:
   - use managed cleanup for connector polling

  panfrost:
   - handle results from reservation locking correctly

  qaic:
   - check for integer overflows in mmap logic

  rocket:
   - handle results from reservation locking correctly"

* tag 'drm-fixes-2026-05-16' of https://gitlab.freedesktop.org/drm/kernel: (26 commits)
  drm: Replace old pointer to new idr
  drm/loongson: Use managed KMS polling
  drm/ttm: Fix ttm_bo_shrink() infinite LRU walk on backup failure
  drm/ttm: Convert -EAGAIN from dmem_cgroup_try_charge to -ENOSPC
  drm/gma500/oaktrail_lvds: fix i2c adapter leaks on init
  drm/gma500/oaktrail_lvds: fix hang on init failure
  drm/gma500/oaktrail_hdmi: fix i2c adapter leak on setup
  drm/xe: Drop unused ggtt_balloon field
  accel/qaic: Add overflow check to remap_pfn_range during mmap
  drm/i915/dp: Fix VSC dynamic range signaling for RGB formats
  drm/i915: skip __i915_request_skip() for already signaled requests
  drm/bridge: imx8qxp-pxl2dpi: avoid ERR_PTR with device_node cleanup
  drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12
  drm/amd/ras: Fix CPER ring debugfs read overflow
  drm/amd/display: Wrap DCN32 phantom-plane allocation in DC_RUN_WITH_PREEMPTION_ENABLED
  drm/amdgpu: fix userq hang detection and reset
  drm/amdgpu: remove almost all calls to amdgpu_userq_detect_and_reset_queues
  drm/amdgpu: rework amdgpu_userq_signal_ioctl v3
  drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset
  drm/xe/dma-buf: fix UAF with retry loop
  ...

drm: Replace old pointer to new idr

Commit 5e28b7b94408 introduced a logical error by failing to replace the
newly generated IDR pointer to old id's pointer at the correct location
within the "change handle" logic; this resulted in the issue reported by
syzbot [1].

Specifically, the new IDR object pointer is intended to replace the original
id's pointer during the normal execution flow.

Additionally, an unnecessary conditional check for the ret exit path has
been removed.

[1]
!RB_EMPTY_ROOT(&prime_fpriv->dmabufs)
WARNING: drivers/gpu/drm/drm_prime.c:224 at drm_prime_destroy_file_private+0x48/0x60 drivers/gpu/drm/drm_prime.c:224, CPU#0: syz.0.17/5833
Call Trace:
drm_file_free.part.0+0x7e6/0xcc0 drivers/gpu/drm/drm_file.c:269
drm_file_free drivers/gpu/drm/drm_file.c:237 [inline]
drm_close_helper.isra.0+0x186/0x200 drivers/gpu/drm/drm_file.c:290
drm_release+0x1ab/0x360 drivers/gpu/drm/drm_file.c:438

Fixes: 5e28b7b94408 ("drm: Set old handle to NULL before prime swap in change_handle")
Reported-by: syzbot+d7c9eed171647e421013@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d7c9eed171647e421013
Cc: stable@vger.kernel.org
Tested-by: syzbot+d7c9eed171647e421013@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patch.msgid.link/tencent_C267296443AAA4567771176886DFF364A305@qq.com

dt-bindings: Consolidate "sram" property definition

The "sram" property has become a de facto standard property, so create a
common schema for it and drop all the duplicated definitions.

Reviewed-by: Linus Walleij <linusw@kernel.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Liu Ying <victor.liu@nxp.com> #fsl,imx8qxp-dc-command-sequencer.yaml
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Vinod Koul <vkoul@kernel.org>
Acked-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> # display/msm
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reviewed-by: Tanmay Shah <tanmay.shah@amd.com>
Link: https://patch.msgid.link/20260511165942.2774868-1-robh@kernel.org
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

dt-bindings: Fix phandle-array constraints, again

The unfortunately named 'phandle-array' property type is really a matrix
with phandle and fixed arg cells entries. A matrix property should have 2
levels of items constraints.

Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260507201749.2605365-1-robh@kernel.org
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

ipv4: raw: reject IP_HDRINCL packets with ihl < 5

raw_send_hdrinc() validates that the caller-supplied IPv4 header
fits within the message length:

    iphlen = iph->ihl * 4;
    err = -EINVAL;
    if (iphlen > length)
        goto error_free;

    if (iphlen >= sizeof(*iph)) {
        /* fix up saddr, tot_len, id, csum, transport_header */
    }

It does not, however, reject ihl < 5.  For such a packet the
"if (iphlen >= sizeof(*iph))" branch is skipped, leaving the
crafted iphdr untouched, but the packet is still handed to
__ip_local_out() and onward.  Downstream consumers that read
iph->ihl assume a sane value: net/ipv4/ah4.c:ah_output() in
particular subtracts sizeof(struct iphdr) from top_iph->ihl * 4
and passes the (signed-int-negative, then cast to size_t)
result to memcpy(), producing an OOB access of length close to
SIZE_MAX and a host kernel panic.

An IPv4 header with ihl < 5 is malformed by definition (RFC 791:
"Internet Header Length is the length of the internet header in
32 bit words ... Note that the minimum value for a correct header
is 5.").  The kernel should not be willing to inject such a
packet into its own output path.

Reject "iphlen < sizeof(*iph)" alongside the existing
"iphlen > length" check.  This matches the principle that locally
constructed packets that re-enter the IP stack must pass the same
basic sanity tests that a foreign packet would be subjected to.

Once this lands, the "if (iphlen >= sizeof(*iph))" wrapper around
the fixup branch becomes redundant; left in place to keep the
patch minimal and backport-friendly.  A follow-up can unwrap it.

Note that commit 86f4c90a1c5c ("ipv4, ipv6: ensure raw socket
message is big enough to hold an IP header") ensures the message
buffer is large enough to hold an iphdr, but does not constrain
the self-reported iph->ihl.

Reachability: the malformed packet source is any caller with
CAP_NET_RAW, including an unprivileged process in a user+net
namespace on a kernel with CONFIG_USER_NS=y.  The reproduced AH
crash also requires a matching xfrm AH policy on the outgoing
route; a container granted CAP_NET_ADMIN can install that state
and policy in its netns.  Loopback bypasses xfrm_output, so the
trigger uses a real netdev.

Reproduced on UML + KASAN: kernel-mode fault at addr 0x0 with
memcpy_orig at the crash site.  Same shape reproduces inside a
rootless Docker container with --cap-add NET_ADMIN on a stock
distro kernel.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/77ec2b5e8111961c2c39883c92e8aa2709039c17.1778614451.git.michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: netlink: Correct buffer sizing info

Update the docs to match the code (include/linux/netlink.h):

  /*
   * skb should fit one page. This choice is good for headerless malloc.
   * But we should limit to 8K so that userspace does not have to
   * use enormous buffer sizes on recvmsg() calls just to avoid
   * MSG_TRUNC when PAGE_SIZE is very large.
  */
  #if PAGE_SIZE < 8192UL
  #define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(PAGE_SIZE)
  #else
  #define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(8192UL)
  #endif

Signed-off-by: Konstantin Shabanov <mail@etehtsea.me>
Link: https://patch.msgid.link/20260512103101.1076173-1-mail@etehtsea.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 MPAM fixes from Catalin Marinas:

- Fix NULL dereference and a false-positive warning when the driver
   probes hardware with surprising version numbers

- Fix writing values to the wrong registers when probing
   cache-utilisation counters. Replace 'NRDY' probing with a version
   that is robust for platforms where the bit is writeable by both
   hardware and software

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm_mpam: Check whether the config array is allocated before destroying it
  arm_mpam: Fix false positive assert failure during mpam_disable()
  arm_mpam: Improve check for whether or not NRDY is hardware managed
  arm_mpam: Pretend that NRDY is always hardware managed
  arm_mpam: Fix monitor instance selection when checking for hardware NRDY

Merge tag 'iommu-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux

Pull iommu fixes from Joerg Roedel:
"This is probably the largest fixes pull-request ever sent for IOMMU. I
  partially blame it on AI code review which found some issues but there
  is also some rework in here to fix issues in the iommu parts of PCI
  device reset.

  AMD-Vi:
   - Add bounds checks to debugfs and table lookups

  Intel VT-d:
   - Apply an existing quirk for Q35 graphic device
   - Skip dev_pasid teardown for the blocked domain to avoid
     out-of-bounds access
   - Return early if dev_pasid is missing to prevent NULL dereference
     or UAF

  Core:
   - Fix bugs and corner cases in pci_dev_reset_iommu_prepare/done()
   - Fix various issues found by AI in iommupt code

  MAINTAINERS email address update for RISCV IOMMU"

* tag 'iommu-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
  MAINTAINERS: update Tomasz Jeznach's email address
  iommupt: Fix the end_index calculation in __map_range_leaf()
  iommupt: Check for missing PAGE_SIZE in the pgsize_bitmap
  iommu: Handle unmap error when iommu_debug is enabled
  iommu: Fix up map/unmap debugging for iommupt domains
  iommu: Fix loss of errno on map failure for classic ops
  iommu/vt-d: Avoid NULL pointer dereference or refcount corruption
  iommu/vt-d: Fix oops due to out of scope access
  iommu/vt-d: Disable DMAR for Intel Q35 IGFX
  iommu: Warn on premature unblock during DMA aliased sibling reset
  iommu: Fix WARN_ON in __iommu_group_set_domain_nofail() due to reset
  iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid()
  iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  iommu: Fix pasid attach in pci_dev_reset_iommu_prepare/done()
  iommu: Replace per-group resetting_domain with per-gdev blocked flag
  iommu: Fix kdocs of pci_dev_reset_iommu_done()
  iommu: Fix NULL group->domain dereference in pci_dev_reset_iommu_done()
  iommu/amd: Bounds-check devid in __rlookup_amd_iommu()
  iommu/amd: Remove latent out-of-bounds access in IOMMU debugfs

Merge tag 'vfio-v7.1-rc4' of https://github.com/awilliam/linux-vfio

Pull VFIO fixes from Alex Williamson:

- Convert vfio-pci BAR resource requests and iomaps initialization
   from a lazy, on-demand model to an eager pre-allocation model to
   avoid races while preserving legacy error behavior.  Fix unchecked
   barmap access in dma-buf export path (Matt Evans)

- Introduce an implicit unsigned cast in converting vfio-pci device
   offsets to region indexes, closing a potential out-of-bounds
   access through the vfio_pci_ioeventfd() interface (Matt Evans)

- Fix a dma-buf kref underflow and stuck wait_for_completion() when
   closing a previously revoked dma-buf (Alex Williamson)

* tag 'vfio-v7.1-rc4' of https://github.com/awilliam/linux-vfio:
  vfio/pci: Check BAR resources before exporting a DMABUF
  vfio/pci: Set up BAR resources and maps in vfio_pci_core_enable()
  vfio/pci: Make VFIO_PCI_OFFSET_TO_INDEX() return unsigned
  vfio/pci: fix dma-buf kref underflow after revoke

Merge tag 'drm-misc-fixes-2026-05-15' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

Short summary of fixes pull:

bridge:
- imx8qxp-pxl2dpi: avoid ERR_PTR with device_node cleanup

gma500:
- oaktrail_lvds: fix i2c handling

loongson:
- use managed cleanup for connector polling

panfrost:
- handle results from reservation locking correctly

qaic:
- check for integer overflows in mmap logic

rocket:
- handle results from reservation locking correctly

ttm:
- avoid infinite loop in swap out
- avoid infinite loop in BO shrinking
- convert -EAGAIN from dmem_cgroup_try_charge to -ENOSPC

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patch.msgid.link/20260515070816.GA88575@2a02-2455-9062-2500-7dec-552d-233d-9fe0.dyn6.pyur.net

Merge tag 'v7.1-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- Fix integer overflow in read

- Fix smbdirect error cleanup

- Multichannel reconnect fix

- Add some missing defines and correct some references to protocol spec

- Fix oob symlink read

* tag 'v7.1-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smbdirect: Fix error cleanup in smbdirect_map_sges_from_iter()
  smb: client: avoid integer overflow in SMB2 READ length check
  cifs: client: stage smb3_reconfigure() updates and restore ctx on failure
  smb/client: fix possible infinite loop and oob read in symlink_data()
  SMB3.1.1: add missing QUERY_DIR info levels

bpf: Add Jiayuan Chen to sockmap maintainers

Nominate Jiayuan Chen for the sockmap co-maintainer. Jiayuan has been a
regular contributor and reviewer for the sockmap and networking code.

Since we are now down to just two maintainers, and John has to split his
time between BPF core, BPF networking, and sockmap, having three
maintainers again will help with the review load.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260511-sockmap-ktls-fix-1-v1-1-96ff8c1906e4@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'ceph-for-7.1-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
"An important patch from Hristo that squashes a folio reference leak
  that could lead to OOM kills in CephFS and a number of miscellaneous
  fixes from Raphael and Slava.

  All but two are marked for stable"

* tag 'ceph-for-7.1-rc4' of https://github.com/ceph/ceph-client:
  libceph: Fix potential null-ptr-deref in decode_choose_args()
  libceph: handle rbtree insertion error in decode_choose_args()
  libceph: Fix potential out-of-bounds access in osdmap_decode()
  ceph: put folios not suitable for writeback
  ceph: add ceph_has_realms_with_quotas() check to ceph_quota_update_statfs()
  libceph: Fix potential out-of-bounds access in __ceph_x_decrypt()
  ceph: fix BUG_ON in __ceph_build_xattrs_blob() due to stale blob size
  ceph: fix a buffer leak in __ceph_setxattr()
  libceph: Fix unnecessarily high ceph_decode_need() for uniform bucket
  libceph: Fix potential out-of-bounds access in crush_decode()

Merge tag 'drm-xe-fixes-2026-05-14' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

- Madvise fix around purgeability tracking (Arvind)
- Restore engine mask for specific blitter style (Roper)
- Couple UAF fixes (Auld)
- Drop unused ggtt_balloon field (Wajdeczko)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patch.msgid.link/agXWkM3Y98bqt6TG@intel.com

drm/xe/reg_sr: Do sanity check for MCR vs non-MCR

The type struct xe_reg_mcr exists to ensure that the correct API is used
when handling MCR registers.  However, for the register save/restore
functionality, the RTP processing always cast the register to a struct
xe_reg and then apply_one_mmio() selects the MMIO API based on the "mcr"
field of the register instance.

This allows the developer to commit mistakes like passing a MCR register
for an RTP action for a GT where the respective register is not MCR; and
vice-versa.

To capture such scenarios, do a sanity check in xe_reg_sr_add() that,
upon an inconsistency:

- "fixes" the register type by favoring what we have in our MCR range
  tables instead of what the developer selected for the save/restore
  entry;
- raises a notice-level message to inform about the inconsistency.

Note: As a collateral of this change, we need to include MCR
initialization in xe_wa_test.c, otherwise a bunch of test cases end up
failing because xe_gt_mcr_check_reg() will always return false, meaning
that will incorrectly say that a MCR register is not MCR.

v2:
  - Downgrade messages to notice level so as not to block CI execution
    when inconsistencies are found. (Matt)
  - Add missing EXPORT_SYMBOL_IF_KUNIT() calls. (Gustavo)

Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-7-30dd47855fee@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

drm/xe/mcr: Extract reg_in_steering_type_ranges()

The logic to check if a register falls within one of the ranges for a
steering type is already duplicated in
xe_gt_mcr_get_nonterminated_steering(). We will also want to use that
same logic in another upcoming function. Let's factor out that logic
and put it into a function named reg_in_steering_type_ranges().

Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-6-30dd47855fee@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

drm/xe/kunit: Use KUNIT_EXPECT_EQ() in xe_wa_gt()

Use KUNIT_EXPECT_EQ() in xe_wa_gt() as reg_sr errors in one GT do not
impact the next GT in the test.

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-5-30dd47855fee@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

drm/xe: Extract xe_hw_engine_setup_reg_lrc()

The steps for processing RTP rules that build up an engine's reg_lrc
arguably belongs to xe_hw_engine.c and should be encapsulated into a
function in that unit.

Move that logic to a new function called xe_hw_engine_setup_reg_lrc().

Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-4-30dd47855fee@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

drm/xe: Define and use MCR version of COMMON_SLICE_CHICKEN4

The register COMMON_SLICE_CHICKEN4 is a MCR register on both Xe2 and
Xe3. Let's make sure to define a MCR version of it and use it for the
relevant IP versions.

Use XEHP_ as prefix for the register name, since it is MCR as of Xe_HP.

v2:
  - Also change for one entry in lrc_tunnings, which was caught by
    manual testing and add corresponging Fixes tag in commit message.
    (Gustavo)

Fixes: 8d6f16f1f082 ("drm/xe: Extend Wa_22021007897 to Xe3 platforms")
Fixes: e5c13e2c505b ("drm/xe/xe2hpg: Add Wa_22021007897")
Fixes: 8ccf5f6b2295 ("drm/xe/tuning: Apply windower hardware filtering setting on Xe3 and Xe3p")
Bspec: 66534, 71185, 74417
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-3-30dd47855fee@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

drm/xe: Define and use MCR version of COMMON_SLICE_CHICKEN1

The register COMMON_SLICE_CHICKEN1 is a MCR register on Xe2.
Let's make sure to define a MCR version of it and use it for the
relevant IP versions.

Use XEHP_ as prefix for the register name, since it is MCR as of Xe_HP.

Fixes: a5d221924e13 ("drm/xe/xe2_hpg: Add set of workarounds")
Fixes: 9f18b55b6d3f ("drm/xe/xe2: Add workaround 18033852989")
Bspec: 66534, 71185
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-2-30dd47855fee@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

drm/xe: Define CACHE_MODE_1 as MCR register

CACHE_MODE_1 is a MCR register for all platforms that currently use it
in the Xe driver. Use XE_REG_MCR() when defining it.

Fixes: 8cd7e9759766 ("drm/xe: Add missing DG2 lrc workarounds")
Fixes: ff063430caa8 ("drm/xe/mtl: Add some initial MTL workarounds")
Bspec: 66534, 67788
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260514-rtp-mcr-check-v3-1-30dd47855fee@intel.com
Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com>

thermal/core: Split __thermal_cooling_device_register() into two functions

In preparation for the upcoming changes separating OF and non-OF code,
split __thermal_cooling_device_register() into allocation and addition
phases.

This allows moving the device node assignment out of the core
initialization path.

This change is not a trivial split. The lifetime of the cooling device
is managed by the device core through put_device(), which triggers
thermal_release() to free all associated resources.

With the introduction of thermal_cooling_device_alloc(), the allocation
path must mirror what thermal_release() undoes. In contrast,
thermal_cooling_device_add() must not perform any rollback and relies
on put_device() for cleanup on error paths. This avoids both double
free and resource leaks.

As part of this rework, add the missing device_initialize() call when
allocating the cooling device.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
[ rjw: Replace device_register() with device_add() ]
[ rjw: Rebase on top of previously applied material ]
Link: https://patch.msgid.link/20260505144447.2853933-1-daniel.lezcano@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Merge tag 'for-7.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fixup warning when allocating memory for readahead, __GFP_NOWARN was
   accidentally dropped when setting mapping constraints

- in tracepoint of file sync, fix sleeping in atomic context when
   handling dentries

- harden initial loading of block group on crafted/fuzzed images,
   iterate all chunk mapping entries unconditionally

- fix freeing pages of submitted io after checking for errors

- fix incorrect inode size after remount when using fallocate KEEP_SIZE
   mode (also requires disabled 'no-holes' feature)

* tag 'for-7.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix incorrect i_size after remount caused by KEEP_SIZE prealloc gap
  btrfs: only release the dirty pages io tree after successful writes
  btrfs: tracepoints: fix sleep while in atomic context in btrfs_sync_file()
  btrfs: always pass __GFP_NOWARN from add_ra_bio_pages()
  btrfs: fix check_chunk_block_group_mappings() to iterate all chunk maps

Merge tag 'xfs-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:
"A few bug fixes, nothing really special stands out"

* tag 'xfs-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Fix typo in comment
  xfs: fix the "limiting open zones" message
  xfs: flush delalloc blocks on ENOSPC in xfs_trans_alloc_icreate
  xfs: check da node block pad field during scrub
  xfs: fix memory leak for data allocated by xfs_zone_gc_data_alloc()
  xfs: fix memory leak on error in xfs_alloc_zone_info()
  xfs: check directory data block header padding in scrub
  xfs: zero directory data block padding on write verification
  xfs: zero entire directory data block header region at init
  xfs: remove the meaningless XFS_ALLOC_FLAG_FREEING

Merge tag 'nfsd-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
"Fixes for this release:
   - Correctness fix for the new sunrpc cache netlink protocol

  Marked for stable:
   - Correctness fixes for delegated attributes
   - Prevent an infinite loop when revoking layouts"

* tag 'nfsd-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix infinite loop in layout state revocation
  sunrpc: start cache request seqno at 1 to fix netlink GET_REQS
  nfsd: update mtime/ctime on COPY in presence of delegated attributes
  nfsd: update mtime/ctime on CLONE in presense of delegated attributes
  nfsd: fix file change detection in CB_GETATTR
  nfsd: fix GET_DIR_DELEGATION when VFS leases are disabled

Merge tag 'block-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block fixes from Jens Axboe:

- NVMe merge request via Keith:
     - Fix memory leak on a passthrough integrity mapping failure (Keith)
     - Hide secrets behind debug option (Hannes)
     - Fix pci use-after-free for host memory buffer (Chia-Lin Kao)
     - Fix tcp taregt use-after-free for data digest (Sagi)
     - Revert a mistaken quirk (Alan Cui)
     - Fix uevent and controller state race condition (Maurizio)
     - Fix apple submission queue re-initialization (Nick Chan)

- Three fixes for blk-integrity, fixing an issue with the user data
   mapping and two problems with recomputing number of segments

- Two fixes for the iov_iter bounce buffering

- Fix for the handling of dead zoned write plugs

- ublk max_sectors validation fix, with associated selftest addition

* tag 'block-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  nvme-apple: Reset q->sq_tail during queue init
  block: align down bounces bios
  block: pass a minsize argument to bio_iov_iter_bounce
  selftests: ublk: cap nthreads to kernel's actual nr_hw_queues
  block: fix handling of dead zone write plugs
  block: bio-integrity: Fix null-ptr-deref in bio_integrity_map_user()
  block: recompute nr_integrity_segments in blk_insert_cloned_request
  block: don't overwrite bip_vcnt in bio_integrity_copy_user()
  nvme: fix race condition between connected uevent and STARTED_ONCE flag
  Revert "nvme: add quirk NVME_QUIRK_IGNORE_DEV_SUBNQN for 144d:a808"
  nvmet-tcp: Fix potential UAF when ddgst mismatch
  nvme-pci: fix use-after-free in nvme_free_host_mem()
  nvmet-auth: Do not print DH-HMAC-CHAP secrets
  nvme: fix bio leak on mapping failure
  nvme: make prp passthrough usage less scary
  ublk: reject max_sectors smaller than PAGE_SECTORS in parameter validation

Merge tag 'io_uring-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring fixes from Jens Axboe:

- Small series sanitizing the locking done for either modifying or
   reading a chain of requests

- If the application has a pid namespace, ensure that the sqthread pid
   is correctly printed in fdinfo

- Fix for a hashing issue in the io-wq thread pool, which could lead to
   a use-after-free

- Kill dead argument from io_prep_rw_pi()

- Fix for a missed validation of the CQ ring head, affecting CQE refill

* tag 'io_uring-7.1-20260515' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
  io_uring: validate user-controlled cq.head in io_cqe_cache_refill()
  io-wq: check that the predecessor is hashed in io_wq_remove_pending()
  io_uring/rw: drop unused attr_type_mask from io_prep_rw_pi()
  io_uring: hold uring_lock across io_kill_timeouts() in cancel path
  io_uring: defer linked-timeout chain splice out of hrtimer context
  io_uring: hold uring_lock when walking link chain in io_wq_free_work()
  io_uring/fdinfo: translate SqThread PID through caller's pid_ns

Merge tag 'hardening-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fix from Kees Cook:

- gcc-plugins: Fix GCC 16 removal of CONST_CAST macros

* tag 'hardening-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: Always define CONST_CAST_GIMPLE and CONST_CAST_TREE

Merge tag 'docs-7.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/docs/linux

Pull documentation fixes from Jonathan Corbet:
"This is Willy Tarreau's new document clarifying the definition and
  handling of security-related bugs, which we're trying to get out there
  quickly on the theory that some of the bug reporters might actually
  read and pay attention to it"

* tag 'docs-7.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/docs/linux:
  docs: threat-model: don't limit root capabilities to CAP_SYS_ADMIN
  docs: security-bugs: add a link to the threat-model documentation
  Documentation: security-bugs: clarify requirements for AI-assisted reports
  Documentation: security-bugs: explain what is and is not a security bug
  Documentation: security-bugs: do not systematically Cc the security team

perf pmu: Skip test on Arm64 when #slots is zero

Some Arm64 PMUs expose 'caps/slots' as 0 when the slot count is not
implemented, tool_pmu__read_event() currently returns false for this,
so metrics that reference #slots are reported as syntax error.

Since the commit 3a61fd866ef9 ("perf expr: Return -EINVAL for syntax
error in expr__find_ids()"), these syntax errors are populated as
failures and make the PMU metric test fail:

    9.3: Parsing of PMU event table metrics:
    --- start ---
    ...

    Found metric 'backend_bound'
    metric expr 100 * (stall_slot_backend / (#slots * cpu_cycles)) for backend_bound
    parsing metric: 100 * (stall_slot_backend / (#slots * cpu_cycles))
    Failure to read '#slots'
    literal: #slots = nan
    syntax error
    Fail to parse metric or group `backend_bound'

    ...
    ---- end(-1) ----
    9.3: Parsing of PMU event table metrics    : FAILED!

This commit introduces a new function is_expected_broken_metric() to
identify broken metrics, and treats metrics containing "#slots" as
expected broken when #slots == 0 on Arm64 platforms.

Fixes: 3a61fd866ef9aaa1 ("perf expr: Return -EINVAL for syntax error in expr__find_ids()")
Reviewed-by: Ian Rogers <irogers@google.com>
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf trace beauty fcntl: Fix build with older kernel headers

Toolchains with older kernel headers that do not include upstream commit
c75b1d9421f80f41 ("fs: add fcntl() interface for setting/getting write
life time hints") will now fail to build perf due to missing definitions
for F_GET_RW_HINT/F_SET_RW_HINT/F_GET_FILE_RW_HINT/F_SET_FILE_RW_HINT.

Provide a fallback definition for these when they are not already
defined.

Fixes: 9c47f66748381ecb ("perf trace beauty fcntl: Basic 'arg' beautifier")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Markus Mayer <mmayer@broadcom.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf unwind-libunwind: Add RISC-V libunwind support

Add a RISC-V implementation for unwinding.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andrew Jones <andrew.jones@oss.qualcomm.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Dmitrii Dolgov <9erthalion6@gmail.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Li Guan <guanli.oerv@isrc.iscas.ac.cn>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shimin Guo <shimin.guo@skydio.com>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf unwind-libunwind: Remove libunwind-local

Local unwinding only works on the machine libunwind is built for,
rather than cross platform, the APIs for remote and local unwinding
are similar but types like unw_word_t depend on the included
header. Place the architecture specific code into the appropriate
libunwind-<arch>.c file. Put generic code in unwind-libunwind.c and
use libunwind-arch to choose the correct implementation based on the
thread's e_machine. Structuring the code this way avoids including the
unwind-libunwind-local.c for each architecture of remote
unwinding. Data is moved into the struct unwind_info to simplify the
architecture and generic code, trying to keep as much code as possible
generic.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andrew Jones <andrew.jones@oss.qualcomm.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Dmitrii Dolgov <9erthalion6@gmail.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Li Guan <guanli.oerv@isrc.iscas.ac.cn>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shimin Guo <shimin.guo@skydio.com>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf unwind-libunwind: Move flush/finish access out of local

Flush and finish access are relatively simple calls into libunwind,
move them out struct unwind_libunwind_ops. So that the correct version
can be called, add an e_machine variable to maps. This size regression
will go away when the unwind_libunwind_ops no longer need stashing in
the maps. To set the e_machine up pass it into unwind__prepare_access,
which no longer needs to determine the unwind operations based on a
map dso because of this. This also means the maps copying code can
call unwind__prepare_access once for the e_machine rather than once
per map.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andrew Jones <andrew.jones@oss.qualcomm.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Dmitrii Dolgov <9erthalion6@gmail.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Li Guan <guanli.oerv@isrc.iscas.ac.cn>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shimin Guo <shimin.guo@skydio.com>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf unwind-libunwind: Make libunwind register reading cross platform

Move the libunwind register to perf register mapping functions in
arch/../util/unwind-libunwind.c into a new libunwind-arch
directory. Rename the functions to
__get_perf_regnum_for_unw_regnum_<arch>. Add untested ppc32 and s390
functions. Add a get_perf_regnum_for_unw_regnum function that takes an
ELF machine as well as a register number and chooses the appropriate
architecture implementation.

Split the x86 and powerpc 32 and 64-bit implementations apart so that
a single libunwind-<arch>.h header is included.

Move the e_machine into the unwind_info struct to make it easier to
pass.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andrew Jones <andrew.jones@oss.qualcomm.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Dmitrii Dolgov <9erthalion6@gmail.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Li Guan <guanli.oerv@isrc.iscas.ac.cn>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <pjw@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shimin Guo <shimin.guo@skydio.com>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Will Deacon <will@kernel.org>
[ Map UNW_PPC32_NIP to PERF_REG_POWERPC_NIP like done for 64-bit, pointed out by a local sashiko ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

drm/xe/pf: Fix CFI failure in debugfs access

Reading debugfs file (/sys/kernel/debug/dri/0/gt*/pf/adverse_events)
with CFI (Control Flow Integrity) enabled, the kernel panics at
xe_gt_debugfs_simple_show+0x82/0xc0.

xe_gt_debugfs_simple_show() declare a function pointer expecting int
return type, but xe_gt_sriov_pf_monitor_print_events() is void return
type, leading to CFI failure and kernel panic.

[507620.973657] CFI failure at xe_gt_debugfs_simple_show+0x82/0xc0 [xe]
(target: xe_gt_sriov_pf_monitor_print_events+0x0/0x130 [xe]; expected
type: 0xd72c7139)

Fix xe_gt_sriov_pf_monitor_print_events() function by updating to return
an int type.

Fixes: 1c99d3d3edab ("drm/xe/pf: Expose PF monitor details via debugfs")
Signed-off-by: Mohanram Meenakshisundaram <mohanram.meenakshisundaram@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20260514174918.1556357-2-mohanram.meenakshisundaram@intel.com

drm/xe/vf: Fix signature of print functions

We have plugged-in existing VF print functions into our GT debugfs
show helper as-is, but we missed that the helper expects functions
to return int, while they were defined as void. This can lead to
errors being reported when CFI is enabled.

Fixes: 63d8cb8fe3dd ("drm/xe/vf: Expose SR-IOV VF attributes to GT debugfs")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Mohanram Meenakshisundaram <mohanram.meenakshisundaram@intel.com>
Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://patch.msgid.link/20260514155726.7165-1-michal.wajdeczko@intel.com

ring-buffer remote: Avoid unexpected symbol warnings (arm, s390)

The now more verbose check found more architecture specific symbol
missing from the whitelist, during randconfig testing on s390
and 32-bit arm:

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
         U __aeabi_unwind_cpp_pr1

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
                 U __s390_indirect_jump_r1
                 U __s390_indirect_jump_r10
                 U __s390_indirect_jump_r14
                 U __s390_indirect_jump_r2
                 U __s390_indirect_jump_r5
                 U __s390_indirect_jump_r7
                 U __s390_indirect_jump_r8
                 U __s390_indirect_jump_r9
make[6]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:160: kernel/trace/simple_ring_buffer.o.checked] Error 1

Add these to the list and keep it roughly sorted into sanitizer
and architecture symbols.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20260515105717.1023007-1-arnd@kernel.org
Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Merge tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

- one simple cleanup

- a fix for a corner case when running as Xen PV dom0

- a fix of a regression for Xen PV guests, introduced in 7.0

* tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving
  x86/xen: Fix xen_e820_swap_entry_with_ram()
  xen/arm: Replace __ASSEMBLY__ with __ASSEMBLER__ in interface.h

Merge tag 'platform-drivers-x86-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Ilpo Järvinen:

- asus-nb-wmi:
    - Use existing keyboard quirk for ASUS Zenbook Duo UX8407AA

- hp-wmi:
    - Add support for Victus 16-r0xxx (8BC2)

- intel/vsec_tpmi:
    - Move debugfs register before creating devices
    - Prevent fault during unbind

- lenovo-wmi-*:
    - Fix memory leak in lwmi_dev_evaluate_int()
    - Balance IDA id allocation and free
    - Balance component bind and unbind
    - Prevent sending uninitialized WMI arguments to the device
    - Decouple lenovo-wmi-gamezone and lenovo-wmi-other to simplify
      module dependency graph
    - Limit adding attributes to supported devices

- samsung-galaxybook:
    - Handle kbd backlight, mic mute and camera block hotkeys

* tag 'platform-drivers-x86-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8407AA
  platform/x86: lenovo-wmi-other: Limit adding attributes to supported devices
  platform/x86: lenovo-wmi-other: Add Attribute ID helper functions
  platform/x86: lenovo-wmi-helpers: Move gamezone enums to wmi-helpers
  platform/x86: lenovo: Decouple lenovo-wmi-gamezone and lenovo-wmi-other
  platform/x86: lenovo-wmi-other: Fix tunable_attr_01 struct members
  platform/x86: lenovo-wmi-other: Zero initialize WMI arguments
  platform/x86: lenovo-wmi-other: Balance component bind and unbind
  platform/x86: lenovo-wmi-other: Balance IDA id allocation and free
  platform/x86: lenovo-wmi-helpers: Fix memory leak in lwmi_dev_evaluate_int()
  platform/x86: hp-wmi: Add support for Victus 16-r0xxx (8BC2)
  platform/x86/intel/tpmi/plr: Prevent fault during unbind
  platform/x86: intel: Add notifiers support
  platform/x86: intel: Move debugfs register before creating devices
  platform/x86: samsung-galaxybook: Handle ACPI hotkey notifications
  platform/x86: samsung-galaxybook: Refactor camera lens cover input device

PCI: brcmstb: Assign pcie->gen from of_pci_get_max_link_speed()

After commit 03f920936977 ("PCI: controller: Validate max-link-speed"),
pcie->gen stopped being assigned and as a result the established PCIe link
would stop supporting Gen3 speeds on 2712 since pcie->gen is used to
populate LnkCntl2 and LnkCap in brcm_pcie_set_gen().

If the 'max-link-speed' property is not specified, or it exceeds Gen3,
resort to the HW defaults.

Link: https://github.com/raspberrypi/linux/issues/7343
Reported-by: Dom Cobley <popcornmix@gmail.com>
Reported-by: Phil Elwell <phil@raspberrypi.com>
Fixes: 03f920936977 ("PCI: controller: Validate max-link-speed")
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Hans Zhang <18255117159@163.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20260506164537.103196-1-florian.fainelli@broadcom.com

Merge tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:

- Fix potential dead-lock in rhashtable when used by xattr

- Avoid calling kvfree on atomic path in rhashtable

* tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  rhashtable: Add bucket_table_free_atomic() helper
  mm/slab: Add kvfree_atomic() helper
  rhashtable: drop ht->mutex in rhashtable_free_and_destroy()

PCI: altera: Fix resource leaks on probe failure

The chained IRQ handler is set during probe, but is only removed during the
driver remove(). If pci_host_probe() fails, the handler and INTx IRQ
domain remain set even though the devm-managed host bridge storage
containing struct altera_pcie will be released, leaving the handler with
a stale data pointer.

Interrupts are also enabled before pci_host_probe() is called. If probe
fails after that point, the controller interrupt source should be disabled
before the chained handler and INTx domain are removed.

So set the chained handler only after the INTx domain has been created.
Disable controller interrupts during IRQ teardown, and tear the IRQ setup
down if pci_host_probe() fails.

Fixes: c63aed7334c2 ("PCI: altera: Use pci_host_probe() to register host")
Signed-off-by: Mahesh Vaidya <mahesh.vaidya@altera.com>
[mani: commit log]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Subhransu S. Prusty <subhransu.sekhar.prusty@altera.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430204330.3121003-3-mahesh.vaidya@altera.com

PCI: altera: Do not dispose parent IRQ mapping

altera_pcie_irq_teardown() calls irq_dispose_mapping() on pcie->irq.
However, pcie->irq is the parent IRQ returned by platform_get_irq(), not
the mapping created by Altera INTx irq_domain.

The Altera driver only sets the chained handler on the parent IRQ. It
should detach that handler during teardown, but it should not dispose the
parent IRQ mapping, which belongs to the parent interrupt controller's
irq_domain.

Drop irq_dispose_mapping(pcie->irq) from the teardown path.

Note that during irqchip remove(), the child IRQs should've disposed. But
since the chained handler itself is removed, there is no way the stale
child IRQs (if exists) could fire. So it is safe here.

Fixes: ec15c4d0d5d2 ("PCI: altera: Allow building as module")
Signed-off-by: Mahesh Vaidya <mahesh.vaidya@altera.com>
[mani: added a note about IRQ disposal]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Subhransu S. Prusty <subhransu.sekhar.prusty@altera.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430204330.3121003-2-mahesh.vaidya@altera.com

Merge patch series "VFS changes for nfsd CB_NOTIFY callbacks in directory delegations"

The series starts with patches to allow the vfs to ignore certain types
of events on directories. nfsd can then request these sorts of
delegations on directories, and then set up inotify watches on the
directory to trigger sending CB_NOTIFY events.

* patches from https://patch.msgid.link/20260428-dir-deleg-v3-0-5a0780ba9def@kernel.org:
  fsnotify: add FSNOTIFY_EVENT_RENAME data type
  fsnotify: add fsnotify_modify_mark_mask()
  fsnotify: new tracepoint in fsnotify()
  filelock: add an inode_lease_ignore_mask helper
  filelock: add a tracepoint to start of break_lease()
  filelock: add support for ignoring deleg breaks for dir change events
  filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"

Link: https://patch.msgid.link/20260428-dir-deleg-v3-0-5a0780ba9def@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

fsnotify: add FSNOTIFY_EVENT_RENAME data type

Add a new fsnotify_rename_data struct and FSNOTIFY_EVENT_RENAME data
type that carries both the moved dentry and the inode that was
overwritten by the rename (if any).

Update fsnotify_data_inode(), fsnotify_data_dentry(), and
fsnotify_data_sb() to handle the new type, and add a new
fsnotify_data_rename_target() helper for extracting the overwritten
target inode.

Update fsnotify_move() to use the new data type for FS_RENAME and
FS_MOVED_TO events, passing the overwritten target inode through the
event data. FS_MOVED_FROM is unchanged since the source directory
doesn't need overwrite information.

This is done so that fsnotify consumers like nfsd can atomically
observe the overwritten file when a rename replaces an existing entry,
without needing a separate FS_DELETE event.

Assisted-by: Claude (Anthropic Claude Code)
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260428-dir-deleg-v3-7-5a0780ba9def@kernel.org
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

fsnotify: add fsnotify_modify_mark_mask()

nfsd needs to be able to modify the mask on an existing mark when new
directory delegations are set or unset. Add an exported function that
allows the caller to set and clear bits in the mark->mask, and does
the recalculation if something changed.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260428-dir-deleg-v3-6-5a0780ba9def@kernel.org
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

fsnotify: new tracepoint in fsnotify()

Add a tracepoint so we can see exactly how this is being called.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260428-dir-deleg-v3-5-5a0780ba9def@kernel.org
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()

Same race shape as the rmdir path that 93618edf7538 ("cgroup: Defer css
percpu_ref kill on rmdir until cgroup is depopulated") fixed: a task past
exit_signals() whose cset subsys[ssid] still pins the disabled controller's
css can be touching subsys state while ->css_offline() runs. The earlier
patches in this series built up the per-subsys-css deferral machinery and
routed cgroup_destroy_locked() through it. Apply the same shape to
cgroup_apply_control_disable():

kill_css_sync(css);
if (!css_is_populated(css))
kill_css_finish(css);

When the dying css is still populated, kill_css_finish() is deferred. The
walker in css_update_populated() fires kill_finish_work once the css's
hierarchical populated count drops to zero.

cgroup_lock_and_drain_offline()'s wait predicate switches from
percpu_ref_is_dying() to css_is_dying(). CSS_DYING is set by kill_css_sync()
and is a strict superset of percpu_ref_is_dying. Without this change, a +cpu
re-enable after a deferred -cpu disable would skip the drain (percpu_ref
isn't killed yet) and observe the still-CSS_DYING css through cgroup_css(),
treating it as live.

Signed-off-by: Tejun Heo <tj@kernel.org>

filelock: add an inode_lease_ignore_mask helper

Add a new routine that returns a mask of all dir change events that are
currently ignored by any leases. nfsd will use this to determine how to
configure the fsnotify_mark mask.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260428-dir-deleg-v3-4-5a0780ba9def@kernel.org
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

filelock: add a tracepoint to start of break_lease()

...mostly to show the LEASE_BREAK_* flags.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260428-dir-deleg-v3-3-5a0780ba9def@kernel.org
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

filelock: add support for ignoring deleg breaks for dir change events

If a NFS client requests a directory delegation with a notification
bitmask covering directory change events, the server shouldn't recall
the delegation. Instead the client will be notified of the change after
the fact.

Add support for ignoring lease breaks on directory changes. Add a new
flags parameter to try_break_deleg() and teach __break_lease how to
ignore certain types of delegation break events.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260428-dir-deleg-v3-2-5a0780ba9def@kernel.org
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"

The break_lease_block tracepoint currently just shows the type of
"new_fl", which we can predict from the "flags" value. Switch it to
display info about "fl" instead, as that's the file_lease on which the
code is blocking.

For trace_break_lease_unblock(), pass it a NULL pointer. "fl" may have
been freed by that point, and passing it the info in new_fl is
deceptive.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260428-dir-deleg-v3-1-5a0780ba9def@kernel.org
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

cgroup: Add per-subsys-css kill_css_finish deferral

93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir until cgroup is
depopulated") deferred kill_css_finish() at the cgroup level: rmdir waits
for the entire cgroup's populated count to drop to zero, then fires
kill_css_finish() on every subsystem css at once. Replace that with
per-subsys-css deferral. Each subsystem css now tracks its own hierarchical
populated count and independently defers its kill_css_finish() until its own
subtree drains.

The rmdir-race fix carries through unchanged in shape. The dying css's
->css_offline() still waits until no PF_EXITING task references it, and v2's
cgroup-level machinery goes away.

cgroup_apply_control_disable() has the same race shape (PF_EXITING tasks
pinning a css whose ->css_offline() is about to run) and stays synchronous
here. This patch lays the groundwork for fixing it - per-cgroup waiting
can't gate one subsys css being killed while the rest of the cgroup stays
live, but per-css can.

Subtree-wide invariant preserved: a dying ancestor css stays populated
through nr_populated_children until every dying descendant's task drains, so
the walker fires the ancestor's kill_finish_work only after all descendants
have drained.

Add paired smp_mb()s in kill_css_sync() and css_update_populated() to fence
the StoreLoad on (CSS_DYING, populated counter), guaranteeing that either
the walker queues kill_finish_work or the caller fires synchronously.
cgroup_destroy_locked() was implicitly fenced by an unrelated css_set_lock
pair; cgroup_apply_control_disable() in the next patch is not.

Signed-off-by: Tejun Heo <tj@kernel.org>

cgroup: Move populated counters to cgroup_subsys_state

Later patches replace the cgroup-level finish_destroy_work deferral added
by 93618edf7538 ("cgroup: Defer css percpu_ref kill on rmdir until cgroup
is depopulated") with a per-subsys-css deferral. That needs each subsystem
css to track its own populated count. Move the populated counters from
cgroup onto cgroup_subsys_state. cgroup->self is itself a
cgroup_subsys_state and self.parent walks the same chain as cgroup_parent(),
so cgroup_update_populated() generalizes to a single css_update_populated()
taking a css. The cgroup-side bookkeeping runs only when the walk started
from a self css.

Keep nr_populated_{domain,threaded}_children on cgroup. Both sum to
self.nr_populated_children, but staying as dedicated fields to allow readers
like cgroup_can_be_thread_root() unlocked access.

css_set_update_populated() also walks the per-subsys-css chain so each
subsystem css's hierarchical populated count is maintained. No reader
consumes those counts yet.

Signed-off-by: Tejun Heo <tj@kernel.org>

cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE

cgroup_update_populated() updates nr_populated_csets,
nr_populated_domain_children, and nr_populated_threaded_children under
css_set_lock, but cgroup_has_tasks(), cgroup_is_populated(), and
cgroup_can_be_thread_root() read them without holding it. Use
READ_ONCE/WRITE_ONCE.

Signed-off-by: Tejun Heo <tj@kernel.org>

cgroup: Inline cgroup_has_tasks() in cgroup.h

cpuset reads cs->css.cgroup->nr_populated_csets directly in two places to
test whether a cgroup has tasks. cgroup.c already has a matching helper,
cgroup_has_tasks(). Move it to cgroup.h as static inline and use that
instead. This is to prepare for relocation of cgroup->nr_populated_csets. No
semantic change.

Signed-off-by: Tejun Heo <tj@kernel.org>

PCI: keembay: Use common mode field in struct dw_pcie

Remove the redundant mode field from struct keembay_pcie and use the
existing mode field in struct dw_pcie instead.

This avoids duplication and prevents potential inconsistencies between
the two mode fields.

Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260501161010.71688-5-18255117159@163.com

PCI: dwc: Use common mode field in struct dw_pcie

Remove the redundant mode field from struct dw_plat_pcie and use the
existing mode field in struct dw_pcie instead.

This avoids duplication and prevents potential inconsistencies between
the two mode fields.

Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260501161010.71688-4-18255117159@163.com

PCI: artpec6: Use common mode field in struct dw_pcie

Remove the redundant mode field from struct artpec6_pcie and use the
existing mode field in struct dw_pcie instead.

This avoids duplication and prevents potential inconsistencies between
the two mode fields.

Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260501161010.71688-3-18255117159@163.com

PCI: dra7xx: Use common mode field in struct dw_pcie

Remove the redundant mode field from struct dra7xx_pcie and use the
existing mode field in struct dw_pcie instead.

This avoids duplication and prevents potential inconsistencies between
the two mode fields.

Signed-off-by: Hans Zhang <18255117159@163.com>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260501161010.71688-2-18255117159@163.com