git.ipfire.org Git - thirdparty/kernel/stable.git/log

net: openvswitch: Avoid releasing netdev before teardown completes

The patch cited in the Fixes tag below changed the teardown code for
OVS ports to no longer unconditionally take the RTNL. After this change,
the netdev_destroy() callback can proceed immediately to the call_rcu()
invocation if the IFF_OVS_DATAPATH flag is already cleared on the
netdev.

The ovs_netdev_detach_dev() function clears the flag before completing
the unregistration, and if it gets preempted after clearing the flag (as
can happen on an -rt kernel), netdev_destroy() can complete and the
device can be freed before the unregistration completes. This leads to a
splat like:

[  998.393867] Oops: general protection fault, probably for non-canonical address 0xff00000001000239: 0000 [#1] SMP PTI
[  998.393877] CPU: 42 UID: 0 PID: 55177 Comm: ip Kdump: loaded Not tainted 6.12.0-211.1.1.el10_2.x86_64+rt #1 PREEMPT_RT
[  998.393886] Hardware name: Dell Inc. PowerEdge R740/0JMK61, BIOS 2.24.0 03/27/2025
[  998.393889] RIP: 0010:dev_set_promiscuity+0x8d/0xa0
[  998.393901] Code: 00 00 75 d8 48 8b 53 08 48 83 ba b0 02 00 00 00 75 ca 48 83 c4 08 5b c3 cc cc cc cc 48 83 bf 48 09 00 00 00 75 91 48 8b 47 08 <48> 83 b8 b0 02 00 00 00 74 97 eb 81 0f 1f 80 00 00 00 00 90 90 90
[  998.393906] RSP: 0018:ffffce5864a5f6a0 EFLAGS: 00010246
[  998.393912] RAX: ff00000000ffff89 RBX: ffff894d0adf5a05 RCX: 0000000000000000
[  998.393917] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffff894d0adf5a05
[  998.393921] RBP: ffff894d19252000 R08: ffff894d19252000 R09: 0000000000000000
[  998.393924] R10: ffff894d19252000 R11: ffff894d192521b8 R12: 0000000000000006
[  998.393927] R13: ffffce5864a5f738 R14: 00000000ffffffe2 R15: 0000000000000000
[  998.393931] FS:  00007fad61971800(0000) GS:ffff894cc0140000(0000) knlGS:0000000000000000
[  998.393936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  998.393940] CR2: 000055df0a2a6e40 CR3: 000000011c7fe003 CR4: 00000000007726f0
[  998.393944] PKRU: 55555554
[  998.393946] Call Trace:
[  998.393949]  <TASK>
[  998.393952]  ? show_trace_log_lvl+0x1b0/0x2f0
[  998.393961]  ? show_trace_log_lvl+0x1b0/0x2f0
[  998.393975]  ? dp_device_event+0x41/0x80 [openvswitch]
[  998.394009]  ? __die_body.cold+0x8/0x12
[  998.394016]  ? die_addr+0x3c/0x60
[  998.394027]  ? exc_general_protection+0x16d/0x390
[  998.394042]  ? asm_exc_general_protection+0x26/0x30
[  998.394058]  ? dev_set_promiscuity+0x8d/0xa0
[  998.394066]  ? ovs_netdev_detach_dev+0x3a/0x80 [openvswitch]
[  998.394092]  dp_device_event+0x41/0x80 [openvswitch]
[  998.394102]  notifier_call_chain+0x5a/0xd0
[  998.394106]  unregister_netdevice_many_notify+0x51b/0xa60
[  998.394110]  rtnl_dellink+0x169/0x3e0
[  998.394121]  ? rt_mutex_slowlock.constprop.0+0x95/0xd0
[  998.394125]  rtnetlink_rcv_msg+0x142/0x3f0
[  998.394128]  ? avc_has_perm_noaudit+0x69/0xf0
[  998.394130]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[  998.394132]  netlink_rcv_skb+0x50/0x100
[  998.394138]  netlink_unicast+0x292/0x3f0
[  998.394141]  netlink_sendmsg+0x21b/0x470
[  998.394145]  ____sys_sendmsg+0x39d/0x3d0
[  998.394149]  ___sys_sendmsg+0x9a/0xe0
[  998.394156]  __sys_sendmsg+0x7a/0xd0
[  998.394160]  do_syscall_64+0x7f/0x170
[  998.394162]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  998.394165] RIP: 0033:0x7fad61bf4724
[  998.394188] Code: 89 02 b8 ff ff ff ff eb bb 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 80 3d c5 e9 0c 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 89 54 24 1c 48 89
[  998.394189] RSP: 002b:00007ffd7e2f7cb8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[  998.394191] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fad61bf4724
[  998.394193] RDX: 0000000000000000 RSI: 00007ffd7e2f7d20 RDI: 0000000000000003
[  998.394194] RBP: 00007ffd7e2f7d90 R08: 0000000000000010 R09: 000000000000003f
[  998.394195] R10: 000055df11558010 R11: 0000000000000202 R12: 00007ffd7e2f8380
[  998.394196] R13: 0000000069b233d7 R14: 000055df0a256040 R15: 0000000000000000
[  998.394200]  </TASK>

To fix this, reorder the operations in ovs_netdev_detach_dev() to only
clear the flag after completing the other operations, and introduce an
smp_wmb() to make the ordering requirement explicit. The smp_wmb() is
paired with a full smp_mb() in netdev_destroy() to make sure the
call_rcu() invocation does not happen before the unregister operations
are visible.

Reported-by: Minxi Hou <mhou@redhat.com>
Tested-by: Minxi Hou <mhou@redhat.com>
Fixes: 549822767630 ("net: openvswitch: Avoid needlessly taking the RTNL on vport destroy")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20260318155554.1133405-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-drv-net-driver-tests-for-hw-gro'

Jakub Kicinski says:

====================
selftests: drv-net: driver tests for HW GRO

Add tests for HW GRO stats, packet ordering and depth.

The ynltool and bnxt patches from v2 were applied separately.
====================

Link: https://patch.msgid.link/20260318033819.1469350-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: gro: add a test for GRO depth

Reuse the long sequence test to max out the GRO contexts.
Repeat for a single queue, 8 queues, and default number
of queues but flow steering to just one.

The SW GRO's capacity should be around 64 per queue
(8 buckets, up to 8 skbs in a chain).

Link: https://patch.msgid.link/20260318033819.1469350-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: gro: add test for packet ordering

Add a test to check if the NIC reorders packets if the hit GRO.

Link: https://patch.msgid.link/20260318033819.1469350-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: gro: test GRO stats

Test accuracy of GRO stats. We want to cover two potentially tricky
cases:
- single segment GRO
- packets which were eligible but didn't get GRO'd

The first case is trivial, teach gro.c to send one packet, and check
GRO stats didn't move.

Second case requires gro.c to send a lot of flows expecting the NIC
to run out of GRO flow capacity.

To avoid system traffic noise we steer the packets to a dedicated
queue and operate on qstat.

Link: https://patch.msgid.link/20260318033819.1469350-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: gro: use SO_TXTIME to schedule packets together

Longer packet sequence tests are quite flaky when the test is run
over a real network. Try to avoid at least the jitter on the sender
side by scheduling all the packets to be sent at once using SO_TXTIME.
Use hardcoded tx time of 5msec in the future. In my test increasing
this time past 2msec makes no difference so 5msec is plenty of margin.
Since we now expect more output buffering make sure to raise SNDBUF.

Note that this is an opportunistic reliability improvement which
will only work if the qdisc can schedule Tx time for us (fq).
Fiddling with qdisc config was deemed too complex, so it's not
part of the patch.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260318033819.1469350-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: give HW stats sync time extra 25% of margin

There are transient failures for devices which update stats
periodically, especially if it's the FW DMA'ing the stats
rather than host periodic work querying the FW. Wait 25%
longer than strictly necessary.

For devices which don't report stats-block-usecs we retain
25 msec as the default wait time (0.025sec == 20,000usec * 1.25).

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260318033819.1469350-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: move gro to lib for HW vs SW reuse

The gro.c packet sender is used for SW testing but bulk of incoming
new tests will be HW-specific. So it's better to put them under
drivers/net/hw/, to avoid tip-toeing around netdevsim. Move gro.c
to lib so we can reuse it.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260318033819.1469350-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfc: nci: fix circular locking dependency in nci_close_device

nci_close_device() flushes rx_wq and tx_wq while holding req_lock.
This causes a circular locking dependency because nci_rx_work()
running on rx_wq can end up taking req_lock too:

  nci_rx_work -> nci_rx_data_packet -> nci_data_exchange_complete
    -> __sk_destruct -> rawsock_destruct -> nfc_deactivate_target
    -> nci_deactivate_target -> nci_request -> mutex_lock(&ndev->req_lock)

Move the flush of rx_wq after req_lock has been released.
This should safe (I think) because NCI_UP has already been cleared
and the transport is closed, so the work will see it and return
-ENETDOWN.

NIPA has been hitting this running the nci selftest with a debug
kernel on roughly 4% of the runs.

Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation")
Reviewed-by: Ian Ray <ian.ray@gehealthcare.com>
Link: https://patch.msgid.link/20260317193334.988609-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-net-2026-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

- hci_ll: Fix firmware leak on error path
- hci_sync: annotate data-races around hdev->req_status
- L2CAP: Fix null-ptr-deref on l2cap_sock_ready_cb
- L2CAP: Validate PDU length before reading SDU length in l2cap_ecred_data_rcv()
- L2CAP: Fix regressions caused by reusing ident
- L2CAP: Fix stack-out-of-bounds read in l2cap_ecred_conn_req
- MGMT: Fix dangling pointer on mgmt_add_adv_patterns_monitor_complete
- SCO: Fix use-after-free in sco_recv_frame() due to missing sock_hold

* tag 'for-net-2026-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: L2CAP: Fix regressions caused by reusing ident
  Bluetooth: L2CAP: Fix null-ptr-deref on l2cap_sock_ready_cb
  Bluetooth: hci_ll: Fix firmware leak on error path
  Bluetooth: hci_sync: annotate data-races around hdev->req_status
  Bluetooth: MGMT: Fix dangling pointer on mgmt_add_adv_patterns_monitor_complete
  Bluetooth: SCO: Fix use-after-free in sco_recv_frame() due to missing sock_hold
  Bluetooth: L2CAP: Validate PDU length before reading SDU length in l2cap_ecred_data_rcv()
  Bluetooth: L2CAP: Fix stack-out-of-bounds read in l2cap_ecred_conn_req
====================

Link: https://patch.msgid.link/20260319190455.135302-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'parisc-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc fix from Helge Deller:
"Fix for the cacheflush() syscall which had D/I caches mixed up"

* tag 'parisc-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Flush correct cache in cacheflush() syscall

Merge tag 'pci-v7.0-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci fixes from Bjorn Helgaas:

- Create pwrctrl devices only for DT nodes below a PCI controller that
   describe PCI devices and are related to a power supply; this prevents
   waiting indefinitely for pwrctrl drivers that will never probe
   (Manivannan Sadhasivam)

- Restore endpoint BAR mapping on subrange setup failure to make
   selftest reliable (Koichiro Den)

* tag 'pci-v7.0-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI: endpoint: pci-epf-test: Roll back BAR mapping when subrange setup fails
  PCI/pwrctrl: Create pwrctrl devices only for PCI device nodes
  PCI/pwrctrl: Ensure that remote endpoint node parent has supply requirement

i2c: npcm7xx: Use NULL instead of 0 for pointer

Pointers should use NULL instead of explicit '0', as pointed out by
sparse:

i2c-npcm7xx.c:1387:61: warning: Using plain integer as NULL pointer

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Tali Perry <tali.perry1@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260216085825.70568-2-krzysztof.kozlowski@oss.qualcomm.com

hsi: hsi_core: use kzalloc_flex

Simplifies allocations by using a flexible array member in this struct.

Add __counted_by to get extra runtime analysis.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260318191037.5661-1-rosenp@gmail.com
Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>

i2c: pxa: defer reset on Armada 3700 when recovery is used

The I2C communication is completely broken on the Armada 3700 platform
since commit 0b01392c18b9 ("i2c: pxa: move to generic GPIO recovery").

For example, on the Methode uDPU board, probing of the two onboard
temperature sensors fails ...

  [    7.271713] i2c i2c-0: using pinctrl states for GPIO recovery
  [    7.277503] i2c i2c-0:  PXA I2C adapter
  [    7.282199] i2c i2c-1: using pinctrl states for GPIO recovery
  [    7.288241] i2c i2c-1:  PXA I2C adapter
  [    7.292947] sfp sfp-eth1: Host maximum power 3.0W
  [    7.299614] sfp sfp-eth0: Host maximum power 3.0W
  [    7.308178] lm75 1-0048: supply vs not found, using dummy regulator
  [   32.489631] lm75 1-0048: probe with driver lm75 failed with error -121
  [   32.496833] lm75 1-0049: supply vs not found, using dummy regulator
  [   82.890614] lm75 1-0049: probe with driver lm75 failed with error -121

... and accessing the plugged-in SFP modules also does not work:

  [  511.298537] sfp sfp-eth1: please wait, module slow to respond
  [  536.488530] sfp sfp-eth0: please wait, module slow to respond
  ...
  [ 1065.688536] sfp sfp-eth1: failed to read EEPROM: -EREMOTEIO
  [ 1090.888532] sfp sfp-eth0: failed to read EEPROM: -EREMOTEIO

After a discussion [1], there was an attempt to fix the problem by
reverting the offending change by commit 7b211c767121 ("Revert "i2c:
pxa: move to generic GPIO recovery""), but that only helped to fix
the issue in the 6.1.y stable tree. The reason behind the partial succes
is that there was another change in commit 20cb3fce4d60 ("i2c: Set i2c
pinctrl recovery info from it's device pinctrl") in the 6.3-rc1 cycle
which broke things further.

The cause of the problem is the same in case of both offending commits
mentioned above. Namely, the I2C core code changes the pinctrl state to
GPIO while running the recovery initialization code. Although the PXA
specific initialization also does this, but the key difference is that
it happens before the controller is getting enabled in i2c_pxa_reset(),
whereas in the case of the generic initialization it happens after that.

Change the code to reset the controller only before the first transfer
instead of before registering the controller. This ensures that the
controller is not enabled at the time when the generic recovery code
performs the pinctrl state changes, thus avoids the problem described
above.

As the result this change restores the original behaviour, which in
turn makes the I2C communication to work again as it can be seen from
the following log:

  [    7.363250] i2c i2c-0: using pinctrl states for GPIO recovery
  [    7.369041] i2c i2c-0:  PXA I2C adapter
  [    7.373673] i2c i2c-1: using pinctrl states for GPIO recovery
  [    7.379742] i2c i2c-1:  PXA I2C adapter
  [    7.384506] sfp sfp-eth1: Host maximum power 3.0W
  [    7.393013] sfp sfp-eth0: Host maximum power 3.0W
  [    7.399266] lm75 1-0048: supply vs not found, using dummy regulator
  [    7.407257] hwmon hwmon0: temp1_input not attached to any thermal zone
  [    7.413863] lm75 1-0048: hwmon0: sensor 'tmp75c'
  [    7.418746] lm75 1-0049: supply vs not found, using dummy regulator
  [    7.426371] hwmon hwmon1: temp1_input not attached to any thermal zone
  [    7.432972] lm75 1-0049: hwmon1: sensor 'tmp75c'
  [    7.755092] sfp sfp-eth1: module MENTECHOPTO      POS22-LDCC-KR    rev 1.0  sn MNC208U90009     dc 200828
  [    7.764997] mvneta d0040000.ethernet eth1: unsupported SFP module: no common interface modes
  [    7.785362] sfp sfp-eth0: module Mikrotik         S-RJ01           rev 1.0  sn 61B103C55C58     dc 201022
  [    7.803426] hwmon hwmon2: temp1_input not attached to any thermal zone

Link: https://lore.kernel.org/r/20230926160255.330417-1-robert.marko@sartura.hr
Cc: stable@vger.kernel.org # 6.3+
Fixes: 20cb3fce4d60 ("i2c: Set i2c pinctrl recovery info from it's device pinctrl")
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260226-i2c-pxa-fix-i2c-communication-v4-1-797a091dae87@gmail.com

docs: symbol-namespaces: mention sysfs attribute

Reference the new /sys/module/*/import_ns sysfs attribute in docs as an
alternative to modinfo for inspecting imported namespaces of loaded
modules.

Signed-off-by: Nicholas Sielicki <linux@opensource.nslick.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>

ionic: fix persistent MAC address override on PF

The use of IONIC_CMD_LIF_SETATTR in the MAC address update path causes
the ionic firmware to update the LIF's identity in its persistent state.
Since the firmware state is maintained across host warm boots and driver
reloads, any MAC change on the Physical Function (PF) becomes "sticky.

This is problematic because it causes ethtool -P to report the
user-configured MAC as the permanent factory address, which breaks
system management tools that rely on a stable hardware identity.

While Virtual Functions (VFs) need this hardware-level programming to
properly handle MAC assignments in guest environments, the PF should
maintain standard transient behavior. This patch gates the
ionic_program_mac call using is_virtfn so that PF MAC changes remain
local to the netdev filters and do not overwrite the firmware's
permanent identity block.

Fixes: 19058be7c48c ("ionic: VF initial random MAC address if no assigned mac")
Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260317170806.35390-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

i2c: fsi: Fix a potential leak in fsi_i2c_probe()

In the commit in Fixes:, when the code has been updated to use an explicit
for loop, instead of for_each_available_child_of_node(), the assumption
that a reference to a device_node structure would be released at each
iteration has been broken.

Now, an explicit of_node_put() is needed to release the reference.

Fixes: 095561f476ab ("i2c: fsi: Create busses for all ports")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: <stable@vger.kernel.org> # v5.3+
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/fd805c39f8de51edf303856103d782138a1633c8.1772382022.git.christophe.jaillet@wanadoo.fr

module: expose imported namespaces via sysfs

Add /sys/module/*/import_ns to expose imported namespaces for
currently loaded modules. The file contains one namespace per line and
only exists for modules that import at least one namespace.

Previously, the only way for userspace to inspect the symbol
namespaces a module imports is to locate the .ko on disk and invoke
modinfo(8) to decompress/parse the metadata. The kernel validated
namespaces at load time, but it was otherwise discarded.

Exposing this data via sysfs provides a runtime mechanism to verify
which namespaces are being used by modules. For example, this allows
userspace to audit driver API access in Android GKI, which uses symbol
namespaces to restrict vendor drivers from using specific kernel
interfaces (e.g., direct filesystem access).

Signed-off-by: Nicholas Sielicki <linux@opensource.nslick.com>
[Sami: Updated the commit message to explain motivation.]
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>

i2c: cp2615: fix serial string NULL-deref at probe

The cp2615 driver uses the USB device serial string as the i2c adapter
name but does not make sure that the string exists.

Verify that the device has a serial number before accessing it to avoid
triggering a NULL-pointer dereference (e.g. with malicious devices).

Fixes: 4a7695429ead ("i2c: cp2615: add i2c driver for Silicon Labs' CP2615 Digital Audio Bridge")
Cc: stable@vger.kernel.org # 5.13
Cc: Bence Csókás <bence98@sch.bme.hu>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Bence Csókás <bence98@sch.bme.hu>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260309075016.25612-1-johan@kernel.org

cxl: Adjust the startup priority of cxl_pmem to be higher than that of cxl_acpi

During the cxl_acpi probe process, it checks whether the cxl_nvb device
and driver have been attached. Currently, the startup priority of the
cxl_pmem driver is lower than that of the cxl_acpi driver. At this point,
the cxl_nvb driver has not yet been registered on the cxl_bus, causing
the attachment check to fail. This results in a failure to add the root
nvdimm bridge, leading to a cxl_acpi probe failure and ultimately
affecting the subsequent loading of cxl drivers. As a consequence, only
one mem device object exists on the cxl_bus, while the cxl_port device
objects and decoder device objects are missing.

The solution is to raise the startup priority of cxl_pmem to be higher
than that of cxl_acpi, ensuring that the cxl_pmem driver is registered
before the aforementioned attachment check occurs.

Co-developed-by: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Signed-off-by: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
Fixes: e7e222ad73d9 ("cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko")
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20260319074535.1709250-1-cuichao1753@phytium.com.cn
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

x86/cpu: Remove LASS restriction on vsyscall emulation

Vsyscall emulation has two modes of operation: XONLY and EMULATE. The
default XONLY mode is now supported with a LASS-triggered #GP. OTOH,
LASS is disabled if someone requests the deprecated EMULATE mode via the
vsyscall=emulate command line option. So, remove the restriction on LASS
when the overall vsyscall emulation support is compiled in.

As a result, there is no need for setup_lass() anymore. LASS is enabled
by default through a late_initcall().

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Reviewed-by:
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260309181029.398498-6-sohil.mehta@intel.com

x86/vsyscall: Disable LASS if vsyscall mode is set to EMULATE

The EMULATE mode of vsyscall maps the vsyscall page with a high kernel
address directly into user address space. Reading the vsyscall page in
EMULATE mode would cause LASS to trigger a #GP.

Fixing the LASS violation in EMULATE mode would require complex
instruction decoding because the resulting #GP does include the
necessary error information, and the vsyscall address is not
readily available in the RIP.

The EMULATE mode has been deprecated since 2022 and can only be enabled
using the command line parameter vsyscall=emulate. See commit
bf00745e7791 ("x86/vsyscall: Remove CONFIG_LEGACY_VSYSCALL_EMULATE") for
details. At this point, no one is expected to be using this insecure
mode. The rare usages that need it obviously do not care about security.

Disable LASS when EMULATE mode is requested to avoid breaking legacy
user software. Also, update the vsyscall documentation to reflect this.
LASS will only be supported if vsyscall mode is set to XONLY (default)
or NONE.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260309181029.398498-5-sohil.mehta@intel.com

x86/vsyscall: Restore vsyscall=xonly mode under LASS

Background
==========
The vsyscall page is located in the high/kernel part of the address
space. Prior to LASS, a vsyscall page access from userspace would always
generate a #PF. The kernel emulates the accesses in the #PF handler and
returns the appropriate values to userspace.

Vsyscall emulation has two modes of operation, specified by the
vsyscall={xonly, emulate} kernel command line option. The vsyscall page
behaves as execute-only in XONLY mode or read-execute in EMULATE mode.
XONLY mode is the default and the only one expected to be commonly used.
The EMULATE mode has been deprecated since 2022 and is considered
insecure.

With LASS, a vsyscall page access triggers a #GP instead of a #PF.
Currently, LASS is only enabled when all vsyscall modes are disabled.

LASS with XONLY mode
====================
Now add support for LASS specifically with XONLY vsyscall emulation. For
XONLY mode, all that is needed is the faulting RIP, which is trivially
available regardless of the type of fault. Reuse the #PF emulation code
during the #GP when the fault address points to the vsyscall page.

As multiple fault handlers will now be using the emulation code, add a
sanity check to ensure that the fault truly happened in 64-bit user
mode.

LASS with EMULATE mode
======================
Supporting vsyscall=emulate with LASS is much harder because the #GP
doesn't provide enough error information (such as PFEC and CR2 as in
case of a #PF). So, complex instruction decoding would be required to
emulate this mode in the #GP handler.

This isn't worth the effort as remaining users of EMULATE mode can be
reasonably assumed to be niche users, who are already trading off
security for compatibility. LASS and vsyscall=emulate will be kept
mutually exclusive for simplicity.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260309181029.398498-4-sohil.mehta@intel.com

x86/traps: Consolidate user fixups in the #GP handler

Move the UMIP exception fixup under the common "if (user_mode(regs))"
condition where the rest of user mode fixups reside. Also, move the UMIP
feature check into its fixup function to keep the calling code
consistent and clean.

No functional change intended.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260309181029.398498-3-sohil.mehta@intel.com

x86/vsyscall: Reorganize the page fault emulation code

With LASS, vsyscall page accesses will cause a #GP instead of a #PF.
Separate out the core vsyscall emulation code from the #PF specific
handling in preparation for the upcoming #GP emulation.

No functional change intended.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Link: https://patch.msgid.link/20260309181029.398498-2-sohil.mehta@intel.com

perf evlist: Improve default event for s390

Frame pointer callchains are not supported on s390 and dwarf
callchains are only supported on software events.

Switch the default event from the hardware 'cycles' event to the
software 'cpu-clock' or 'task-clock' on s390 if callchains are
enabled. Move some of the target initialization earlier in builtin-top
and builtin-record, so it is ready for use by evlist__new_default.

If frame pointer callchains are requested on s390 show a
warning. Modify the '-g' option of `perf top` and `perf record` to
default to dwarf callchains on s390.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf callchain: Refactor callchain option parsing

record_opts__parse_callchain is shared by builtin-record and
builtin-trace, it is declared in callchain.h. Move the declaration to
callchain.c for consistency with the header. In other cases make the
option callback a small static stub that then calls into callchain.c.

Make the no argument '-g' callchain option just a short-cut for
'--call-graph fp' so that there is consistency in how the arguments
are handled. This requires the const char* string to be strdup-ed in
__parse_callchain_report_opt. For consistency also make
parse_callchain_record use strdup and remove some unnecessary
casts. Also, be more explicit about the '-g' behavior if there is a
.perfconfig file setting.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf evsel: Constify option arguments to config functions

The options are used to configure the evsel but are not themselves
configured. Make the arguments const to better capture this.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf target: Constify simple check functions

Allow the target to be const in callers.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf evsel: Improve falling back from cycles

Switch to using evsel__match rather than comparing perf_event_attr
values, this is robust on hybrid architectures.
Ensure evsel->pmu matches the evsel->core.attr.
Remove exclude bits that get set in other fallback attempts when
switching the event.
Log the event name with modifiers when switching the event on fallback.

Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf dwarf-aux: Collect all variable locations for insn tracking

Previously, only the first DWARF location entry was collected for each
variable. This was based on the assumption that instruction tracking
could reconstruct the remaining state. However, variables may have
different locations across different address ranges, and relying solely
on instruction tracking can miss valid type information.

Change __die_collect_vars_cb() to iterate over all location entries
using dwarf_getlocations() in a loop. This ensures that variables with
multiple location ranges are properly tracked, improving type coverage.

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf annotate-data: Use DWARF location ranges to preserve reg state

When a function call occurs, caller-saved registers are typically
invalidated since the callee may clobber them. However, DWARF debug info
provides location ranges that indicate exactly where a variable is valid
in a register.

Track the DWARF location range end address in type_state_reg and use it
to determine if a caller-saved register should be preserved across a
call. If the current call address is within the DWARF-specified lifetime
of the variable, keep the register state valid instead of invalidating
it.

This improves type annotation for code where the compiler knows a
register value survives across calls (e.g., when the callee is known not
to clobber certain registers or when the value is reloaded after the
call at the same logical location).

Changes:
- Add `end` and `has_range` fields to die_var_type to capture DWARF
location range information
- Add `lifetime_active` and `lifetime_end` fields to type_state_reg
- Check location lifetime before invalidating caller-saved registers

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf annotate-data: Invalidate caller-saved regs for all calls

Previously, the x86 call handler returned early without invalidating
caller-saved registers when the call target symbol could not be resolved
(func == NULL). This violated the ABI which requires caller-saved
registers to be considered clobbered after any call instruction.

Fix this by:
1. Always invalidating caller-saved registers for any call instruction
(except __fentry__ which preserves registers)
2. Using dl->ops.target.name as fallback when func->name is unavailable,
allowing return type lookup for more call targets

This is a conservative change that may reduce type coverage for indirect
calls (e.g., callq *(%rax)) where we cannot determine the return type
but it ensures correctness.

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf annotate-data: Add invalidate_reg_state() helper for x86

Add a helper function to consistently invalidate register state instead
of field assignments. This ensures kind, ok, and copied_from are all
properly cleared when a register becomes invalid.

The helper sets:
- kind = TSR_KIND_INVALID
- ok = false
- copied_from = -1

Replace all invalidation patterns with calls to this helper. No
functional change and this removes some incorrect annotations that were
caused by incomplete invalidation (e.g. a obsolete copied_from from an
invalidated register).

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf annotate-data: Handle global variable access with const register

When a register holds a constant value (TSR_KIND_CONST) and is used with
a negative offset, treat it as a potential global variable access
instead of falling through to CFA (frame) handling.

This fixes cases like array indexing with computed offsets:

movzbl -0x7d72725a(%rax), %eax # array[%rax]

Where %rax contains a computed index and the negative offset points to a
global array. Previously this fell through to the CFA path which doesn't
handle global variables, resulting in "no type information".

The fix redirects such accesses to check_kernel which calls
get_global_var_type() to resolve the type from the global variable
cache. This is only done for kernel DSOs since the pattern relies on
kernel-specific global variable resolution. We could also treat
registers with integer types to the global variable path, but this
requires more changes.

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf annotate-data: Collect global variables without name

Previously, global_var__collect() required get_global_var_info() to
succeed (i.e., the variable must have a symbol name) before caching a
global variable. This prevented variables that exist in DWARF but lack
symbol table coverage from being cached.

Remove the symbol table requirement since DW_OP_addr already provides
the variable's address directly from DWARF. The symbol table lookup is
now optional to obtain the variable name when available.

Also remove the var_offset != 0 check, which was intended to skip
variables where the access address doesn't match the symbol start. The
symbol table lookup is now optional and I found removing this check has
no effect on the annotation results for both kernel and userspace
programs.

Test results show improved annotation coverage especially for userspace
programs with RIP-relative addressing instructions.

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf dwarf-aux: Handle array types in die_get_member_type

When a struct member is an array type, die_get_member_type() would stop
iterating since array types weren't handled in the loop. This caused
accesses to array elements within structs to not resolve properly.

Add array type handling by resolving the array to its element type and
calculating the offset within an element using modulo arithmetic

This improves type annotation coverage for struct members that are
arrays.

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf annotate-data: Improve type comparison from different scopes

When comparing types from different scopes, first compare their type
offsets. A larger offset means the field belongs to an outer (enclosing)
struct. This helps resolve cases where a pointer is found in an inner
scope, but a struct containing that pointer exists in an outer scope.
Previously, is_better_type would prefer the pointer type, but the struct
type is actually more complete and should be chosen.

Prefer types from outer scopes when is_better_type cannot determine a
better type. This is a heuristic for the case `struct A { struct B; }`
where A and B have the same size but I think in most cases A is in the
outer scope and should be preferred.

Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf dwarf-aux: Skip check_variable for variable lookup

Both die_find_variable_by_reg and die_find_variable_by_addr call
match_var_offset which already performs sufficient checking and type
matching. The additional check_variable call is redundant, and its
need_pointer logic is only a heuristic. Since DWARF encodes accurate
type information, which match_var_offset verifies, skipping
check_variable improves both coverage and accuracy.

Return the matched type from die_find_variable_by_reg and
die_find_variable_by_addr via the existing `type` field in
find_var_data, removing the need for check_variable in
find_data_type_die.

Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf dwarf-aux: Preserve typedefs in match_var_offset

Preserve typedefs in match_var_offset to match the results by
__die_get_real_type. Also move the (offset == 0) branch after the
is_pointer check to ensure the correct type is used, fixing cases where
an incorrect pointer type was chosen when the access offset was 0.

Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

perf dwarf-aux: Add die_get_pointer_type to get pointer types

When a variable type is wrapped in typedef/qualifiers, callers may need
to first resolve it to the underlying DW_TAG_pointer_type or
DW_TAG_array_type. A simple tag check is not enough and directly calling
__die_get_real_type() can stop at the pointer type (e.g. typedef ->
pointer) instead of the pointee type.

Add die_get_pointer_type() helper that follows typedef/qualifier chains
and returns the underlying pointer DIE. Use it in annotate-data.c so
pointer checks and dereference work correctly for typedef'd pointers.

Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>

PCI: eswin: Add ESWIN PCIe Root Complex driver

Add driver for the ESWIN PCIe Root Complex based on the DesignWare PCIe
core, IP revision 5.96a. The PCIe Gen.3 Root Complex supports data rate of
8 GT/s and x4 lanes, with INTx and MSI interrupt capability.

Signed-off-by: Yu Ning <ningyu@eswincomputing.com>
Signed-off-by: Yanghui Ou <ouyanghui@eswincomputing.com>
Signed-off-by: Senchuan Zhang <zhangsenchuan@eswincomputing.com>
[mani: renamed "EIC7700" to "ESWIN", added maintainers entry, removed async probe]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
[bhelgaas: add driver tag in subject]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260227111808.1996-1-zhangsenchuan@eswincomputing.com

dt-bindings: PCI: eswin: Add ESWIN PCIe Root Complex

Add Device Tree binding documentation for the ESWIN PCIe Root Complex. The
Root Complex is based on Synopsys Designware PCIe IP.

Signed-off-by: Yu Ning <ningyu@eswincomputing.com>
Signed-off-by: Yanghui Ou <ouyanghui@eswincomputing.com>
Signed-off-by: Senchuan Zhang <zhangsenchuan@eswincomputing.com>
[mani: Renamed 'EIC7700' to 'ESWIN']
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
[bhelgaas: add driver tag in subject]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260227111732.1979-1-zhangsenchuan@eswincomputing.com

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-7.0-rc5).

net/netfilter/nft_set_rbtree.c
598adea720b97 ("netfilter: revert nft_set_rbtree: validate open interval overlap")
3aea466a43998 ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock")
https://lore.kernel.org/abgaQBpeGstdN4oq@sirena.org.uk

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

io_uring/kbuf: propagate BUF_MORE through early buffer commit path

When io_should_commit() returns true (eg for non-pollable files), buffer
commit happens at buffer selection time and sel->buf_list is set to
NULL. When __io_put_kbufs() generates CQE flags at completion time, it
calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence
cannot determine whether the buffer was consumed or not. This means that
IORING_CQE_F_BUF_MORE is never set for non-pollable input with
incrementally consumed buffers.

Likewise for io_buffers_select(), which always commits upfront and
discards the return value of io_kbuf_commit().

Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early
commit. Then __io_put_kbuf_ring() can check this flag and set
IORING_F_BUF_MORE accordingy.

Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF

For a zero length transfer, io_kbuf_inc_commit() is called with !len.
Since we never enter the while loop to consume the buffers,
io_kbuf_inc_commit() ends up returning true, consuming the buffer. But
if no data was consumed, by definition it cannot have consumed the
buffer. Return false for that case.

Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>

selftests/landlock: Test tsync interruption and cancellation paths

Add tsync_interrupt test to exercise the signal interruption path in
landlock_restrict_sibling_threads().  When a signal interrupts
wait_for_completion_interruptible() while the calling thread waits for
sibling threads to finish credential preparation, the kernel:

1. Sets ERESTARTNOINTR to request a transparent syscall restart.
2. Calls cancel_tsync_works() to opportunistically dequeue task works
   that have not started running yet.
3. Breaks out of the preparation loop, then unblocks remaining
   task works via complete_all() and waits for them to finish.
4. Returns the error, causing abort_creds() in the syscall handler.

Specifically, cancel_tsync_works() in its entirety, the ERESTARTNOINTR
error branch in landlock_restrict_sibling_threads(), and the
abort_creds() error branch in the landlock_restrict_self() syscall
handler are timing-dependent and not exercised by the existing tsync
tests, making code coverage measurements non-deterministic.

The test spawns a signaler thread that rapidly sends SIGUSR1 to the
calling thread while it performs landlock_restrict_self() with
LANDLOCK_RESTRICT_SELF_TSYNC.  Since ERESTARTNOINTR causes a
transparent restart, userspace always sees the syscall succeed.

This is a best-effort coverage test: the interruption path is exercised
when the signal lands during the preparation wait, which depends on
thread scheduling.  The test creates enough idle sibling threads (200)
to ensure multiple serialized waves of credential preparation even on
machines with many cores (e.g., 64), widening the window for the
signaler.  Deterministic coverage would require wrapping the wait call
with ALLOW_ERROR_INJECTION() and using CONFIG_FAIL_FUNCTION.

Test coverage for security/landlock was 90.2% of 2105 lines according to
LLVM 21, and it is now 91.1% of 2105 lines with this new test.

Cc: Günther Noack <gnoack@google.com>
Cc: Justin Suess <utilityemal77@gmail.com>
Cc: Tingmao Wang <m@maowtm.org>
Cc: Yihan Ding <dingyihan@uniontech.com>
Link: https://lore.kernel.org/r/20260310190416.1913908-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>

bpf: Add warning to detect memory leak in bpf_selem_unlink_nofail()

While very unlikely, local storage theoretically may leak memory of the
size of "struct bpf_local_storage" when destroy() fails to grab
local_storage->lock and initializes selem->local_storage before other
racing map_free() see it. Warn the user to allow debugging the issue
instead of leaking the memory silently.

Note that test_maps in bpf selftests already stress tested
bpf_selem_unlink_nofail() by creating 4096 sockets and then immediately
destroying them in multiple threads. With 64 threads, 64 x 4096 socket
local storages were created and destroyed during the test and no warning
in the function were triggered.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://patch.msgid.link/20260318224219.615105-1-ameryhung@gmail.com

smb: client: fix generic/694 due to wrong ->i_blocks

When updating ->i_size, make sure to always update ->i_blocks as well
until we query new allocation size from the server.

generic/694 was failing because smb3_simple_falloc() was missing the
update of ->i_blocks after calling cifs_setsize(). So, fix this by
updating ->i_blocks directly in cifs_setsize(), so all places that
call it doesn't need to worry about updating ->i_blocks later.

Reported-by: Shyam Prasad N <sprasad@microsoft.com>
Closes: https://lore.kernel.org/r/CANT5p=rqgRwaADB=b_PhJkqXjtfq3SFv41SSTXSVEHnuh871pA@mail.gmail.com
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

pinctrl: mediatek: common: Fix probe failure for devices without EINT

Some pinctrl devices like mt6397 or mt6392 don't support EINT at all, but
the mtk_eint_init function is always called and returns -ENODEV, which
then bubbles up and causes probe failure.

To address this only call mtk_eint_init if EINT pins are present.

Tested on Xiaomi Mi Smart Clock x04g (mt6392).

Fixes: e46df235b4e6 ("pinctrl: mediatek: refactor EINT related code for all MediaTek pinctrl can fit")
Signed-off-by: Luca Leonardo Scorcia <l.scorcia@gmail.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>

Bluetooth: L2CAP: Fix regressions caused by reusing ident

This attempt to fix regressions caused by reusing ident which apparently
is not handled well on certain stacks causing the stack to not respond to
requests, so instead of simple returning the first unallocated id this
stores the last used tx_ident and then attempt to use the next until all
available ids are exausted and then cycle starting over to 1.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=221120
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221177
Fixes: 6c3ea155e5ee ("Bluetooth: L2CAP: Fix not tracking outstanding TX ident")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Christian Eggers <ceggers@arri.de>

Bluetooth: L2CAP: Fix null-ptr-deref on l2cap_sock_ready_cb

Before using sk pointer, check if it is null.

Fix the following:

KASAN: null-ptr-deref in range [0x0000000000000260-0x0000000000000267]
CPU: 0 UID: 0 PID: 5985 Comm: kworker/0:5 Not tainted 7.0.0-rc4-00029-ga989fde763f4 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-9.fc43 06/10/2025
Workqueue: events l2cap_info_timeout
RIP: 0010:kasan_byte_accessible+0x12/0x30
Code: 79 ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 40 d6 48 c1 ef 03 48 b8 00 00 00 00 00 fc ff df <0f> b6 04 07 3c 08 0f 92 c0 c3 cc cce
veth0_macvtap: entered promiscuous mode
RSP: 0018:ffffc90006e0f808 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffffffff89746018 RCX: 0000000080000001
RDX: 0000000000000000 RSI: ffffffff89746018 RDI: 000000000000004c
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: dffffc0000000000 R11: ffffffff8aae3e70 R12: 0000000000000000
R13: 0000000000000260 R14: 0000000000000260 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff8880983c2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005582615a5008 CR3: 000000007007e000 CR4: 0000000000752ef0
PKRU: 55555554
Call Trace:
  <TASK>
  __kasan_check_byte+0x12/0x40
  lock_acquire+0x79/0x2e0
  lock_sock_nested+0x48/0x100
  ? l2cap_sock_ready_cb+0x46/0x160
  l2cap_sock_ready_cb+0x46/0x160
  l2cap_conn_start+0x779/0xff0
  ? __pfx_l2cap_conn_start+0x10/0x10
  ? l2cap_info_timeout+0x60/0xa0
  ? __pfx___mutex_lock+0x10/0x10
  l2cap_info_timeout+0x68/0xa0
  ? process_scheduled_works+0xa8d/0x18c0
  process_scheduled_works+0xb6e/0x18c0
  ? __pfx_process_scheduled_works+0x10/0x10
  ? assign_work+0x3d5/0x5e0
  worker_thread+0xa53/0xfc0
  kthread+0x388/0x470
  ? __pfx_worker_thread+0x10/0x10
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x51e/0xb90
  ? __pfx_ret_from_fork+0x10/0x10
veth1_macvtap: entered promiscuous mode
  ? __switch_to+0xc7d/0x1450
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
batman_adv: batadv0: Interface activated: batadv_slave_0
batman_adv: batadv0: Interface activated: batadv_slave_1
netdevsim netdevsim7 netdevsim0: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim7 netdevsim1: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim7 netdevsim2: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim7 netdevsim3: set [1, 0] type 2 family 0 port 6081 - 0
RIP: 0010:kasan_byte_accessible+0x12/0x30
Code: 79 ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 40 d6 48 c1 ef 03 48 b8 00 00 00 00 00 fc ff df <0f> b6 04 07 3c 08 0f 92 c0 c3 cc cce
ieee80211 phy39: Selected rate control algorithm 'minstrel_ht'
RSP: 0018:ffffc90006e0f808 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffffffff89746018 RCX: 0000000080000001
RDX: 0000000000000000 RSI: ffffffff89746018 RDI: 000000000000004c
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: dffffc0000000000 R11: ffffffff8aae3e70 R12: 0000000000000000
R13: 0000000000000260 R14: 0000000000000260 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff8880983c2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7e16139e9c CR3: 000000000e74e000 CR4: 0000000000752ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception

Fixes: 54a59aa2b562 ("Bluetooth: Add l2cap_chan->ops->ready()")
Signed-off-by: Helen Koike <koike@igalia.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: hci_ll: Fix firmware leak on error path

Smatch reports:

drivers/bluetooth/hci_ll.c:587 download_firmware() warn:
'fw' from request_firmware() not released on lines: 544.

In download_firmware(), if request_firmware() succeeds but the returned
firmware content is invalid (no data or zero size), the function returns
without releasing the firmware, resulting in a resource leak.

Fix this by calling release_firmware() before returning when
request_firmware() succeeded but the firmware content is invalid.

Fixes: 371805522f87 ("bluetooth: hci_uart: add LL protocol serdev driver support")
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: hci_sync: annotate data-races around hdev->req_status

__hci_cmd_sync_sk() sets hdev->req_status under hdev->req_lock:

    hdev->req_status = HCI_REQ_PEND;

However, several other functions read or write hdev->req_status without
holding any lock:

  - hci_send_cmd_sync() reads req_status in hci_cmd_work (workqueue)
  - hci_cmd_sync_complete() reads/writes from HCI event completion
  - hci_cmd_sync_cancel() / hci_cmd_sync_cancel_sync() read/write
  - hci_abort_conn() reads in connection abort path

Since __hci_cmd_sync_sk() runs on hdev->req_workqueue while
hci_send_cmd_sync() runs on hdev->workqueue, these are different
workqueues that can execute concurrently on different CPUs. The plain
C accesses constitute a data race.

Add READ_ONCE()/WRITE_ONCE() annotations on all concurrent accesses
to hdev->req_status to prevent potential compiler optimizations that
could affect correctness (e.g., load fusing in the wait_event
condition or store reordering).

Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: MGMT: Fix dangling pointer on mgmt_add_adv_patterns_monitor_complete

This fixes the condition checking so mgmt_pending_valid is executed
whenever status != -ECANCELED otherwise calling mgmt_pending_free(cmd)
would kfree(cmd) without unlinking it from the list first, leaving a
dangling pointer. Any subsequent list traversal (e.g.,
mgmt_pending_foreach during __mgmt_power_off, or another
mgmt_pending_valid call) would dereference freed memory.

Link: https://lore.kernel.org/linux-bluetooth/20260315132013.75ab40c5@kernel.org/T/#m1418f9c82eeff8510c1beaa21cf53af20db96c06
Fixes: 302a1f674c00 ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>

Bluetooth: SCO: Fix use-after-free in sco_recv_frame() due to missing sock_hold

sco_recv_frame() reads conn->sk under sco_conn_lock() but immediately
releases the lock without holding a reference to the socket. A concurrent
close() can free the socket between the lock release and the subsequent
sk->sk_state access, resulting in a use-after-free.

Other functions in the same file (sco_sock_timeout(), sco_conn_del())
correctly use sco_sock_hold() to safely hold a reference under the lock.

Fix by using sco_sock_hold() to take a reference before releasing the
lock, and adding sock_put() on all exit paths.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: L2CAP: Validate PDU length before reading SDU length in l2cap_ecred_data_rcv()

l2cap_ecred_data_rcv() reads the SDU length field from skb->data using
get_unaligned_le16() without first verifying that skb contains at least
L2CAP_SDULEN_SIZE (2) bytes. When skb->len is less than 2, this reads
past the valid data in the skb.

The ERTM reassembly path correctly calls pskb_may_pull() before reading
the SDU length (l2cap_reassemble_sdu, L2CAP_SAR_START case). Apply the
same validation to the Enhanced Credit Based Flow Control data path.

Fixes: aac23bf63659 ("Bluetooth: Implement LE L2CAP reassembly")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

Bluetooth: L2CAP: Fix stack-out-of-bounds read in l2cap_ecred_conn_req

Syzbot reported a KASAN stack-out-of-bounds read in l2cap_build_cmd()
that is triggered by a malformed Enhanced Credit Based Connection Request.

The vulnerability stems from l2cap_ecred_conn_req(). The function allocates
a local stack buffer (`pdu`) designed to hold a maximum of 5 Source Channel
IDs (SCIDs), totaling 18 bytes. When an attacker sends a request with more
than 5 SCIDs, the function calculates `rsp_len` based on this unvalidated
`cmd_len` before checking if the number of SCIDs exceeds
L2CAP_ECRED_MAX_CID.

If the SCID count is too high, the function correctly jumps to the
`response` label to reject the packet, but `rsp_len` retains the
attacker's oversized value. Consequently, l2cap_send_cmd() is instructed
to read past the end of the 18-byte `pdu` buffer, triggering a
KASAN panic.

Fix this by moving the assignment of `rsp_len` to after the `num_scid`
boundary check. If the packet is rejected, `rsp_len` will safely
remain 0, and the error response will only read the 8-byte base header
from the stack.

Fixes: c28d2bff7044 ("Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short")
Reported-by: syzbot+b7f3e7d9a596bf6a63e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b7f3e7d9a596bf6a63e3
Tested-by: syzbot+b7f3e7d9a596bf6a63e3@syzkaller.appspotmail.com
Signed-off-by: Minseo Park <jacob.park.9436@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO

When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
value.

The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
caller that new initial data is available in the migration stream.

If the firmware reports a new initial-data chunk, any previously dirty
bytes in memory are treated as initial bytes, since the caller must read
both sets before reaching the end of the initial-data region.

In this case, the driver issues a new SAVE command to fetch the data and
prepare it for a subsequent read() from userspace.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/mlx5: consider inflight SAVE during PRE_COPY

Consider an inflight SAVE operation during the PRE_COPY phase, so the
caller will wait when no data is currently available but is expected
to arrive.

This enables a follow-up patch to avoid returning -ENOMSG while a new
*initial_bytes* chunk is still pending from an asynchronous SAVE command
issued by the VFIO_MIG_GET_PRECOPY_INFO ioctl.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

net/mlx5: Add IFC bits for migration state

Add the relevant IFC bits for querying an extra migration state from the
device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-5-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl

Introduce a core helper function for VFIO_MIG_GET_PRECOPY_INFO and adapt
all drivers to use it.

It centralizes the common code and ensures that output flags are cleared
on entry, in case user opts in to VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.
This preventing any unintended echoing of userspace data back to
userspace.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-4-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2

Currently, existing VFIO_MIG_GET_PRECOPY_INFO implementations don't
assign info.flags before copy_to_user().

Because they copy the struct in from userspace first, this effectively
echoes userspace-provided flags back as output, preventing the field
from being used to report new reliable data from the drivers.

Add support for a new device feature named
VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.

On SET, enables the v2 pre_copy_info behaviour, where the
vfio_precopy_info.flags is a valid output field.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase

As currently defined, initial_bytes is monotonically decreasing and
precedes dirty_bytes when reading from the saving file descriptor.
The transition from initial_bytes to dirty_bytes is unidirectional and
irreversible.

The initial_bytes are considered as critical data that is highly
recommended to be transferred to the target as part of PRE_COPY, without
this data, the PRE_COPY phase would be ineffective.

We come to solve the case when a new chunk of critical data is
introduced during the PRE_COPY phase and the driver would like to report
an entirely new value for the initial_bytes.

For that, we extend the VFIO_MIG_GET_PRECOPY_INFO ioctl with an output
flag named VFIO_PRECOPY_INFO_REINIT to allow drivers reporting a new
initial_bytes value during the PRE_COPY phase.

Currently, existing VFIO_MIG_GET_PRECOPY_INFO implementations don't
assign info.flags before copy_to_user(), this effectively echoes
userspace-provided flags back as output, preventing the field from being
used to report new reliable data from the drivers.

Reliable use of the new VFIO_PRECOPY_INFO_REINIT flag requires userspace
to explicitly opt in by enabling the
VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 device feature.

When the caller opts in, the driver may report an entirely new
value for initial_bytes. It may be larger, it may be smaller, it may
include the previous unread initial_bytes, it may discard the previous
unread initial_bytes, up to the driver logic and state.
The presence of the VFIO_PRECOPY_INFO_REINIT output flag set by the
driver indicates that new initial data is present on the stream.

Once the caller sees this flag, the initial_bytes value should be
re-evaluated relative to the readiness state for transition to
STOP_COPY.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260317161753.18964-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio: selftests: Fix VLA initialisation in vfio_pci_irq_set()

C does not permit an initialiser expression on a variable-length array
(C99 Section 6.7.9 constraint: "The type of the entity to be initialized
shall not be a variable length array type").

vfio_pci_irq_set() declared:

      u8 buf[sizeof(struct vfio_irq_set) + sizeof(int) * count] = {};

where `count` is a runtime function parameter, making `buf` a VLA.

GCC rejects this with (tried with GCC-9.4.0):

      error: variable-sized object may not be initialized

Fix by removing the `= {}` initialiser and inserting an explicit
memset() immediately after the declaration.  memset() on a VLA is
perfectly legal and achieves the same zero-initialisation on all
conforming C implementations.

Fixes: 19faf6fd969c ("vfio: selftests: Add a helper library for VFIO selftests")
Cc: stable@vger.kernel.org
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Link: https://lore.kernel.org/r/20260317051402.3725670-1-mhonap@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio: uapi: fix comment typo

The file contains a spelling error in a source comment (succes).

Typos in comments reduce readability and make text searches less reliable
for developers and maintainers.

Replace 'succes' with 'success' in the affected comment. This is a
comment-only cleanup and does not change behavior.

Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Link: https://lore.kernel.org/r/20260316185617.166414-1-joseph.salisbury@oracle.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio: mdev: replace mtty_dev->vd_class with a const struct class

The class_create() call has been deprecated in favor of class_register()
as the driver core now allows for a struct class to be in read-only
memory. Replace mtty_dev->vd_class with a const struct class and drop the
class_create() call.

Compile tested and found no errors/warns in dmesg after enabling
CONFIG_VFIO and CONFIG_SAMPLE_VFIO_MDEV_MTTY.

Link: https://lore.kernel.org/all/2023040244-duffel-pushpin-f738@gregkh/
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20260308214939.1215682-1-jkoolstra@xs4all.nl
Signed-off-by: Alex Williamson <alex@shazbot.org>

bpf: Do not allow deleting local storage in NMI

Currently, local storage may deadlock when deferring freeing selem or
local storage through kfree_rcu(), call_rcu() or call_rcu_tasks_trace()
in NMI or reentrant. Since deleting selem in NMI is an unlikely use
case, partially mitigate it by returning error when calling from
bpf_xxx_storage_delete() helpers in NMI. Note that, it is still possible
to deadlock through reentrant. A full mitigation requires returning
error when irqs_disabled() is true, which, however is too heavy-handed
for bpf_xxx_storage_delete().

The long-term solution requires _nolock versions of call_rcu. Another
possible solution is to defer the free through irq_work [0], but it
would grow the size of selem, which is non-ideal.

The check is only needed in bpf_selem_unlink(), which is used by helpers
and syscalls. bpf_selem_unlink_nofail() is fine as it is called during
map and owner tear down that never run in NMI or reentrant.

[0] https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com/

Fixes: a10787e6d58c ("bpf: Enable task local storage for tracing programs")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://patch.msgid.link/20260319025716.2361065-1-ameryhung@gmail.com

Merge tag 'net-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, Bluetooth and netfilter.

  Nothing too exciting here, mostly fixes for corner cases.

  Current release - fix to a fix:

   - bonding: prevent potential infinite loop in bond_header_parse()

  Current release - new code bugs:

   - wifi: mac80211: check tdls flag in ieee80211_tdls_oper

  Previous releases - regressions:

   - af_unix: give up GC if MSG_PEEK intervened

   - netfilter: conntrack: add missing netlink policy validations

   - NFC: nxp-nci: allow GPIOs to sleep"

* tag 'net-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
  MPTCP: fix lock class name family in pm_nl_create_listen_socket
  icmp: fix NULL pointer dereference in icmp_tag_validation()
  net: dsa: bcm_sf2: fix missing clk_disable_unprepare() in error paths
  net: shaper: protect from late creation of hierarchy
  net: shaper: protect late read accesses to the hierarchy
  net: mvpp2: guard flow control update with global_tx_fc in buffer switching
  nfnetlink_osf: validate individual option lengths in fingerprints
  netfilter: nf_tables: release flowtable after rcu grace period on error
  netfilter: bpf: defer hook memory release until rcu readers are done
  net: bonding: fix NULL deref in bond_debug_rlb_hash_show
  udp_tunnel: fix NULL deref caused by udp_sock_create6 when CONFIG_IPV6=n
  net/mlx5e: Fix race condition during IPSec ESN update
  net/mlx5e: Prevent concurrent access to IPSec ASO context
  net/mlx5: qos: Restrict RTNL area to avoid a lock cycle
  ipv6: add NULL checks for idev in SRv6 paths
  NFC: nxp-nci: allow GPIOs to sleep
  net: macb: fix uninitialized rx_fs_lock
  net: macb: fix use-after-free access to PTP clock
  netdevsim: drop PSP ext ref on forward failure
  wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure
  ...

KVM: arm64: selftests: Add no-vgic-v5 selftest

Now that GICv5 is supported, it is important to check that all of the
GICv5 register state is hidden from a guest that doesn't create a
vGICv5.

Rename the no-vgic-v3 selftest to no-vgic, and extend it to check
GICv5 system registers too.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-42-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: selftests: Introduce a minimal GICv5 PPI selftest

This basic selftest creates a vgic_v5 device (if supported), and tests
that one of the PPI interrupts works as expected with a basic
single-vCPU guest.

Upon starting, the guest enables interrupts. That means that it is
initialising all PPIs to have reasonable priorities, but marking them
as disabled. Then the priority mask in the ICC_PCR_EL1 is set, and
interrupts are enable in ICC_CR0_EL1. At this stage the guest is able
to receive interrupts. The architected SW_PPI (64) is enabled and
KVM_IRQ_LINE ioctl is used to inject the state into the guest.

The guest's interrupt handler has an explicit WFI in order to ensure
that the guest skips WFI when there are pending and enabled PPI
interrupts.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-41-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Communicate userspace-driveable PPIs via a UAPI

GICv5 systems will likely not support the full set of PPIs. The
presence of any virtual PPI is tied to the presence of the physical
PPI. Therefore, the available PPIs will be limited by the physical
host. Userspace cannot drive any PPIs that are not implemented.

Moreover, it is not desirable to expose all PPIs to the guest in the
first place, even if they are supported in hardware. Some devices,
such as the arch timer, are implemented in KVM, and hence those PPIs
shouldn't be driven by userspace, either.

Provided a new UAPI:
KVM_DEV_ARM_VGIC_GRP_CTRL => KVM_DEV_ARM_VGIC_USERPSPACE_PPIs

This allows userspace to query which PPIs it is able to drive via
KVM_IRQ_LINE.

Additionally, introduce a check in kvm_vm_ioctl_irq_line() to reject
any PPIs not in the userspace mask.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-40-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

Documentation: KVM: Introduce documentation for VGICv5

Now that it is possible to create a VGICv5 device, provide initial
documentation for it. At this stage, there is little to document.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-39-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Probe for GICv5 device

The basic GICv5 PPI support is now complete. Allow probing for a
native GICv5 rather than just the legacy support.

The implementation doesn't support protected VMs with GICv5 at this
time. Therefore, if KVM has protected mode enabled the native GICv5
init is skipped, but legacy VMs are allowed if the hardware supports
it.

At this stage the GICv5 KVM implementation only supports PPIs, and
doesn't interact with the host IRS at all. This means that there is no
need to check how many concurrent VMs or vCPUs per VM are supported by
the IRS - the PPI support only requires the CPUIF. The support is
artificially limited to VGIC_V5_MAX_CPUS, i.e. 512, vCPUs per VM.

With this change it becomes possible to run basic GICv5-based VMs,
provided that they only use PPIs.

Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-38-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Set ICH_VCTLR_EL2.En on boot

This control enables virtual HPPI selection, i.e., selection and
delivery of interrupts for a guest (assuming that the guest itself has
opted to receive interrupts). This is set to enabled on boot as there
is no reason for disabling it in normal operation as virtual interrupt
signalling itself is still controlled via the HCR_EL2.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-37-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Introduce kvm_arm_vgic_v5_ops and register them

Only the KVM_DEV_ARM_VGIC_GRP_CTRL->KVM_DEV_ARM_VGIC_CTRL_INIT op is
currently supported. All other ops are stubbed out.

Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-36-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Hide FEAT_GCIE from NV GICv5 guests

Currently, NV guests are not supported with GICv5. Therefore, make
sure that FEAT_GCIE is always hidden from such guests.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-35-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic: Hide GICv5 for protected guests

We don't support running protected guest with GICv5 at the moment.
Therefore, be sure that we don't expose it to the guest at all by
actively hiding it when running a protected guest.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-34-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Mandate architected PPI for PMU emulation on GICv5

Make it mandatory to use the architected PPI when running a GICv5
guest. Attempts to set anything other than the architected PPI (23)
are rejected.

Additionally, KVM_ARM_VCPU_PMU_V3_INIT is relaxed to no longer require
KVM_ARM_VCPU_PMU_V3_IRQ to be called for GICv5-based guests. In this
case, the architectued PPI is automatically used.

Documentation is bumped accordingly.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-33-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Enlighten arch timer for GICv5

Now that GICv5 has arrived, the arch timer requires some TLC to
address some of the key differences introduced with GICv5.

For PPIs on GICv5, the queue_irq_unlock irq_op is used as AP lists are
not required at all for GICv5. The arch timer also introduces an
irq_op - get_input_level. Extend the arch-timer-provided irq_ops to
include the PPI op for vgic_v5 guests.

When possible, DVI (Direct Virtual Interrupt) is set for PPIs when
using a vgic_v5, which directly inject the pending state into the
guest. This means that the host never sees the interrupt for the guest
for these interrupts. This has three impacts.

* First of all, the kvm_cpu_has_pending_timer check is updated to
  explicitly check if the timers are expected to fire.

* Secondly, for mapped timers (which use DVI) they must be masked on
  the host prior to entering a GICv5 guest, and unmasked on the return
  path. This is handled in set_timer_irq_phys_masked.

* Thirdly, it makes zero sense to attempt to inject state for a DVI'd
  interrupt. Track which timers are direct, and skip the call to
  kvm_vgic_inject_irq() for these.

The final, but rather important, change is that the architected PPIs
for the timers are made mandatory for a GICv5 guest. Attempts to set
them to anything else are actively rejected. Once a vgic_v5 is
initialised, the arch timer PPIs are also explicitly reinitialised to
ensure the correct GICv5-compatible PPIs are used - this also adds in
the GICv5 PPI type to the intid.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-32-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

irqchip/gic-v5: Introduce minimal irq_set_type() for PPIs

GICv5 does not support configuring the handling mode or trigger mode
of PPIs at runtime - these choices are made at implementation time,
and most of the architected PPIs have an architected handling mode (as
reported in the ICH_PPI_HMRn_EL1 registers). As chip->set_irq_type()
is optional, this has not been implemented for GICv5 PPIs as it served
no real purpose.

However, although the set_irq_type() function is marked as optional,
the lack of it breaks attempts to create a domain hierarchy on top of
GICv5's PPI domain. This is due to __irq_set_trigger() calling
chip->set_irq_type(), which returns -ENOSYS if the parent domain
doesn't implement the set_irq_type() call.

In order to make things work, this change introduces a set_irq_type()
call for GICv5 PPIs. This performs a basic sanity check (that the
hardware's handling mode (Level/Edge) matches what is being set as the
type, and does nothing else.

This is sufficient to get hierarchical domains working for GICv5 PPIs
(such as the one KVM introduces for the arch timer). It has the side
benefit (or drawback) that it will catch cases where the firmware
description doesn't match what the hardware reports.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-31-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Initialise ID and priority bits when resetting vcpu

Determine the number of priority bits and ID bits exposed to the guest
as part of resetting the vcpu state. These values are presented to the
guest by trapping and emulating reads from ICC_IDR0_EL1.

GICv5 supports either 16- or 24-bits of ID space (for SPIs and
LPIs). It is expected that 2^16 IDs is more than enough, and therefore
this value is chosen irrespective of the hardware supporting more or
not.

The GICv5 architecture only supports 5 bits of priority in the CPU
interface (but potentially fewer in the IRS). Therefore, this is the
default value chosen for the number of priority bits in the CPU
IF.

Note: We replicate the way that GICv3 uses the num_id_bits and
num_pri_bits variables. That is, num_id_bits stores the value of the
hardware field verbatim (0 means 16-bits, 1 would mean 24-bits for
GICv5), and num_pri_bits stores the actual number of priority bits;
the field value + 1.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-30-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Create and initialise vgic_v5

Update kvm_vgic_create to create a vgic_v5 device. When creating a
vgic, FEAT_GCIE in the ID_AA64PFR2 is only exposed to vgic_v5-based
guests, and is hidden otherwise. GIC in ~ID_AA64PFR0_EL1 is never
exposed for a vgic_v5 guest.

When initialising a vgic_v5, skip kvm_vgic_dist_init as GICv5 doesn't
support one. The current vgic_v5 implementation only supports PPIs, so
no SPIs are initialised either.

The current vgic_v5 support doesn't extend to nested guests. Therefore,
the init of vgic_v5 for a nested guest is failed in vgic_v5_init.

As the current vgic_v5 doesn't require any resources to be mapped,
vgic_v5_map_resources is simply used to check that the vgic has indeed
been initialised. Again, this will change as more GICv5 support is
merged in.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-29-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Support GICv5 interrupts with KVM_IRQ_LINE

Interrupts under GICv5 look quite different to those from older Arm
GICs. Specifically, the type is encoded in the top bits of the
interrupt ID.

Extend KVM_IRQ_LINE to cope with GICv5 PPIs and SPIs. The requires
subtly changing the KVM_IRQ_LINE API for GICv5 guests. For older Arm
GICs, PPIs had to be in the range of 16-31, and SPIs had to be
32-1019, but this no longer holds true for GICv5. Instead, for a GICv5
guest support PPIs in the range of 0-127, and SPIs in the range
0-65535. The documentation is updated accordingly.

The SPI range doesn't cover the full SPI range that a GICv5 system can
potentially cope with (GICv5 provides up to 24-bits of SPI ID space,
and we only have 16 bits to work with in KVM_IRQ_LINE). However, 65k
SPIs is more than would be reasonably expected on systems for years to
come.

In order to use vgic_is_v5(), the kvm/arm_vgic.h header is added to
kvm/arm.c.

Note: As the GICv5 KVM implementation currently doesn't support
injecting SPIs attempts to do so will fail. This restriction will by
lifted as the GICv5 KVM support evolves.

Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-28-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Implement direct injection of PPIs

GICv5 is able to directly inject PPI pending state into a guest using
a mechanism called DVI whereby the pending bit for a paticular PPI is
driven directly by the physically-connected hardware. This mechanism
itself doesn't allow for any ID translation, so the host interrupt is
directly mapped into a guest with the same interrupt ID.

When mapping a virtual interrupt to a physical interrupt via
kvm_vgic_map_irq for a GICv5 guest, check if the interrupt itself is a
PPI or not. If it is, and the host's interrupt ID matches that used
for the guest DVI is enabled, and the interrupt itself is marked as
directly_injected.

When the interrupt is unmapped again, this process is reversed, and
DVI is disabled for the interrupt again.

Note: the expectation is that a directly injected PPI is disabled on
the host while the guest state is loaded. The reason is that although
DVI is enabled to drive the guest's pending state directly, the host
pending state also remains driven. In order to avoid the same PPI
firing on both the host and the guest, the host's interrupt must be
disabled (masked). This is left up to the code that owns the device
generating the PPI as this needs to be handled on a per-VM basis. One
VM might use DVI, while another might not, in which case the physical
PPI should be enabled for the latter.

Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-27-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Introduce set_direct_injection irq_op

GICv5 adds support for directly injected PPIs. The mechanism for
setting this up is GICv5 specific, so rather than adding
GICv5-specific code to the common vgic code, we introduce a new
irq_op.

This new irq_op is intended to be used to enable or disable direct
injection for interrupts that support it. As it is an irq_op, it has
no effect unless explicitly populated in the irq_ops structure for a
particular interrupt. The usage is demonstracted in the subsequent
change.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-26-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Trap and mask guest ICC_PPI_ENABLERx_EL1 writes

A guest should not be able to detect if a PPI that is not exposed to
the guest is implemented or not. Avoid the guest enabling any PPIs
that are not implemented as far as the guest is concerned by trapping
and masking writes to the two ICC_PPI_ENABLERx_EL1 registers.

When a guest writes these registers, the write is masked with the set
of PPIs actually exposed to the guest, and the state is written back
to KVM's shadow state. As there is now no way for the guest to change
the PPI enable state without it being trapped, saving of the PPI
Enable state is dropped from guest exit.

Reads for the above registers are not masked. When the guest is
running and reads from the above registers, it is presented with what
KVM provides in the ICH_PPI_ENABLERx_EL2 registers, which is the
masked version of what the guest last wrote.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-25-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Check for pending PPIs

This change allows KVM to check for pending PPI interrupts. This has
two main components:

First of all, the effective priority mask is calculated. This is a
combination of the priority mask in the VPEs ICC_PCR_EL1.PRIORITY and
the currently running priority as determined from the VPE's
ICH_APR_EL1. If an interrupt's priority is greater than or equal to
the effective priority mask, it can be signalled. Otherwise, it
cannot.

Secondly, any Enabled and Pending PPIs must be checked against this
compound priority mask. The reqires the PPI priorities to by synced
back to the KVM shadow state on WFI entry - this is skipped in general
operation as it isn't required and is rather expensive. If any Enabled
and Pending PPIs are of sufficient priority to be signalled, then
there are pending PPIs. Else, there are not. This ensures that a VPE
is not woken when it cannot actually process the pending interrupts.

As the PPI priorities are not synced back to the KVM shadow state on
every guest exit, they must by synced prior to checking if there are
pending interrupts for the guest. The sync itself happens in
vgic_v5_put() if, and only if, the vcpu is entering WFI as this is the
only case where it is not planned to run the vcpu thread again. If the
vcpu enters WFI, the vcpu thread will be descheduled and won't be
rescheduled again until it has a pending interrupt, which is checked
from kvm_arch_vcpu_runnable().

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-24-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Clear TWI if single task running

Handle GICv5 in kvm_vcpu_should_clear_twi(). Clear TWI if there is a
single task running, and enable it otherwise. This is a sane default
for GICv5 given the current level of support.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-23-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Init Private IRQs (PPIs) for GICv5

Initialise the private interrupts (PPIs, only) for GICv5. This means
that a GICv5-style intid is generated (which encodes the PPI type in
the top bits) instead of the 0-based index that is used for older
GICs.

Additionally, set all of the GICv5 PPIs to use Level for the handling
mode, with the exception of the SW_PPI which uses Edge. This matches
the architecturally-defined set in the GICv5 specification (the CTIIRQ
handling mode is IMPDEF, so Level has been picked for that).

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-22-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Implement PPI interrupt injection

This change introduces interrupt injection for PPIs for GICv5-based
guests.

The lifecycle of PPIs is largely managed by the hardware for a GICv5
system. The hypervisor injects pending state into the guest by using
the ICH_PPI_PENDRx_EL2 registers. These are used by the hardware to
pick a Highest Priority Pending Interrupt (HPPI) for the guest based
on the enable state of each individual interrupt. The enable state and
priority for each interrupt are provided by the guest itself (through
writes to the PPI registers).

When Direct Virtual Interrupt (DVI) is set for a particular PPI, the
hypervisor is even able to skip the injection of the pending state
altogether - it all happens in hardware.

The result of the above is that no AP lists are required for GICv5,
unlike for older GICs. Instead, for PPIs the ICH_PPI_* registers
fulfil the same purpose for all 128 PPIs. Hence, as long as the
ICH_PPI_* registers are populated prior to guest entry, and merged
back into the KVM shadow state on exit, the PPI state is preserved,
and interrupts can be injected.

When injecting the state of a PPI the state is merged into the
PPI-specific vgic_irq structure. The PPIs are made pending via the
ICH_PPI_PENDRx_EL2 registers, the value of which is generated from the
vgic_irq structures for each PPI exposed on guest entry. The
queue_irq_unlock() irq_op is required to kick the vCPU to ensure that
it seems the new state. The result is that no AP lists are used for
private interrupts on GICv5.

Prior to entering the guest, vgic_v5_flush_ppi_state() is called from
kvm_vgic_flush_hwstate(). This generates the pending state to inject
into the guest, and snapshots it (twice - an entry and an exit copy)
in order to track any changes. These changes can come from a guest
consuming an interrupt or from a guest making an Edge-triggered
interrupt pending.

When returning from running a guest, the guest's PPI state is merged
back into KVM's vgic_irq state in vgic_v5_merge_ppi_state() from
kvm_vgic_sync_hwstate(). The Enable and Active state is synced back for
all PPIs, and the pending state is synced back for Edge PPIs (Level is
driven directly by the devices generating said levels). The incoming
pending state from the guest is merged with KVM's shadow state to
avoid losing any incoming interrupts.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-21-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic: Introduce queue_irq_unlock to irq_ops

There are times when the default behaviour of vgic_queue_irq_unlock()
is undesirable. This is because some GICs, such a GICv5 which is the
main driver for this change, handle the majority of the interrupt
lifecycle in hardware. In this case, there is no need for a per-VCPU
AP list as the interrupt can be made pending directly. This is done
either via the ICH_PPI_x_EL2 registers for PPIs, or with the VDPEND
system instruction for SPIs and LPIs.

The vgic_queue_irq_unlock() function is made overridable using a new
function pointer in struct irq_ops. vgic_queue_irq_unlock() is
overridden if the function pointer is non-null.

This new irq_op is unused in this change - it is purely providing the
infrastructure itself. The subsequent PPI injection changes provide a
demonstration of the usage of the queue_irq_unlock irq_op.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-20-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Finalize GICv5 PPIs and generate mask

We only want to expose a subset of the PPIs to a guest. If a PPI does
not have an owner, it is not being actively driven by a device. The
SW_PPI is a special case, as it is likely for userspace to wish to
inject that.

Therefore, just prior to running the guest for the first time, we need
to finalize the PPIs. A mask is generated which, when combined with
trapping a guest's PPI accesses, allows for the guest's view of the
PPI to be filtered. This mask is global to the VM as all VCPUs PPI
configurations must match.

In addition, the PPI HMR is calculated.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-19-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Implement GICv5 load/put and save/restore

This change introduces GICv5 load/put. Additionally, it plumbs in
save/restore for:

* PPIs (ICH_PPI_x_EL2 regs)
* ICH_VMCR_EL2
* ICH_APR_EL2
* ICC_ICSR_EL1

A GICv5-specific enable bit is added to struct vgic_vmcr as this
differs from previous GICs. On GICv5-native systems, the VMCR only
contains the enable bit (driven by the guest via ICC_CR0_EL1.EN) and
the priority mask (PCR).

A struct gicv5_vpe is also introduced. This currently only contains a
single field - bool resident - which is used to track if a VPE is
currently running or not, and is used to avoid a case of double load
or double put on the WFI path for a vCPU. This struct will be extended
as additional GICv5 support is merged, specifically for VPE doorbells.

Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-18-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: gic-v5: Add vgic-v5 save/restore hyp interface

Introduce the following hyp functions to save/restore GICv5 state:

* __vgic_v5_save_apr()
* __vgic_v5_restore_vmcr_apr()
* __vgic_v5_save_ppi_state() - no hypercall required
* __vgic_v5_restore_ppi_state() - no hypercall required
* __vgic_v5_save_state() - no hypercall required
* __vgic_v5_restore_state() - no hypercall required

Note that the functions tagged as not requiring hypercalls are always
called directly from the same context. They are either called via the
vgic_save_state()/vgic_restore_state() path when running with VHE, or
via __hyp_vgic_save_state()/__hyp_vgic_restore_state() otherwise. This
mimics how vgic_v3_save_state()/vgic_v3_restore_state() are
implemented.

Overall, the state of the following registers is saved/restored:

* ICC_ICSR_EL1
* ICH_APR_EL2
* ICH_PPI_ACTIVERx_EL2
* ICH_PPI_DVIRx_EL2
* ICH_PPI_ENABLERx_EL2
* ICH_PPI_PENDRx_EL2
* ICH_PPI_PRIORITYRx_EL2
* ICH_VMCR_EL2

All of these are saved/restored to/from the KVM vgic_v5 CPUIF shadow
state, with the exception of the PPI active, pending, and enable
state. The pending state is saved and restored from kvm_host_data as
any changes here need to be tracked and propagated back to the
vgic_irq shadow structures (coming in a future commit). Therefore, an
entry and an exit copy is required. The active and enable state is
restored from the vgic_v5 CPUIF, but is saved to kvm_host_data. Again,
this needs to by synced back into the shadow data structures.

The ICSR must be save/restored as this register is shared between host
and guest. Therefore, to avoid leaking host state to the guest, this
must be saved and restored. Moreover, as this can by used by the host
at any time, it must be save/restored eagerly. Note: the host state is
not preserved as the host should only use this register when
preemption is disabled.

As with GICv3, the VMCR is eagerly saved as this is required when
checking if interrupts can be injected or not, and therefore impacts
things such as WFI.

As part of restoring the ICH_VMCR_EL2 and ICH_APR_EL2, GICv3-compat
mode is also disabled by setting the ICH_VCTLR_EL2.V3 bit to 0. The
correspoinding GICv3-compat mode enable is part of the VMCR & APR
restore for a GICv3 guest as it only takes effect when actually
running a guest.

Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-17-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

pinctrl: pinconf-generic: Convert ..._parse_dt_pinmux() to fwnode API

Convert pinconf_generic_parse_dt_pinmux() to fwnode API. This makes code
cleaner and potentially reusable for some other types of fwnodes, such as
software nodes.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>

KVM: arm64: gic-v5: Trap and emulate ICC_IDR0_EL1 accesses

Unless accesses to the ICC_IDR0_EL1 are trapped by KVM, the guest
reads the same state as the host. This isn't desirable as it limits
the migratability of VMs and means that KVM can't hide hardware
features such as FEAT_GCIE_LEGACY.

Trap and emulate accesses to the register, and present KVM's chosen ID
bits and Priority bits (which is 5, as GICv5 only supports 5 bits of
priority in the CPU interface). FEAT_GCIE_LEGACY is never presented to
the guest as it is only relevant for nested guests doing mixed GICv5
and GICv3 support.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-16-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

pinctrl: pinconf-generic: Validate fwnode instead of device node

Currently we convert device node to fwnode in the
pinconf_generic_parse_dt_config() and then validate the device node.
This is confusing order. Instead, assign fwnode and validate it.

Fixes: e002d162654b ("pinctrl: pinconf-generic: Use only fwnode API in parse_dt_cfg()")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>

KVM: arm64: gic-v5: Add emulation for ICC_IAFFIDR_EL1 accesses

GICv5 doesn't provide an ICV_IAFFIDR_EL1 or ICH_IAFFIDR_EL2 for
providing the IAFFID to the guest. A guest access to the
ICC_IAFFIDR_EL1 must therefore be trapped and emulated to avoid the
guest accessing the host's ICC_IAFFIDR_EL1.

The virtual IAFFID is provided to the guest when it reads
ICC_IAFFIDR_EL1 (which always traps back to the hypervisor). Writes are
rightly ignored. KVM treats the GICv5 VPEID, the virtual IAFFID, and
the vcpu_id as the same, and so the vcpu_id is returned.

The trapping for the ICC_IAFFIDR_EL1 is always enabled when in a guest
context.

Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-15-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>