git.ipfire.org Git - thirdparty/kernel/stable.git/log

ALSA: ice1712: check snd_ctl_new1() return value

snd_ctl_new1() can return NULL when memory allocation fails. The
ice1712 driver calls snd_ctl_new1() without checking the return value
before dereferencing the pointer in multiple places (ice1712.c,
ice1724.c, aureon.c), which can lead to NULL pointer dereferences.

Add NULL checks after snd_ctl_new1() calls and return -ENOMEM if any
fails.

Assisted-by: Opencode:DeepSeek-V4-Flash
Cc: stable@vger.kernel.org
Fixes: b9a4efd61b6b ("ALSA: ice1712,ice1724: fix the kcontrol->id initialization")
Signed-off-by: Zhao Dongdong <zhaodongdong@kylinos.cn>
Link: https://patch.msgid.link/tencent_42E5E2AB1B6A5101F7EE8C2117F1F687BB07@qq.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: gus: check snd_ctl_new1() return value

snd_ctl_new1() can return NULL when memory allocation fails.
snd_gf1_pcm_volume_control() does not check the return value before
dereferencing kctl->id.index, which can lead to a NULL pointer
dereference.

Add a NULL check after snd_ctl_new1() and return -ENOMEM if it fails.

Assisted-by: Opencode:DeepSeek-V4-Flash
Cc: stable@vger.kernel.org
Fixes: c5ae57b1bb99 ("ALSA: gus: Fix kctl->id initialization")
Signed-off-by: Zhao Dongdong <zhaodongdong@kylinos.cn>
Link: https://patch.msgid.link/tencent_F644A3DCAD32945D62DB2FEEBE8A996F6809@qq.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

iommu/arm-smmu-v3: Allow ATS to be always on

When a device's default substream attaches to an identity domain, the SMMU
driver currently sets the device's STE between two modes:

  Mode 1: Cfg=Translate, S1DSS=Bypass, EATS=1
  Mode 2: Cfg=bypass (EATS is ignored by HW)

When there is an active PASID (non-default substream), mode 1 is used. And
when there is no PASID support or no active PASID, mode 2 is used.

The driver will also downgrade an STE from mode 1 to mode 2, when the last
active substream becomes inactive.

However, there are PCIe devices that demand ATS to be always on. For these
devices, their STEs have to use the mode 1 as HW ignores EATS with mode 2.

Change the driver accordingly:
  - always use the mode 1
  - never downgrade to mode 2
  - allocate and retain a CD table (see note below)

Note that these devices might not support PASID, i.e. doing non-PASID ATS.
In such a case, the ssid_bits is set to 0. However, s1cdmax must be set to
a !0 value in order to keep the S1DSS field effective. Thus, when a master
requires ats_always_on, set its s1cdmax to at least 1, meaning that the CD
table will have a dummy entry (SSID=1) that will never be used.

Now for these devices, arm_smmu_cdtab_allocated() will always return true,
v.s. false prior to this change. When its default substream is attached to
an IDENTITY domain, its first CD is NULL in the table, which is a totally
valid case. Thus, add "!master->ats_always_on" to the condition.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

PCI: Allow ATS to be always on for pre-CXL devices

Some NVIDIA GPU/NIC devices, though they don't implement CXL config space,
have many CXL-like properties. Call this kind "pre-CXL".

Similar to CXL.cache capability, these pre-CXL devices also require the ATS
function even when their RIDs are IOMMU bypassed, i.e. keep ATS "always on"
v.s. "on demand" when a non-zero PASID line gets enabled in SVA use cases.

Introduce pci_dev_specific_ats_required() quirk function to scan a list of
IDs for these devices. Then, include it in pci_ats_required().

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nirmoy Das <nirmoyd@nvidia.com>
Tested-by: Nirmoy Das <nirmoyd@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

ALSA: es1938: check snd_ctl_new1() return value

snd_ctl_new1() can return NULL when memory allocation fails.
snd_es1938_mixer() does not check the return value before dereferencing
the pointer, which can lead to a NULL pointer dereference.

Add a NULL check after snd_ctl_new1() and return -ENOMEM if it fails.

Assisted-by: Opencode:DeepSeek-V4-Flash
Cc: stable@vger.kernel.org
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Zhao Dongdong <zhaodongdong@kylinos.cn>
Link: https://patch.msgid.link/tencent_E0DC65165FDF2C8982BAFB6794B854B53B0A@qq.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

PCI: Add pci_ats_required() for CXL.cache capable devices

Controlled by IOMMU drivers, ATS can be enabled "on demand", when a given
PASID on a device is attached to an I/O page table. This is working, even
when a device has no translation on its RID (i.e., RID is IOMMU bypassed).

However, certain PCIe devices require non-PASID ATS on their RID even when
the RID is IOMMU bypassed. Call this "ATS always on" in IOMMU term.

For example, CXL spec r4.0 notes in sec 3.2.5.13 Memory Type on CXL.cache:
"To source requests on CXL.cache, devices need to get the Host Physical
Address (HPA) from the Host by means of an ATS request on CXL.io."

In other words, the CXL.cache capability requires ATS; otherwise, it can't
access host physical memory.

Introduce a new pci_ats_required() helper for the IOMMU driver to scan a
PCI device and shift ATS policies between "on demand" and "always on".

Add the support for CXL.cache devices first. Pre-CXL devices will be added
in quirks.c file.

Note that pci_ats_required() validates against pci_ats_supported(), so we
ensure that untrusted devices (e.g. external ports) will not be always on.
This maintains the existing ATS security policy regarding potential side-
channel attacks via ATS.

Cc: linux-cxl@vger.kernel.org
Suggested-by: Vikram Sethi <vsethi@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/vsi: Use list_for_each_entry()

Smatch complains about the NULL check on "iommu" because list_entry()
can't be NULL. Clean up this code by using list_for_each_entry().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu: vsi: avoid -Wformat-security warning

When -Wformat-security is enabled, it catches a call to
iommu_device_sysfs_add() that passes a string variable in
place of a format:

drivers/iommu/vsi-iommu.c: In function 'vsi_iommu_probe':
drivers/iommu/vsi-iommu.c:717:9: error: format not a string literal and no format arguments [-Werror=format-security]
717 | err = iommu_device_sysfs_add(&iommu->iommu, dev, NULL, dev_name(dev));
| ^~~

Pass this indirectly using "%s" as the format instead.

Fixes: 917ace84b770 ("iommu: Add verisilicon IOMMU driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/amd: Fix premature break in init_iommu_one()

In init_iommu_one(), when processing IOMMU EFR attributes, the code checks
whether GASUP is enabled. If GASUP is not enabled, the code falls back to
legacy guest IR mode and then breaks out of the switch statement.

This break incorrectly skips the subsequent initialization steps that
follow the GASUP check. These initializations are independent of GASUP
support and must always be performed.

Fix this by replacing the early break with a conditional else block,
ensuring that the XTSUP check is only skipped when GASUP is not available.

Fixes: a44092e326d4 ("iommu/amd: Use IVHD EFR for early initialization of IOMMU features")
Reported-by: Sudheer Dantuluri <dantuluris@google.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

Merge tag 'alloc-7.2-rc1' of https://github.com/Rust-for-Linux/linux into rust-next

Pull alloc updates from Danilo Krummrich:

- Fix the 'Vec::reserve()' doctest to properly account for the existing
   vector length in the capacity assertion.

- Fix an incorrect operator in the 'Vec::extend_with()' SAFETY comment;
   add a doc test demonstrating basic usage and the zero-length case.

- Cleanup all imports in the alloc module and its doctests to use the
   "kernel vertical" import style.

* tag 'alloc-7.2-rc1' of https://github.com/Rust-for-Linux/linux:
  rust: alloc: cleanup doctest imports to "kernel vertical" style
  rust: alloc: cleanup imports and use "kernel vertical" style
  rust: alloc: fix `Vec::extend_with` SAFETY comment
  rust: alloc: add doc test for `Vec::extend_with`
  rust: alloc: fix assert in `Vec::reserve` doc test

drm/exynos: fix size_t format string

The exynos_gem->base.size argument is a size_t rather than an
unsigned long, so adapt the printk() format string accordingly:

In file included from drivers/gpu/drm/exynos/exynos_drm_gem.c:16:
drivers/gpu/drm/exynos/exynos_drm_gem.c: In function 'exynos_drm_alloc_buf':
drivers/gpu/drm/exynos/exynos_drm_gem.c:69:49: error: format '%lx' expects argument of type 'long unsigned int', but argument 6 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
   69 |         DRM_DEV_DEBUG_KMS(drm_dev_dma_dev(dev), "dma_addr(0x%lx), size(0x%lx)\n",
      |                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   70 |                         (unsigned long)exynos_gem->dma_addr, exynos_gem->base.size);
      |                                                              ~~~~~~~~~~~~~~~~~~~~~
      |                                                                              |
      |                                                                              size_t {aka unsigned int}

The dma_addr in the same line is already printed using a cast
to unsigned long, so change that similarly to use the correct
%pad format.

Fixes: 11e898373fba ("drm/exynos: Drop exynos_drm_gem.size field")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
Link: https://patch.msgid.link/20260527194525.45762-1-arnd@kernel.org

iommu, debugobjects: avoid gcc-16.1 section mismatch warnings

gcc-16 has gained some more advanced inter-procedual optimization
techniques that enable it to inline the dummy_tlb_add_page() and
dummy_tlb_flush() function pointers into a specialized version of
__arm_v7s_unmap:

WARNING: modpost: vmlinux: section mismatch in reference: __arm_v7s_unmap+0x2cc (section: .text) -> dummy_tlb_add_page (section: .init.text)
ERROR: modpost: Section mismatches detected.

>From what I can tell, the transformation is correct, as this is only
called when __arm_v7s_unmap() is called from arm_v7s_do_selftests(),
which is also __init. Since __arm_v7s_unmap() however is not __init,
gcc cannot inline the inner function calls directly.

In debug_objects_selftest(), the same thing happens. Both the
caller and the leaf function are __init, but the IPA pulls
it into a non-init one:

WARNING: modpost: vmlinux: section mismatch in reference: lookup_object_or_alloc+0x7c (section: .text.lookup_object_or_alloc) -> is_static_object (section: .init.text)

Marking the affected functions as not "__init" would reliably avoid this
issue but is not a good solution because it removes an otherwise correct
annotation. I tried marking the functions as 'noinline', but that ended
up not covering all the affected configurations.

With some more experimenting, I found that marking these functions as
__attribute__((noipa)) is both logical and reliable.

In order to keep the syntax readable, add a custom macro for this in
include/linux/compiler_attributes.h next to other related macros and
use it to annotate both files.

Link: https://lore.kernel.org/all/abRB6g-48ZX6Yl2r@willie-the-truck/
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: linux-kbuild@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

accel/ivpu: Remove disable_d0i3_msg workaround

All published NPU firmware versions support D0i3 delayed entry
flow, making this workaround obsolete. It was originally added as
a safety measure for potential firmware bugs.

Recent firmware dropped legacy D0i3 entry support, so the workaround
can't be used anyway. Hardcode d0i3_delayed_entry boot param to 1 to
ensure older firmware works in the correct mode.

No functional changes, just dead code cleanup.

Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com>
Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com>
Link: https://patch.msgid.link/20260526125521.594479-1-andrzej.kacprowski@linux.intel.com

platform/chrome: cros_ec_chardev: Introduce rwsem for protecting ec_dev

Introduce a rwsem for protecting `ec_dev` to prevent Use-After-Free on
the `ec_dev`.

- Writers: In driver's probe() and remove().
- Readers: In file operations.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20260525052654.4076429-5-tzungbi@kernel.org
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>

drm/i915/bw: Do not consider tile4 as tileY

For the purposes of memory bandwidth calculations tile4
should not be considered the same as tileY. Make it so.

This should not actually change anything as the affected
code only applies to pre-MTL integrated GPUs, which don't
have tile4.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-11-ville.syrjala@linux.intel.com
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>

drm/i915/bw: Remove deinterleave fallback for TGL+

Remove the deinterleave fallback calculation from the TGL+ codepath.
The fallback is using the ICL deinterleave calculation which was never
in the TGL+ algorithm. All supported memory types have the correct
deinterleave already specified for TGL+ anyway, so this is dead code.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-10-ville.syrjala@linux.intel.com
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>

drm/i915/bw: Round the PM demand bandwidth down

Bspec asks us to round down instead of closest doing the /100 for
the PM demand bandwidth. Make it so.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-9-ville.syrjala@linux.intel.com
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>

drm/i915/bw: Fix/unify peakbw calculations

We have several copies of the same memory peak bandwidth calculations,
and the rounding directions are all over the place in some of them.
Unify it all into one small function (with rounding matching what Bspec
says).

Note that 'channel_width' is always a multiple of 8 anyway, so for
'channel_width / 8' the rounding direction doesn't actually matter.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-8-ville.syrjala@linux.intel.com
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>

drm/i915/bw: Fix DEPROGBWPCLIMIT handling on BMG

DEPROGBWPCLIMIT is specified in %, so divide by 100 instead of 10.

Fortunately the deprobbwlimit is much lower than the peak memory
bandwidth on BMG, so whether we take 60% or 600% of the peak
bandwidth doesn't matter as the min() will pick the lower
deprobbwlimit anyway.

Eg. on the BMG here I get (with or without the fix):
QGV 0: deratedbw=33600 peakbw=48000
QGV 1: deratedbw=53000 peakbw=456000

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-7-ville.syrjala@linux.intel.com
Reviewed-by: Michał Grzelak <michal.grzelak@intel.com>

drm/i915/bw: Fix rounding direction in clperchgroup calculation

The '8/num_channels' in the clperchgroup is supposed to be rounded
down according to the spec. Make it so.

Not sure we can ever actually have a non-power of two number of
channels, so this might not matter.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-6-ville.syrjala@linux.intel.com
Reviewed-by: Michał Grzelak <michal.grzelak@intel.com>

drm/i915/bw: Fix 'deinterleave' rounding direction

For some reason we're rounding up when calculating the deinterleave
value. But the spec says we should round down. Fix it.

But I suppose this doesn't actually matter since the deinterleave
values should always be power of two. The only exception is therefore
the deinterleave==1 case, which gets handled by the max(..., 1).

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-5-ville.syrjala@linux.intel.com
Reviewed-by: Michał Grzelak <michal.grzelak@intel.com>

drm/i915/bw: Fix bw rounding direction

The DRAM bandwidth value should be rounded down, not up.

Bspec: 64631
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-4-ville.syrjala@linux.intel.com
Reviewed-by: Michał Grzelak <michal.grzelak@intel.com>
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>

drm/i915/bw: Fix DCLK rounding mess

Fix up the total mess when calculating the DCLK
frequency. Some codepaths are trying to do both DIV_ROUND_UP()
and an open coded "round to nearest" at the same time. The
MTL+ codepath was the only one that was correct (using
DIV_ROUND_CLOSEST()).

Let's unify all of them, and borrow the actual '100/6'
approach from adl_calc_psf_bw() so that we get even less
rounding errors.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-3-ville.syrjala@linux.intel.com
Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com>

drm/i915/bw: Fix num_planes handling on TGL+

The TGL+ bw code has an off by one error on the num_planes
calculation, and tgl_max_bw_index() incorrectly bumps
the num_planes to 1 from 0.

That approach made sense on ICL where num_planes is more or
less a minimum number of planes to consider for the group,
but on TGL+ num_planes really is a maximum number of planes,
so these adjustments no longer make any sense there.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260522200346.17377-2-ville.syrjala@linux.intel.com
Reviewed-by: Michał Grzelak <michal.grzelak@intel.com>

platform/chrome: cros_ec_chardev: Add event relayer

Introduce an event relayer mechanism. Instead of each open file
registering directly with `ec_dev->event_notifier`, the platform device
registers a single relayer notifier. Individual files then register
with a local subscribers list in `chardev_pdata`.

This allows the driver to safely disconnect from the event chain
`ec_dev->event_notifier` during cros_ec_chardev_remove(), preventing
events from being delivered to open files after the device is removed,
while still allowing those files to be closed safely later.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20260525052654.4076429-4-tzungbi@kernel.org
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>

platform/chrome: cros_ec_chardev: Move data to chardev_pdata

Move `ec_dev` and `cmd_offset` from `chardev_priv` to `chardev_pdata` as
they are per-device properties but not per-open-file properties.

Hold a reference to `chardev_pdata` for each open file to ensure the
data remains valid even if the underlying platform device is removed.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20260525052654.4076429-3-tzungbi@kernel.org
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>

platform/chrome: cros_ec_chardev: Introduce chardev_data

Introduce struct chardev_pdata to hold platform driver data.

The platform driver data is allocated by kzalloc() instead of devm
variant, allowing for managed cleanup that can eventually extend beyond
device removal if files are still open.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20260525052654.4076429-2-tzungbi@kernel.org
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>

ipv6: rpl: fix hdrlen overflow in ipv6_rpl_srh_decompress()

ipv6_rpl_srh_decompress() computes:

outhdr->hdrlen = (((n + 1) * sizeof(struct in6_addr)) >> 3);

hdrlen is __u8. For n >= 127 the result exceeds 255 and silently
truncates. With n=127 (cmpri=15, cmpre=15, pad=0, hdrlen=16):

(128 * 16) >> 3 = 256, truncated to 0 as __u8

The caller in ipv6_rpl_srh_rcv() then places the compressed header
at buf + ((ohdr->hdrlen + 1) << 3). With hdrlen=0 this is buf + 8,
but the decompressed region occupies buf[0..2055] (8-byte header
plus 128 full addresses). The compressed header overlaps the
decompressed data, and ipv6_rpl_srh_compress() writes into this
overlap, corrupting the routing header of the forwarded packet.

The existing guard at exthdrs.c:546 checks (n + 1) > 255, which
prevents n+1 from overflowing unsigned char (the segments_left
field), but does not prevent the computed hdrlen from overflowing
__u8. n=127 passes because 128 <= 255, yet hdrlen=256 does not
fit.

Tighten the bound to (n + 1) > 127. This caps n at 126, giving
hdrlen = (127 * 16) >> 3 = 254, which fits in __u8. The compressed
header then lands at buf + ((254 + 1) << 3) = buf + 2040, exactly
past the decompressed region (buf[0..2039]). No overlap. 127
segments is well beyond any realistic RPL deployment.

Fixes: 8610c7c6e3bd ("net: ipv6: add support for rpl sr exthdr")
Signed-off-by: Rahul Chandelkar <rc@rexion.ai>
Link: https://patch.msgid.link/20260525154031.2290876-1-rc@rexion.ai
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

KVM: x86/pmu: Allow Host-Only/Guest-Only bits with nSVM and mediated PMU

Now that KVM correctly handles Host-Only and Guest-Only bits in the
event selector MSRs, allow the guest to set them if the vCPU advertises
SVM and uses the mediated PMU.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-14-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Reprogram Host/Guest-Only counters on nested transitions

Reprogram PMU counters on nested transitions for the mediated PMU, to
re-evaluate Host-Only and Guest-Only bits and enable/disable the PMU
counters accordingly. For example, if Host-Only is set and Guest-Only is
cleared, a counter should be disabled when entering guest mode and
enabled when exiting guest mode.

According to the APM, when EFER.SVME is cleared, setting Host-Only or
Guest-Only disables the counter, so also trigger counter reprogramming
when EFER.SVME is toggled.

Counters setting any of Host-Only and Guest-Only bits are already being
tracked in pmc_has_mode_specific_enables, use the bitmap to reprogram
these counters.

Reprogram the counters synchronously on nested VMRUN/#VMEXIT and
EFER.SVME toggling. This is necessary as these instructions are counted
based on the new CPU state (after the instruction is retired in
hardware). Hence, the PMU needs to be updated before instruction
emulation is completed and kvm_pmu_instruction_retired() is called.

Defer reprogramming the counters when force leaving guest mode through
svm_leave_nested() to avoid potentially reading stale state (e.g.
incorrect EFER). All flows force leaving nested are non-architectural,
so accuracy is irrelevant.

Refactor a helper out of kvm_pmu_request_reprogram_counters() that
accepts a boolean allowing synchronous vs deferred reprogramming, and
use that from SVM code to support both scenarios.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-13-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Track mediated PMU counters with mode-specific enables

Instead of always checking of a counter needs to be disabled for
mode-specific reasons (e.g. Host-Only/Guest-Only bits in SVM), add a
bitmap to track such counters. Set the bit for counters using either
Host-Only or Guest-Only bits in EVENTSEL on SVM.

This bitmap will also be reused in following changes to selectively
apply changes to such counters.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-12-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Disable counters based on Host-Only/Guest-Only bits in SVM

Introduce an optional per-vendor PMU callback for checking if a counter
is disabled in the current mode, and register a callback on AMD to
disable a counter based on the vCPU's setting of Host-Only or Guest-Only
EVENT_SELECT bits with the mediated PMU.

If EFER.SVME is set, all events are counted if both bits are set or
cleared. If only one bit is set, the counter is disabled if the vCPU
context does not match the set bit.

If EFER.SVME is cleared, the counter is disabled if any of the bits is
set, otherwise all events are counted. Note that a Linux guest correctly
handles this and clears Host-Only when EFER.SVME is cleared, see commit
1018faa6cf23 ("perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM
disabled").

The callback is made from pmc_is_locally_enabled(), which is used for
the mediated PMU when updating eventsel_hw in
kvm_mediated_pmu_refresh_eventsel_hw(), as well as when checking what
PMCs count instructions/branches for emulation in
kvm_pmu_recalc_pmc_emulation().

Host-Only and Guest-Only bits are currently reserved, so this change is
a noop, but the bits will be allowed with mediated PMU in a following
change when fully supported.

Originally-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-11-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Add support for KVM_X86_PMU_OP_OPTIONAL_RET0

Add definitions for KVM_X86_PMU_OP_OPTIONAL_RET0() to resolve to
__static_call_return0, similar to KVM_X86_OP_OPTIONAL_RET0(). Move the
definition of kvm_pmu_call() to pmu.h, and add declarations for the
static PMU calls in the header to allow making callbacks from the header
in following changes.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-10-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Check mediated PMU counter enablement before event filters

If the guest disables the counter (by clearing
ARCH_PERFMON_EVENTSEL_ENABLE), KVM still performs the PMU filter lookup,
even though it doesn't end up changing eventsel_hw. Check if the
counter is enabled by the guest before doing the potentially expensive
PMU filter lookup.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-9-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Do a single atomic OR when reprogramming counters

Do a single atomic OR using the atomic overlay of reprogram_pmi bitmask,
instead of one atomic set_bit() call per counter.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-8-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Rename reprogram_counters() to clarify usage

Rename reprogram_counters() to kvm_pmu_request_counters_reprogram()
clarifying that it is more similar to
kvm_pmu_request_counter_reprogram(), and less similar to
reprogram_counter(). The kvm_pmu_* prefix is also appropriate as the
function is exposed in the header.

Opportunistically rename the argument from 'diff' to 'counters'.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-7-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move enable_pmu/enable_mediated_pmu to pmu.h and pmu.c

The declaration and definition of enable_pmu/enable_mediated_pmu
semantically belongs in pmu.h and pmu.c, and more importantly, pmu.h
uses enable_mediated_pmu and relies on the caller including x86.h.

There is already precedence for other module params defined outside of
x86.c, so move enable_pmu/enable_mediated_pmu to pmu.c.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-6-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: nSVM: Move VMRUN instruction retirement after entering guest mode

A successful VMRUN retires in guest mode and should be counted by the
PMU as a guest instruction. Move the call to
kvm_pmu_instruction_retired() after potentially entering guest mode,
such that VMRUN is counted correctly.

The PMU event will be matched against L2's CPL, but otherwise this does
not change the behavior in terms of guest vs. host, because KVM does
not virtualize Host-Only/Guest-Only PMC controls yet, so all
instructions are counted regardless of the vCPU's host/guest state. But
this change is needed for the incoming support for Host-Only/Guest-Only
controls to count VMRUN correctly.

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-5-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: nSVM: Unify RIP and PMU handling calls when emulating VMRUN

The code paths for advancing RIP and retiring the instruction for RIP
are very similar whether or not caching vmcb12 succeeds. The only
difference is handling mapping failures (i.e. EFAULT).

Pull the mapping failure handling out and unify the calls to
svm_skip_emulated_instruction() and kvm_pmu_instruction_retired(), but
return immediately after if copying and caching vmcb12 failed. A nice
side effect of this is that the FIXME comment is now above the only code
path calling svm_skip_emulated_instruction().

Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-4-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: nSVM: Bail early out of VMRUN emulation if advancing RIP fails

If svm_skip_emulation_instruction() fails, then RIP could not be
advanced correctly (e.g. decode failure when NextRIP is not available).
KVM will exit to userspace to handle the emulation failure, but only
after stuffing the wrong RIP into vmcb01 and entering guest mode.

Bail early and exit to userspace before committing any side-effects of
emulating the VMRUN (e.g. entering guest mode).

Fixes: c8e16b78c614 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()")
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-3-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: nSVM: Stop leaking single-stepping on VMRUN into L2

According to the APM, TF on VMRUN causes a #DB after VMRUN completes on
the _host_ side. However, KVM injects a #DB in L2 context instead (or
exits to userspace if KVM_GUESTDBG_SINGLESTEP is set) in
kvm_skip_emulated_instruction().

Avoid single-step handling on VMRUN by open-coding the rest of
kvm_skip_emulated_instruction() in nested_svm_vmrun(). This doesn't look
pretty, but following changes will need to open-code
kvm_pmu_instruction_retired() anyway, and will cleanup the code. This
ignores TF on VMRUN instead of injecting a spurious exception into
L2. Document this virtualization hole with a FIXME.

Note that a failed VMRUN would have been correctly single-stepped, but
now TF is always ignored for consistency and simplicity purposes. VMX
does not support TF on a successful VMLAUNCH/VMRESUME, so it's unlikely
that single-stepping VMRUN properly is important, especially if it's
only for failed VMRUNs.

Fixes: c8e16b78c614 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()")
Signed-off-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260527234711.4175166-2-yosry@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>

net: sch_fq: update flow delivery time on earlier EDT packet

When inserting an EDT packet with time before flow->time_next_packet,
update the flow and possibly queue next delivery time.

Reinsert the flow into the q->delayed rb-tree to position correctly
and to have fq_check_throttled set wake-up at the right next time.

Factor RB tree insertion out fq_flow_set_throttled to avoid open
coding twice.

EDT packets do not take precedence over queue rate limit. Skip this
new step if a queue limit is set. EDT packets do take precedence over
per-socket rate limits, as can be seen from fq_dequeue reading
sk_pacing_rate if !skb->tstamp.

With this change the so_txtime selftest sends packets in the expected
order.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260526134109.2624493-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'introduce-airoha-an8801r-series-gigabit-ethernet-phy-driver'

Louis-Alexis Eyraud says:

====================
Introduce Airoha AN8801R series Gigabit Ethernet PHY driver

This series introduces the Airoha AN8801R Gigabit Ethernet PHY initial
support.

The Airoha AN8801R is a low power single-port Ethernet PHY Transceiver
with Single-port serdes interface for 1000Base-X/RGMII.
This chip is compliant with 10Base-T, 100Base-TX and 1000Base-T IEEE
802.3(u,ab) and supports:
  - Energy Efficient Ethernet (802.3az)
  - Full Duplex Control Flow (802.3x)
  - auto-negotiation
  - crossover detect and autocorrection,
  - Wake-on-LAN with Magic Packet
  - Jumbo Frame up to 9 Kilobytes.
This PHY also supports up to three user-configurable LEDs, which are
usually used for LAN Activity, 100M, 1000M indication.

The series provides the devicetree binding and the driver that have been
written by AngeloGioacchino Del Regno, based on downstream
implementation ([1]). The driver allows setting up PHY LEDs, 10/100M,
1000M speeds, and Wake on LAN and PHY interrupts.

Since v2, the series also adds the air_phy_lib library, which goal is to
share common code between air_en8811h and air_an8801 drivers, and its use
in them. The first shared functions are the existing BuckPbus register
accessors and air_phy_read/write_page functions coming from air_en8811h
driver.

The series is based on net-next kernel tree (sha1: 90d03ee2c5dc) and
I have tested it on Mediatek Genio 720-EVK board (that integrates an
Airoha AN8801RIN/A Ethernet PHY) with early board hardware enablement
patches.

[1]: https://gitlab.com/mediatek/aiot/bsp/linux/-/blob/mtk-v6.6/drivers/net/phy/an8801.c
====================

Link: https://patch.msgid.link/20260526-add-airoha-an8801-support-v5-0-01aea8dee69b@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: air_an8801: ensure maximum available speed link use

To ensure that the Airoha AN8801R PHY uses the maximum available link
speed, an additional register write is needed to configure the function
mode for either 1G or 100M/10M operation after link detection.

So, in air_an8801 driver, implement a custom read_status callback, that
after genphy_read_status determines the link speed, sets the bit 0 of
the link mode register (REG_LINK_MODE) if the detected speed is 1Gbps,
or unsets it otherwise.

Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260526-add-airoha-an8801-support-v5-6-01aea8dee69b@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Introduce Airoha AN8801R Gigabit Ethernet PHY driver

Introduce a driver for the Airoha AN8801R Series Gigabit Ethernet
PHY; this currently supports setting up PHY LEDs, 10/100M, 1000M
speeds, and Wake on LAN and PHY interrupts.

Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260526-add-airoha-an8801-support-v5-5-01aea8dee69b@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Rename Airoha common BuckPBus register accessors

Rename the BuckPBus register accessors functions present in air_phy_lib
and their calls in air_en8811h driver, so all exported functions start
with the same prefix.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Link: https://patch.msgid.link/20260526-add-airoha-an8801-support-v5-4-01aea8dee69b@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: air_phy_lib: Factorize BuckPBus register accessors

In preparation of Airoha AN8801R PHY support, move the BuckPBus
register accessors and definitions, present in air_en8811h driver,
into the Airoha PHY shared code (air_phy_lib), so they will be usable
by the new driver without duplicating them.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Link: https://patch.msgid.link/20260526-add-airoha-an8801-support-v5-3-01aea8dee69b@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Add Airoha phy library for shared code

In preparation of Airoha AN8801R PHY support, split out the interface
functions that will be common between the already present air_en8811h
driver and the new one, and put them into a new library named
air_phy_lib.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Link: https://patch.msgid.link/20260526-add-airoha-an8801-support-v5-2-01aea8dee69b@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: Add support for Airoha AN8801R GbE PHY

Add a new binding to support the Airoha AN8801R Series Gigabit
Ethernet PHY.

Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Link: https://patch.msgid.link/20260526-add-airoha-an8801-support-v5-1-01aea8dee69b@collabora.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: cls_bpf: prevent unbounded recursion in offload rollback

Quan Sun reported [1] a stack overflow in cls_bpf_offload_cmd().

Reproducer on netdevsim: add a skip_sw cls_bpf filter, set the
bpf_tc_accept debugfs knob to 0, then `tc filter replace`. The replace
calls tc_setup_cb_replace() which fails. cls_bpf_offload_cmd() then
swaps prog/oldprog and recursively calls itself to roll back. But
bpf_tc_accept=0 makes the rollback fail too, which triggers yet another
rollback frame with the same arguments, and so on until the stack is
exhausted.

bpf_tc_accept is just a convenient knob for the reproducer. Any driver
whose tc_setup_cb_replace() fails twice in a row can hit the same loop,
so this is not a netdevsim-only issue.

Two ways to fix it:

  1) Have the rollback call tc_setup_cb_add() on oldprog instead of
     re-entering cls_bpf_offload_cmd().
  2) Mark the rollback frame with a flag and skip a second-level
     rollback from inside it.

Go with (2). It is the smaller change and keeps the original behaviour:
the rollback still goes through tc_setup_cb_replace(), so the driver
gets one real chance to restore its state. If that attempt also fails,
we just return the original error instead of recursing.

[1]: https://lore.kernel.org/bpf/ce5a6005-3c5e-4696-9e05-eba9461dc860@std.uestc.edu.cn/T/#u

Fixes: 102740bd9436 ("cls_bpf: fix offload assumptions after callback conversion")
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260526025529.24382-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ethtool-more-bug-fixes'

Jakub Kicinski says:

====================
ethtool: more bug fixes

Last week I sent two patch sets - one fixing bugs in RSS handling,
and one fixing CMIS / module handling. This set contains the remaining
fixes. There's a concentration of fixes around PHY and timestamp config
handling but not enough to break those out as separate sets.
====================

Link: https://patch.msgid.link/20260526153533.2779187-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: eeprom: add more safeties to EEPROM Netlink fallback

The Netlink fallback path for reading module EEPROM
(fallback_set_params()) validates that offset < eeprom_len,
but does not check that offset + length stays within eeprom_len.
The ioctl equivalent (ethtool_get_any_eeprom() in ioctl.c) has
always enforced both bounds:

if (eeprom.offset + eeprom.len > total_len)
return -EINVAL;

This could lead to surprises in both drivers and device FW.
Add the missing offset + length validation to fallback_set_params(),
mirroring the ioctl.

Similarly - ethtool core in general, and ethtool_get_any_eeprom()
in particular tries to zero-init all buffers passed to the drivers
to avoid any extra work of zeroing things out. eeprom_fallback()
uses a plain kmalloc(), change it to zalloc.

Fixes: 96d971e307cc ("ethtool: Add fallback to get_module_eeprom from netlink command")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: eeprom: add missing ethnl_ops_begin() / _complete() during fallback

All ethtool driver op calls should be sandwiched between
ethnl_ops_begin() / ethnl_ops_complete(). In Netlink eeprom code,
if the paged access failed we fall back to old API, but we
first call _complete() and the fallback never does its own
ethnl_ops_begin(). Move the fallback into the _begin() / _complete()
section.

Fixes: 96d971e307cc ("ethtool: Add fallback to get_module_eeprom from netlink command")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: strset: fix header attribute index in ethnl_req_get_phydev()

strset_prepare_data() passes ETHTOOL_A_HEADER_FLAGS (3) as the header
attribute to ethnl_req_get_phydev(). This is incorrect, in the main
attr space 3 is ETHTOOL_A_STRSET_COUNTS_ONLY, not the request
header attr. The correct constant is ETHTOOL_A_STRSET_HEADER (1).

ethnl_req_get_phydev() only uses this value for the extack,
so this is not a "functionally visible"(?) bug.

Fixes: e96c93aa4be9 ("net: ethtool: strset: Allow querying phy stats by index")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: tsinfo: don't pass ERR_PTR to genlmsg_cancel on prepare failure

The goto err label leads to:

genlmsg_cancel(skb, ehdr);
return ret;

If ethnl_tsinfo_prepare_dump() failed, it has not started a genlmsg.
There's nothing to cancel, and passing an error pointer to
genlmsg_cancel() would cause a crash.

Fixes: b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: tsinfo: fix uninitialized stats on the by-PHC path

tsinfo_prepare_data() has two code paths: a "by-PHC" path for
user-specified hardware timestamping providers, and the old path.
Commit 89e281ebff72 ("ethtool: init tsinfo stats if requested") added
ethtool_stats_init() to mark stat slots as ETHTOOL_STAT_NOT_SET before
the driver callback populates them, but placed the call inside the
old-path block.

When commit b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to
support several hwtstamp by net topology") added the by-PHC early
return, it landed above the stats initialization. On that path
the stats array retains the zero-fill from ethnl_init_reply_data()'s
zalloc. This leads to the reply including a stats nest with four
zero-valued attributes that should have been absent.

Reject GET requests for stats with HWTSTAMP_PROVIDER or dump.

Fixes: b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: tsconfig: fix missing ethnl_ops_complete()

tsconfig_prepare_data() calls ethnl_ops_begin(), we need to call
ethnl_ops_complete() before returning the error.

Fixes: 6e9e2eed4f39 ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config")
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: pse-pd: fix missing ethnl_ops_complete()

pse_prepare_data() is missing ethnl_ops_complete() if
ethnl_req_get_phydev() returned an error. Move getting
phydev up so that we don't have to worry about this
(similar order to linkstate_prepare_data()).

Note that phydev may still be NULL (this is checked in
pse_get_pse_attributes()), the goal isn't really to avoid
the _begin() / _complete() calls, only to simplify the error
handling.

While at it propagate the original error. Why this code
overrides the error with -ENODEV but !phydev generates
-EOPNOTSUPP is unclear to me...

Fixes: 31748765bed3 ("net: ethtool: pse-pd: Target the command to the requested PHY")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: linkstate: fix unbalanced ethnl_ops_complete() on PHY lookup error

linkstate_prepare_data() calls ethnl_req_get_phydev() before
ethnl_ops_begin(), but routes its error path through "goto out"
which calls ethnl_ops_complete().

Fixes: fe55b1d401c6 ("ethtool: linkstate: migrate linkstate functions to support multi-PHY setups")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: tsconfig: fix reply error handling

A couple of trivial bugs in error handling in tsconfig_send_reply().
If we failed to allocate rskb we need to set the error.
If we did allocate it but failed to send it - we need to remember
to free it.

Fixes: 6e9e2eed4f39 ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config")
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES

ethnl_update_profile() walks the ETHTOOL_A_PROFILE_IRQ_MODERATION
nest list with an index 'i' and writes new_profile[i++] without
bounding i. The destination is kmemdup()'d at NET_DIM_PARAMS_NUM_PROFILES
entries (5), but the Netlink nest count is entirely user-controlled.
Netlink policies do not have support for constraining the number
of nested entries (or number of multi-attr entries).

Fixes: f750dfe825b9 ("ethtool: provide customized dim profile management")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260526153533.2779187-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: page_pool: silence static analysis warnings in page_pool_nl_stats_fill()

nla_nest_start() can return NULL if the skb runs out of space.

Jakub:
There is no bug here, if nla_nest_start() failed there's not space
left in the message. Next nla_put_uint() will also fail and we will
exit via nla_nest_cancel() which handles NULL just fine.
Various people keep sending us this patch so let's commit this.

Signed-off-by: Zhao Dongdong <zhaodongdong@kylinos.cn>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/tencent_A82EBAB365A8B888B66FDCF115A3DCB8880A@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv6-frags-adopt-__in6_dev_stats_get-a-bit-more'

Eric Dumazet says:

====================
ipv6: frags: adopt __in6_dev_stats_get() a bit more

First patch addresses Sashiko's feedback about a potential
NULL dereference in __in6_dev_stats_get().

Second patch adopts __in6_dev_stats_get() in net/ipv6/reassembly.c.
====================

Link: https://patch.msgid.link/20260526145529.3587126-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: frags: cleanup __IP6_INC_STATS() confusion

After commits e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original
netdev") and bdb7cc643fc9 ("ipv6: Count interface receive statistics
on the ingress netdev") net/ipv6/reassembly.c uses three different
ways to reach idev in various __IP6_INC_STATS() calls.

- ip6_dst_idev(skb_dst(skb))
- __in6_dev_get_safely(skb->dev)
- __in6_dev_stats_get(skb->dev)

Lets centralize this from ipv6_frag_rcv() and use __in6_dev_stats_get().

Note that ipv6_frag_rcv() tests if skb->dev could be NULL already, so
I chose to also guard against NULL, but we probably can remove the
tests in a followup patch, because I do not think skb->dev could be NULL.

iif = skb->dev ? skb->dev->ifindex : 0;

idev can be NULL, __IP6_INC_STATS() deals with this possibility.

Small code size reduction as a bonus.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-145 (-145)
Function                                     old     new   delta
ipv6_frag_rcv                               2399    2362     -37
ip6_frag_reasm                               705     597    -108
Total: Before=31455552, After=31455407, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526145529.3587126-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: guard against possible NULL deref in __in6_dev_stats_get()

dev_get_by_index_rcu() could return NULL if the original physical
device is unregistered.

Found by Sashiko.

Fixes: e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original netdev")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526145529.3587126-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: add quirk for OEM 2.5G optical modules

Some OEM-branded SFP modules are incorrectly detected as
1000Base-X and fail to establish link on 2.5G-capable ports.

These modules do not properly advertise 2500Base-X capability
in their EEPROM and require forcing the correct SerDes mode.

Add sfp_quirk_2500basex for:

  - OEM SFP-2.5G-LH03-B
  - OEM SFP-2.5G-LH20-A

Both modules report:

  Vendor name: OEM
  Vendor PN: SFP-2.5G-LH03-B / SFP-2.5G-LH20-A

Tested on OpenWrt with successful 2.5G link establishment.

Signed-off-by: Wei Qisen <weixiansen574@163.com>
Link: https://patch.msgid.link/20260526055206.1750-1-weixiansen574@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bridge-fix-sleep-in-atomic-context'

Ido Schimmel says:

====================
bridge: Fix sleep in atomic context

Under certain circumstances the bridge driver can call
dev_set_promiscuity() while holding the bridge spin lock. This is a
problem as dev_set_promiscuity() might sleep.

Patches #1-#2 fix the problem in the netlink and sysfs configuration
paths by only taking the lock where it is actually needed, thereby
avoiding calling dev_set_promiscuity() from an atomic context.

Patch #3 adds test cases for both configuration paths in rtnetlink.sh
which already includes test cases for similar issues.

Note that dev_set_promiscuity() can sleep either when it takes the net
device mutex or when calling netif_rx_mode_sync(). I encountered the
problem with the latter, but blamed the former since it came earlier.
====================

Link: https://patch.msgid.link/20260526064818.272516-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rtnetlink: Add bridge promiscuity tests

Add two test cases that always pass, but trigger sleeping in atomic
context BUGs without "bridge: Fix sleep in atomic context in netlink
path" and "bridge: Fix sleep in atomic context in sysfs path".

Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526064818.272516-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: Fix sleep in atomic context in sysfs path

Since the start of the git history, brport_store() always acquired the
bridge lock. Back then this decision made sense: The bridge lock
protects the STP state of the bridge and its ports and at that time the
function was only used by two STP related attributes (cost and
priority).

Nowadays, brport_store() processes a lot more attributes and most of
them do not need the bridge lock:

* Bridge flags: Only require RTNL. Read locklessly by the data path.
Annotations can be added in net-next.

* FDB port flushing: Only requires the FDB lock.

* Multicast attributes: Only require the multicast lock.

* Group forward mask: Only requires RTNL. Read locklessly by the data
path. Annotations can be added in net-next.

* Backup port: Only requires RTNL. Read locklessly by the data path.

This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].

Fix this by reducing the scope of the bridge lock and only take it when
processing the two STP related attributes that require it. Remove the
now stale comment from br_switchdev_set_port_flag(). The
SWITCHDEV_F_DEFER flag can be removed in net-next.

[1]
BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 372, name: bash
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
5 locks held by bash/372:
#0: ffff88810c51c3f0 (sb_writers#7){.+.+}-{0:0}, at: ksys_write (fs/read_write.c:740)
#1: ffff888115ce9480 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter (fs/kernfs/file.c:343)
#2: ffff88810b9fd330 (kn->active#37){.+.+}-{0:0}, at: kernfs_fop_write_iter (fs/kernfs/file.c:80 fs/kernfs/file.c:344)
#3: ffffffffa59473a0 (rtnl_mutex){+.+.}-{4:4}, at: brport_store (net/bridge/br_sysfs_if.c:326)
#4: ffff8881099d2d58 (&br->lock){+...}-{3:3}, at: brport_store (./include/linux/spinlock.h:348 net/bridge/br_sysfs_if.c:345)
Preemption disabled at:
0x0
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
__might_resched.cold (kernel/sched/core.c:9163)
netif_rx_mode_run (net/core/dev_addr_lists.c:1262)
netif_rx_mode_sync (net/core/dev_addr_lists.c:1428)
dev_set_promiscuity (net/core/dev_api.c:289)
br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172)
br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747)
store_learning (net/bridge/br_sysfs_if.c:79 net/bridge/br_sysfs_if.c:235)
brport_store (net/bridge/br_sysfs_if.c:346)
kernfs_fop_write_iter (fs/kernfs/file.c:352)
new_sync_write (fs/read_write.c:595)
vfs_write (fs/read_write.c:688)
ksys_write (fs/read_write.c:740)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)

Fixes: 78cd408356fe ("net: add missing instance lock to dev_set_promiscuity")
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526064818.272516-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bridge: Fix sleep in atomic context in netlink path

Since the introduction of the netlink configuration path for bridge
ports in commit 25c71c75ac87 ("bridge: bridge port parameters over
netlink"), br_setport() was always called with the bridge lock held
around it. Back then this decision made sense: The bridge lock protects
the STP state of the bridge and its ports and at that time the function
only processed three STP related netlink attributes (cost, priority and
state).

Nowadays, br_setport() processes a lot more attributes and most of them
do not need the bridge lock:

* Bridge flags: Only require RTNL. Read locklessly by the data path.
  Annotations can be added in net-next.

* FDB port flushing: Only requires the FDB lock.

* Multicast attributes: Only require the multicast lock.

* Group forward mask: Only requires RTNL. Read locklessly by the data
  path. Annotations can be added in net-next.

* Backup port and NHID: Only require RTNL. Read locklessly by the data
  path.

This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].

Fix this by reducing the scope of the bridge lock and only take it when
processing the three STP related attributes that require it. This is
consistent with the multicast attributes where each attribute acquires
the multicast lock instead of having one critical section for all
relevant attributes.

[1]
BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 356, name: bridge
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
2 locks held by bridge/356:
#0: ffffffff919473a0 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg (net/core/rtnetlink.c:80 net/core/rtnetlink.c:7002)
#1: ffff888115072d58 (&br->lock){+...}-{3:3}, at: br_setlink (./include/linux/spinlock.h:348 net/bridge/br_netlink.c:1117)
Preemption disabled at:
0x0
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
__might_resched.cold (kernel/sched/core.c:9163)
netif_rx_mode_run (net/core/dev_addr_lists.c:1262)
netif_rx_mode_sync (net/core/dev_addr_lists.c:1428)
dev_set_promiscuity (net/core/dev_api.c:289)
br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172)
br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747)
br_setport (net/bridge/br_netlink.c:1000)
br_setlink (net/bridge/br_netlink.c:1118)
rtnl_bridge_setlink (net/core/rtnetlink.c:5572)
rtnetlink_rcv_msg (net/core/rtnetlink.c:7005)
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
netlink_unicast (net/netlink/af_netlink.c:1318 net/netlink/af_netlink.c:1344)
netlink_sendmsg (net/netlink/af_netlink.c:1894)
__sock_sendmsg (net/socket.c:787 (discriminator 4) net/socket.c:802 (discriminator 4))
____sys_sendmsg (net/socket.c:2698)
___sys_sendmsg (net/socket.c:2752)
__sys_sendmsg (net/socket.c:2784)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)

Fixes: 78cd408356fe ("net: add missing instance lock to dev_set_promiscuity")
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526064818.272516-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

KVM: TDX: Move external page table freeing to TDX code

Move the freeing of external page tables into the reclaim operation that
lives in TDX code.

The TDP MMU supports traversing the TDP without holding locks. Page tables
need to be freed via RCU to prevent walking one that gets freed.

While none of these lockless walk operations actually happen for the mirror
page table, the TDP MMU nonetheless frees the mirror page table in the same
way, and (because it's a handy place to plug it in) the external page table
as well.

However, the external page table definitely can't be walked once the page
table pages are reclaimed from the TDX module. The TDX module releases the
page for the host VMM to use, so this RCU-time free is unnecessary for the
external page table.

So move the free_page() call to TDX code. Create an
tdp_mmu_free_unused_sp() to allow for freeing external page tables that
have never left the TDP MMU code (i.e. don't need to be freed in a special
way).

Link: https://lore.kernel.org/kvm/aYpjNrtGmogNzqwT@google.com
[Based on a diff by Sean, added log]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075740.4371-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move error handling inside free_external_spt()

Move the logic for TDX's specific need to leak pages when reclaim
fails inside the free_external_spt() op, so this can be done in TDX
specific code and not the generic MMU.

Do this by passing in "sp" instead of the external page table pointer.
This way, TDX code can set sp->external_spt to NULL. Since the error is now
handled internally in TDX code (by triggering KVM_BUG_ON() or
TDX_BUG_ON_3(), which warn and stop the VM on any error), change the op to
return void. This way it also operates like a normal free in that success
is guaranteed from the caller's perspective.

Opportunistically, drop the unused level and gfn args while adjusting the
sp arg.

[ Rick: Re-wrote log and massaged op name ]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[ Yan: Updated patch log/function comment, dropped unused param in op ]
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075730.4354-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Rename tdx_sept_remove_private_spte() to show it's for leaf SPTEs

Rename tdx_sept_remove_private_spte() to tdx_sept_remove_leaf_spte() to
clearly show that this function is for removal of leaf SPTEs.

No functional change intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075719.4338-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Drop kvm_x86_ops.remove_external_spte()

Drop kvm_x86_ops.remove_external_spte(), and instead handle the removal of
leaf SPTEs in the S-EPT (a.k.a. external page table) in
kvm_x86_ops.set_external_spte(). This will also allow extending
tdx_sept_set_private_spte() to support splitting a huge S-EPT entry without
needing yet another kvm_x86_ops hook.

Now all changes for removing leaf mirror SPTEs are propagated through
kvm_x86_ops.set_external_spte().
- When removing leaf mirror SPTEs under shared mmu_lock (though currently
  no path can trigger this scenario and TDX does not support this
  scenario), tdx_sept_remove_private_spte() may produce a warning due to
  lockdep_assert_held_write() or may return -EIO and trigger TDX_BUG_ON()
  due to concurrent BLOCK, TRACK, REMOVE.
- When removing leaf mirror SPTEs under exclusive mmu_lock, all errors are
  unexpected. If any error occurs in this scenario,
  tdx_sept_remove_private_spte() will return -EIO and trigger KVM_BUG_ON().
  A redundant KVM_BUG_ON() call will also be triggered in TDP MMU core in
  handle_changed_spte(), which is benign (the WARN will fire if and only if
  the VM isn't already bugged).

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075709.4322-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte()

Arrange tdx_sept_remove_private_spte() (and its tdx_track() helper) to be
above tdx_sept_set_private_spte() in anticipation of routing all S-EPT
writes (with the exception of reclaiming non-leaf pages) through the "set"
API.

No functional change intended.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075658.4306-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Drop KVM_BUG_ON() on shared lock to zap child external PTEs

Drop the KVM_BUG_ON() in the KVM MMU core before zapping child external
PTEs, since requiring zapping PTEs to be protected by exclusive mmu_lock is
TDX's specific requirement.

No need to plumb the shared/exclusive info into the remove_external_spte()
op or move the KVM_BUG_ON() to TDX, because
- There's already an assertion of exclusive mmu_lock protection in TDX.
- The KVM_BUG_ON() is a bit redundant given that if there's any bug causing
zapping of leaf PTEs in S-EPT under shared mmu_lock, SEAMCALL failures
due to contention would result in TDX_BUG_ON() in TDX.

Link: https://lore.kernel.org/kvm/aYUarHf3KEwHGuJe@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075647.4290-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/tdp_mmu: Centrally propagate to-present/atomic zap updates to external PTEs

Move propagation of to-present changes and atomic zap changes to external
PTEs from function __tdp_mmu_set_spte_atomic() to function
__handle_changed_spte(), which centrally handles changes of SPTEs.

When setting a PTE to present in the mirror page tables, the update needs
to be propagated to the external page tables (in TDX parlance, the S-EPT).
Today this is handled by special mirror page tables logic/branching in
__tdp_mmu_set_spte_atomic(), which is the only place where present PTEs are
set for TDX.

The current approach obviously works, but is a bit hacked on. The hook for
setting present leaf PTEs is added only where TDX happens to need it. For
example, TDX does not support any of the operations that use the non-atomic
variant, tdp_mmu_set_spte(), to set present PTEs. Since the hook is missing
there, it is very hard to understand the code from a non-TDX lens. If the
reader doesn't know the TDX specifics it could look like the external SPTE
update is missing.

In addition to being confusing, it also litters the TDP MMU with "external"
update callbacks. This is especially unfortunate because there is already a
central place to react to TDP updates, handle_changed_spte().

Begin the process of moving towards a model where all mirror page table
updates are forwarded to TDX code where the TDX-specific logic can live
with a more proper separation of concerns. Do this by adding a helper
__handle_changed_spte() and teaching it how to return error codes, such
that it can propagate the failures that may come from TDX external page
table updates. Make the original handle_changed_spte() a no-fail version of
__handle_changed_spte(), so it handles no-fail changes which are under
exclusive mmu_lock or under the no-fail path handle_removed_pt(),
triggering KVM_BUG_ON() on error returns.

Instead of having __tdp_mmu_set_spte_atomic() do the frozen mirror SPTE
dance and trigger propagation to external PTEs, make
__tdp_mmu_set_spte_atomic() a simple helper of try_cmpxchg64() and hoist
the frozen mirror SPTE dance up a level to tdp_mmu_set_spte_atomic(). Then,
the propagation of changes to present to the external PTEs can be
centralized to __handle_changed_spte(). Aging external SPTEs is not yet
supported for the mirror page table, so just warn on mirror usage in
kvm_tdp_mmu_age_spte() and invoke __tdp_mmu_set_spte_atomic() directly
without frozen dance. No need to warn on installing FROZEN_SPTE as a
long-term value in kvm_tdp_mmu_age_spte() since removing accessed bit is
mutually exclusive with installing FROZEN_SPTE (FROZEN_SPTE is with
accessed bit in all x86 platforms).

Since tdp_mmu_set_spte_atomic() can also be invoked to atomically zap SPTEs
(though there's no path to trigger atomic zap on the mirror page table up
to now), also leverage set_external_spte() op to propagate the atomic zaps
when tdp_mmu_set_spte_atomic() zaps leaf SPTEs directly. (When
tdp_mmu_set_spte_atomic() zaps a non-leaf SPTE, zaps of the child leaf
SPTEs are propagated via the remove_external_spte() op).

Note: tdp_mmu_set_spte_atomic() invokes __handle_changed_spte() to handle
changes to new_spte while the mirror SPTE is frozen, so
(1) the update of the external PTEs and statistics, or
(2) the update of child mirror SPTEs, child external PTEs and corresponding
statistics,
now occur before the mirror SPTE is actually set to new_spte.
(1) is ok since if it fails, the mirror SPTE will be restored to its
original value. (2) is also ok since handle_removed_pt() is no-fail.

Link: https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com
[Rick: Based on a diff by Sean Chrisopherson]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[Yan: added atomic zap case ]
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075634.4274-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Restrict userspace return codes for KVM_HC_MAP_GPA_RANGE

To align with the updated TDX api that allows userspace to request
that guests retry MAP_GPA operations, make sure that userspace is only
returning EINVAL or EAGAIN as possible error codes.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260305222627.4193305-3-sagis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Allow userspace to return errors to guest for MAPGPA

MAPGPA request from TDX VMs gets split into chunks by KVM using a loop
of userspace exits until the complete range is handled.

In some cases userspace VMM might decide to break the MAPGPA operation
and continue it later. For example: in the case of intrahost migration
userspace might decide to continue the MAPGPA operation after the
migration is completed.

Allow userspace to signal to TDX guests that the MAPGPA operation should
be retried the next time the guest is scheduled.

This is potentially a breaking change since if userspace sets
hypercall.ret to a value other than EBUSY or EINVAL an EINVAL error code
will be returned to userspace. As of now QEMU never sets hypercall.ret
to a non-zero value after handling KVM_EXIT_HYPERCALL so this change
should be safe.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Sagi Shahar <sagis@google.com>
Signed-off-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260305222627.4193305-2-sagis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

selinux: comment spelling fix in ibpkey.c

Signed-off-by: Kalevi Kolttonen <kalevi@kolttonen.fi>
[PM: updated subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>

selinux: comment typo fix in selinuxfs.c

Signed-off-by: Kalevi Kolttonen <kalevi@kolttonen.fi>
[PM: updated subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>

ipv6: mcast: annotate data-races around mca_users

/proc/net/igmp6 walks IPv6 multicast memberships under RCU and prints
mca_users without holding idev->mc_lock, while multicast join and leave
paths update the field while holding idev->mc_lock. Annotate this
intentional lockless snapshot with READ_ONCE() and the matching writers
with WRITE_ONCE().

Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260524022456.20689-1-sigefriedhyy@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/dns_resolver: consolidate namelen checks in dns_query

Consolidate the namelen checks and return -EINVAL early if needed. Drop
the namelen == 0 check since it is covered by namelen < 3.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260525095400.821912-3-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: refuse to enslave CAN devices

syzbot reported a kernel paging request crash in
can_rx_unregister() inside net/can/af_can.c. The crash occurs
because a virtual CAN device (vxcan) is being enslaved to a
bonding master.

During the enslavement process, the bonding driver mutates
and modifies the network device states to fit an Ethernet-like
aggregation model. However, CAN devices operate on a completely
different Layer 2 architecture, relying on the CAN mid-layer
private data structure (can_ml_priv) instead of standard
Ethernet structures. Since bonding does not initialize or
maintain these CAN structures, subsequent operations on the
half-enslaved interface (such as closing associated sockets
via isotp_release) lead to a null-pointer dereference when
accessing the CAN receiver lists.

Bonding CAN interfaces is architecturally invalid as CAN lacks
MAC addresses, ARP capabilities, and standard Ethernet
link-layer mechanisms. While generic loopback devices are
blocked globally in net/core/dev.c, virtual CAN devices
bypass this check because they do not carry the IFF_LOOPBACK
flag, despite acting as local software-loopbacks.

Fix this by explicitly blocking network devices of type
ARPHRD_CAN from being enslaved at the very beginning of
bond_enslave(). This prevents illegal state mutations,
eliminates the resulting KASAN crashes, and avoids potential
memory leaks from incomplete socket cleanups.

As the CAN support has been added a long time after bonding
the Fixes-tag points to the introduction of ARPHRD_CAN that
would have needed a specific handling in bonding_main.c.

Fixes: cd05acfe65ed ("[CAN]: Allocate protocol numbers for PF_CAN")
Reported-by: syzbot+8ed98cbd0161632bce95@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8ed98cbd0161632bce95
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Jay Vosburgh <jv@jvosburgh.net>
Link: https://patch.msgid.link/20260526-bonding-candev-v1-1-ba1df400918a@hartkopp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

run-clang-tools: run multiprocessing.Pool as context manager

`multiprocessing.pool.Pool()` should be used as a context manager so
Python can free its internal resources and do a proper cleanup.[1]

While at it move the code to read the `compiler_commands.json` so the
opened file can be closed before the sub-processes are fork()ed.

Link: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
Link: https://patch.msgid.link/40180613bef84946c45d6fbeb4bb274573cd0beb.1778849135.git.phahn-oss@avm.de
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

selinux: hooks: use __getname() to allocate path buffer

selinux_genfs_get_sid() allocates memory for a path with __get_free_page()
although there is a dedicated helper for allocation of file paths:
__getname().

Replace __get_free_page() for allocation of a path buffer with __getname().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>

selinux: use k[mz]alloc() to allocate temporary buffers

Several functions in selinuxfs.c allocate temporary buffers using
__get_free_page() or get_zeroed_page().

These buffers are used either to store a string generated by snprintf() (in
sel_make_bools()) or to copy data from user (sel_read_avc_hash_stats() and
sel_read_sidtab_hash_stats()).

Such usage does not require struct page access and it is better to allocate
these buffers with kzalloc()/kmalloc() that provide better scalability and
more debugging possibilities.

Replace use of get_zeroed_page() with kzalloc() and usage of
__get_free_page() with kmalloc().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>

KVM: VMX: Handle bad values on proxied writes to LBR MSRs

Use the "safe" WRMSR API when writing LBRs on behalf of the guest (or host
userspace), and propagate any errors back to the instigator, as the value
being written is untrusted.  E.g. if the guest (or host userspace) attempts
to set reserved bits in LBR_SELECT, then KVM needs to return an error, and
not WARN on the bad value.

Continue using the "unsafe" version of RDMSR, as it should be impossible to
reach the helper with a completely bogus MSR, i.e. WARNing on RDMSR failure
is very desirable, e.g. to make KVM bugs more visible.

  unchecked MSR access error: WRMSR to 0x1c8 (tried to write 0x0000000000004000)
  Call Trace:
   intel_pmu_set_msr+0x4e0/0x7f0 [kvm_intel]
   kvm_pmu_set_msr+0x17e/0x1c0 [kvm]
   kvm_set_msr_common+0xc76/0x1440 [kvm]
   vmx_set_msr+0x5e6/0x1570 [kvm_intel]
   kvm_emulate_wrmsr+0x54/0x1d0 [kvm]
   vmx_handle_exit+0x7fc/0x970 [kvm_intel]

Fixes: 1b5ac3226a1a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE")
Cc: stable@vger.kernel.org
Signed-off-by: Xuanqing Shi <1356292400@qq.com>
[sean: rework changelog, only modify WRMSR path, tag for stable@]
Link: https://patch.msgid.link/20260527022617.3973884-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

audit: fix recursive locking deadlock in audit_dupe_exe()

A deadlock occurs in the audit subsystem when duplicating
executable-related rules.

When a file is moved (e.g., via do_renameat2()), the VFS layer locks
the parent directory (I_MUTEX_PARENT), which synchronously triggers an
fsnotify_move event. If an existing executable audit rule matches the
file being moved, the audit subsystem catches this event and calls
audit_dupe_exe() to duplicate the watch and update the rule. Then,
audit_alloc_mark() would call kern_path_parent() to resolve the path,
leading to a blind attempt to acquire the exact same I_MUTEX_PARENT lock
already held by the task, resulting in the following recursive locking
deadlock:

============================================
WARNING: possible recursive locking detected
6.12.0-55.27.1.el10_0.x86_64+debug #1 Not tainted
--------------------------------------------
mv/5099 is trying to acquire lock:
ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
at: __kern_path_locked+0x10a/0x2f0

but task is already holding lock:
ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
at: lock_two_directories+0x13f/0x2b0

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&inode->i_sb->s_type->i_mutex_dir_key/1);
   lock(&inode->i_sb->s_type->i_mutex_dir_key/1);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

  6 locks held by mv/5099:
  #0: ffff888112a9c440 (sb_writers#13)
  at: do_renameat2+0x34c/0xbc0
  #1: ffff888112a9c790 (&type->s_vfs_rename_key#3)
  at: do_renameat2+0x415/0xbc0
  #2: ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1)
  at: lock_two_directories+0x13f/0x2b0
  #3: ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/5)
  at: lock_two_directories+0x175/0x2b0
  #4: ffffffffb3a1fb10 (&fsnotify_mark_srcu)
  at: fsnotify+0x454/0x28a0
  #5: ffffffffaf886230 (audit_filter_mutex)
  at: audit_update_watch+0x36/0x11e0

stack backtrace:
Call Trace:
  <TASK>
  dump_stack_lvl+0x6f/0xb0
  print_deadlock_bug.cold+0xbd/0xca
  validate_chain+0x83a/0xf00
  __lock_acquire+0xcac/0x1d20
  lock_acquire.part.0+0x11b/0x360
  down_write_nested+0x9f/0x230
  __kern_path_locked+0x10a/0x2f0
  kern_path_locked+0x26/0x40
  audit_alloc_mark+0xfb/0x4f0
  audit_dupe_exe+0x6c/0xe0
  audit_dupe_rule+0x6c2/0xc00
  audit_update_watch+0x4cc/0x11e0
  audit_watch_handle_event+0x12c/0x1b0
  send_to_group+0x5d0/0x8b0
  fsnotify+0x615/0x28a0
  fsnotify_move+0x1d8/0x630
  vfs_rename+0xdcd/0x1df0
  do_renameat2+0x9d4/0xbc0
  __x64_sys_renameat+0x192/0x260
  do_syscall_64+0x92/0x180
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f0491fe8c4e
Code: 0f 1f 40 00 48 8b 15 c1 e1 16 00 f7 d8 64 89 02 b8 ff ff ff ff
c3 66 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 08 01 00 00 0f 05 <48>
3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 8b 15 89
RSP: 002b:00007ffc7210bf38 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0491fe8c4e
RDX: 0000000000000003 RSI: 00007ffc7210e6c8 RDI: 00000000ffffff9c
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: 00005575eb2dae2a R11: 0000000000000246 R12: 00005575eb2dae2a
R13: 00007ffc7210e6c8 R14: 0000000000000003 R15: 00000000ffffff9c
  </TASK>

The aforementioned deadlock can be consistently reproduced by running
the script below:

audit-dupe-exe-deadlock.sh
--------------------------
#!/bin/bash
auditctl -D
mkdir -p /tmp/foo
touch /tmp/file
auditctl -a always,exit -F exe=/tmp/file -F path=/tmp/file -S all -k dr
mv /tmp/file /tmp/foo/file
rm -Rf /tmp/foo

This patch fixes the issue by introducing struct audit_watch_ctx to pass
the fsnotify event context down to audit_alloc_mark(). By utilizing the
already-resolved directory inode provided by the event, we bypass the
kern_path_parent() path resolution entirely, safely avoiding the
recursive lock. Furthermore, it explicitly allows duplicate fsnotify
marks (allow_dups = 1) during the rename update, allowing the new rule's
mark to safely coexist with the old rule's mark until the old rule is
freed.

P.S.: This issue was identified and reproduced during a comprehensive
code coverage analysis of the audit subsystem. The full report is
available at the link below:

https://people.redhat.com/rrobaina/audit-code-coverage-analysis.pdf

P.P.S: With the permission of both Ricardo and Nathan, I've squashed a
fixup patch from Nathan that addresses a compile time error when
CONFIG_AUDITSYSCALL=n.

Cc: stable@kernel.org
Fixes: 34d99af52ad4 ("audit: implement audit by executable")
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
[PM: move link metadata into the msg, apply fix from NC]
Signed-off-by: Paul Moore <paul@paul-moore.com>

KVM: x86/mmu: Plumb "sp" _pointer_ into the TDP MMU's handle_changed_spte()

Plumb the "sp" pointer into handle_changed_spte() to allow checking of
is_mirror_sp(sp) in handle_changed_spte(). This will allow consolidating
all S-EPT updates into a single kvm_x86_ops hook.

[Yan: Remove unused "as_id" param in tdp_mmu_set_spte() ]

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075622.4258-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/tdp_mmu: Morph !is_frozen_spte() check into a KVM_MMU_WARN_ON()

Remove the conditional logic for handling the setting of mirror page table
to frozen in __tdp_mmu_set_spte_atomic() and add it as a warning for both
mirror and direct cases.

The mirror page table needs to propagate PTE changes to the external page
table. This presents a problem for atomic updates which can't update both
page tables at once. So a special value, FROZEN_SPTE, is used as a
temporary state during these updates to prevent concurrent operations on
the PTE. If the TDP MMU tried to install FROZEN_SPTE as a long-term value,
it would confuse these updates.

On the other hand, it would also confuse other threads if FROZEN_SPTE is
installed as a long-term value for direct page tables (e.g., causing
another thread working on atomic zap to wait for a !FROZEN_SPTE value
endlessly).

Therefore, add the warning for installing FROZEN_SPTE as a long-term value
in __tdp_mmu_set_spte_atomic() without differentiating whether it's a
mirror or direct page table.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075609.4242-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Move lockdep assert in __tdp_mmu_set_spte_atomic() to TDX code

Move the MMU lockdep assert in __tdp_mmu_set_spte_atomic() into the TDX
specific op because the assert is TDX specific in intention.

The TDP MMU has many lockdep asserts for various scenarios, and in fact
the callchains that are used for TDX already have a lockdep assert which
covers the case in __tdp_mmu_set_spte_atomic(). However, these asserts are
for management of the TDP root owned by KVM. In the
__tdp_mmu_set_spte_atomic() assert case, it is helping with a scheme to
avoid contention in the TDX module during zap operations. That is very
TDX specific.

One option would be to just remove the assert in
__tdp_mmu_set_spte_atomic() and rely on the other ones in the TDP MMU. But
that assert is for a different intention, and too far away from the
SEAMCALL that needs it. So just move it to TDX code.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075557.4226-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Move KVM_BUG_ON()s in __tdp_mmu_set_spte_atomic() to TDX code

Drop some KVM_BUG_ON()s that are guarding against TDP MMU attempting to
propagate unsupported changes to the external page table through
__tdp_mmu_set_spte_atomic(). Have TDX code trigger them instead.

Now that TDP MMU logically allows propagating atomic zapping operation to
the external page table through the set_external_spte() op in
__tdp_mmu_set_spte_atomic(), TDX code will trigger the KVM_BUG_ON() on the
atomic zapping request instead. (Note: non-atomic zapping is not propagated
via the set_external_spte() op yet).

Despite the generic naming, external page table ops are designed completely
around TDX. They hook the bare minimum of what is needed, and exclude the
operations that are not supported by TDX. To help wrangle which operations
are handleable by various operations, warnings and KVM_BUG_ON()s exist in
the code. These warnings and KVM_BUG_ON()s put the burden of understanding
which operations should be forwarded to TDX code on TDP MMU developers, who
often read the code without TDX context.

Future changes will transition the encapsulation of this domain knowledge
to TDX code by funneling the external page table updates through a central
update mechanism. In this paradigm, the central update mechanism can
encapsulate the special knowledge, but will not have as much knowledge
about what operation is in progress.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075544.4210-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Plumb param "old_spte" into kvm_x86_ops.set_external_spte()

If tdp_mmu_set_spte_atomic() triggers an atomic zap on a mirror SPTE
(though currently no paths trigger it), the change is propagated via the
set_external_spte() op. Plumb the old SPTE into the set_external_spte() op,
so TDX code rather than TDP MMU code can warn if the atomic zap isn't
allowed, i.e. to let TDX enforce TDX's rules (inasmuch as possible).

Rename mirror_spte to new_spte to follow the TDP MMU's naming, and to make
it more obvious what value the parameter holds.

Opportunistically tweak the ordering of parameters to match the pattern of
most TDP MMU functions, which do "old, new, level".

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075533.4193-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/mmu: Fold set_external_spte_present() into its sole caller

Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() in
anticipation of propagating *all* changes (like atomic zap) triggered by
tdp_mmu_set_spte_atomic() to the external PTEs.

No functional change intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075520.4177-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Wrap mapping of leaf and non-leaf S-EPT entries into helpers

Add a helper, tdx_sept_map_leaf_spte(), to wrap and isolate PAGE.ADD and
PAGE.AUG operations. Rename tdx_sept_link_private_spt() to
tdx_sept_map_nonleaf_spte() to wrap SEPT.ADD for symmetry.

Thus, transition tdx_sept_set_private_spte() into a "dispatch" routine for
setting/writing S-EPT entries.

No functional change intended.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075500.4157-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Drop kvm_x86_ops.link_external_spt()

Drop the dedicated .link_external_spt() for linking S-EPT pages, and
instead funnel everything through .set_external_spte() for mapping S-EPT
entries. Using separate hooks doesn't help prevent TDP MMU details from
bleeding into TDX, and vice versa; to the contrary, dedicated callbacks
will result in _more_ pollution when hugepage support is added, e.g. will
require the TDP MMU to know details about the splitting rules for TDX that
aren't all that relevant to the TDP MMU.

Ideally, KVM would provide a single pair of hooks to set S-EPT entries,
one hook for setting SPTEs under write-lock and another for setting SPTEs
under read-lock (e.g. to ensure the entire operation is "atomic", to allow
for failure, etc.). Sadly, TDX's requirement that all child S-EPT entries
are removed before the parent makes that impractical: the TDP MMU
deliberately prunes non-leaf SPTEs and _then_ processes its children, thus
making it quite important for the TDP MMU to differentiate between zapping
leaf and non-leaf S-EPT entries.

However, that's the _only_ case that's truly special, and even that case
could be shoehorned into a single hook; it just wouldn't be a net positive.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20260509075357.4113-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

mshv: support 1G hugepages by passing them as 2M-aligned chunks

The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
chunks into 1G mappings when alignment permits, so the driver can
support 1G hugepages by feeding them in as 2M chunks. Note that this
is the only way to make 1G mappings; there is no way to directly map
a 1G hugepage using the hypercall.

Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
hypercall has no 1G stride, so 1G folios are processed as a
sequence of 2M chunks. Folios whose order is less than PMD_ORDER
(e.g. mTHP) fall back to single-page stride; mapping them as 2M
would fail in the hypervisor anyway.

Assisted-by: Copilot-CLI:claude-opus-4.7
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>

Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs

If vmbus_reserve_fb() in the kdump/kexec kernel fails to properly reserve
the framebuffer MMIO range (which is below 4GB) due to a Gen2 VM's
screen.lfb_base being zero [1], there is an MMIO conflict between the
drivers hyperv-drm and pci-hyperv: when the driver pci-hyperv's
hv_allocate_config_window() calls vmbus_allocate_mmio() to get an
MMIO range, typically it gets a 32-bit MMIO range that overlaps with the
framebuffer MMIO range, and later hv_pci_enter_d0() fails with an
error message "PCI Pass-through VSP failed D0 Entry with status" since
the host thinks that PCI devices must not use MMIO space that the
host has assigned to the framebuffer.

This is especially an issue if pci-hyperv is built-in and hyperv-drm is
built as a module. Consequently, the kdump/kexec kernel fails to detect
PCI devices via pci-hyperv, and may fail to mount the root file system,
which may reside in a NVMe disk. The issue described here has existed
for SR-IOV VF NICs since day one of the pci-hyperv driver, and has been
worked around on x64 when possible. With the recent introduction of
ARM64 VMs that boot from NVMe, there is no workaround, so we need a
formal fix.

On Gen2 VMs, if the screen.lfb_base is 0 in the kdump/kexec kernel [1],
fall back to the low MMIO base, which should be equal to the framebuffer
MMIO base [2] (the statement is true according to my testing on x64
Windows Server 2016, and on x64 and ARM64 Windows Server 2025 and on
Azure. I checked with the Hyper-V team and they said the statement should
continue to be true for Gen2 VMs). In the first kernel, screen.lfb_base
is not 0; if the user specifies a very high resolution, it's not enough
to only reserve 8MB: let's always reserve half of the space below 4GB,
but cap the reservation to 128MB, which is the required framebuffer size
of the highest resolution 7680*4320 supported by Hyper-V.

While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
the > to >=. Here the 'end' is an inclusive end (typically, it's
0xFFFF_FFFF for the low MMIO range).

Note: vmbus_reserve_fb() now also reserves an MMIO range at the beginning
of the low MMIO range on CVMs, which have no framebuffers (the
'screen.lfb_base' in vmbus_reserve_fb() is 0 for CVMs), just in case the
host might treat the beginning of the low MMIO range specially [3]. BTW,
the OpenHCL kernel is not affected by the change, because that kernel
boots with DeviceTree rather than ACPI (so vmbus_reserve_fb() won't run
there), and there is no framebuffer device for that kernel.

Note: normally Gen1 VMs don't have the MMIO conflict issue because the
framebuffer MMIO range (which is hardcoded to base=4GB-128MB and
size=64MB for Gen1 VMs by the host) is always reported via the legacy PCI
graphics device's BAR, so the kdump/kexec kernel can reserve the 64MB
MMIO range; however, if the VM is configured to use a very high resolution
and the required framebuffer size exceeds 64MB (AFAIK, in practice, this
isn't a typical configuration by users), the hyperv-drm driver may need to
allocate an MMIO range above 4GB and change the framebuffer MMIO location
to the allocated MMIO range -- in this case, there can still be issues [4]
which can't be easily fixed: any possible affected Gen1 users would have
to use a resolution whose framebuffer size is <= 64MB, or switch to Gen2
VMs.

[1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
[2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
[3] https://lore.kernel.org/all/SN6PR02MB415726B17D5A6027CD1717E8D4342@SN6PR02MB4157.namprd02.prod.outlook.com/
[4] https://lore.kernel.org/all/SA1PR21MB69213486F821CA5A2C793C81BF342@SA1PR21MB6921.namprd21.prod.outlook.com/

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
CC: stable@vger.kernel.org
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>

mshv: use kmalloc_array in mshv_root_scheduler_init

Replace kmalloc() with kmalloc_array() to prevent potential
overflow, as recommended in Documentation/process/deprecated.rst.

No functional change.

Signed-off-by: Can Peng <pengcan@kylinos.cn>
Signed-off-by: Wei Liu <wei.liu@kernel.org>