git.ipfire.org Git - thirdparty/kernel/stable.git/log

net/mlx5: Map SF controller to pfnum for satellite PFs

SF devlink port creation and registration used the ECPF's PCI function
as pfnum. Extend this to support satellite PF controllers by introducing
mlx5_esw_sf_controller_to_pfnum() that maps a controller number to the
corresponding PF number, and use it in SF port attribute setup and SF
creation validation.

Reorder the checks in mlx5_devlink_sf_port_new() so that
mlx5_sf_table_supported() runs before attribute validation, since the
new helper requires the eswitch to be initialized.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Expose PF number from query_esw_functions

Extract pci_device_function from the query_esw_functions output for both
the host PF and satellite PFs, storing it alongside the existing
host_number field.

Add mlx5_esw_get_hpf_pf_num() helper that returns the host PF's actual
PCI device function when the new query format is supported, falling back
to PCI_FUNC(dev->pdev->devfn) for older firmware. Use it in devlink port
attribute setup so that host PF and VF devlink ports report the correct
PF number rather than the ECPF's own PCI function number.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Support SPF SFs in SF hardware table

Convert the SF hardware table from a fixed-size hwc array to a
dynamically allocated one, supporting satellite PF (SPF) SFs alongside
local and external host SFs. Initialize hwc entries for each SPF using
its host_number as controller. Rename MLX5_SF_HWC_EXTERNAL to
MLX5_SF_HWC_EXT_HOST and add MLX5_SF_HWC_FIRST_SPF for clarity.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Initialize satellite PF SF vports

Extend satellite PF (SPF) initialization to allocate SF vports for each
SPF. For each discovered SPF, query its SF capabilities, allocate SF
vports, and store the host_number for controller identification.

Add accessor APIs mlx5_esw_get_num_spfs(),
mlx5_esw_spf_get_host_number(), mlx5_esw_sf_max_spf_functions(), and
mlx5_esw_has_spf_sfs() for use by the SF hardware table in a subsequent
patch. Also extend mlx5_esw_offloads_controller_valid() to accept SPF
controllers in addition to the host PF controller.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Initialize host PF host number earlier

Move host_number from esw->offloads to esw->esw_funcs as hpf_host_number
and initialize it during vports_init instead of offloads_enable. This
makes the host PF host number available earlier in the initialization
sequence, which is required for upcoming SF hardware table support for
satellite PFs.

Add a mlx5_esw_get_hpf_host_number() accessor to retrieve the stored
host number.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Introduce generic helper for PF SFs info

Introduce mlx5_esw_sf_max_pf_functions() that queries a PF's max_num_sf
and sf_base_id using mlx5_vport_get_other_func_general_cap(), which
supports both function_id and vhca_id based addressing.

Refactor mlx5_esw_sf_max_hpf_functions() into a thin wrapper that adds
the host PF precondition checks and calls the new generic helper. Remove
mlx5_query_hca_cap_host_pf() as it is not used anymore.

This prepares for querying SFs info of Satellite PFs.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add satellite PF vport support

Discover satellite PFs from query_esw_functions output and allocate
eswitch vports for them. For each satellite PF, create a vport via the
CREATE_ESW_VPORT command using its vhca_id and allocate it in the
eswitch vport table.

When enabling switchdev mode, the ECPF acting as the eswitch manager
activates each satellite PF with enable_hca, loads its vport and adds
a representor. Since satellite PF devlink ports are registered in a
later patch, guard mlx5_esw_offloads_devlink_port() against vports
with no devlink port to avoid NULL dereference during representor
attach.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: sysctl/net: Remove ax25, netrom, rose entries

These networking subsystems were removed in commit dd8d4bc28ad7
("net: remove ax25 and amateur radio (hamradio) subsystem"),
but the sysctl directory table still listed them.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260515180200.1490926-1-costa.shul@redhat.com>

docs: pt_BR: update minimal software requirements in changes.rst

Update the Brazilian Portuguese translation of changes.rst to align with
the latest English version.

Key changes include:
- Updated minimum versions for Rust (1.85.0), bindgen (0.71.1), and
pahole (1.22).
- Fixed ReST syntax for internal references (:ref:) and external links.
- Corrected formatting for tool names and config options using inline
code backticks.
- Synchronized technical descriptions for udev, kmod, and NFS-utils.

v2:
- Fix alignment in the minimal software requirements table that broke the build.
- Fix Sphinx footnote syntax.

Signed-off-by: Daniel Pereira <danielmaraboo@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260515182200.654324-1-danielmaraboo@gmail.com>

docs/ja_JP: translate more of submitting-patches.rst (no-mime)

Translate the "No MIME, no links, no compression, no attachments.
Just plain text" and "Respond to review comments" sections in
Documentation/translations/ja_JP/process/submitting-patches.rst.

Keep the wording close to the English text and wrap lines to match
the style used in the surrounding Japanese translation.

Signed-off-by: Akiyoshi Kurita <weibu@redadmin.org>
Acked-by: Akira Yokosawa <akiyks@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260513131111.432772-1-weibu@redadmin.org>

docs: maintainers_include: keep the last entry at the end

The last maintainer's entry ("THE REST") is meant to be at the
end. Ensure that.

While here, use a case-insensitive sort to avoid placing "iSCSI"
near the end.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <b4f45565eff4ba6f01e84a6877813038a23ba83b.1778952682.git.mchehab+huawei@kernel.org>

docs: maintainers_include: restore compatibility with Python 3.6

glob root_dir parameter requires Python 3.10, which is more than
our current Python minimal requirement.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <930036c189414f3f7096c22269687489f8566dd9.1778952682.git.mchehab+huawei@kernel.org>

docs: fix typo in user_mode_linux_howto_v2.rst

Replace "privilges" with "privileges"

Signed-off-by: Sakurai Shun <ssh1326@icloud.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260517022456.5895-1-ssh1326@icloud.com>

docs: fix typo in leds-lp55xx.rst

Replace "regsister" with "register"

Signed-off-by: Sakurai Shun <ssh1326@icloud.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260517043303.17111-1-ssh1326@icloud.com>

docs: threat-model: add missing closing parenthesis

Fixes: a03ef333fbd6 ("Documentation: security-bugs: explain what is and is not a security bug")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Acked-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <da8ee1e8b4e99261ec11544c4e1a4f81316ae965.1779032501.git.baruch@tkos.co.il>

docs: pt_BR: Translate process/kernel-docs.rst into Portuguese

Translate Documentation/process/kernel-docs.rst into Portuguese (pt_BR)
and update the main index.

The content was adapted following the RST formatting rules and the
appropriate technical terminology for Brazilian Portuguese.

Signed-off-by: Daniel Pereira <danielmaraboo@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260519140035.1031694-1-danielmaraboo@gmail.com>

docs: submitting-patches: Clarify that "reviewer" is a person

Common understanding of word "Reviewer" is: a person performing a review
work [1]. Tools are not persons, thus cannot be reviewers in this term.
Also tools cannot make statements and cannot take responsibility for the
review.

Our docs already clearly mark that "Reviewed-by" must come from a
person:

- "By offering my Reviewed-by: tag, I state that:"

   Usage of first person "I" and word "state"

- "A Reviewed-by tag is *a statement of opinion* that the patch is an
    appropriate modification of the kernel without any remaining serious"

   Only a person can make a statement of opinion.

- "Any interested reviewer (who has done the work) can offer a
   Reviewed-by"

   A person can offer a tag thus above does not grant the tool
   permission to offer a tag.

However this might not be enough, so let's clarify that only a person
with a known identity can state the "Reviewer's statement of oversight".

Link: https://en.wiktionary.org/wiki/reviewer
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260520154846.162170-2-krzysztof.kozlowski@oss.qualcomm.com>

arm64: dts: microchip: lan969x: add OTP node

Add the required OTP on LAN969x.

Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Link: https://lore.kernel.org/r/20260515115954.701155-3-robimarko@gmail.com
Signed-off-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>

ARM: configs: at91: sama7: add sama7d65 i3c-hci

Enable the configs needed for I3C framework and microchip
sama7d65 i3c-hci driver.

Signed-off-by: Durai Manickam KR <durai.manickamkr@microchip.com>
Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Signed-off-by: Manikandan Muralidharan <manikandan.m@microchip.com>
Link: https://lore.kernel.org/r/20260525092405.1514213-6-manikandan.m@microchip.com
Signed-off-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>

ARM: dts: microchip: add I3C controller

Add I3C controller for sama7d65 SoC.

Signed-off-by: Durai Manickam KR <durai.manickamkr@microchip.com>
Signed-off-by: Manikandan Muralidharan <manikandan.m@microchip.com>
Link: https://lore.kernel.org/r/20260525092405.1514213-5-manikandan.m@microchip.com
Signed-off-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>

net: lan966x: cleanup error handling in lan966x_fdma_rx_alloc_page_pool()

This code works, but there are a few things to tidy up:
1. No need to an unlikely() because IS_ERR() already has an unlikely()
built in.
2. No need to use PTR_ERR_OR_ZERO() because it's not an error pointer.
3. Use the returned error code directly instead of using groveling in
rx->page_pool to find it.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Link: https://patch.msgid.link/ag7_YBWRpRmY9MGT@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

clk: at91: sama7d65: add peripheral clock for I3C

Add peripheral clock description for I3C.

Signed-off-by: Durai Manickam KR <durai.manickamkr@microchip.com>
Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Signed-off-by: Manikandan Muralidharan <manikandan.m@microchip.com>
Link: https://lore.kernel.org/r/20260525092405.1514213-3-manikandan.m@microchip.com
Signed-off-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>

octeontx2-af: validate body pcifunc in rvu_mbox_handler_rep_event_notify

rvu_mbox_handler_rep_event_notify() in drivers/net/ethernet/marvell/
octeontx2/af/rvu_rep.c queues a sender-controlled REP_EVENT_NOTIFY
request body verbatim, and rvu_rep_up_notify() then forwards
event->pcifunc (the nested body field, distinct from the
AF-normalised header pcifunc) into rvu_get_pfvf(), rvu_get_pf() and
the AF->PF mailbox device index without any bounds check.

A VF attached to a PF that has been put into switchdev
representor mode reaches this path: the VF mailbox handler
otx2_pfvf_mbox_handler() forwards every message id including
MBOX_MSG_REP_EVENT_NOTIFY to AF without an allowlist, and the AF
dispatcher rewrites only msg->pcifunc, leaving struct
rep_event::pcifunc attacker-controlled. The sibling
rvu_mbox_handler_esw_cfg() refuses requests whose header pcifunc
is not rvu->rep_pcifunc; this handler has no equivalent gate.

An out-of-range body pcifunc selects an &rvu->pf[]/&rvu->hwvf[]
element past the allocated array and, for RVU_EVENT_MAC_ADDR_CHANGE,
turns into a six-byte attacker-chosen OOB ether_addr_copy() target
inside the queued worker; KASAN reports a slab-out-of-bounds write
in rvu_rep_wq_handler.

Reject malformed requests at the handler entry by gating on
is_pf_func_valid(), which is already the canonical PF/VF range check
in this driver; expose it via rvu.h so callers in rvu_rep.c can use
it instead of open-coding the same range arithmetic.

Fixes: b8fea84a0468 ("octeontx2-pf: Add support to sync link state between representor and VFs")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://patch.msgid.link/20260520154157.1439319-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-7.1/hpfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull hpfs fix from Mikulas Patocka:

- Fix a crash on corrupted filesystem

* tag 'for-7.1/hpfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
hpfs: fix a crash if hpfs_map_dnode_bitmap fails

perf test: Add stat metrics --for-each-cgroup test

Add a new shell test `stat_metrics_cgrp.sh` to verify metric reporting
with `--for-each-cgroup`, both with and without `--bpf-counters`.

The test:
- Checks if system-wide monitoring is supported (skips if not).
- Finds cgroups to test.
- Runs `perf stat` with `insn_per_cycle` metric and verifies that the
metric is reported for each cgroup.
- Dynamically pairs and verifies instructions and cycles counts to
avoid false failures on idle cgroups.
- Tests both standard mode and BPF counters mode (if supported).

Assisted-by: Antigravity:gemini-3-flash
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Svilen Kanev <skanev@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf stat: Propagate supported flag to follower cgroup BPF events

When using BPF counters with cgroups, follower events (for cgroups
other than the first one) are not opened. Because they are not opened,
their `supported` flag was left as `false`.

During metric calculation, `prepare_metric` checks if the event is
supported. If it is not supported (like the follower events), it
explicitly sets the value to `NAN`, which eventually causes the metric
to be reported as `nan %`.

Fix this by propagating the `supported` flag from the "leader" events
(the ones opened for the first cgroup) to the "follower" events.

Also add a validation check to `bperf_load_program` to ensure `nr_cgroups`
is not zero and the number of events is a multiple of `nr_cgroups`,
preventing a potential division-by-zero (SIGFPE) exception when
`num_events` evaluates to 0 (e.g., with a trailing comma in cgroups list).

Reported-by: Svilen Kanev <skanev@google.com>
Assisted-by: Antigravity:gemini-3-flash
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Merge tag 'for-7.1/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mikulas Patocka:

- fix crashes in dm-vdo if GFP_NOWAIT allocation fails

* tag 'for-7.1/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm vdo: use GFP_NOIO for blkdev_issue_zeroout on format path

sched_ext: Convert ops.set_cmask() to arena-resident cmask

ops_cid.set_cmask() expects a cmask. The kernel couldn't write into the
arena, so it translated cpumask -> cmask in kernel memory and passed the
result as a trusted pointer. The BPF cmask helpers all operate on arena
cmasks though, so the BPF side had to word-by-word probe-read the kernel
cmask into an arena cmask via cmask_copy_from_kernel() before any helper
could touch it. It works, but is clumsy.

With direct kernel-side arena access now in place, build the cmask in the
arena. The kernel writes to it through the kern_va side of the dual mapping.
BPF directly dereferences it via an __arena pointer like any other arena
struct.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

sched_ext: Sub-allocator over kernel-claimed BPF arena pages

Build a per-scheduler sub-allocator on top of pages claimed from the BPF
arena registered in the previous patch. Subsequent kernel-managed
arena-resident structures (e.g. per-CPU set_cmask cmask) carve their storage
from this pool.

scx_arena_pool_init() creates a gen_pool. scx_arena_alloc() returns the
kernel VA. On exhaustion, the pool grows by claiming more pages via
bpf_arena_alloc_pages_sleepable(). Chunks are added at the kernel-side
mapping address. Callers translate to the BPF-arena form themselves if
needed.

Allocations sleep (GFP_KERNEL) - they may grow the pool through vzalloc and
arena page allocation. All current consumers run from the enable path (after
ops.init() and the kernel-side arena auto-discovery, before validate_ops()),
where sleeping is fine.

scx_arena_pool_destroy() walks each chunk, returns outstanding ranges to the
gen_pool with gen_pool_free() and then calls gen_pool_destroy(). The
underlying arena pages are released when the arena map itself is torn down,
so the pool destroy doesn't free them explicitly.

v2: Switch scx_arena_alloc() to a loop. (Andrea)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Require an arena for cid-form schedulers

Upcoming patches will let the kernel place arena-resident scratch shared
with the BPF program (e.g. per-CPU set_cmask cmask) so the BPF side can
dereference it directly via __arena pointers, replacing the current
cmask_copy_from_kernel() probe-read loop. That requires each cid-form
scheduler to expose its arena to the kernel. Kernel- side accesses are
recovered by the per-arena scratch-page mechanism.

bpf_scx_reg_cid() walks the struct_ops member progs via
bpf_struct_ops_for_each_prog() and reads each prog's arena via
bpf_prog_arena(). The verifier enforces one arena per program, so each
member prog contributes at most one arena. All non-NULL contributions must
match and at least one member prog must use an arena. The map ref is held on
scx_sched and dropped on sched destroy. cpu-form schedulers (bpf_scx_reg)
are unchanged - no arena requirement.

Signed-off-by: Tejun Heo <tj@kernel.org>

macsec: fix replay protection at XPN lower-PN wrap

In macsec_post_decrypt(), when pn is U32_MAX, pn + 1 overflows u32 to 0
and the first branch never fires. If next_pn_halves.lower is also in the
upper half, pn_same_half(pn, lower) is true and the XPN else-if does not
fire either, leaving next_pn_halves unchanged. An attacker that captures
the legitimate frame carrying pn == 0xFFFFFFFF on an XPN association
can then replay it indefinitely, since lowest_pn never rises above
the captured pn and macsec_decrypt() reconstructs the same IV.

Extend the XPN else-if to also fire when pn + 1 wraps to 0, so receipt
of pn == U32_MAX advances next_pn_halves to (upper + 1, 0).

Fixes: a21ecf0e0338 ("macsec: Support XPN frame handling - IEEE 802.1AEbw")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Link: https://patch.msgid.link/SYBPR01MB78813FD49E58F253B989F197AF012@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'arena_direct_access' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next into for-7.2

Merge tag 'bootconfig-fixes-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull bootconfig fix from Masami Hiramatsu:

- Fix buf leak in apply_xbc

* tag 'bootconfig-fixes-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tools/bootconfig: Fix buf leaks in apply_xbc

rds: filter RDS_INFO_* getsockopt by caller's netns

The RDS_INFO_* family of getsockopt(2) options reads several
file-scope global lists that are not per-netns:

  rds_sock_info / rds6_sock_info,
  rds_sock_inc_info / rds6_sock_inc_info        -> rds_sock_list
  rds_tcp_tc_info / rds6_tcp_tc_info            -> rds_tcp_tc_list
  rds_conn_info / rds6_conn_info,
  rds_conn_message_info_cmn (for the *_SEND_MESSAGES and
  *_RETRANS_MESSAGES variants),
  rds_for_each_conn_info (for RDS_INFO_IB_CONNECTIONS)
                                                -> rds_conn_hash[]

The handlers do not filter by the caller's network namespace.
rds_info_getsockopt() has no netns or capable() check, and
rds_create() has no capable() check, so AF_RDS is reachable from
an unprivileged user namespace. As a result, an unprivileged
caller in a fresh user_ns plus netns can read the bound address
and sock inode of every RDS socket on the host, the peer address
of incoming messages on every RDS socket on the host, the peer
address and TCP sequence numbers of every rds-tcp connection on
the host, and the peer address and RDS sequence numbers of every
RDS connection on the host.

The rds-tcp transport is reachable from a non-initial netns (see
rds_set_transport()), so a one-shot init_net gate at
rds_info_getsockopt() would deny legitimate per-netns visibility
to rds-tcp callers. Instead, filter at each handler by comparing
the netns of the caller's socket to the netns of the list entry,
or to rds_conn_net(conn) for connection paths. Only copy entries
whose netns matches the caller. Counters (RDS_INFO_COUNTERS) are
aggregate statistics and remain global.

Reproducer (KASAN VM, rds and rds_tcp loaded): an AF_RDS socket
binds 127.0.0.1:4242 in init_net as root. A child process enters
a fresh user_ns plus netns and opens AF_RDS there, then calls
getsockopt(SOL_RDS, RDS_INFO_SOCKETS). Before this change, the
child sees the init_net socket. After this change, the child
sees zero entries.

Drop the rds_sock_count, rds_tcp_tc_count, and rds6_tcp_tc_count
globals. v2 used them for the size precheck and lens->nr; v3
replaced the precheck with a per-ns count from a first pass over
the list, so the globals have no remaining readers. The matching
increments and decrements in rds_create()/rds_destroy_sock() and
rds_tcp_set_callbacks()/rds_tcp_restore_callbacks() go away with
them. Reported by the kernel test robot under clang W=1.

Suggested-by: Allison Henderson <achender@kernel.org>
Suggested-by: Simon Horman <horms@kernel.org>
Reviewed-by: Allison Henderson <achender@kernel.org>
Co-developed-by: Praveen Kakkolangara <praveen.kakkolangara@aumovio.com>
Signed-off-by: Praveen Kakkolangara <praveen.kakkolangara@aumovio.com>
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
Link: https://patch.msgid.link/20260520084236.2724349-1-maoyixie.tju@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlabel: fix IPv6 unlabeled address add error handling

netlbl_unlhsh_add_addr6() always returned zero after
netlbl_af6list_add(), masking failures such as duplicate
IPv6 static label entries.

Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20260522022910.398416-1-zhaochenguang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rds: annotate data-race around rs_seen_congestion

rs_seen_congestion is read in rds_poll() and written in rds_sendmsg()
and rds_poll() without any lock. Use READ_ONCE()/WRITE_ONCE() to
annotate these lockless accesses and silence KCSAN.

Reported-by: syzbot+fbf3648ae7f5bdb05c59@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a0f8d94.050a0220.6b33c.0000.GAE@google.com/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Allison Henderson <achender@kernel.org>
Tested-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260522011621.304470-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gpu: nova-core: vbios: remove unused rom_header field

This is only used during construction, so we can remove it.

Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-22-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: move constants and functions to be associated

Move constants and functions to be inside the impls of the types they
are related to. This makes it more obvious what each type and value is
for.

Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-21-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: drop redundant TryFrom import

This is unused.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-20-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: drop unused image wrappers

These are unused currently, and it is probably sufficient to just check
the type of BIOS image in the future.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-19-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: remove unnecessary fields in PciRomHeader

Remove unnecessary fields in PciRomHeader. This allows a simplification
to use `FromBytes` instead of reading fields piecemeal. A lot of these
checks were redundant as well since it checks the size of the `data`
first in `BiosImage`.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-18-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: use let-else in Vbios::new

Improve readability by moving the success path outside of a nested
branch.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-17-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: use single logical block for the FWSEC section

Currently, FWSEC takes the first image and the last image. Treat the
first FWSEC image and all following image data as one logical block
for building the final FWSEC image. This avoids explicitly tracking
two FWSEC images.

Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-16-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: use the first PCI-AT image

Currently, PCI-AT takes the final image if multiple exist. Use the
first one instead, to match nouveau behaviour.

Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-15-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

net/sched: sch_ets: make cl->quantum lockless

cl->quantum does not need to be protected by RTNL or qdisc spinlock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260522110356.1403343-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: igmp: annotate data-races around im->users

/proc/net/igmp walks IPv4 multicast memberships under RCU and
prints im->users without holding RTNL, while multicast join and leave
paths update the field while holding RTNL. Annotate this intentional
lockless snapshot with READ_ONCE() and the matching writers with
WRITE_ONCE().

Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260522093906.39764-1-sigefriedhyy@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: refresh nh pointer after ipv6_hop_jumbo()

ipv6_hop_jumbo() calls pskb_trim_rcsum(), which can change skb pointers.
Let's recompute nh pointer to make sure any change won't mess things up.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Justin Iurman <justin.iurman@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260522112013.12342-1-justin.iurman@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: refresh nh after handling HAO option

ip6_parse_tlv() caches skb_network_header(skb) in nh while walking
IPv6 TLVs.

ipv6_dest_hao() may call pskb_expand_head() for a cloned skb, which can
move the skb head and invalidate the cached network header pointer.
Refresh nh after ipv6_dest_hao() returns so any trailing padding or TLVs
are parsed from the current skb head.

This matches the existing pattern used in ip6_parse_tlv() after helpers
that can modify skb header storage.

Fixes: a831f5bbc89a ("[IPV6] MIP6: Add inbound interface of home address option.")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Co-developed-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Luxing Yin <tr0jan@lzu.edu.cn>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/7aba1debc2196189172499e5769802b026f8caf8.1779247873.git.zcliangcn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: nf_conntrack_ftp: avoid u16 overflows

get_port and try_number() parse comma-separated decimal values from FTP PORT
and EPRT commands into a u_int32_t array, but does not validate that each
value fits in a single octet. RFC 959 specifies that PORT parameters
are decimal integers in the range 0-255, representing the four octets
of an IP address followed by two octets encoding the port number.

Values exceeding 255 are silently accepted. In try_rfc959(), the raw
u32 values are combined via shift-and-OR to form the IP and port:

  cmd->u3.ip = htonl((array[0] << 24) | (array[1] << 16) |
                     (array[2] << 8) | array[3]);
  cmd->u.tcp.port = htons((array[4] << 8) | array[5]);

When array elements exceed 255, bits from one field bleed into adjacent
fields after shifting, producing IP addresses and port numbers that
differ from what the text representation suggests. For example,
"PORT 10,0,1,2,256,22" yields port (256<<8)|22 = 65558, truncated to
u16 = 22. This mismatch between the textual and computed values can
confuse network monitoring tools that parse FTP commands independently.

Ignore the command by returning 0 (no match) when any accumulated
value exceeds 255 so that no expectation is created.

Signed-off-by: Giuseppe Caruso <giuseppecaruso0990@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

memblock: don't touch memblock arrays when memblock_free() is called late

When memblock_free() is called after memblock_discard() on architectures
that don't select ARCH_KEEP_MEMBLOCK, it tries to update memblock.reserved
that was already discarded and it causes use-after-free, for example

[    8.514775] BUG: KASAN: use-after-free in memblock_isolate_range+0x4ac/0x650
[    8.514775] Read of size 8 at addr ffff88a07fe6a000 by task swapper/0/1
[    8.514775] Call Trace:
[    8.514775]  <TASK>
[    8.514775]  kasan_report+0xb2/0x1b0
[    8.514775]  memblock_isolate_range+0x4ac/0x650
[    8.514775]  memblock_phys_free+0xc4/0x190
[    8.514775]  housekeeping_late_init+0x257/0x280
[    8.514775]  do_one_initcall+0xaa/0x470
[    8.514775]  do_initcalls+0x1b4/0x1f0
[    8.514775]  kernel_init_freeable+0x4b5/0x550
[    8.514775]  kernel_init+0x1c/0x150
[    8.514775]  ret_from_fork+0x5dc/0x8e0
[    8.514775]  ret_from_fork_asm+0x1a/0x30
[    8.514775]  </TASK>

Make sure memblock_free() updates memblock.reserved only when called early
enough or when ARCH_KEEP_MEMBLOCK is enabled.

Reported-by: Waiman Long <longman@redhat.com>
Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/all/20260505051821.1107133-1-longman@redhat.com
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Breno Leitao <leitao@debian.org>
Fixes: 87ce9e83ab8b ("memblock, treewide: make memblock_free() handle late freeing")
Link: https://patch.msgid.link/20260513105122.502506-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Merge tag 'nf-26-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Florian Westphal says:

====================
netfilter: updates for net

Patches 7+8 fix a regression from 7.1-rc1. Everything else
is from 2.6.x to 5.3 releases.  There are additional known
issues with these patches (drive-by-findings in related code).

There are many old bugs all over netfilter and our ability to review
feature patches has come to a complete halt due to lack of time.
There are further security bugs that we cannot address
due to lack of time, maintainers and reviewers.

Other remarks: The xtables 32bit compat interface is already
off in many vendor kernels, the plan is to remove it soon.

1) Prevent RST packets with invalid sequence numbers from forcing TCP
   connections into the CLOSE state without a direction check.
   From Hamza Mahfooz.
2) Re-derive the TCP header pointer after skb_ensure_writable in
   synproxy_tstamp_adjust. Prevent use-after-free and invalid checksum
   updates caused by stale pointers during buffer expansion.
   From Chris Mason.
3) Fix a race condition causing keymap list corruption in conntracks gre/pptp
   helper.
4) Use raw_smp_processor_id() in xt_cpu to prevent splats under
   PREEMPT_RCU.
5) Disable netfilter payload mangling in user namespaces (nft_payload.c
   and nf_queue).
   TCP option mangling via nft_exthdr.c remains enabled.
   There will be followups here to restrict resp. revalidate
   headers.
6) Fix an out-of-bounds read in ebtables's compat_mtw_from_user function.
7) Use list_for_each_entry_rcu() to traverse fib6_siblings in
   nft_fib6_info_nh_uses_dev(). Ensure safe list walking under RCU.
8) Fix an out-of-bounds read in nft_fib_ipv6 caused by incorrect list
   traversal.
9) Add nft_fib_nexthop selftest to netfilter. Cover nexthop enumeration for
    single, group, and multipath route shapes.
    All three nft_fib6 fixes from Jiayuan Chen.
10) Fix destination corruption in shift operations when source and destination
    registers overlap.  Reject partial register overlap for all operations
    from control plane.  From Fernando Fernandez Mancera.

* tag 'nf-26-05-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: fix dst corruption in same register operation
  selftests: netfilter: add nft_fib_nexthop test
  netfilter: nft_fib_ipv6: handle routes via external nexthop
  netfilter: nft_fib_ipv6: walk fib6_siblings under RCU
  netfilter: ebtables: fix OOB read in compat_mtw_from_user
  netfilter: disable payload mangling in userns
  netfilter: xt_cpu: prefer raw_smp_processor_id
  netfilter: nf_conntrack_gre: fix gre keymap list corruption
  netfilter: synproxy: refresh tcphdr after skb_ensure_writable
  netfilter: conntrack: tcp: do not force CLOSE on invalid-seq RST without direction check
====================

Link: https://patch.msgid.link/20260522104257.2008-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'v7.1-rc5' into rdma.git for-next

For dependencies in the following patches

Resolve conflicts, use the goto labels from the rc tag.

* tag 'v7.1-rc5': (1526 commits)

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

isofs: replace __get_free_page() with kmalloc()

isofs_readdir() allocates a temporary buffer with __get_free_page().

kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.

Replace use of __get_free_page() with kmalloc().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-11-275e36a83f0e@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>

quota: allocate dquot_hash with kmalloc()

dquot_init() allocates a single page for dquot_hash with
__get_free_pages().

kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.

Replace use of __get_free_pages() with kmalloc() and get rid of the order
variable that remained 0 for more than 20 years.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-1-275e36a83f0e@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>

dm-inlinecrypt: add support for hardware-wrapped keys

Add support for hardware-wrapped encryption keys to the
dm-inlinecrypt target.

Introduce a new optional argument <key_type> to indicate
whether the provided key is a raw key or a hardware-wrapped
key. Based on this flag, the appropriate blk-crypto key type
is selected when initializing the key.

This allows dm-inlinecrypt to work with hardware that requires
keys to be wrapped and managed by the underlying inline
encryption engine.

Update the target argument parsing accordingly and pass the
key type to blk_crypto_init_key(). Documentation is also
updated to reflect the new parameter and usage.

Signed-off-by: Linlin Zhang <linlin.zhang@oss.qualcomm.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: e7f57d2c47e2 ("dm-inlinecrypt: add target for inline block device encryption")

RDMA/counter: Fix incorrect port index in rdma_counter_init() error cleanup

The error cleanup loop in rdma_counter_init() iterates with variable
'i' but accesses dev->port_data[port] instead of dev->port_data[i].
This causes the failed port's hstats to be freed multiple times while
leaking hstats of previously initialized ports.

Fixes: 56594ae1d250 ("RDMA/core: Annotate destroy of mutex to ensure that it is released as unlocked")
Link: https://patch.msgid.link/r/20260520104546.1776253-3-cuitao@kylinos.cn
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/counter: Fix num_counters leak on bind_qp failure in alloc_and_bind()

When __rdma_counter_bind_qp() fails in alloc_and_bind(), the error path
jumps to err_mode which frees the counter without decrementing
port_counter->num_counters. The only place that decrements is
rdma_counter_free(), which is unreachable since the counter was never
successfully bound.

This leak accumulates across repeated failures, permanently preventing
the port from switching to AUTO mode (-EBUSY in __counter_set_mode())
and blocking the MANUAL→NONE auto-revert in rdma_counter_free(). When
the mode was NONE before the call, the MANUAL mode set by
__counter_set_mode() also leaks since the revert logic is never
reached.

Add an err_bind label between the num_counters increment and the
existing err_mode label. It decrements num_counters and mirrors the
MANUAL→NONE revert from rdma_counter_free(), ensuring the port state
is fully restored on bind failure.

Link: https://patch.msgid.link/r/20260520104546.1776253-2-cuitao@kylinos.cn
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

remoteproc: use rsc_table_for_each_entry() in rproc_handle_resources()

Replace the open-coded resource table iteration loop in
rproc_handle_resources() with the rsc_table_for_each_entry() helper.

The remoteproc-specific dispatch logic (vendor resource handling via
rproc_handle_rsc(), RSC_LAST bounds check, handler table lookup) is
moved into a local callback rproc_handle_rsc_entry(), keeping the
iteration mechanics in one canonical place.

The callback receives the payload offset within the table so that
handlers which write back into the resource table (e.g.
rproc_handle_carveout() recording a dynamically allocated address via
rsc_offset) continue to work correctly.

No functional change.

Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260506050107.1985033-3-mukesh.ojha@oss.qualcomm.com

remoteproc: Move resource table data structure to its own header

The resource table data structure has traditionally been associated with
the remoteproc framework, where the resource table is included as a
section within the remote processor firmware binary. However, it is also
possible to obtain the resource table through other means—such as from a
reserved memory region populated by the boot firmware, statically
maintained driver data, or via a secure SMC call—when it is not embedded
in the firmware.

There are multiple Qualcomm remote processors (e.g., Venus, Iris, GPU,
etc.) in the upstream kernel that do not use the remoteproc framework to
manage their lifecycle for various reasons.

When Linux is running at EL2, similar to the Qualcomm PAS driver
(qcom_q6v5_pas.c), client drivers for subsystems like video and GPU may
also want to use the resource table SMC call to retrieve and map
resources before they are used by the remote processor.

In such cases, the resource table data structure is no longer tightly
coupled with the remoteproc headers. Client drivers that do not use the
remoteproc framework should still be able to parse the resource table
obtained through alternative means. Therefore, there is a need to
decouple the resource table definitions from the remoteproc headers.

Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260506050107.1985033-2-mukesh.ojha@oss.qualcomm.com

net/mlx5: HWS: Reject unsupported remove-header action

mlx5_cmd_hws_packet_reformat_alloc() handles
MLX5_REFORMAT_TYPE_REMOVE_HDR by looking up a matching HWS remove-header
action.

If mlx5_fs_get_action_remove_header_vlan() returns NULL, the code only
logs an error and continues. The function then returns success with a NULL
HWS action stored in the packet-reformat object.

Return an error when no matching remove-header action is available.

Fixes: aecd9d1020e3 ("net/mlx5: fs, add HWS packet reformat API function")
Signed-off-by: Prathamesh Deshpande <prathameshdeshpande7@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260506000054.51797-1-prathameshdeshpande7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'arena_direct_access'

Tejun Heo says:

====================
This makes BPF arena memory directly dereferenceable from kernel code
(struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page
that an arch fault hook installs into empty PTEs on kernel-side faults,
after KFENCE. The faulting instruction retries and the violation is reported
through the program's BPF stream.

v4:
- Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in
  inline comments on both x86 and arm64. (Andrea)
- Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a
  new include/linux/bpf_defs.h. (lkp)
- Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on
  pool growth. (Andrea)
- Picked up Reviewed-by tags from Emil and Andrea.

v3: https://lore.kernel.org/r/20260520235052.4180316-1-tj@kernel.org
v2: https://lore.kernel.org/r/20260517211232.1670594-1-tj@kernel.org
v1 (RFC): https://lore.kernel.org/r/20260427105109.2554518-1-tj@kernel.org

Motivation
----------

sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask
*. The kernel translates a kernel cpumask to a cmask, but it had no way to
write into the arena, so the cmask lived in kernel memory and was passed as
a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so
the BPF side had to word-by-word probe-read the kernel cmask into an arena
cmask via cmask_copy_from_kernel() before any helper could touch it. It
works, but is clumsy.

The shape isn't unique to set_cmask. Sub-scheduler support is on the way and
more sched_ext callbacks will want to pass structured data to BPF. Anywhere
a kfunc or struct_ops callback wants to hand a struct to a BPF program,
arena residence is the natural answer.

Approach
--------

Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as
today - PTEs are populated only for allocated pages. A new arch fault hook
(bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64
__do_kernel_fault(), after KFENCE. When a kernel-side access faults inside
an arena's kern_vm range, the helper walks the stack to find the BPF program
responsible, range-checks the fault address against prog->aux->arena, and
atomically installs the scratch page into the empty PTE via the new
ptep_try_set() wrapper. The kernel instruction retries and reads/writes the
scratch page. Free paths and map destruction treat scratch as non-owned.
Real allocation refuses to overwrite scratch (apply_range_set_cb returns
-EBUSY). A scratched address stays dead until map destroy, since its
presence means the BPF program has already malfunctioned.

The mechanism is default behavior - no UAPI flag.

What this preserves
-------------------

All the debugging properties of today's sparse-PTE design are preserved:

* BPF programs still fault on unmapped arena accesses. The fault semantics
  (instruction retry with rdst = 0) and the violation report through
  bpf_streams are unchanged for prog-side accesses.

* The first kernel-side touch of an unmapped address is reported via
  bpf_streams the same way as a prog-side fault, with the stack walk
  attributing it to the originating prog.

* User-side fault on a never-scratched address still lazy-allocates a real
  page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a
  scratched address SIGSEGVs.

What changes for the kernel-side caller is just that an unmapped deref no
longer oopses - it retries through the scratch page and emits a violation
report. The same shape today's BPF instruction faults have.

Patches 1-2 (atomic PTE install + arena scratch-page recovery)
--------------------------------------------------------------

  mm: Add ptep_try_set() for lockless empty-slot installs
  bpf: Recover arena kernel faults with scratch page

Patches 3-5 (helpers used by struct_ops registration)
-----------------------------------------------------

  bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
  bpf: Add bpf_struct_ops_for_each_prog()
  bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()
====================

Link: https://lore.kernel.org/bpf/20260522172219.1423324-1-tj@kernel.org/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ACPI: video: Switch over to devres-based resource management

Turn acpi_video_bus_remove_notify_handler() into a devm
action added by acpi_video_bus_probe() after calling
acpi_video_bus_add_notify_handler and use the newly introduced
devm_acpi_install_notify_handler() to install an ACPI notify
handler for the video bus device.

This replaces the rollback path remnant in acpi_video_bus_probe()
and allows acpi_video_bus_remove() to be dropped altogether.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2556320.jE0xQCEvom@rafael.j.wysocki

ACPI: video: Use devm for video->entry and backlight cleanup

Introduce acpi_video_bus_del() for removing the video bus object
from the video_bus_head list and unregistering backlight and make
acpi_video_bus_probe() add it as a devm action after adding the
video bus object to the video_bus_head list.

Accordingly, remove the code superseded by it from
acpi_video_bus_remove() and from the rollback path in
acpi_video_bus_probe().

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2279582.Icojqenx9y@rafael.j.wysocki

ACPI: video: Use devm action for freeing video devices

Rename acpi_video_bus_put_devices() to devm_acpi_video_bus_get_devices()
and turn acpi_video_bus_put_devices() into a devm action added by it for
freeing the video devices allocated by it and the attached_array memory.

Accordingly, remove the acpi_video_bus_put_devices() calls and
attached_array freeing from acpi_video_bus_remove() and the rollback
path in acpi_video_bus_probe().

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/1932803.atdPhlSkOF@rafael.j.wysocki

ACPI: video: Use devm action for video bus object cleanup

Introduce acpi_video_bus_free() for freeing video bus object memory
and reversing changes related to it made during ACPI video bus device
probe, modify acpi_video_bus_probe() to add acpi_video_bus_free() as
a devm action, and remove the code superseded by it from
acpi_video_bus_remove().

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3892168.MHq7AAxBmi@rafael.j.wysocki

ACPI: video: Rearrange probe and remove code

Rearrange some ACPI video bus probe and remove code so that it is more
clear that the probe and removal are carried in reverse orders, which
will also facilitate subsequent changes.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2276683.Mh6RI2rZIc@rafael.j.wysocki

ACPI: video: Reduce the number of auxiliary device dereferences

Store the &aux_dev->dev pointer in a separate local variable in
acpi_video_bus_probe() to avoid dereferencing aux_dev many times.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2707186.Lt9SDvczpP@rafael.j.wysocki

ACPI: PAD: Switch over to devres-based resource management

Use the newly introduced devm_acpi_install_notify_handler() for
installing an ACPI notify handler and since that function checks the
ACPI companion of the owner device against NULL internally, remove the
the explicit ACPI companion check from acpi_pad_probe().

However, to prevent the notify handler from running acpi_pad_idle_cpus()
with the number of idle CPUs greater than zero after acpi_pad_remove()
has returned, add a bool static variable for synchronization between
the two.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/1964581.CQOukoFCf9@rafael.j.wysocki

ACPI: PAD: Fix teardown ordering in acpi_pad_remove()

The ACPI notify handler installed by acpi_pad_probe() needs to be
removed before calling acpi_pad_idle_cpus() in acpi_pad_remove()
so it doesn't schedule idle time injection on some CPUs again.

Fixes: 8e0af5141ab9 ("ACPI: create Processor Aggregator Device driver")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2064153.usQuhbGJ8B@rafael.j.wysocki

ACPI: PAD: Pass struct device pointer to acpi_pad_notify()

Use the struct device pointer to the dev member in the struct
platform_device object representing the platform device used for driver
binding as the last argument of acpi_dev_install_notify_handler() and
accordingly update acpi_pad_notify() to pass that pointer directly to
dev_name() when generating the netlink event.

Since the dev_name() value for an ACPI-enumerated platform device is the
same as the dev_name() value for the dev member of its ACPI companion
object, as per acpi_create_platform_device(), the above code modification
is not expected to cause functionality to change.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/1862521.VLH7GnMWUR@rafael.j.wysocki

ACPI: PAD: Rearrange acpi_pad_notify()

Use an if () in acpi_pad_notify() instead of a switch () statement to
make the code somewhat easier to follow and reduce its indentation
level.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3345485.5fSG56mABF@rafael.j.wysocki

ACPI: thermal: Switch over to devres-based resource management

Switch over the ACPI thermal zone driver to devres-based resource
management by making the following changes:

* Turn acpi_thermal_zone_free() into a devm action added from
   acpi_thermal_probe() after allocating the struct acpi_thermal object.

* Rename acpi_thermal_unregister_thermal_zone() to
   acpi_thermal_zone_unregister(), add acpi_thermal_pm_queue flushing to
   it, and turn it into a devm action added by acpi_thermal_probe()
   after calling acpi_thermal_register_thermal_zone().

* Use the newly introduced devm_acpi_install_notify_handler() for
   installing an ACPI notify handler.

* Drop acpi_thermal_remove() that is not necessary any more.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3698719.iIbC2pHGDl@rafael.j.wysocki

ACPI: HED: Switch over to devres-based resource management

Use the newly introduced devm_acpi_install_notify_handler() for
installing an ACPI notify handler and since that function checks the
ACPI companion of the owner device against NULL internally, remove the
the explicit ACPI companion check from acpi_hed_probe().

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/7950702.EvYhyI6sBW@rafael.j.wysocki

ACPI: HED: Refine guarding against adding a second instance

There can be only one ACPI hardware event device (HED) in use at a time,
so acpi_hed_probe() uses static variable hed_handle for guarding against
adding a second HED instance, but there is no reason for that variable
to hold an ACPI handle, so change it to a bool one.

While at it also set that variable at the end of acpi_hed_probe() to
avouid the need to clear it when installing the ACPI notify handler
fails.

Note that ACPI devices are enumerated sequentially, so there's no need
for additional locking around the accesses to that variable.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2042970.PYKUYFuaPT@rafael.j.wysocki

ACPI: battery: Switch over to devres-based resource management

The ACPI battery driver already uses devm_kzalloc() for allocating
memory and devm_mutex_init() for mutex initialization, but it still
carries out some manual rollback in acpi_battery_probe().

Switch it over to devres-based resource management completely by
making three changes:

* Rename acpi_battery_update_retry() to devm_acpi_battery_update_retry(),
   turn sysfs_battery_cleanup() into a devm action and modify the former
   to add it.

* Add devm_acpi_battery_init_wakeup() for initializing the wakeup
   source and make it add a custom devm action to automatically remove
   the wakeup source registered by it.

* Make acpi_battery_probe() use devm_acpi_install_notify_handler()
   that has just been introduced for installing an ACPI notify handler.

Note that the code ordering change related to the last of the above
changes does not matter because there is no functional dependency
between the PM notifier and the wakeup source or the ACPI notify
handler.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/10856906.nUPlyArG6x@rafael.j.wysocki

ACPI: AC: Switch over to devres-based resource management

Use devm_kzalloc() for allocating memory, devm_power_supply_register()
for registering a power supply class device and the newly introduced
devm_acpi_install_notify_handler() for installing an ACPI notify handler.

Note that the code ordering change related to the third of the above
modifications does not matter because there is no order dependency
between the battery notifier and the ACPI notify handler.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3422377.44csPzL39Z@rafael.j.wysocki

ACPI: NFIT: core: Use devm_acpi_install_notify_handler()

Now that devm_acpi_install_notify_handler() is available, use it in
acpi_nfit_probe() instead of a custom devm action removing an ACPI
notify handler installed via acpi_dev_install_notify_handler().

Also drop the explicit ACPI_COMPANION() check against NULL that is
not necessary any more becuase devm_acpi_install_notify_handler()
carries out an equivalent check internally and use ACPI_HANDLE() to
retrieve the platform device's ACPI handle.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3048737.e9J7NaK4W3@rafael.j.wysocki

ACPI: bus: Introduce devm_acpi_install_notify_handler()

Introduce devm_acpi_install_notify_handler() for installing an ACPI
notify handler managed by devres that will be removed automatically on
driver detach.

It installs the notify handler on the device object in the ACPI
namespace that corresponds to the owner device's ACPI companion, if
present (an error is returned if the owner device doesn't have an ACPI
companion).

Currently, there is no way to manually remove the notify handler
installed by it because none of its users brought on subsequently
will need to do that.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ rjw: Kerneldoc comment refinement ]
Link: https://patch.msgid.link/2268031.irdbgypaU6@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

RDMA/hns: Fix log flood after cmd_mbox failure

hns_roce_cmd_mbox() is the command interface between driver and
hardware. When hardware is abnormal, the unlimited error printings
after hns_roce_cmd_mbox() failure will cause log flood and even
system crash.

Replace ibdev_err() and ibdev_warn() with their ratelimited versions
in the error handling path after hns_roce_cmd_mbox() (and its wrappers
hns_roce_create_hw_ctx/hns_roce_destroy_hw_ctx) fails.

Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Link: https://patch.msgid.link/r/20260520055759.2354037-4-huangjunxian6@hisilicon.com
Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/hns: Fix warning in poll cq direct mode

CQs allocated by ib_alloc_cq() always have a comp_handler. Though
in direct mode this handler is never expected to be called, it
is still called when the driver is reset, triggering the following
WARN_ONCE():

Call trace:
ib_cq_completion_direct+0x38/0x60
hns_roce_cq_completion+0x54/0x90 (hns_roce_hw_v2]
hns_roce_handle_device_err+Ox1c8/0x340 [hns_roce_hw_v2]
hns_roce_hw_v2_uninit_instance.constprop.0+0x34/0x70 [hns_roce_hw_v2]
hns_roce_hw_v2_reset_notify+0xc4/0xe0 [hns_roce_hw_v2]
hclge_notify_roce_client+0x60/0xbc [hclge]
hclge_reset_rebuild+0x48/0x34c [hclge]
hclge_reset_subtask+0xcc/0xec [hclge]
hclge_reset_service_task+0x80/0x160 [hclge]
hclge_service_task+0x50/0x80 (hclge]
process_one_work+0x1cc/0x4d0
worker_thread+0x154/0x414
kthread+0x104/0x144
ret_from_fork+0x10/0x18

Fixes: f295e4cece5c ("RDMA/hns: Delete unnecessary callback functions for cq")
Link: https://patch.msgid.link/r/20260520055759.2354037-3-huangjunxian6@hisilicon.com
Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

gpu: nova-core: vbios: construct `FwSecBiosImage` directly from BIOS images

`FwSecBiosBuilder` now only contains `falcon_ucode_offset` which just
gets passed directly into `FwSecBiosImage`. Remove `FwSecBiosBuilder`
and construct `FwSecBiosImage` directly, as a simplification.

Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-14-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: store PMU lookup entries in a KVVec

The current code copies the data into a KVec and parses it on demand. We
can simplify the code by storing the parsed entries.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-13-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: read PMU lookup entries using FromBytes

This simplifies the construction of `PmuLookupTableEntry` and is
allowed now that the driver can assume it is little endian.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-12-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: simplify setup_falcon_data

The code first computes `pmu_in_first_fwsec` or adjusts the offset and
then uses it in a branch just once to get the correct source for the PMU
table. This can be simplified to a single branch while also avoiding the
mutation of `offset`. Also, adjust the code after this to keep the
success case non-nested.

Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-11-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: compute FWSEC-relative Falcon data offset

Push the computation of the falcon data offset into a helper function.
The subtraction to create the offset should be checked, and by doing
this the check can be folded into the existing check in
`falcon_data_ptr`.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-10-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: keep PmuLookupTable local in setup_falcon_data

This does not need to be stored in `FwSecBiosBuilder` so we can remove
it from there, and just create and use it locally in
`setup_falcon_data`.

Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-9-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: drop unused falcon_data_offset from FwSecBiosBuilder

This is unused, so we can remove it.

Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-8-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: use checked accesses in `setup_falcon_data`

Use checked arithmetic for `ucode_offset` in `setup_falcon_data`. This
prevents a malformed firmware from causing a panic.

Fixes: dc70c6ae2441 ("gpu: nova-core: vbios: Add support to look up PMU table in FWSEC")
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-7-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: use checked access in `FwSecBiosImage::header`

Use checked access in `FwSecBiosImage::header` for getting the header
version since the value is firmware derived.

Fixes: 47c4846e4319 ("gpu: nova-core: vbios: Add support for FWSEC ucode extraction")
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-6-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: use checked ops and accesses in `FwSecBiosImage::ucode`

Use checked arithmetic and access for extracting the microcode since the
offsets are firmware derived.

Fixes: 47c4846e4319 ("gpu: nova-core: vbios: Add support for FWSEC ucode extraction")
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-5-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: read BitToken using FromBytes

If `header.token_size` is smaller than `BitToken`, then we currently can
read past the end of `image.base.data`. Use checked arithmetic for
computing offsets and simplify reading it in using `FromBytes`.

Fixes: dc70c6ae2441 ("gpu: nova-core: vbios: Add support to look up PMU table in FWSEC")
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-4-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: avoid reading too far in read_more_at_offset

Fix bug where `read_more_at_offset` would unnecessarily read more data.
This happens when the window to read has some part cached and some part
not. It would read `len` bytes instead of just the uncached portion,
which could read past `BIOS_MAX_SCAN_LEN`.

Fixes: 6fda04e7f0cd ("gpu: nova-core: vbios: Add base support for VBIOS construction and iteration")
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-3-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: use checked arithmetic for bios image range end

`read_bios_image_at_offset` is called with a length from the VBIOS
header, so we should be more defensive here and use checked arithmetic.

Fixes: 6fda04e7f0cd ("gpu: nova-core: vbios: Add base support for VBIOS construction and iteration")
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-2-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: vbios: stop scanning at BIOS_MAX_SCAN_LEN

Current code lets `current_offset` go to `BIOS_MAX_SCAN_LEN` which is
one byte too far.

Fixes: 6fda04e7f0cd ("gpu: nova-core: vbios: Add base support for VBIOS construction and iteration")
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260525-fix-vbios-v5-1-e5e455251537@nvidia.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

IB/mlx4: Fix refcount leak in add_port() error path

After kobject_init_and_add(), the lifetime of the embedded struct
kobject is expected to be managed through the kobject core reference
counting.

In add_port(), failure paths after kobject_init_and_add() must not free
struct mlx4_port directly, because the embedded kobject is then managed
by the kobject core. Freeing it directly leaves the kobject reference
counting unbalanced and can lead to incorrect lifetime handling.

Allocate the pkey and gid attribute arrays before kobject_init_and_add(),
so failures before kobject initialization can be handled by directly
freeing the allocated memory. Once kobject_init_and_add() has been
called, unwind later failures by removing any successfully created sysfs
groups, calling kobject_del(), and then releasing the embedded kobject
with kobject_put().

Fixes: c1e7e466120b ("IB/mlx4: Add iov directory in sysfs under the ib device")
Link: https://patch.msgid.link/r/20260518021910.972900-1-lgs201920130244@gmail.com
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/rxe: Fix a use-after-free problem in rxe_mmap

rxe_mmap() removes a rxe_mmap_info struct from the pending_mmaps list
and releases pending_lock while the struct's kref is still at 1:

   list_del_init(&ip->pending_mmaps);
   spin_unlock_bh(&rxe->pending_lock);   /* ref == 1, no lock held */
   ret = remap_vmalloc_range(vma, ip->obj, 0);  /* walks PTEs */
   [...]
   rxe_vma_open(vma);                    /* kref_get, ref → 2 */
   remap_vmalloc_range_partial() walks PTEs without any lock.

A concurrent DESTROY_CQ ioctl on another CPU calls:

    kref_put(&q->ip->ref, rxe_mmap_release)   /* ref 1→0 */
    vfree(ip->obj)   /* clears vmalloc PTEs mid-walk */
    kfree(ip)        /* frees rxe_mmap_info */

This yields:

   1. Kernel crash, vmalloc_to_page() returns NULL when vfree wins the
   per-PTE race -> vm_insert_page(NULL) → GPF in validate_page_before_insert

   2. Page UAF, vmalloc_to_page() reads a stale PTE before vfree clears
   it. User VMA holds a PTE to a free'd page which might eventually get
   reallocated later by vmalloc which allows the attacker to get a clean
   page-level UAF.

   It is worth noting that even though a page-level UAF is possible given
   the strong primitive, it is statistically very difficult to achieve
   given the very short time window (after the last insert_page and before
   the kref_get).

The call trace are as below:

  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
  CPU: 0 UID: 1000 PID: 413 Comm: poc Not tainted 7.0.0-rc5-dirty #28 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:validate_page_before_insert+0x32/0x300
  Code: e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 89 f3 e8 93 b5 a3 ff 48 8d 7b 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 7b 02 00 00 4c 8b 63 08 31 ff 4d 89 e5 41 83 e5
  RSP: 0018:ffff88811b15f2f0 EFLAGS: 00000202
  RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
  RBP: ffff88811b15f318 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881181eee00
  R13: 0000000000000000 R14: ffff8881181eee00 R15: ffff8881181eee20
  FS:  00007b1e000f76c0(0000) GS:ffff8884268e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007b1e00a24ac0 CR3: 0000000116eb3000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   insert_page+0x8f/0x190
   ? __pfx_insert_page+0x10/0x10
   ? kasan_save_alloc_info+0x38/0x60
   vm_insert_page+0x2e7/0x400
   remap_vmalloc_range_partial+0x212/0x3e0
   remap_vmalloc_range+0x6e/0xb0
   ? __kasan_check_write+0x14/0x30
   rxe_mmap+0x2e9/0x5d0
   ib_uverbs_mmap+0x1ad/0x2c0
   __mmap_region+0x12c2/0x2ad0
   ? __pfx___mmap_region+0x10/0x10
   ? __sanitizer_cov_trace_switch+0x58/0xb0
   ? mas_prev_slot+0x360/0x39c0
   ? __sanitizer_cov_trace_switch+0x58/0xb0
   ? mas_next_slot+0x1e5b/0x2f40
   ? __sanitizer_cov_trace_cmp8+0x18/0x30
   ? unmapped_area_topdown+0x4dd/0x610
   ? kfree+0x1b1/0x440
   ? free_cpumask_var+0x16/0x30
   ? __kasan_slab_free+0x7d/0xa0
   ? __sanitizer_cov_trace_cmp8+0x18/0x30
   mmap_region+0x2e6/0x3c0
   do_mmap+0xa3e/0x12a0
   ? __pfx_do_mmap+0x10/0x10
   ? __kasan_check_write+0x14/0x30
   ? down_write_killable+0xba/0x160
   ? __pfx_down_write_killable+0x10/0x10
   ? __sanitizer_cov_trace_cmp4+0x16/0x30
   vm_mmap_pgoff+0x2d4/0x4a0
   ? __pfx_vm_mmap_pgoff+0x10/0x10
   ? fget+0x1bf/0x270
   ksys_mmap_pgoff+0x40c/0x690
   ? __sanitizer_cov_trace_const_cmp4+0x16/0x30
   ? __pfx_ksys_mmap_pgoff+0x10/0x10
   ? __kasan_check_write+0x14/0x30
   ? _raw_spin_trylock+0xbb/0x130
   ? __pfx__raw_spin_trylock+0x10/0x10
   __x64_sys_mmap+0x135/0x1e0
   x64_sys_call+0x1c14/0x2790
   do_syscall_64+0xd2/0x1050
   ? rcu_core+0x352/0x7d0
   ? rcu_core_si+0xe/0x20
   ? handle_softirqs+0x1aa/0x650
   ? __sanitizer_cov_trace_cmp4+0x16/0x30
   ? fpregs_assert_state_consistent+0xe1/0x160
   ? irqentry_exit+0xb1/0x670
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Link: https://patch.msgid.link/r/20260515002537.6209-1-yanjun.zhu@linux.dev
Reported-and-tested-by: nasm <n4sm@protonmail.com>
Suggested-by: nasm <n4sm@protonmail.com>
Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

firmware: zynqmp: Add dynamic CSU register discovery and sysfs interface

Add support for dynamically discovering and exposing Configuration
Security Unit (CSU) registers through sysfs. Leverage the existing
PM_QUERY_DATA API to discover available registers at runtime, making
the interface flexible and maintainable.

Key features:
- Dynamic register discovery using PM_QUERY_DATA API
  * PM_QID_GET_NODE_COUNT: Query number of available registers
  * PM_QID_GET_NODE_NAME: Query register names by index
- Automatic sysfs attribute creation under csu_registers/ group
- Read operations via existing IOCTL_READ_REG API
- Write operations via existing IOCTL_MASK_WRITE_REG API

The sysfs interface is created at:
  /sys/devices/platform/firmware:zynqmp-firmware/csu_registers/

Currently supported registers include:
  - multiboot (CSU_MULTI_BOOT)
  - idcode (CSU_IDCODE, read-only)
  - pcap-status (CSU_PCAP_STATUS, read-only)

The dynamic discovery approach allows firmware to control which
registers are exposed without requiring kernel changes, improving
maintainability and security.

The firmware does not currently expose per-register access mode
information, so the kernel cannot distinguish read-only registers
from read-write ones at discovery time. All discovered registers are
therefore created with sysfs mode 0644, and the firmware is
responsible for rejecting writes to registers it treats as read-only
(for example idcode and pcap-status); that error is propagated back
to userspace from the store callback. If a per-register access-mode
query is added to the firmware in the future, sysfs permissions can
be tightened to match.

CSU register discovery is an optional feature: on firmware that lacks
support for PM_QID_GET_NODE_COUNT or PM_QID_GET_NODE_NAME, the probe
returns gracefully without exposing any sysfs entries. To keep the
memory footprint minimal on that path, partial devm allocations made
during discovery are explicitly released on failure so that no memory
lingers until device unbind when the feature is unavailable.

Signed-off-by: Ronak Jain <ronak.jain@amd.com>
Signed-off-by: Michal Simek <michal.simek@amd.com>
Link: https://lore.kernel.org/r/20260520093654.3303917-3-ronak.jain@amd.com

Documentation: ABI: add sysfs interface for ZynqMP CSU registers

Document the new sysfs interface that exposes Configuration Security
Unit (CSU) registers through the zynqmp-firmware driver.

The interface is available under:

  /sys/devices/platform/firmware:zynqmp-firmware/csu_registers/

The CSU registers are discovered at boot time using the PM_QUERY_DATA
firmware API. The following registers are currently supported:

  - multiboot     (CSU_MULTI_BOOT)
  - idcode        (CSU_IDCODE, read-only)
  - pcap-status   (CSU_PCAP_STATUS, read-only)

Read operations use the existing IOCTL_READ_REG firmware interface,
while write operations use IOCTL_MASK_WRITE_REG.

Access control is enforced by the firmware. Write attempts to
read-only registers are rejected by firmware even though the sysfs file
permissions allow writes.

Document the ABI entry accordingly.

Signed-off-by: Ronak Jain <ronak.jain@amd.com>
Signed-off-by: Michal Simek <michal.simek@amd.com>
Link: https://lore.kernel.org/r/20260520093654.3303917-2-ronak.jain@amd.com

RDMA/irdma: Fix out-of-bounds write in irdma_copy_user_pgaddrs

The irdma_copy_user_pgaddrs function loops through all of the umem DMA
blocks to populate the PBLEs and will stop when either the last DMA
block is reached or palloc->total_cnt is reached. The issue is that
the logic for checking palloc->total_cnt would only work for non-zero
values.

When irdma_setup_pbles is called with lvl==0, it
calls irdma_copy_user_pgaddrs with palloc->total_cnt==0, which means
the only way to break out of the loop is to reach the last umem DMA
block, which means it could end up going beyond the fixed size of 4
iwmr->pgaddrmem array that is used in the lvl==0 case.

In the case of QP/CQ/SRQ rings, the value of lvl is determined by a
separate input (for example, req.cq_pages in the case of a CQ). So,
we must perform explicit checking to ensure we don't overflow the
pgaddrmem array if the user provides a umem that consists of more
blocks than their provided req.cq_pages.

Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://patch.msgid.link/r/20260512183852.614045-1-jmoroni@google.com
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc5

Cross-merge BPF and other fixes after downstream PR.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>