Wu Fei [Thu, 4 Jun 2026 23:03:15 +0000 (07:03 +0800)]
RISC-V: KVM: Fix skip of valid pages in kvm_riscv_gstage_unmap_range
Same as kvm_riscv_gstage_wp_range, the possible valid pages should not
be skipped if !found_leaf. Different from wp case, which can
write-protect more than asked, unmap can't do that, no splitting is
added right now but a warning is logged instead.
Wu Fei [Thu, 4 Jun 2026 23:03:14 +0000 (07:03 +0800)]
RISC-V: KVM: Fix skip of valid pages in kvm_riscv_gstage_wp_range
The current gstage range walker unconditionally advances by 'page_size'
when a leaf PTE is not found, e.g. when the range to wp is
[0xfffff01fc000, 0xfffff023c000) and page_size is 2MB, if found_leaf of
0xfffff01fc000 returns false, it skip the whole range, but it's possible
to have valid entries in [0xfffff0200000, 0xfffff023c000).
Vivian Wang [Tue, 3 Mar 2026 05:29:49 +0000 (13:29 +0800)]
riscv: mm: Unconditionally sfence.vma for spurious fault
Svvptc does not guarantee that it's safe to just return here. Since we
have already cleared our bit, if, theoretically, the bounded timeframe
for the accessed page to become valid still hasn't happened after sret,
we could fault again and actually crash.
Hopefully, these spurious faults should be rare enough that this is an
acceptable slowdown.
Vivian Wang [Tue, 3 Mar 2026 05:29:48 +0000 (13:29 +0800)]
riscv: mm: Use the bitmap API for new_valid_map_cpus
The bitmap was defined with incorrect size. Fix it by using the proper
bitmap API in C code. The corresponding assembly code is still okay and
remains unchanged.
Vivian Wang [Tue, 3 Mar 2026 05:29:47 +0000 (13:29 +0800)]
riscv: mm: Rename new_vmalloc into new_valid_map_cpus
Since this mechanism is now used for the kfence pool, which comes from
the linear mapping and not vmalloc, rename new_vmalloc into
new_valid_map_cpus to avoid misleading readers.
Vivian Wang [Tue, 3 Mar 2026 05:29:46 +0000 (13:29 +0800)]
riscv: kfence: Call mark_new_valid_map() for kfence_unprotect()
In kfence_protect_page(), which kfence_unprotect() calls, we cannot send
IPIs to other CPUs to ask them to flush TLB. This may lead to those CPUs
spuriously faulting on a recently allocated kfence object despite it
being valid, leading to false positive use-after-free reports.
Fix this by calling mark_new_valid_map() so that the page fault handling
code path notices the spurious fault and flushes TLB then retries the
access.
Update the comment in handle_exception to indicate that
new_valid_map_cpus_check also handles kfence_unprotect() spurious
faults.
Note that kfence_protect() has the same stale TLB entries problem, but
that leads to false negatives, which is fine with kfence.
Rui Qi [Sun, 7 Jun 2026 02:17:59 +0000 (20:17 -0600)]
riscv: stacktrace: Remove bogus -0x4 offset in non-FP walk_stackframe
In the non-frame-pointer version of walk_stackframe, each value read
from the stack is treated as a potential return address and has 0x4
subtracted before being used as the program counter. This was intended
to convert the return address (the instruction after a call) back to
the call site, but it is incorrect:
1. RISC-V has variable-length instructions due to the RVC (compressed
instruction) extension. A call instruction can be either 4 bytes
(regular) or 2 bytes (compressed, e.g. c.jal). Subtracting a fixed
0x4 assumes all call instructions are 4 bytes, which is wrong for
compressed instructions.
2. Stack traces conventionally report return addresses, not call sites.
Other architectures (ARM64, x86, ARM) do not subtract instruction
size from return addresses in their stack unwinding code.
3. The frame-pointer version of walk_stackframe already dropped the
-0x4 offset. Commit b785ec129bd9 ("riscv/ftrace: Add
HAVE_FUNCTION_GRAPH_RET_ADDR_PTR support") replaced "pc =
frame->ra - 0x4" with ftrace_graph_ret_addr(), and the commit
message explicitly noted that "the original calculation, pc =
frame->ra - 4, is buggy when the instruction at the return address
happened to be a compressed inst." The non-FP version was simply
overlooked.
Remove the bogus -0x4 offset to match the FP version and the
conventions used by other architectures.
Zishun Yi [Sun, 7 Jun 2026 02:17:58 +0000 (20:17 -0600)]
riscv: cacheinfo: Fix node reference leak in populate_cache_leaves
Currently, the while loop drops the reference to prev in each iteration.
If the loop terminates early due to a break, the final of_node_put(np)
correctly drops the reference to the current node.
However, if the loop terminates naturally because np == NULL, calling
of_node_put(np) is a no-op. This leaves the last valid node stored in
prev without its reference dropped, resulting in a node reference leak.
Fix this by changing the final `of_node_put(np)` to `of_node_put(prev)`.
Fixes: 94f9bf118f1e ("RISC-V: Fix of_node_* refcount") Cc: stable@vger.kernel.org Assisted-by: Gemini:gemini-3.1-pro Signed-off-by: Zishun Yi <vulab@iscas.ac.cn> Link: https://patch.msgid.link/20260509074040.1747800-1-vulab@iscas.ac.cn Signed-off-by: Paul Walmsley <pjw@kernel.org>
Han Gao [Sun, 7 Jun 2026 02:17:58 +0000 (20:17 -0600)]
riscv: kexec_file: Constrain segment placement to direct map
When kexec_file_load places segments with buf_max=ULONG_MAX and
top_down=true, they land at the highest available physical addresses.
On RISC-V the size of the linear mapping is determined by the active
VM mode: SV39 caps the direct map at roughly 128GB, while SV48/SV57
extend the range substantially further. When the installed physical
memory exceeds the direct map size of the active mode, top-down
placement puts DTB/initrd at physical addresses outside the linearly
mapped region. The kexec'd kernel cannot reach them during early
boot, triggering a page fault at memcmp in start_kernel.
Fix by constraining buf_max to PFN_PHYS(max_low_pfn), which reflects
the runtime direct map boundary for the active VM mode (SV39/SV48/
SV57). This keeps all kexec segments within the linearly mapped
region while preserving the upstream top_down allocation strategy.
Vivian Wang [Sun, 7 Jun 2026 02:17:54 +0000 (20:17 -0600)]
riscv: mm: Define DIRECT_MAP_PHYSMEM_END
On RISC-V, the actual mappable range of physical address space is
dependent on the current MMU mode i.e. satp_mode (See
Documentation/arch/riscv/vm-layout.rst).
Define the DIRECT_MAP_PHYSMEM_END macro based on the existing virtual
address space layout macros to expose this information to
get_free_mem_region(). Otherwise, it returns a region that couldn't be
mapped, which breaks ZONE_DEVICE.
Hui Wang [Sun, 7 Jun 2026 02:17:54 +0000 (20:17 -0600)]
riscv: cpu_ops_sbi: No need to be bothered to check ret.error
If the ret.error equals to 0, the sbi_err_map_linux_errno() can also
handle it, i.e. if ret.error is SBI_SUCCESS, it will return 0
immediately, so no need to be bothered to check ret.error here.
Hui Wang [Sun, 7 Jun 2026 02:17:54 +0000 (20:17 -0600)]
riscv: cpu_ops: Change return value type of cpu_is_stopped() to bool
In the original sbi_cpu_is_stopped(), if rc doesn't equal to the
SBI_HSM_STATE_STOPPED, it will return rc to the caller directly. But
there is a hidden problem, the rc could be SBI_HSM_STATE_STARTED, if
so, this function will report cpu stopped while the cpu isn't really
stopped.
Furthermore, from the name of cpu_is_stopped(), it gives a sense the
return value is a bool type, true means the cpu is stopped, conversely
false means the cpu is not stopped.
Here change the return value type to bool and change the callers
accordingly. This could fix the above two issues.
Fixes: f1e58583b9c7c ("RISC-V: Support cpu hotplug") Signed-off-by: Hui Wang <hui.wang@canonical.com> Link: https://patch.msgid.link/20260413123515.48423-1-hui.wang@canonical.com
[pjw@kernel.org: cleaned up some of the pr_warn() messages] Signed-off-by: Paul Walmsley <pjw@kernel.org>
Hui Wang [Sun, 7 Jun 2026 02:17:53 +0000 (20:17 -0600)]
riscv: kexec_elf: Remove unused pr_fmt definition
Remove the pr_fmt macro as no pr_*() calls exist in this file. The
prefix string "kexec_image: " is also not appropriate for kexec_elf.c,
if pr_fmt is needed in the future, referring to kexec_image.c, a more
appropriate prefix like "kexec_file(elf): " can be added at that time.
Chen Pei [Sun, 7 Jun 2026 02:17:53 +0000 (20:17 -0600)]
riscv: ftrace: select HAVE_BUILDTIME_MCOUNT_SORT
RISC-V already satisfies all prerequisites for build-time mcount sorting:
the sorttable host tool handles EM_RISCV in its machine-type dispatch, and
the __mcount_loc section entries are stored as direct virtual addresses in
the final vmlinux binary, so no relocation processing is required during
the sort step.
Select HAVE_BUILDTIME_MCOUNT_SORT so that BUILDTIME_MCOUNT_SORT is
automatically enabled when DYNAMIC_FTRACE is configured. This allows
sorttable to sort the __mcount_loc section at link time, making the
run-time ftrace initialisation path skip the software sort and reducing
kernel startup overhead.
Verified with CONFIG_FTRACE_SORT_STARTUP_TEST=y, which confirms that
the section produced by the build is already in ascending order:
Julian Braha [Sun, 7 Jun 2026 02:17:53 +0000 (20:17 -0600)]
riscv: dead code cleanup in kconfig for RISCV_PROBE_VECTOR_UNALIGNED_ACCESS
The same Kconfig statement 'depends on RISCV_ISA_V' appears twice for
RISCV_PROBE_VECTOR_UNALIGNED_ACCESS. The first instance is in its choice
menu, "Vector unaligned Accesses Support", making the second instance in
its specific Kconfig definition dead code. I propose removing this second
instance.
This dead code was found by kconfirm, a static analysis tool for Kconfig.
Julian Braha [Sun, 3 May 2026 04:03:31 +0000 (05:03 +0100)]
riscv: replace select with dependency for visible RELOCATABLE
RANDOMIZE_BASE currently selects RELOCATABLE even though RELOCATABLE
is visible to users. Some other architectures, like x86, use 'depends on'
for RELOCATABLE in their definition of RANDOMIZE_BASE, so let's do the same
here.
This select-visible Kconfig misusage was detected by Kconfirm, a static
analysis tool for Kconfig.
Jinyu Tang [Thu, 4 Jun 2026 14:26:02 +0000 (22:26 +0800)]
KVM: selftests: Add a hugetlb memslot alignment test mode
kvm_page_table_test can already exercise hugetlb-backed guest memory,
but it always creates the test memslot with GPA alignment matching the
hugetlb backing size. That misses the case where a valid hugetlb
memslot is later moved so that the memslot GPA and HVA no longer have
the same offset within the backing huge page.
Add a -u option that moves the test memslot GPA by one guest page after
creating the hugetlb memslot. The memslot is created through the normal
helper first, so the backing allocation remains valid and hugetlb aligned.
Moving the memslot then creates a deliberate HVA/GPA offset mismatch
before the guest mapping is installed.
This mode is useful for checking that architecture MMUs do not install
a block mapping when the block would map the wrong host pages or cover
memory outside the memslot. The option is restricted to hugetlb-backed
test memory because it's specifically about hugetlb block mapping
eligibility.
Jinyu Tang [Thu, 4 Jun 2026 14:26:01 +0000 (22:26 +0800)]
KVM: riscv: Check hugetlb block mappings against memslot bounds
RISC-V KVM has used the hugetlb VMA size directly as the G-stage
mapping size since stage-2 page table support was added. That is safe
only if the block covered by the fault is fully contained in the
memslot and the userspace address has the same offset as the GPA
within that block.
The THP path already checks those constraints before installing a PMD
block mapping. The hugetlb path did not, so an unaligned memslot could
make KVM install a PMD or PUD sized G-stage block that covers memory
outside the slot or maps the wrong host pages.
Pass the target mapping size into fault_supports_gstage_huge_mapping().
The same helper can be used for both THP PMD mappings and hugetlb
PMD/PUD mappings.
Select hugetlb mapping sizes through the same memslot-boundary check,
falling back from PUD to PMD to PAGE_SIZE. When a smaller hugetlb
mapping size is selected, fault the GFN aligned to that selected size
instead of the original VMA size.
Also keep hugetlb mappings out of transparent_hugepage_adjust(). Once
the hugetlb path has chosen PAGE_SIZE, promoting it again through the
THP helper would miss the hugetlb fallback decision.
In particular, the address of a label is only expected to be used with a
computed goto.
While the generic version more or less works today, it is known to be
brittle and may break with current and future optimizations. For
example, Clang -O2 always returns 1 when this function is inlined:
Fix it by overriding _THIS_IP_ in <asm/linkage.h> (which is included by
<linux/instruction_pointer.h>) using an architecture-specific inline asm
version. Additionally, avoiding taking the address of a label prevents
compilers from emitting spurious indirect branch targets (e.g. ENDBR or
BTI) under control-flow integrity schemes.
Florian Schmaus [Sun, 7 Jun 2026 02:17:53 +0000 (20:17 -0600)]
riscv: module: Use generic cmp_int() instead of custom cmp_3way()
The module-sections.c file defines a custom cmp_3way() macro to perform
3-way comparisons during relocation sorting.
Instead of maintaining our own implementation, use the generic
cmp_int() macro provided by the already included <linux/sort.h>. This
removes redundant code and relies on standard kernel interfaces.
Rui Qi [Sun, 7 Jun 2026 02:17:53 +0000 (20:17 -0600)]
riscv: Fix ftrace_graph_ret_addr() to use the correct task pointer
The walk_stackframe() function is used to unwind the stack of a given
task. When function graph tracing is enabled, ftrace_graph_ret_addr()
is called to resolve the original return address if it was modified by
the tracer.
The current code incorrectly passes 'current' instead of 'task' to
ftrace_graph_ret_addr(). This causes incorrect return address resolution
when unwinding a stack of a different task (e.g., when the task is
blocked in __switch_to).
Fix this by passing 'task' instead of 'current' to match the behavior
of other architectures (arm64, loongarch, powerpc, s390, x86).
Zong Li [Sun, 7 Jun 2026 02:17:52 +0000 (20:17 -0600)]
selftests/riscv: fix compiler output flag spacing in all Makefiles
Standardize the compiler output flag format across all RISC-V
selftests by adding a space between '-o' and '$@'.
Although '-o$@' is perfectly valid for GCC/Clang to parse,
changing it to '-o $@' with a space aligns with the GNU
official documentation conventions, improves readability by
visually separating the flag from the target variable, and
ensures consistency with other architectures.
Currently, RISC-V selftests use '-o$@' (without space) in 13
instances across 6 Makefiles, while all other architectures
consistently use '-o $@' (with space). This inconsistency makes
RISC-V an outlier in the kernel's selftest infrastructure.
Thorsten Blum [Sun, 7 Jun 2026 02:17:52 +0000 (20:17 -0600)]
riscv: propagate insert_resource result from add_resource
Currently, add_resource() returns 1 on success, even though its callers
only check for negative values. Instead, propagate the insert_resource()
result from add_resource() to align with standard kernel return-value
conventions (0 on success, negative errno on failure).
Use %pR to print the full resource range while at it.
Thorsten Blum [Sun, 7 Jun 2026 02:17:52 +0000 (20:17 -0600)]
riscv: use sysfs_emit in cpu_show_ghostwrite
Replace sprintf() with sysfs_emit() in cpu_show_ghostwrite(), which is
preferred for formatting sysfs output because it provides safer bounds
checking.
While the current code only emits fixed strings that fit easily within
PAGE_SIZE, use sysfs_emit() to follow secure coding best practices.
Thorsten Blum [Sun, 7 Jun 2026 02:17:52 +0000 (20:17 -0600)]
riscv: pi: replace strlcat with strscpy in get_early_cmdline
Use the return value of strscpy() instead of calling strlen(fdt_cmdline)
again and return early on string truncation. Drop the explicit size
argument since early_cmdline has a fixed length, which strscpy()
determines using sizeof() when the argument is omitted.
Replace strlcat() with strscpy() to append CONFIG_CMDLINE.
Also remove the unnecessary fdt_cmdline NULL initialization.
Thorsten Blum [Sun, 7 Jun 2026 02:17:51 +0000 (20:17 -0600)]
riscv/purgatory: return bool from verify_sha256_digest
Change the function's return type from int to bool and return the result
of memcmp() directly to simplify the code. While at it, cast ->start to
'const u8 *' to better match the expected type.
Richard Patel [Mon, 18 May 2026 18:39:18 +0000 (18:39 +0000)]
riscv: cfi: reject unknown flags in PR_SET_CFI
prctl(PR_SET_CFI,PR_CFI_BRANCH_LANDING_PADS) silently ignored
unknown control values. Only PR_CFI_{ENABLE,DISABLE,LOCK} should
be permitted.
This changes the behavior of the uABI (fails previously accepted bits
with EINVAL).
Fixes: 08ee1559052b ("prctl: cfi: change the branch landing pad prctl()s to be more descriptive") Signed-off-by: Richard Patel <ripatel@wii.dev> Link: https://patch.msgid.link/20260518183918.322545-1-ripatel@wii.dev
[pjw@kernel.org: change the patch description to note that although this is a uABI change, it does not break the uABI] Signed-off-by: Paul Walmsley <pjw@kernel.org>
Nam Cao [Tue, 7 Apr 2026 12:06:39 +0000 (14:06 +0200)]
riscv: Fix fast_unaligned_access_speed_key not getting initialized
The static key fast_unaligned_access_speed_key is supposed to be
initialized after check_unaligned_access_all_cpus() has been completed.
However, check_unaligned_access_all_cpus() has been moved to late_initcall
while setting fast_unaligned_access_speed_key still happens at
arch_initcall_sync, thus the static key does not get properly initialized.
fast_unaligned_access_speed_key can still be initialized in CPU hotplug
events, but that cannot be relied on.
Move fast_unaligned_access_speed_key's initialization into
check_unaligned_access_all_cpus() to fix this issue. This also prevent
someone from moving one initcall while forgetting the other in the future.
Fixes: 6455c6c11827 ("riscv: Clean up & optimize unaligned scalar access probe") Reported-by: Michael Neuling <mikey@neuling.org> Closes: https://lore.kernel.org/linux-riscv/CAEjGV6y0=bSLp_wrS0uHFj1S2TCRtz4GKzaU5O-L1VV-EL7Nnw@mail.gmail.com/ Signed-off-by: Nam Cao <namcao@linutronix.de> Link: https://patch.msgid.link/20260407120639.4006031-1-namcao@linutronix.de Signed-off-by: Paul Walmsley <pjw@kernel.org>
Andreas Schwab [Thu, 21 May 2026 22:34:30 +0000 (00:34 +0200)]
riscv/ptrace: Use USER_REGSET_NOTE_TYPE for REGSET_CFI
Fixes a warning while dumping core:
[54983.546369][ C7] WARNING: [!note_name] fs/binfmt_elf.c:1771 at elf_core_dump+0x910/0xf68, CPU#7: abort01/31982
Fixes: 2af7c9cf021c ("riscv/ptrace: expose riscv CFI status and state via ptrace and in core files") Signed-off-by: Andreas Schwab <schwab@suse.de> Link: https://patch.msgid.link/87y0hcxuh5.fsf@igel.home Signed-off-by: Paul Walmsley <pjw@kernel.org>
A constant offset added to a PTR_TO_FLOW_KEYS register lands in
reg->var_off, but check_flow_keys_access() bounds-checks only insn->off
and never folds reg->var_off.value. A BPF_PROG_TYPE_FLOW_DISSECTOR
program can therefore do "flow_keys += 0x1000; *(flow_keys + 0)" and have
it accepted, then read/write kernel stack past struct bpf_flow_keys at
runtime. Patch 1 folds reg->var_off.value into the offset (and rejects
non-constant offsets), mirroring check_ctx_access(); patch 2 adds verifier
selftests.
This is a regression introduced in the 7.1 development cycle by commit 022ac0750883 ("bpf: use reg->var_off instead of reg->off for pointers"),
which moved the constant offset from reg->off (folded generically before 022ac0750883) into reg->var_off without updating the flow_keys path. No
released kernel is affected: v7.0.x rejects the program above, and the bug
reproduces only on v7.1-rc1..rc5, so no stable backport is needed.
It was first reported privately to security@kernel.org; per their guidance
it is handled in the open as a normal regression fix. Found by manual
verifier audit and confirmed dynamically in a disposable QEMU/KVM guest:
the load above is accepted, a runtime read leaked a kernel-stack pointer
0x1000 past bpf_flow_keys, and a runtime write of a marker faulted the
guest in net_rx_action.
An alternative -- forbidding pointer arithmetic on PTR_TO_FLOW_KEYS
outright by dropping "if (known) break;" in adjust_ptr_min_max_vals() --
was rejected because v7.0.x accepted (and correctly bounds-checked)
constant arithmetic on the keys pointer; restoring the fold preserves that
behaviour while closing the divergence.
Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
---
v2 -> v3:
- Pass existing reg/argno context into check_flow_keys_access(), avoiding
a stale regno reference in check_mem_access().
- Add a variable-offset selftest using bpf_get_prandom_u32().
v1 -> v2:
- Target bpf-next instead of bpf (per reviewer feedback).
- Base-commit updated to bpf-next/master.
Nuoqi Gui [Sat, 6 Jun 2026 10:50:38 +0000 (18:50 +0800)]
selftests/bpf: add tests for PTR_TO_FLOW_KEYS offset bounds
Add verifier tests covering pointer arithmetic on a PTR_TO_FLOW_KEYS
register. This covers the bpf-next regression where an out-of-bounds
constant offset introduced as flow_keys += K and then dereferenced at
insn->off 0 was accepted, while the equivalent flow_keys + K direct offset
was rejected.
The tests check that in-bounds constant arithmetic on the keys pointer is
still accepted, out-of-bounds constant arithmetic is rejected for both read
and write, and a truly varying offset from bpf_get_prandom_u32() remains
rejected by the existing PTR_TO_FLOW_KEYS pointer arithmetic rules.
Nuoqi Gui [Sat, 6 Jun 2026 10:50:37 +0000 (18:50 +0800)]
bpf: Fold reg->var_off into PTR_TO_FLOW_KEYS bounds check
Constant pointer arithmetic on a PTR_TO_FLOW_KEYS register lands the
constant in reg->var_off (e.g. flow_keys(imm=4096)), but the
PTR_TO_FLOW_KEYS path in check_mem_access() passes only insn->off to
check_flow_keys_access() and never folds reg->var_off.value. The
verifier therefore accepts an access that, at runtime, dereferences past
struct bpf_flow_keys -- a verifier/runtime divergence that yields an
out-of-bounds read and write of kernel stack memory.
Commit 022ac0750883 ("bpf: use reg->var_off instead of reg->off for
pointers") removed the generic "off += reg->off" that check_mem_access()
applied before the per-type dispatch and replaced it with per-path
folding of reg->var_off.value (for example the ctx path now folds the
register offset via check_ctx_access()). The PTR_TO_FLOW_KEYS path was
not given the equivalent fold, so a constant offset that used to be
folded and rejected is now silently accepted:
before 022ac0750883: the offset stays in reg->off and is folded
generically, so the access is checked with off=4096 and rejected.
after 022ac0750883: the offset lands in reg->var_off, the flow_keys
path checks off=0 and accepts; at runtime the access dereferences
base + 0x1000.
For a BPF_PROG_TYPE_FLOW_DISSECTOR program the following is accepted:
has the same effective offset but is correctly rejected with
"invalid access to flow keys off=4096 size=8", which isolates the defect
to the missing var_off fold. Once attached as a flow dissector, the
accepted program reads kernel stack past struct bpf_flow_keys (a
kernel-stack / KASLR information leak) and can likewise write past it,
corrupting kernel memory.
Fix it by folding reg->var_off.value into the offset before the bounds
check and rejecting non-constant offsets, mirroring the other pointer
types (e.g. check_ctx_access()).
After commit 0652a3daa787 ("tracing: Fix CFI violation in probestub
being called by tprobes"), there are many build errors when building
ARCH=arm multi_v7_defconfig + CONFIG_CFI=y like:
In file included from drivers/base/devres.c:17:
In file included from drivers/base/trace.h:16:
In file included from include/linux/tracepoint.h:23:
include/linux/cfi.h:44:6: error: call to undeclared function 'get_kernel_nofault'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
44 | if (get_kernel_nofault(hash, func - cfi_get_offset()))
| ^
1 error generated.
get_kernel_nofault() is called in the generic version of
cfi_get_func_hash() but nothing ensures uaccess.h is always included for
a proper expansion and prototype. Include uaccess.h in cfi.h to clear
up the errors.
Cc: stable@vger.kernel.org Fixes: 0652a3daa787 ("tracing: Fix CFI violation in probestub being called by tprobes") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Input: atkbd - skip deactivate for HONOR BCC-N's internal keyboard
After commit 9cf6e24c9fbf17e52de9fff07f12be7565ea6d61 ("Input: atkbd -
do not skip atkbd_deactivate() when skipping ATKBD_CMD_GETID"), HONOR
BCC-N, aka HONOR MagicBook 14 2026's internal keyboard stops
working. Adding the atkbd_deactivate_fixup quirk fixes it.
Linus Torvalds [Sat, 6 Jun 2026 16:49:16 +0000 (09:49 -0700)]
Merge tag 'sound-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"It's getting calmer, but we still came up with a handful of small
fixes, including two core fixes. All look sane and safe.
Core:
- Fix wait queue list corruption in snd_pcm_drain() on linked streams
- Fix UMP event stack overread in seq dummy driver
USB-audio:
- Add quirk for AB13X USB Audio
- Fix the regression with sticky mixer volumes in 7.1-rc
ASoC:
- Fix 32-slot TDM breakage on Freescale SAI
- Varioud DMI quirks for AMD ACP"
* tag 'sound-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: seq: dummy: fix UMP event stack overread
ALSA: usb-audio: Add iface reset and delay quirk for AB13X USB Audio
ALSA: PCM: Fix wait queue list corruption in snd_pcm_drain() on linked streams
ASoC: amd: acp70: add standalone RT721 SoundWire machine
ASoC: amd: yc: Add MSI Raider A18 HX A9WJG to quirk table
ASoC: fsl_sai: Fix 32 slots TDM broken by integer shift UB in xMR write
ASoC: amd: yc: Enable internal mic on MSI Bravo 17 C7VF
ASoC: amd: acp: Add DMI quirk for Lenovo Yoga Pro 7 15ASH11
ALSA: usb-audio: Set the value of potential sticky mixers to maximum
Linus Torvalds [Sat, 6 Jun 2026 16:44:42 +0000 (09:44 -0700)]
Merge tag 'rust-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull Rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Fix 'rustc-option' (the Makefile one) when cross-compiling that
leads to build or boot failures in certain configs
- Work around a Rust compiler bug (already fixed for Rust 1.98.0)
thats lead to boot failures in certain configs due to missing
'uwtable' LLVM module flags
- Support a Rust compiler change (starting with Rust 1.98.0) in the
unstable target specification JSON files
- Forbid Rust + arm + KASAN configs, which do not build
'kernel' crate:
- Fix NOMMU build by adding a missing helper"
* tag 'rust-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: x86: support Rust >= 1.98.0 target spec
rust: arm64: set uwtable llvm module flag for CONFIG_UNWIND_TABLES
rust: helpers: add is_vmalloc_addr wrapper for NOMMU builds
rust: kasan/kbuild: fix rustc-option when cross-compiling
ARM: Do not select HAVE_RUST when KASAN is enabled
Cássio Gabriel [Fri, 5 Jun 2026 15:48:27 +0000 (12:48 -0300)]
ALSA: pcm: Fix unlocked runtime state reads in xfer ioctls
The recent runtime state locking cleanup converted several PCM ioctl state
checks to snd_pcm_get_state(), including snd_pcm_pre_prepare(),
snd_pcm_drain() and snd_pcm_kernel_ioctl(). The native and compat xfer
ioctl paths still sample runtime->state directly before dispatching to the
PCM transfer helpers, and snd_pcm_common_ioctl() still samples the
DISCONNECTED state directly in its common precheck.
Use snd_pcm_get_state() for those ioctl-side prechecks as well. This keeps
the externally visible ioctl entry checks consistent with the stream-locked
state access used by the recent PCM state-read cleanup.
HyeongJun An [Sat, 6 Jun 2026 04:09:13 +0000 (13:09 +0900)]
ALSA: seq: Fix partial userptr event expansion
snd_seq_expand_var_event_at() clamps the number of bytes to copy to the
remaining variable-event length, but passes the original buffer size to
expand_var_event().
For SNDRV_SEQ_EXT_USRPTR events, expand_var_event() copies exactly the
size argument from userspace. On the final chunk, when the remaining
event data is shorter than the caller's buffer, this can read past the
declared event data and can spuriously fail with -EFAULT if the extra
bytes cross an unmapped page.
Pass the clamped length instead. The chained and kernel-backed paths
already reclamp in dump_var_event(), but the user-pointer path handles
the size directly.
Rosen Penev [Sun, 17 May 2026 04:27:16 +0000 (21:27 -0700)]
wifi: ath9k: Clear DMA descriptors without memset
Clear ath9k DMA descriptors with explicit status word stores instead of
memset(). The descriptor rings are coherent DMA memory, which may be
mapped uncached on 32-bit powerpc. The optimized memset() path can use
dcbz there and trigger an alignment warning.
Use WRITE_ONCE() for the descriptor status words so the compiler keeps
the clears as ordinary stores instead of folding them back into bulk
memset(). This covers AR9003 TX status descriptors as well as the RX
status area cleared when setting up RX descriptors.
Rosen Penev [Wed, 6 May 2026 23:48:48 +0000 (16:48 -0700)]
wifi: ath9k_htc: use module_usb_driver
This follows the pattern with other USB Wifi drivers. There is nothing
special being done in the _init and _exit functions here. Simplifies and
saves some lines of code.
wifi: wcn36xx: fix OOB read from short trigger BA firmware response
The firmware response length is only checked against sizeof(*rsp) (20
bytes), but when candidate_cnt >= 1, a 22-byte candidate struct is read
at buf + 20 without verifying the response contains it. This causes an
out-of-bounds read of stale heap data, corrupting the BA session state.
Add validation that the response includes the candidate data.
wifi: wcn36xx: fix OOB read from firmware count in PRINT_REG_INFO indication
The firmware-controlled rsp->count field is used as the loop bound for
indexing into the flexible rsp->regs[] array without validation against
the message length. A count exceeding the actual data causes out-of-
bounds reads from the heap-allocated message buffer.
Add a check that count fits within the received message.
Fixes: 43efa3c0f241 ("wcn36xx: Implement print_reg indication") Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20260421135018.352774-3-tristmd@gmail.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
wifi: wcn36xx: fix heap overflow from oversized firmware HAL response
The firmware response dispatcher copies all synchronous HAL responses
into the 4096-byte hal_buf without validating the response length. A
response exceeding WCN36XX_HAL_BUF_SIZE causes a heap buffer overflow
with firmware-controlled content.
Add a bounds check on the response length.
Fixes: 8e84c2582169 ("wcn36xx: mac80211 driver for Qualcomm WCN3660/WCN3680 hardware") Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20260421135018.352774-2-tristmd@gmail.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Linus Torvalds [Sat, 6 Jun 2026 14:28:59 +0000 (07:28 -0700)]
Merge tag 'vfs-7.1-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix error handling in ovl_cache_get()
- Tighten access checks for exited tasks in pidfd_getfd()
- Fix selftests leak in __wait_for_test()
- Limit FUSE_NOTIFY_RETRIEVE to uptodate folios
- Reject fuse_notify() pagecache ops on directories
- Clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
- Fix failure to unlock in nfsd4_create_file()
- Fix pointer arithmetic in qnx6 directory iteration
- Fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
- Avoid potential null folio->mapping deref during iomap error
reporting
* tag 'vfs-7.1-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: avoid potential null folio->mapping deref during error reporting
fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
fs/qnx6: fix pointer arithmetic in directory iteration
VFS: fix possible failure to unlock in nfsd4_create_file()
signal: clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
fuse: reject fuse_notify() pagecache ops on directories
fuse: limit FUSE_NOTIFY_RETRIEVE to uptodate folios
selftests: harness: fix pidfd leak in __wait_for_test
pidfd: refuse access to tasks that have started exiting harder
ovl: keep err zero after successful ovl_cache_get()
When `I2cAdapter::get` executes, it first calls
`bindings::i2c_get_adapter()` which increments the device and module
reference counts. It then takes a reference to the raw pointer and
converts it to an `ARef` via `.into()`.
The implementation of `From<&T> for ARef<T>` where `T: AlwaysRefCounted`
unconditionally calls `T::inc_ref()`. This leads to a second increment
to the reference counts.
Since the returned `ARef` will only release a single reference when
dropped via `dec_ref()`, this leaks one device and module reference count
on every call.
Jori Koolstra [Thu, 4 Jun 2026 22:24:05 +0000 (00:24 +0200)]
vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
A recent build failure[1] exposed the diffculty of working with the
current octal and hex definitions of O_ flags when trying to find a gap
for a new flag. This difficulty is compounded by the fact that O_ flags
may have architectural specific values.
Replace the hex/octal #defines, which are hard to parse when looking for
free bits, with explicit bit shifts like (1 << 11). Also, add comments
that identify which architectures redefine some of the seemingly free
("cursed") bits in uapi/asm-generic/fcntl.h. These should not be used to
define new O_ flags (for now, at least).
The translastion was done with Claude Opus 4.8, and verified with a
(non-AI) gawk script. The accounting of which architectures claim
which bit-gaps in uapi/asm-generic/fcntl.h is also done by hand.
Add a sleepable BPF kfunc that resolves the real inode backing a dentry
via d_real_inode(). On overlay/union filesystems the inode attached to
the dentry is the overlay inode which does not carry the underlying
device information. d_real_inode() resolves through the overlay and
returns the inode from the lower, real filesystem.
This is used in the RestrictFilesytemAccess bpf program that has been
merged into systemd a little while ago.
Daniel Borkmann [Tue, 2 Jun 2026 07:40:12 +0000 (09:40 +0200)]
bpf: Add simple xattr support to bpffs
Add support for extended attributes on bpffs inodes so that user space
and BPF LSM programs can attach metadata, for example, a content hash
or a security label - to a pinned object or directory. BPF LSM or user
space tooling can then uniformly look at this (e.g. security.bpf.*) in
similar way to other fs'es. The store is in-memory and non-persistent:
it lives only for the lifetime of the mount, like everything else in
bpffs. The modelling is similar to tmpfs.
bpffs serves the trusted.* and security.* namespaces; user.* is left
unsupported. As bpffs is FS_USERNS_MOUNT, security.* is reachable by
the unprivileged mounter in a user namespace, and thus we are using
the simple_xattr_set_limited infra there (trusted.* needs global
CAP_SYS_ADMIN).
bpf_fill_super() is open-coded instead of using simple_fill_super(),
because the root inode must now be allocated through bpf_fs_alloc_inode()
i.e. carry the bpf_fs_inode wrapper and come from the right cache -
which requires s_op (and s_xattr) to be installed before the first
inode is created. While at it, also harden s_iflags with SB_I_NOEXEC
and SB_I_NODEV.
bpf_fs_listxattr() is only reachable through the filesystem via
i_op->listxattr, so the BPF token inode is left untouched. Name-based
fsetxattr()/fgetxattr() on a token fd still work since the get/set
handlers are installed at the superblock.
For security.* namespace, we use simple_xattr_set_limited() but
there was no simple_xattr_add_limited() API yet which was needed
in bpf_fs_initxattrs() to avoid underflows in the accounting. The
symlink target is freed in bpf_free_inode() rather than in
bpf_destroy_inode() so that it is released only after an RCU grace
period, as an RCU path walk following the symlink may still
dereference inode->i_link in security_inode_follow_link(). Lastly,
the bpf_symlink() allocated the symlink target is switched to
GFP_KERNEL_ACCOUNT, so the string is charged to the caller's memcg.
kernfs: link kn to its parent before the LSM init hook
After commit 12e9e3cd03b5 ("simpe_xattr: use per-sb cache"),
kernfs_xattr_set() and kernfs_xattr_get() compute the cache via
kernfs_root(kn) before any other check. kernfs_root(kn) walks
kn->__parent first and falls back to kn->dir.root, both of which are
NULL on a freshly kmem_cache_zalloc()'d kn. kn->__parent was being set
in kernfs_new_node() after __kernfs_new_node() returned, and kn->dir.root
is set even later by kernfs_create_dir_ns() / kernfs_create_empty_dir().
The LSM kernfs_init_security hook is invoked from inside
__kernfs_new_node(), before either field has been initialized.
selinux_kernfs_init_security() ends with kernfs_xattr_set(kn,
XATTR_NAME_SELINUX, ...). kernfs_root(kn) then returns NULL, and
&((struct kernfs_root *)NULL)->xa_cache evaluates to
offsetof(struct kernfs_root, xa_cache) which faults:
Reproduces deterministically at PID 1 (systemd) on an SELinux-enabled
distro. The first cgroup mkdir under /sys/fs/cgroup with a labelled
parent panics the kernel.
The LSM hook's contract is that the kn_dir argument is the parent of
the new kn, so kn->__parent should already point at kn_dir when the
hook runs. Move kernfs_get(parent) and rcu_assign_pointer of
kn->__parent from kernfs_new_node() into __kernfs_new_node() right
before the security hook, and unwind the parent reference on the
err_out4 path. kernfs_root(kn) then takes its parent branch during
the hook and returns parent->dir.root, which is the correct root.
This also closes the same-shape latent bug in kernfs_xattr_get() (which
today is hidden only by kernfs_iattrs_noalloc() returning NULL on a
fresh kn).
Fixes: 12e9e3cd03b5 ("simpe_xattr: use per-sb cache") Reported-by: Calum Mackay <calum.mackay@oracle.com> Closes: https://lore.kernel.org/all/5386153f-9112-4971-98fc-de90d7aae2c6@oracle.com/ Link: https://patch.msgid.link/20260526-ablief-demut-wehen-aef8446ef5c9@brauner Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Miklos Szeredi [Fri, 5 Jun 2026 13:53:19 +0000 (15:53 +0200)]
simpe_xattr: use per-sb cache
Move the hash table to the super block to remove excessive overhead in case
of small number of xattrs per inode.
Add linked list to the inode, used for listxattr and eviction. Listxattr
uses rcu protection to iterate the list of xattrs.
Before being made per-sb, lazy allocation was protected by inode lock. Now
inode lock no longer provides sufficient exclusion, so use cmpxchg() to
ensure atomicity.
Though I haven't found a description of this pattern, after some research
it seems that cmpxchg_release() and READ_ONCE() should provide the
necessary memory barriers.
Use simple_xattr_free_rcu() in simple_xattrs_free(). This is needed because
the hash table is now shared between inodes and lookup on a different inode
might be running the compare function on the just freed element within the
RCU grace period.
Following stats are based on slabinfo diff, after creating 100k empty
files, then adding a "user.test=foo" xattr to each:
The overhead of a single xattr is reduced to nearly v7.0 levels. The per
xattr overhead is slightly larger due to the addition of three pointers to
struct simple_xattr.
Miklos Szeredi [Fri, 5 Jun 2026 13:53:18 +0000 (15:53 +0200)]
simple_xattr: change interface to pass struct simple_xattrs **
Change the simple_xattr API to accept pointer-to-pointer (struct
simple_xattrs **) instead of pointer. This allows the functions to handle
lazy allocation internally without requiring callers to use
simple_xattrs_lazy_alloc().
The simple_xattr_set(), simple_xattr_set_limited() and simple_xattr_add()
functions now handle allocation when xattrs is NULL. simple_xattrs_free()
now also frees the xattrs structure itself and sets the pointer to NULL.
This simplifies callers and removes the need for most callers to explicitly
manage xattrs allocation and lifetime.
In shmem_initxattrs(), the total required space for all initial xattrs
(ispace) is pre-calculated and deducted from sbinfo->free_ispace.
Since this patch modifies the function to add new xattrs directly to the
inode's &info->xattrs list rather than using a local temporary variable, a
failure means that the partially populated info->xattrs list remains
attached to the inode.
When the VFS caller handles the -ENOMEM error, it drops the newly created
inode via iput(), shmem_free_inode() adds freed to sbinfo->free_ispace a
second time, permanently inflating the tmpfs free space quota.
Fix by substracting already added xattrs from ispace.
Miklos Szeredi [Fri, 5 Jun 2026 13:53:16 +0000 (15:53 +0200)]
kernfs: fix xattr race condition with multiple superblocks
Multiple superblocks with different namespaces can share the same
kernfs_node when kernfs_test_super() finds a matching root but
different namespace. This means multiple inodes from different
superblocks can reference the same kernfs_node->iattr->xattrs
structure.
The VFS layer only holds per-inode locks during xattr operations,
which is insufficient to serialize concurrent xattr modifications on
the shared kernfs_node. This can lead to race conditions in
simple_xattr_set() where the lookup->replace/remove sequence is not
atomic with respect to operations from other superblocks.
Fix this by protecting xattr operations with the existing hashed
kernfs_locks->open_file_mutex[] array, which is already used to
protect per-node open file data. The hashed mutex array provides
scalable per-node serialization (scaled by CPU count, up to 1024 locks
on 32+ CPU systems) with zero memory overhead.
Changes:
- Rename open_file_mutex[] to node_mutex[] to reflect dual purpose
- Add kernfs_node_lock_ptr() and kernfs_node_lock() helpers
- Protect simple_xattr_set() calls in kernfs_xattr_set() and
kernfs_vfs_user_xattr_set() with the hashed mutex
- Update file.c to use new helpers via compatibility wrappers
- Update documentation to explain the extended lock usage
Waiman Long [Fri, 5 Jun 2026 17:30:38 +0000 (13:30 -0400)]
debugobjects: Don't call fill_pool() in early boot hardirq context
When booting a debug PREEMPT_RT kernel on an ARM64 system, a "inconsistent
{HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage" lockdep warning message was
reported to the console.
During early boot, interrupts are enabled before the scheduler is
enabled. In this window (before SYSTEM_SCHEDULING is set) interrupts can
fire and in the hard interrupt context handler attempt to fill the pool
This can lead to a deadlock when the interrupt occurred when the interrupt
hits a region which holds a lock that is required to be taken in the
allocation path.
Add a new can_fill_pool() helper and reorder the exception rule and forbid
this scenario by excluding allocations from hard interrupt context.
Fixes: 06e0ae988f6e ("debugobjects: Allow to refill the pool before SYSTEM_SCHEDULING") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260605173038.495075-1-longman@redhat.com
Miguel Ojeda [Sat, 6 Jun 2026 12:01:29 +0000 (14:01 +0200)]
Merge tag 'pin-init-v7.2' of https://github.com/Rust-for-Linux/linux into rust-next
Pull pin-init updates from Gary Guo:
"User visible changes:
- Do not generate 'non_snake_case' warnings for identifiers that are
syntactically just users of a field name. This would allow all
'#[allow(non_snake_case)]' in nova-core to be removed, which I will
send to the nova tree next cycle.
- Filter non-cfg attributes out properly in derived structs. This
improves pin-init compatibility with other derive macros.
- Insert projection types' where clause properly.
Other changes:
- Bump MSRV to 1.82, plus associated cleanups.
- Overhaul how init slots are projected. The new approach is easier
to justify with safety comments.
- Mark more functions as inline, which should help mitigate the
super-long symbol name issue due to lack of inlining.
- Various small code quality cleanups."
* tag 'pin-init-v7.2' of https://github.com/Rust-for-Linux/linux: (27 commits)
rust: pin_init: internal: use `loop {}` to produce never value
rust: pin-init: remove `E` from `InitClosure`
rust: pin-init: move `InitClosure` out from `__internal`
rust: pin-init: docs: fix typos in MaybeZeroable documentation
rust: pin-init: internal: suppress `non_snake_case` lint in `[pin_]init!`
rust: pin-init: internal: suppress `non_snake_case` lint in `#[pin_data]`
rust: pin-init: internal: pin_data: filter non-`#[cfg]` attr in generated code
rust: pin-init: internal: project using full slot
rust: pin-init: internal: project slots instead of references
rust: pin-init: internal: make `make_closure` inherent methods
rust: pin-init: internal: use marker on drop guard type for pinned fields
rust: pin-init: internal: init: handle code blocks early
rust: pin-init: internal: add `PhantomInvariant` and `PhantomInvariantLifetime`
rust: pin-init: internal: pin_data: add struct to record field info
rust: pin-init: internal: pin_data: use closure for `handle_field`
rust: pin-init: examples: fix `useless_borrows_in_formatting` clippy warning
rust: pin-init: internal: remove `collect_tuple` polyfill after MSRV bump
rust: pin-init: internal: turn `PhantomPinned` error into warnings
rust: pin-init: cleanup workaround for old Rust compiler
rust: pin-init: fix badge URL in README
...
Rong Xu [Thu, 4 Jun 2026 19:56:08 +0000 (12:56 -0700)]
kconfig: Remove the architecture specific config for Propeller
The CONFIG_PROPELLER_CLANG option currently depends on
ARCH_SUPPORTS_PROPELLER_CLANG, but this dependency seems unnecessary.
Remove ARCH_SUPPORTS_PROPELLER_CLANG and allow users to control
Propeller builds solely through CONFIG_PROPELLER_CLANG. This simplifies
the kconfig and avoids potential confusion.
Move the .llvm_bb_addr_map sections grouping to
include/asm-generic/vmlinux.lds.h.
The Propeller documentation has been updated to reflect the most
recent tool location and now includes instructions for arm64.
Contributor Acknowledgments:
* SPE instructions: Daniel Hoekwater <hoekwater@google.com>
Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Will Deacon <will@kernel.org> Suggested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Yabin Cui <yabinc@google.com> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20260604195612.3757860-3-xur@google.com Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Rong Xu [Thu, 4 Jun 2026 19:56:07 +0000 (12:56 -0700)]
kconfig: Remove the architecture specific config for AutoFDO
The CONFIG_AUTOFDO_CLANG option currently depends on
ARCH_SUPPORTS_AUTOFDO_CLANG, but this dependency seems unnecessary.
Remove ARCH_SUPPORTS_AUTOFDO_CLANG and allow users to control AutoFDO
builds solely through CONFIG_AUTOFDO_CLANG. This simplifies the kconfig
and avoids potential confusion.
Expand the AutoFDO documentation to include instructions for arm64.
Contributor acknowledgments:
* SPE instructions: Daniel Hoekwater <hoekwater@google.com>
* ETM instructions: Yabin Cui <yabinc@google.com>
Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Will Deacon <will@kernel.org> Tested-by: Yabin Cui <yabinc@google.com> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20260604195612.3757860-2-xur@google.com Signed-off-by: Nathan Chancellor <nathan@kernel.org>
James Lee [Thu, 4 Jun 2026 06:03:00 +0000 (14:03 +0800)]
modpost: Add __llvm_covfun and __llvm_covmap to section_white_list
Modpost emits hundreds of warnings when using Clang to build for ARCH=um
and CONFIG_GCOV=y. e.g.:
vmlinux (__llvm_covfun): unexpected non-allocatable section.
Did you forget to use "ax"/"aw" in a .S file?
Note that for example <linux/init.h> contains
section definitions for use in .S files.
For example, when we use LLVM for a kunit user mode build with coverage:
python3 tools/testing/kunit/kunit.py build --make_options LLVM=1 \
--kunitconfig=tools/testing/kunit/configs/default.config \
--kunitconfig=tools/testing/kunit/configs/coverage_uml.config
The behaviour occurs when building the kernel for ARCH=um with code
coverage enabled. The warnings come from modpost's check_sec_ref
function, which ensures no sections reference others that will be
discarded. covfun and covmap sections must reference __init and __exit
sections to collect coverage data, triggering the modpost warning.
To suppress these warnings, these section names have been added to
modpost's whitelist. This is unlikely to suppress legitimate warnings as
Clang will only insert these sections when building with coverage, and
can be assumed to manage these references safely.
Daniel Borkmann [Fri, 5 Jun 2026 21:35:18 +0000 (23:35 +0200)]
selftests/bpf: Inspect the signature verdict exposed to BPF LSM
Add a minimal BPF LSM program on lsm/bpf_prog_load that, for loads on
the monitored thread, reads back prog->aux->sig.{verdict,keyring_type,
keyring_serial}, and a signed_loader subtest that drives the same
gen_loader loader through the hook twice: i) /unsigned/ where the LSM
must observe UNSIGNED, no keyring and serial 0; ii) /signed/ where the
very same insns signed against the session keyring must be observed as
VERIFIED with a user keyring, and the recorded keyring_serial must be
equal to the resolved session keyring serial. Loading (not running) the
loader is sufficient since the verdict is attached at load time.
KP Singh [Fri, 5 Jun 2026 21:35:17 +0000 (23:35 +0200)]
bpf: Expose signature verdict via bpf_prog_aux
BPF_PROG_LOAD verifies the loader signature but does not record the
outcome on the BPF program. [BPF] LSMs and audit can read attr->signature
and attr->keyring_id to infer "was this signed, and if so, against which
keyring".
Add prog->aux->sig (verdict + keyring_{type,serial}), populated by
bpf_prog_load before the LSM hook. keyring_type classifies the keyring
the load referenced (builtin, secondary, platform or user), while
keyring_serial records the serial of the keyring the signature was
actually validated against. System keyrings carry a pseudo key pointer
with no user-visible serial and are reported as 0, as are unsigned loads.
Failed verifications reject the load before the hook runs, so it observes
only either UNSIGNED or VERIFIED.
Signed-off-by: KP Singh <kpsingh@kernel.org> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260605213518.544262-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
selftests/bpf: libarena: Add initial data structures
Add two new data structures to libarena. These data structures initially
resided in the sched-ext repo (https://github.com/sched-ext/scx) and
have been adapted to the internal libarena build system. The data
structures are:
- Red black tree: Fundamental tree data structure that can also serve
as a base for more domain-specific data structures.
- Lev-Chase deque: Queue data structure that allows efficient work
stealing, useful in scheduling scenarios.
The data structures are accompanied by selftests that are automatically
discovered by the existing libarena test_progs selftest and incorporated
in the CI.
CHANGELOG
=========
v3 -> v4 (https://lore.kernel.org/bpf/20260604235016.20856-1-emil@etsalapatis.com/)
- Turn off load_acquire/store_relesase - dependent selftests for s390 (CI)
- Various style/non-functional nits (AI)
- Add workaround to handle LLVM 21 and GCC 15 assignment-to-memset promotions
that are causing verification failures for arena programs (CI)
- Incorporate Sashiko feedback for cleanup edge cases (Sashiko)
- Simplify some of the ordering semantics in spmc
- Rename tests from st_ to test_ (Alexei)
- Removed the freelist caches from the rbtrees, previously used to defer freeing (Alexei)
- Moved the type and function definitions to use the __arena identifier
- Removed the typecasts during function return and directly return __arena
pointers (Alexei)
- Renamed queues to spmc queues to abstract away the algorithm (Alexei)
- Adjusted the memory barriers in the spmc queue
- Added multithreaded testing harness for libarena programs (Alexei)
- Added parallel selftest for queues (Alexei)
- Split least upper bound and exact find operations back into separate
functions to prevent RB_DUPLICATE-related bug (AI)
====================
Emil Tsalapatis [Fri, 5 Jun 2026 22:20:20 +0000 (18:20 -0400)]
selftests/bpf: libarena: parallel test harness and spmc parallel selftest
Add a parallel test for the SPMC Lev-Chase workstealing queue. The queue
is built to be wait-free even when there are multiple consumers, and
the parallel selftest provides a signal on whether the queue behaves
correctly when stress tested.
To support the test, this patch includes a test harness for parallel
selftests. The spmc selftest acts as an example of the naming and other
conventions expected by the harness.
Emil Tsalapatis [Fri, 5 Jun 2026 22:20:19 +0000 (18:20 -0400)]
selftests/bpf: libarena: Add spmc queue data structure
Expand libarena with a single producer multiple consumer deque data
structure. This is a single producer, multiple consumer lockless structure
that permits efficient work stealing. The structure is a Lev-Chase queue,
so it is lock-free and wait-free.
The data structure exposes three main calls. two of them are available to
the thread owning the queue and one available to all threads in the program:
spmc_owner_push(): Push an item to the top of the queue.
spmc_owner_pop(): Pop an item from the top of the queue.
spmc_steal(): Steal a thread from the bottom of the queue from
any thread.
Note that the queue is not really FIFO for all consumers, since
non-owners of the queue can only work steal from the bottom.
Emil Tsalapatis [Fri, 5 Jun 2026 22:20:18 +0000 (18:20 -0400)]
selftests/bpf: libarena: Add rbtree data structure
Add a native red-black tree data structure to libarena.
The data structure supports multiple APIs (key-value based,
node based) with which users can query and modify it. The
tree uses the libarena memory allocator to manage its data.
Nazim Amirul [Thu, 4 Jun 2026 08:30:37 +0000 (01:30 -0700)]
net: stmmac: xgmac: report L3/L4 filter match count in ethtool stats
Read the L3FM and L4FM bits from the RX descriptor status word (RDES2)
and increment the corresponding ethtool statistics counters. This allows
users to observe L3/L4 filter hit rates via ethtool -S.
Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260604083037.24407-1-muhammad.nazim.amirul.nazle.asmade@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Haoxiang Li [Wed, 3 Jun 2026 06:17:16 +0000 (14:17 +0800)]
net: microchip: sparx5: clean up PSFP resources on flower setup failure
sparx5_tc_flower_psfp_setup() allocates PSFP stream gate, flow meter and
stream filter resources before adding VCAP actions. If a later step
fails, the resources allocated earlier in the function are not unwound.
Add error paths to release the stream filter, flow meter and stream gate
when setup fails after they have been acquired.
Also make sparx5_psfp_fm_add() return the acquired flow-meter id before
the existing-flow-meter early return. When an existing flow meter is
reused, sparx5_psfp_fm_get() increments its pool reference count, but the
caller previously kept psfp_fmid as 0. If a later setup step failed, the
error path could try to delete flow-meter id 0 instead of the reused flow
meter, leaving the incremented reference behind.
Chenguang Zhao [Wed, 3 Jun 2026 01:13:53 +0000 (09:13 +0800)]
netlabel: validate unlabeled address and mask attribute lengths
netlbl_unlabel_addrinfo_get() used the address attribute length to
determine whether the attribute data could be read as an IPv4 or IPv6
address, but did not independently validate the corresponding mask
attribute length. A crafted Generic Netlink request could therefore
provide a valid IPv4/IPv6 address attribute with a shorter mask
attribute, which would later be read as a full struct in_addr or
struct in6_addr.
NLA_BINARY policy lengths are maximum lengths by default, so use
NLA_POLICY_EXACT_LEN() for the unlabeled IPv4/IPv6 address and mask
attributes. This rejects short attributes during policy validation and
also exposes the exact length requirements through policy introspection.
Fixes: 8cc44579d1bd ("NetLabel: Introduce static network labels for unlabeled connections") Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: airoha: Support multiple net_devices connected to the same GDM port
EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
manages the traffic in a TDM manner. As a result multiple net_devices can
connect to the same GDM{3,4} port and there is a theoretical "1:n"
relation between GDM ports and net_devices.
This series introduces support for multiple net_devices connected to the
same Frame Engine (FE) GDM port (GDM3 or GDM4) via an external hw
arbiter. Please note GDM1 or GDM2 does not support the connection with
the external arbiter.
====================
net: airoha: Support multiple LAN/WAN interfaces for hw MAC address configuration
The EN7581 and AN7583 SoCs provide registers to configure hardware LAN/WAN
MAC addresses. These registers are used during FE hw acceleration to
determine whether received traffic is destined to this host (L3 traffic)
or should be switched to another device (L2 traffic).
The SoC hardware design assumes all interfaces configured as LAN (or WAN)
share the MAC address MSBs, which are programmed into the
REG_FE_{LAN,WAN}_MAC_H register. The LSBs of 'local' mac addresses can be
expressed as a range via the REG_FE_MAC_LMIN and REG_FE_MAC_LMAX
registers. In order to properly accelerate the traffic, FE module requires
the user to configure the REG_FE_{LAN,WAN}_MAC_H register respecting this
limitation. Please note a misconfiguration in REG_FE_{LAN,WAN}_MAC_H
will still allow the user to log into the device for debugging.
Previously, only a single interface was considered when programming these
registers. Extend the logic to derive the correct minimum and maximum
values for REG_FE_MAC_LMIN/REG_FE_MAC_LMAX when two or more interfaces are
configured as LAN or WAN. Since this functionality was not available
before this series, no regression is introduced.
Introduce WAN flag to specify if a given device is used to transmit/receive
WAN or LAN traffic. Current codebase supports specifying LAN/WAN device
configuration in ndo_init() callback during device bootstrap.
In order to consider setups where LAN configuration is used even for
GDM3/GDM4 devices, check airoha_is_lan_gdm_dev() to select pse_port in
airoha_ppe_foe_entry_prepare().
Please note after this patch, it will be possible to specify multiple LAN
devices but just a single WAN one. Please note this change is not visible
to the user since airoha_eth driver currently supports just the internal
phy available via the MT7530 DSA switch and there are no WAN interfaces
officially supported since PCS/external phy is not merged mainline yet
(it will be posted with following patches).
Theoretically, in the current codebase, two independent net_devices can
be connected to the same GDM port so we need to check the GDM port is not
used by any other running net_device before setting the forward
configuration to FE_PSE_PORT_DROP.
Moreover, always set in GDM_LONG_LEN_MASK field of REG_GDM_LEN_CFG
register the maximum MTU of all running net_devices connected to the same
GDM port.
net: airoha: Support multiple net_devices for a single FE GDM port
EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
manages the traffic in a TDM manner. As a result multiple net_devices can
connect to the same GDM{3,4} port and there is a theoretical "1:n"
relation between GDM ports and net_devices.
Introduce support for multiple net_devices connected to the same Frame
Engine (FE) GDM port (GDM3 or GDM4) via an external hw arbiter.
Please note GDM1 or GDM2 does not support the connection with the external
arbiter.
Add get_dev_from_sport callback since EN7581 and AN7583 have different
logics for the net_device type connected to GDM3 or GDM4.
net: airoha: Remove private net_device pointer in airoha_gdm_dev struct
Remove redundant net_device pointer inside airoha_gdm_dev struct and
rely on netdev_from_priv routine instead. Please note this patch does
not introduce any logical change, just code refactoring.
dt-bindings: net: airoha: Add GDM port ethernet child node
EN7581 and AN7583 SoCs support connecting multiple external SerDes to GDM3
or GDM4 ports via a hw arbiter that manages the traffic in a TDM manner.
As a result multiple net_devices can connect to the same GDM{3,4} port
and there is a theoretical "1:n" relation between GDM ports and
net_devices.
Introduce the ethernet node child of a specific GDM port in order to model
a given net_device that is connected via the external arbiter to the
GDM{3,4} port. This new ethernet node is defined by the "airoha,eth-port"
compatible string. Please note GDM1 and GDM2 does not support the
connection with the external arbiter and they are represented by an
ethernet node defined by the "airoha,eth-mac" compatible string.
====================
net: mdio: realtek-rtl9300: Refactor initialization and port lookup
The Realtek Otto switch platform consists of four different series
- RTL838x aka maple : 28 port 1G Switches
- RTL839x aka cypress : 52 port 1G Switches
- RTL930x aka longan : 28 port 1G/2.5G/10G Switches
- RTL931x aka mango : 56 port 1G/2.5G/10G Switches
A lot has been done to enhance the ethernet MDIO driver for simple
integration of more SoCs. Now it is time to solve inconveniences
that were discovered during daily operation. That includes
- Consistent "of" API usage
- Tightening error handling and improving overall robustness.
- Adding support for PHY packages.
- Fixing setup order issues. These currently hinder the driver from
properly enabling the hardware on devices where U-Boot skips the
setup and leaves the controller registers untouched.
====================
Realtek Otto switches usually make use of multiport PHYs (e.g. 8 port
1G RTL8218D or 4 port 2.5G RTL8224). The device tree can describe this
fact via an "ethernet-phy-package" node that resides between the bus
and the PHY node.
When looking up the device tree bus node via the chain port->phy->parent
the driver totally ignores the existence of a PHY package. Enhance the
lookup to take care of this feature.
After the former refactoring the existing otto_emdio_9300_mdiobus_init()
contains only the c22/c45 bus mode setup. Like the topology setup this
must run before bus registration. Otherwise the bus does not "speak" the
right protocol for PHY setup.
This setup is device-specific and other SoCs will need to set up other
register bits in the controller in the future. Therefore
- Relocate c22/c45 device tree readout to the very beginning of the probing
- Add a new device-specific setup_controller() into the info structure.
- Relocate otto_emdio_priv to satisfy the new info structure dependency.
- Rename otto_emdio_9300_mdiobus_init accordingly and add it to the
RTL9300 info structure. At the same time, adapt register naming
for the function to make it clear that it only applies to this SoC.
- Call setup_controller() prior to bus registration.
net: mdio: realtek-rtl9300: relocate c22/c45 device tree readout
otto_emdio_map_ports() is the central place to lookup the topology and the
properties of the Realtek ethernet MDIO controller from the device tree.
Deviating from this the c22/c45 detection via "ethernet-phy-ieee802.3-c45"
is running separately in otto_emdio_probe_one(). It loops over the same
nodes, just at a later point in time.
There is no benefit to divide this setup and to have a time window where
the data structure is only filled partially. Additionally it uses the
"fwnode" API. Consolidate the setup and convert it to the "of" API.
Remark. This is a subtle change for dangling PHY nodes (not referenced
by ethernet-ports). Before this commit all PHY nodes were evaluated for
c45 setup, now only the referenced ones.
Until now the driver sets up the port to bus/address topology of the
controller after all buses are set up via otto_emdio_probe_one(). This
does not work for devices where U-Boot skips this setup. It is not
only needed for the hardware internal background PHY polling engine
but it is essential for access to the PHYs during probing.
Depending on the SoC type there exist two different register arrays
- Bus mapping registers (RTL930x, RTL931x) define to which bus the port
is attached. E.g. [1]
- Address mapping registers (RTL838x, RTL930x, RTL931x) define to which
address of the bus the port is attached. E.g. [2]
Relocate the topology setup and make it generic. For this
- Define device-specific bus_base/addr_base attributes that give the
register base address where the mapping lives. In case one or both are
not given the SoC does not support this specific type of mapping.
- Create a helper otto_emdio_setup_topology() that writes the detected
topology to the registers.
- Call this helper prior to otto_emdio_probe_one().
- Remove unneeded code from otto_emdio_9300_mdiobus_init().
- Due to the added prefixes, increase define indentation
Subtle change: The old coding used regmap_bulk_write and silently wrote
bus=0/address=0 to mapping registers for ports that are out of scope.
The new coding leaves those untouched.
The bus probing of the MDIO driver uses a two stage approach.
1. The device tree "ethernet-ports" node is scanned to build a mapping
between ports and PHYs.
2. The children of the device tree "controller" are scanned to create
the individual MDIO buses.
The first step already checks the consistency of the PHY and bus nodes
that are linked via the ports. But it might miss a dangling bus child
node that is not linked. Step two simply iterates over all bus child
nodes and might read malformed data from nodes not checked in step one.
Harden this and return a meaningful error message.
Due to its design the MDIO driver needs to set up a port to bus/address
mapping during probing. The "ethernet-ports" subnodes are scanned and
from the "phy-handle" property the MDIO nodes are looked up. In case of
a malformed device tree the driver might produce out-of-bounds accesses.
The PHY address is not checked against the maximum supported address.
Add a sanity check and drop the unneeded MAX_SMI_ADDR define.
This function has multiple issues:
- It uses __free low level cleanups
- It mixes "fwnode" and "of" functions
Convert this to a uniform "of" usage and manual reference counting
cleanup. With that also fix two subtle lookup bugs in the original
code.
mdio_dn = phy_dn->parent;
if (mdio_dn->parent != dev->of_node)
continue;
This skips an API access and therefore misses reference counting.
Additionally in the case of a very buggy device tree, phy_dn might
be a root node. Looking up its grandparent leads to a NULL pointer
access.
Vikas Gupta [Thu, 4 Jun 2026 16:37:09 +0000 (22:07 +0530)]
bnge: fix context mem iteration
The firmware advertises context memory (backing store) types
through a linked list, with BNGE_CTX_INV serving as the
end-of-list sentinel.
However, the driver incorrectly assumes that the list is strictly
ordered and prematurely terminates traversal when it encounters
an unrecognized type (>=BNGE_CTX_V2_MAX). As a result, any valid
context types that appear later in the chain are silently skipped,
leading to incomplete memory configuration and eventual driver load
failure.
Fix this by traversing the entire list until the BNGE_CTX_INV sentinel
is reached, while safely ignoring only those context types that fall
outside the supported range.
Fixes: 29c5b358f385 ("bng_en: Add backing store support") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Dharmender Garg <dharmender.garg@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bit 16 of the MAC HW Feature1 register reports the DCB (Data Centre
Bridging) feature. Read it so that dma_cap.dcben and the debugfs
report it accurately. Right now it is always reported as being disabled.
Add dma_rmb() barrier after req_id completion check in
ena_com_phc_get_timestamp(). On weakly-ordered architectures,
payload fields may be read before req_id is observed as updated.
Fixes: e0ea34158ee8 ("net: ena: Add PHC support in the ENA driver") Closes: https://sashiko.dev/#/patchset/20260430032507.11586-1-akiyano%40amazon.com Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ZhaoJinming [Thu, 4 Jun 2026 07:03:52 +0000 (15:03 +0800)]
net: airoha: Add NULL check for of_reserved_mem_lookup() in airoha_qdma_init_hfwd_queues()
of_reserved_mem_lookup() may return NULL if the reserved memory region
referenced by the "memory-region" phandle is not found in the reserved
memory table (e.g. due to a misconfigured DTS or a removed
memory-region node). The current code dereferences the returned
pointer without checking for NULL, leading to a kernel NULL pointer
dereference at the following lines:
dma_addr = rmem->base; // line 1156
num_desc = div_u64(rmem->size, buf_size); // line 1160
Add a NULL check after of_reserved_mem_lookup() and return -ENODEV if
the lookup fails, which is consistent with the existing error handling
for of_parse_phandle() failure in the same code block.
Fixes: 3a1ce9e3d01b ("net: airoha: Add the capability to allocate hwfd buffers via reserved-memory") Cc: stable@vger.kernel.org Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Maoyi Xie [Thu, 4 Jun 2026 05:49:49 +0000 (13:49 +0800)]
hsr: broadcast netlink notifications in the device's net namespace
The HSR generic netlink family sets .netnsok = true. HSR devices can
live in network namespaces other than init_net.
Two async notifiers broadcast events with genlmsg_multicast(). They
are hsr_nl_ringerror() and hsr_nl_nodedown(). That helper delivers
only on the default genl socket in init_net. So the events always land
in init_net. The network namespace of the device does not matter.
This has two effects. A listener in the device's own namespace never
sees its own ring error and node down events. A privileged listener in
init_net receives events from HSR devices in other namespaces. The
payload carries the peer node MAC (HSR_A_NODE_ADDR) and the slave port
ifindex (HSR_A_IFINDEX).
Switch both callers to genlmsg_multicast_netns(). Other families with
.netnsok = true already do this. Examples are gtp, ovpn, team,
batman-adv, netdev-genl, ethtool and handshake.
hsr_nl_ringerror() already has the slave port. It uses
dev_net(port->dev). hsr_nl_nodedown() takes the namespace from the
master port via hsr_port_get_hsr().