Will Deacon [Wed, 24 Sep 2025 15:34:56 +0000 (16:34 +0100)]
Merge branch 'for-next/selftests' into for-next/core
* for-next/selftests:
kselftest/arm64: Add lsfe to the hwcaps test
kselftest/arm64: Check that unsupported regsets fail in sve-ptrace
kselftest/arm64: Verify that we reject out of bounds VLs in sve-ptrace
kselftest/arm64/gcs/basic-gcs: Respect parent directory CFLAGS
selftests/arm64: Fix grammatical error in string literals
kselftest/arm64: Add parentheses around sizeof for clarity
kselftest/arm64: Supress warning and improve readability
kselftest/arm64: Remove extra blank line
kselftest/arm64/gcs: Use nolibc's getauxval()
kselftest/arm64/gcs: Correctly check return value when disabling GCS
selftests: arm64: Fix -Waddress warning in tpidr2 test
kselftest/arm64: Log error codes in sve-ptrace
selftests: arm64: Check fread return value in exec_target
Will Deacon [Wed, 24 Sep 2025 15:34:52 +0000 (16:34 +0100)]
Merge branch 'for-next/perf' into for-next/core
* for-next/perf: (29 commits)
perf/dwc_pcie: Fix use of uninitialized variable
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Documentation: hisi-pmu: Fix of minor format error
drivers/perf: hisi: Add support for L3C PMU v3
drivers/perf: hisi: Refactor the event configuration of L3C PMU
drivers/perf: hisi: Extend the field of tt_core
drivers/perf: hisi: Extract the event filter check of L3C PMU
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
drivers/perf: hisi: Relax the event ID check in the framework
perf: Fujitsu: Add the Uncore PMU driver
perf/arm-cmn: Fix CMN S3 DTM offset
perf: arm_spe: Prevent overflow in PERF_IDX2OFF()
coresight: trbe: Prevent overflow in PERF_IDX2OFF()
MAINTAINERS: Remove myself from HiSilicon PMU maintainers
drivers/perf: hisi: Add support for HiSilicon MN PMU driver
drivers/perf: hisi: Add support for HiSilicon NoC PMU
perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions
arm64/boot: Enable EL2 requirements for SPE_FEAT_FDS
arm64/boot: Factor out a macro to check SPE version
...
Will Deacon [Wed, 24 Sep 2025 15:34:06 +0000 (16:34 +0100)]
Merge branch 'for-next/misc' into for-next/core
* for-next/misc:
arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
arm64: Kconfig: Spell out "ARMv9.4" in menuconfig text
arm64/fpsimd: simplify sme_setup()
Will Deacon [Wed, 24 Sep 2025 15:34:02 +0000 (16:34 +0100)]
Merge branch 'for-next/entry' into for-next/core
* for-next/entry:
arm/syscalls: mark syscall invocation as likely in invoke_syscall
arm64: entry: Switch to generic IRQ entry
arm64: entry: Move arm64_preempt_schedule_irq() into __exit_to_kernel_mode()
arm64: entry: Refactor preempt_schedule_irq() check code
entry: Add arch_irqentry_exit_need_resched() for arm64
arm64: entry: Use preempt_count() and need_resched() helper
arm64: entry: Rework arm64_preempt_schedule_irq()
arm64: entry: Refactor the entry and exit for exceptions from EL1
arm64: ptrace: Replace interrupts_enabled() with regs_irqs_disabled()
Will Deacon [Wed, 24 Sep 2025 15:33:03 +0000 (16:33 +0100)]
Merge branch 'for-next/fixes' into for-next/core
* for-next/fixes:
arm64: ftrace: fix unreachable PLT for ftrace_caller in init_module with CONFIG_DYNAMIC_FTRACE
ACPI/IORT: Fix memory leak in iort_rmr_alloc_sids()
arm64: uapi: Provide correct __BITS_PER_LONG for the compat vDSO
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc function signature
Will Deacon [Fri, 19 Sep 2025 18:40:25 +0000 (19:40 +0100)]
arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
Big-endian arm64 configurations are vanishingly rare, yet we still claim
to support them in Linux despite very limited testing or visible
interest. Supporting big-endian adds unnecessary burden to reviewers and
contributors which, without any known active users, is hard to justify.
For example, recent work to improve our futex routines and to implement
nested virtualisation support is non-trivially complicated by having to
support both big- and little-endianness.
Back in 2019 [1], it was claimed that Huawei were using arm64 big-endian
machines in their telecommunication products but I don't know whether
that's still the case and certainly haven't seen any patch contributions
to help support or maintain it.
Make CPU_BIG_ENDIAN depend on BROKEN as an initial deprecation step
towards its removal.
Ilkka Koskinen [Tue, 23 Sep 2025 21:31:36 +0000 (14:31 -0700)]
perf/dwc_pcie: Fix use of uninitialized variable
Fix use of uninitialized variable in group validation code.
Fixes: 71396cfac97d ("perf/dwc_pcie: Support counting multiple lane events in parallel") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202509231223.gZsX6Eio-lkp@intel.com/ Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Will Deacon <will@kernel.org>
Can Peng [Fri, 19 Sep 2025 10:00:42 +0000 (18:00 +0800)]
arm/syscalls: mark syscall invocation as likely in invoke_syscall
The invoke_syscall() function is overwhelmingly called for
valid system call entries. Annotate the main path with likely()
to help the compiler generate better branch prediction hints,
reducing CPU pipeline stalls due to mispredictions.
This is a micro-optimization targeting syscall-heavy workloads [1].
Yushan Wang [Fri, 29 Aug 2025 10:14:27 +0000 (18:14 +0800)]
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Some of HiSilicon V3 PMU hardware is divided into parts to fulfill the
job of monitoring specific parts of a device. Add description on that
as well as the newly added ext option for L3C PMU.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Reviewed-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Will Deacon <will@kernel.org>
Yushan Wang [Fri, 29 Aug 2025 10:14:26 +0000 (18:14 +0800)]
Documentation: hisi-pmu: Fix of minor format error
The inline path of sysfs should be placed in literal blocks to make
documentation look better.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Fri, 29 Aug 2025 10:14:25 +0000 (18:14 +0800)]
drivers/perf: hisi: Add support for L3C PMU v3
This patch adds support for L3C PMU v3. The v3 L3C PMU supports
an extended events space which can be controlled in up to 2 extra
address spaces with separate overflow interrupts. The layout
of the control/event registers are kept the same. The extended events
with original ones together cover the monitoring job of all transactions
on L3C.
The extended events is specified with `ext=[1|2]` option for the
driver to distinguish, like below:
perf stat -e hisi_sccl0_l3c0_0/event=<event_id>,ext=1/
Currently only event option using config bit [7, 0]. There's
still plenty unused space. Make ext using config [16, 17] and
reserve bit [15, 8] for event option for future extension.
With the capability of extra counters, number of counters for HiSilicon
uncore PMU could reach up to 24, the usedmap is extended accordingly.
The hw_perf_event::event_base is initialized to the base MMIO
address of the event and will be used for later control,
overflow handling and counts readout.
We still make use of the Uncore PMU framework for handling the
events and interrupt migration on CPU hotplug. The framework's
cpuhp callback will handle the event migration and interrupt
migration of orginial event, if PMU supports extended events
then the interrupt of extended events is migrated to the same
CPU choosed by the framework.
A new HID of HISI0215 is used for this version of L3C PMU.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Co-developed-by: Yushan Wang <wangyushan12@huawei.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Fri, 29 Aug 2025 10:14:24 +0000 (18:14 +0800)]
drivers/perf: hisi: Refactor the event configuration of L3C PMU
The event register is configured using hisi_pmu::base directly since
only one address space is supported for L3C PMU. We need to extend if
events configuration locates in different address space. In order to
make preparation for such hardware, extract the event register
configuration to separate function using hw_perf_event::event_base as
each event's base address. Implement a private
hisi_uncore_ops::get_event_idx() callback for initialize the event_base
besides get the hardware index.
No functional changes intended.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Fri, 29 Aug 2025 10:14:23 +0000 (18:14 +0800)]
drivers/perf: hisi: Extend the field of tt_core
Currently the tt_core's using config1's bit [7, 0] and can not be
extended. For some platforms there's more the 8 CPUs sharing the
L3 cache. So make tt_core use config2's bit [15, 0] and the remaining
bits in config2 is reserved for extension.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Fri, 29 Aug 2025 10:14:22 +0000 (18:14 +0800)]
drivers/perf: hisi: Extract the event filter check of L3C PMU
L3C PMU has 4 filter options which are sharing perf_event_attr::config1.
Driver will check config1 to see whether a certain event has a filter
setting. It'll be incorrect if we make use of other bits in config1
for non-filter options. So check whether each filter options are set
directly in a separate function instead.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Fri, 29 Aug 2025 10:14:21 +0000 (18:14 +0800)]
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
Version 1 and 2 of L3C PMU also use different HID. Make use of
struct acpi_device_id::driver_data for version specific information
rather than judge the version register. This will help to
simplify the probe process and also a bit easier for extension.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Fri, 29 Aug 2025 10:14:20 +0000 (18:14 +0800)]
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
Currently Uncore PMU framework assume one PMU device only have one
interrupt and will help register the interrupt handler. It cannot
support a PMU with multiple interrupt resources. An uncore PMU may
have multiple interrupts that can share the same handler. Export
hisi_uncore_pmu_isr() to allow drivers register the irq handler by
their own routine.
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Fri, 29 Aug 2025 10:14:19 +0000 (18:14 +0800)]
drivers/perf: hisi: Relax the event ID check in the framework
Event ID is only using the attr::config bit [7, 0] but we check the
event range using the whole 64bit field. It blocks the usage of the
rest field of attr::config. Relax the check by only using the
bit [7, 0].
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Yushan Wang <wangyushan12@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
This adds a new dynamic PMU to the Perf Events framework to program and
control the Uncore PMUs in Fujitsu chips.
This driver exports formatting and event information to sysfs so it can
be used by the perf user space tools with the syntaxes:
perf stat -e pci_iod0_pci0/ea-pci/ ls
perf stat -e pci_iod0_pci0/event=0x80/ ls
perf stat -e mac_iod0_mac0_ch0/ea-mac/ ls
perf stat -e mac_iod0_mac0_ch0/event=0x80/ ls
arm64: map [_text, _stext) virtual address range non-executable+read-only
Since the referenced fixes commit, the kernel's .text section is only
mapped starting from _stext; the region [_text, _stext) is omitted. As a
result, other vmalloc/vmap allocations may use the virtual addresses
nominally in the range [_text, _stext). This address reuse confuses
multiple things:
1. crash_prepare_elf64_headers() sets up a segment in /proc/vmcore
mapping the entire range [_text, _end) to
[__pa_symbol(_text), __pa_symbol(_end)). Reading an address in
[_text, _stext) from /proc/vmcore therefore gives the incorrect
result.
2. Tools doing symbolization (either by reading /proc/kallsyms or based
on the vmlinux ELF file) will incorrectly identify vmalloc/vmap
allocations in [_text, _stext) as kernel symbols.
In practice, both of these issues affect the drgn debugger.
Specifically, there were cases where the vmap IRQ stacks for some CPUs
were allocated in [_text, _stext). As a result, drgn could not get the
stack trace for a crash in an IRQ handler because the core dump
contained invalid data for the IRQ stack address. The stack addresses
were also symbolized as being in the _text symbol.
Fix this by bringing back the mapping of [_text, _stext), but now make
it non-executable and read-only. This prevents other allocations from
using it while still achieving the original goal of not mapping
unpredictable data as executable. Other than the changed protection,
this is effectively a revert of the fixes commit.
Fixes: e2a073dde921 ("arm64: omit [_text, _stext) from permanent kernel mapping") Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Will Deacon <will@kernel.org>
Update TCR_EL1 register fields as per latest ARM ARM DDI 0487 L.B and while
here drop an explicit sysreg definition SYS_TCR_EL1 from sysreg.h, which is
now redundant.
Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Dev Jain [Mon, 22 Sep 2025 06:41:26 +0000 (12:11 +0530)]
arm64: Enable vmalloc-huge with ptdump
Our goal is to move towards enabling vmalloc-huge by default on arm64 so
as to reduce TLB pressure. Therefore, we need a way to analyze the portion
of block mappings in vmalloc space we can get on a production system; this
can be done through ptdump, but currently we disable vmalloc-huge if
CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel
pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump
may dereference a bogus address.
To solve this, we need to synchronize ptdump_walk() and ptdump_check_wx()
with pud_free_pmd_page() and pmd_free_pte_page().
Since this race is very unlikely to happen in practice, we do not want to
penalize the vmalloc pagetable tearing path by taking the init_mm
mmap_lock. Therefore, we use static keys. ptdump_walk() and
ptdump_check_wx() are the pagetable walkers; they will enable the static
key - upon observing that, the vmalloc pagetable tearing path will get
patched in with an mmap_read_lock/unlock sequence. A combination of the
patched-in mmap_read_lock/unlock, the acquire semantics of
static_branch_inc(), and the barriers in __flush_tlb_kernel_pgtable()
ensures that ptdump will never get a hold on the address of a freed PMD
or PTE table.
We can verify the correctness of the algorithm via the following litmus
test (thanks to James Houghton and Will Deacon):
exists (0:X8=0 /\ 1:X8=0 /\ (* Lock acquisitions succeed *)
0:X9=0xa110c /\ (* P0 sees the valid PUD ...*)
0:X11=0xdead) (* ... but the freed PMD *)
For an approximate written proof of why this algorithm works, please read
the code comment in [1], which is now removed for the sake of simplicity.
mm-selftests pass. No issues were observed while parallelly running
test_vmalloc.sh (which stresses the vmalloc subsystem),
and cat /sys/kernel/debug/{kernel_page_tables, check_wx_pages} in a loop.
Ryan Roberts [Fri, 19 Sep 2025 14:58:30 +0000 (15:58 +0100)]
arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
Neoverse-V3AE advertises support for BBML2 and is known to not raise
conflict aborts. So add it to the BBML2_NOABORT allow list.
However, just like Neoverse-V3, Neoverse-V3AE r0p0 and r0p1 suffer from
erratum #3053180, for which the workaround is to always observe
break-before-make requirements for affected revisions. Therefore only
add to the allow list from r0p2 onwards.
For more details see Software Developer Errata Notice (SDEN) document:
Neoverse V3AE (MP172) SDEN v9.0, erratum 3053180
https://developer.arm.com/documentation/SDEN-2615521/9-0/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Enable the workaround for Neoverse-V3AE, and document this.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
... in section A.6.1 ("MIDR_EL1, Main ID Register").
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Ryan Roberts [Wed, 17 Sep 2025 19:02:10 +0000 (12:02 -0700)]
arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs
The kernel linear mapping is painted in very early stage of system boot.
The cpufeature has not been finalized yet at this point. So the linear
mapping is determined by the capability of boot CPU only. If the boot
CPU supports BBML2, large block mappings will be used for linear
mapping.
But the secondary CPUs may not support BBML2, so repaint the linear
mapping if large block mapping is used and the secondary CPUs don't
support BBML2 once cpufeature is finalized on all CPUs.
If the boot CPU doesn't support BBML2 or the secondary CPUs have the
same BBML2 capability with the boot CPU, repainting the linear mapping
is not needed.
Repainting is implemented by the boot CPU, which we know supports BBML2,
so it is safe for the live mapping size to change for this CPU. The
linear map region is walked using the pagewalk API and any discovered
large leaf mappings are split to pte mappings using the existing helper
functions. Since the repainting is performed inside of a stop_machine(),
we must use GFP_ATOMIC to allocate the extra intermediate pgtables. But
since we are still early in boot, it is expected that there is plenty of
memory available so we will never need to sleep for reclaim, and so
GFP_ATOMIC is acceptable here.
The secondary CPUs are all put into a waiting area with the idmap in
TTBR0 and reserved map in TTBR1 while this is performed since they
cannot be allowed to observe any size changes on the live mappings. Some
of this infrastructure is reused from the kpti case. Specifically we
share the same flag (was __idmap_kpti_flag, now idmap_kpti_bbml2_flag)
since it means we don't have to reserve any extra pgtable memory to
idmap the extra flag.
Co-developed-by: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Will Deacon [Fri, 19 Sep 2025 13:56:43 +0000 (14:56 +0100)]
arm64: Kconfig: Spell out "ARMv9.4" in menuconfig text
The menuconfig entries to configure various architectural features are
all formatted as "ARMvx.y architecture features" with the unusual
exception of 9.4, which omits the "ARM" prefix.
Add the "ARM" prefix to the menuconfig entry for the ARMv9.4
architectural features.
Add support for ACPI CCEL by handling the EfiACPIMemoryNVS type memory.
As per UEFI specifications NVS memory is reserved for Firmware use even
after exiting boot services. Thus map the region as read-only.
Cc: Sami Mujawar <sami.mujawar@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Gavin Shan <gshan@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Sami Mujawar <sami.mujawar@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
arm64: realm: ioremap: Allow mapping memory as encrypted
For ioremap(), so far we only checked if it was a device (RIPAS_DEV) to choose
an encrypted vs decrypted mapping. However, we may have firmware reserved memory
regions exposed to the OS (e.g., EFI Coco Secret Securityfs, ACPI CCEL).
We need to make sure that anything that is RIPAS_RAM (i.e., Guest
protected memory with RMM guarantees) are also mapped as encrypted.
Rephrasing the above, anything that is not RIPAS_EMPTY is guaranteed to be
protected by the RMM. Thus we choose encrypted mapping for anything that is not
RIPAS_EMPTY. While at it, rename the helper function
Yang Shi [Wed, 17 Sep 2025 19:02:09 +0000 (12:02 -0700)]
arm64: mm: support large block mapping when rodata=full
When rodata=full is specified, kernel linear mapping has to be mapped at
PTE level since large page table can't be split due to break-before-make
rule on ARM64.
This resulted in a couple of problems:
- performance degradation
- more TLB pressure
- memory waste for kernel page table
With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.
Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full. When
changing permissions for kernel linear mapping, the page table will be
split to smaller size.
The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.
With this we saw significant performance boost with some benchmarks and
much less memory consumption on my AmpereOne machine (192 cores, 1P)
with 256GB memory.
* Memory use after boot
Before:
MemTotal: 258988984 kB
MemFree: 254821700 kB
Around 500MB more memory are free to use. The larger the machine, the
more memory saved.
* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on. Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses. The kernel TLB
MPKI is reduced by 28.5%.
The benchmark data is now on par with rodata=on too.
* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
disk encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap \
--randrepeat 1 --status-interval=999 --rw=write --bs=4k --loops=1 \
--ioengine=sync --iodepth=1 --numjobs=1 --fsync_on_close=1 \
--group_reporting --thread --name=iops-test-job --eta-newline=1 \
--size 100G
The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad
case). The bandwidth is increased and the avg clat is reduced
proportionally.
* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache
populated). The bandwidth is increased by 150%.
Co-developed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Will Deacon <will@kernel.org>
Dev Jain [Wed, 17 Sep 2025 19:02:07 +0000 (12:02 -0700)]
arm64: Enable permission change on arm64 kernel block mappings
This patch paves the path to enable huge mappings in vmalloc space and
linear map space by default on arm64. For this we must ensure that we
can handle any permission games on the kernel (init_mm) pagetable.
Previously, __change_memory_common() used apply_to_page_range() which
does not support changing permissions for block mappings. We move away
from this by using the pagewalk API, similar to what riscv does right
now. It is the responsibility of the caller to ensure that the range
over which permissions are being changed falls on leaf mapping
boundaries. For systems with BBML2, this will be handled in future
patches by dyanmically splitting the mappings when required.
Unlike apply_to_page_range(), the pagewalk API currently enforces the
init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the
mmap_lock for our usecase, this patch extends this generic API to be
used locklessly, so as to retain the existing behaviour for changing
permissions. Apart from this reason, it is noted at [1] that KFENCE can
manipulate kernel pgtable entries during softirqs. It does this by
calling set_memory_valid() -> __change_memory_common(). This being a
non-sleepable context, we cannot take the init_mm mmap lock.
Add comments to highlight the conditions under which we can use the
lockless variant - no underlying VMA, and the user having exclusive
control over the range, thus guaranteeing no concurrent access.
We require that the start and end of a given range do not partially
overlap block mappings, or cont mappings. Return -EINVAL in case a
partial block mapping is detected in any of the PGD/P4D/PUD/PMD levels;
add a corresponding comment in update_range_prot() to warn that
eliminating such a condition is the responsibility of the caller.
Note that, the pte level callback may change permissions for a whole
contpte block, and that will be done one pte at a time, as opposed to an
atomic operation for the block mappings. This is fine as any access will
decode either the old or the new permission until the TLBI.
apply_to_page_range() currently performs all pte level callbacks while
in lazy mmu mode. Since arm64 can optimize performance by batching
barriers when modifying kernel pgtables in lazy mmu mode, we would like
to continue to benefit from this optimisation. Unfortunately
walk_kernel_page_table_range() does not use lazy mmu mode. However,
since the pagewalk framework is not allocating any memory, we can safely
bracket the whole operation inside lazy mmu mode ourselves. Therefore,
wrap the call to walk_kernel_page_table_range() with the lazy MMU
helpers.
Yang Shi [Wed, 17 Sep 2025 19:02:08 +0000 (12:02 -0700)]
arm64: cpufeature: add AmpereOne to BBML2 allow list
AmpereOne supports BBML2 without conflict abort, add to the allow list.
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Will Deacon <will@kernel.org>
Robin Murphy [Thu, 18 Sep 2025 16:25:31 +0000 (17:25 +0100)]
perf/arm-cmn: Fix CMN S3 DTM offset
CMN S3's DTM offset is different between r0px and r1p0, and it
turns out this was not a error in the earlier documentation, but
does actually exist in the design. Lovely.
Cc: stable@vger.kernel.org Fixes: 0dc2f4963f7e ("perf/arm-cmn: Support CMN S3") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
We currently have duplicate definitions for ARM_CPU_PART_CORTEX_X1C and
MIDR_CORTEX_X1C as a result of commits:
58d245e03c324d08 ("arm64: cputype: Add Cortex-X1C definitions") efe676a1a7554219 ("arm64: proton-pack: Add new CPUs 'k' values for branch mitigation")
Due to inconsistent sorting when adding entries, there was no textual
conflict between the two patches.
Delete the duplicate definitions added by the latter commit.
The definitions in general are largely (but not entirely) in order of
the MIDR_EL1.PartNum value rather than by CPU name, and the remaining
Cortex-X1C definitions appear later in the list.
For now I haven't sorted the remaining MIDR definitions to minimize
churn. I intend to perform some larger cleanup of these in the near
future which should supersede that anyhow.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Leo Yan [Wed, 17 Sep 2025 17:41:39 +0000 (18:41 +0100)]
perf: arm_spe: Prevent overflow in PERF_IDX2OFF()
Cast nr_pages to unsigned long to avoid overflow when handling large
AUX buffer sizes (>= 2 GiB).
Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension") Signed-off-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Thu, 14 Aug 2025 09:16:22 +0000 (17:16 +0800)]
MAINTAINERS: Remove myself from HiSilicon PMU maintainers
Remove myself as I'm leaving HiSilicon and not suitable for maintaining
this. Thanks for the journey.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
Junhao He [Thu, 14 Aug 2025 09:16:21 +0000 (17:16 +0800)]
drivers/perf: hisi: Add support for HiSilicon MN PMU driver
MN (Miscellaneous Node) is a hybrid node in ARM CHI. It broadcasts the
following two types of requests: DVM operations and PCIe configuration.
MN PMU devices exist on both SCCL and SICL, so we named the MN pmu
driver after SCL (Super cluster) ID.
The MN PMU driver using the HiSilicon uncore PMU framework. And only
the event parameter is supported.
Signed-off-by: Junhao He <hejunhao3@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Thu, 14 Aug 2025 09:16:20 +0000 (17:16 +0800)]
drivers/perf: hisi: Add support for HiSilicon NoC PMU
Adds the support for HiSilicon NoC (Network on Chip) PMU which
will be used to monitor the events on the system bus. The PMU
device will be named after the SCL ID (either Super CPU cluster
or Super IO cluster) and the index ID, just similar to other
HiSilicon Uncore PMUs. Below PMU formats are provided besides
the event:
- ch: the transaction channel (data, request, response, etc) which
can be used to filter the counting.
- tt_en: tracetag filtering enable. Just as other HiSilicon Uncore
PMUs the NoC PMU supports only counting the transactions with
tracetag.
The NoC PMU doesn't have an interrupt to indicate the overflow.
However we have a 64 bit counter which is large enough and it's
nearly impossible to overflow.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Wed, 20 Aug 2025 08:45:33 +0000 (16:45 +0800)]
perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions
PMCCNTR_EL0 is preferred for counting CPU_CYCLES under certain
conditions. Factor out the condition check to a separate function
for further extension. Add documents for better understanding.
No functional changes intended.
Reviewed-by: James Clark <james.clark@linaro.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Will Deacon <will@kernel.org>
James Clark [Mon, 1 Sep 2025 12:40:35 +0000 (13:40 +0100)]
arm64/boot: Enable EL2 requirements for SPE_FEAT_FDS
SPE data source filtering (optional from Armv8.8) requires that traps to
the filter register PMSDSFR be disabled. Document the requirements and
disable the traps if the feature is present.
Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
James Clark [Mon, 1 Sep 2025 12:40:34 +0000 (13:40 +0100)]
arm64/boot: Factor out a macro to check SPE version
We check the version of SPE twice, and we'll add one more check in the
next commit so factor out a macro to do this. Change the #3 magic number
to the actual SPE version define (V1p2) to make it more readable.
No functional changes intended.
Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
James Clark [Mon, 1 Sep 2025 12:40:33 +0000 (13:40 +0100)]
perf: arm_spe: Add support for FEAT_SPE_EFT extended filtering
FEAT_SPE_EFT (optional from Armv9.4) adds mask bits for the existing
load, store and branch filters. It also adds two new filter bits for
SIMD and floating point with their own associated mask bits. The current
filters only allow OR filtering on samples that are load OR store etc,
and the new mask bits allow setting part of the filter to an AND, for
example filtering samples that are store AND SIMD. With mask bits set to
0, the OR behavior is preserved, so the unless any masks are explicitly
set old filters will behave the same.
Add them all and make them behave the same way as existing format bits,
hidden and return EOPNOTSUPP if set when the feature doesn't exist.
Reviewed-by: Leo Yan <leo.yan@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
Leo Yan [Mon, 1 Sep 2025 12:40:32 +0000 (13:40 +0100)]
perf: arm_spe: Expose event filter
Expose an "event_filter" entry in the caps folder to inform user space
about which events can be filtered.
Change the return type of arm_spe_pmu_cap_get() from u32 to u64 to
accommodate the added event filter entry.
Signed-off-by: Leo Yan <leo.yan@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
James Clark [Mon, 1 Sep 2025 12:40:31 +0000 (13:40 +0100)]
perf: arm_spe: Support FEAT_SPEv1p4 filters
FEAT_SPEv1p4 (optional from Armv8.8) adds some new filter bits and also
makes some previously available bits unavailable again e.g:
E[30], bit [30]
When FEAT_SPEv1p4 is _not_ implemented ...
Continuing to hard code the valid filter bits for each version isn't
scalable, and it also doesn't work for filter bits that aren't related
to SPE version. For example most bits have a further condition:
E[15], bit [15]
When ... and filtering on event 15 is supported:
Whether "filtering on event 15" is implemented or not is only
discoverable from the TRM of that specific CPU or by probing
PMSEVFR_EL1.
Instead of hard coding them, write all 1s to the PMSEVFR_EL1 register
and read it back to discover the RES0 bits. Unsupported bits are RAZ/WI
so should read as 0s.
For any hardware that doesn't strictly follow RAZ/WI for unsupported
filters: Any bits that should have been supported in a specific SPE
version but now incorrectly appear to be RES0 wouldn't have worked
anyway, so it's better to fail to open events that request them rather
than behaving unexpectedly. Bits that aren't implemented but also aren't
RAZ/WI will be incorrectly reported as supported, but allowing them to
be used is harmless.
Testing on N1SDP shows the probed RES0 bits to be the same as the hard
coded ones. The FVP with SPEv1p4 shows only additional new RES0 bits,
i.e. no previously hard coded RES0 bits are missing.
Tested-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
James Clark [Mon, 1 Sep 2025 12:40:30 +0000 (13:40 +0100)]
arm64: sysreg: Add new PMSFCR_EL1 fields and PMSDSFR_EL1 register
Add new fields and register that are introduced for the features
FEAT_SPE_EFT (extended filtering) and FEAT_SPE_FDS (data source
filtering).
Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
Ilkka Koskinen [Thu, 28 Aug 2025 22:35:19 +0000 (15:35 -0700)]
perf/dwc_pcie: Support counting multiple lane events in parallel
While Designware PCIe PMU allows to count only one time based event
at a time, it allows to count all the lane events simultaneously.
After the patch one is able to count a group of lane events:
$ perf stat -e '{dwc_rootport/tx_memory_write,lane=1/,dwc_rootport/rx_memory_read,lane=0/}' dd if=/dev/nvme0n1 of=/dev/null bs=1M count=1
Earlier the events wouldn't have been counted successfully.
Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Will Deacon <will@kernel.org>
Xu Yang [Thu, 21 Aug 2025 11:01:52 +0000 (19:01 +0800)]
MAINTAINERS: include fsl_imx9_ddr_perf.c and some perf metric files
The fsl_imx9_ddr_perf.c and some perf metric files under
tools/perf/pmu-events/arch/arm64/freescale/ is missing in MAINTAINERS.
Add them and add me as another maintainer.
Reviewed-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Signed-off-by: Will Deacon <will@kernel.org>
Xu Yang [Thu, 21 Aug 2025 11:01:50 +0000 (19:01 +0800)]
perf: imx_perf: add support for i.MX94 platform
Add compatible string and related devtype for i.MX94 platform.
Reviewed-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Signed-off-by: Will Deacon <will@kernel.org>
Xu Yang [Thu, 21 Aug 2025 11:01:49 +0000 (19:01 +0800)]
dt-bindings: perf: fsl-imx-ddr: Add a compatible string fsl,imx94-ddr-pmu for i.MX94
i.MX94 has a DDR Performance Monitor Unit which is compatible with i.MX93.
This will add a compatible for i.MX94.
Reviewed-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Signed-off-by: Will Deacon <will@kernel.org>
Mark Brown [Wed, 20 Aug 2025 18:29:04 +0000 (19:29 +0100)]
kselftest/arm64: Check that unsupported regsets fail in sve-ptrace
Add a test which verifies that NT_ARM_SVE and NT_ARM_SSVE reads and writes
are rejected as expected when the relevant architecture feature is not
supported.
Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Mark Brown [Wed, 20 Aug 2025 18:29:03 +0000 (19:29 +0100)]
kselftest/arm64: Verify that we reject out of bounds VLs in sve-ptrace
We do not currently have a test that asserts that we reject attempts to set
a vector length smaller than SVE_VL_MIN or larger than SVE_VL_MAX, add one
since that is our current behaviour.
Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
basic-gcs has it's own make rule to handle the special compiler
invocation to build against nolibc. This rule does not respect the
$(CFLAGS) passed by the Makefile from the parent directory.
However these $(CFLAGS) set up the include path to include the UAPI
headers from the current kernel.
Due to this the asm/hwcap.h header is used from the toolchain instead of
the UAPI and the definition of HWCAP_GCS is not found.
Restructure the rule for basic-gcs to respect the $(CFLAGS).
Also drop those options which are already provided by $(CFLAGS).
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Closes: https://lore.kernel.org/lkml/CA+G9fYv77X+kKz2YT6xw7=9UrrotTbQ6fgNac7oohOg8BgGvtw@mail.gmail.com/ Fixes: a985fe638344 ("kselftest/arm64/gcs: Use nolibc's getauxval()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
As per admin guide documentation, "rodata=on" should be the default on
platforms. Documentation/admin-guide/kernel-parameters.txt describes
these options as
rodata= [KNL,EARLY]
on Mark read-only kernel memory as read-only (default).
off Leave read-only kernel memory writable for debugging.
full Mark read-only kernel memory and aliases as read-only
[arm64]
But on arm64 platform, RODATA_FULL_DEFAULT_ENABLED is enabled by default,
so "rodata=full" is the default instead.
For parity with other architectures, namely x86, rework 'rodata=on' to
match the current "full" behaviour and replace 'rodata=full' with a new
'rodata=noalias' option which retains writable aliases in the direct map
for memory regions outside of the kernel image.
Sam Edwards [Thu, 4 Sep 2025 00:52:09 +0000 (17:52 -0700)]
arm64: mm: Represent physical memory with phys_addr_t and resource_size_t
This is a type-correctness cleanup to MMU/boot code that replaces
several instances of void * and u64 with phys_addr_t (to represent
addresses) and resource_size_t (to represent sizes) to emphasize that
the code in question concerns physical memory specifically.
The rationale for this change is to improve clarity and readability in
a few modules that handle both types (physical and virtual) of address
and differentiation is essential.
I have left u64 in cases where the address may be either physical or
virtual, where the address is exclusively virtual but used in heavy
pointer arithmetic, and in cases I may have overlooked. I do not
necessarily consider u64 the ideal type in those situations, but it
avoids breaking existing semantics in this cleanup.
This patch provably has no effect at runtime: I have verified that
.text of vmlinux is identical after this change.
Signed-off-by: Sam Edwards <CFSworks@gmail.com> Signed-off-by: Will Deacon <will@kernel.org>
Sam Edwards [Thu, 4 Sep 2025 00:52:08 +0000 (17:52 -0700)]
arm64: mm: Make map_fdt() return mapped pointer
Currently map_fdt() accepts a physical address and relies on the caller
to keep using the same value after mapping, since the implementation
happens to install an identity mapping. This obscures the fact that the
usable pointer is defined by the mapping, not by the input value. Since
the mapping determines pointer validity, it is more natural to produce
the pointer at mapping time.
Change map_fdt() to return a void * pointing to the mapped FDT. This
clarifies the data flow, removes the implicit identity assumption, and
prepares for making map_fdt() accept a phys_addr_t in a follow-up
change.
Signed-off-by: Sam Edwards <CFSworks@gmail.com> Signed-off-by: Will Deacon <will@kernel.org>
Sam Edwards [Thu, 4 Sep 2025 00:52:07 +0000 (17:52 -0700)]
arm64: mm: Cast start/end markers to char *, not u64
There are a few memset() calls in map_kernel.c that cast marker-symbol
addresses to u64 in order to perform pointer subtraction (range size
computation).
Cast them with (char *) instead, aligning with idiomatic C pointer
arithmetic.
This patch provably has no effect at runtime: I have verified that
.text of vmlinux is identical after this change.
Signed-off-by: Sam Edwards <CFSworks@gmail.com> Signed-off-by: Will Deacon <will@kernel.org>
Mark Brown [Mon, 18 Aug 2025 19:21:18 +0000 (20:21 +0100)]
arm64/hwcap: Add hwcap for FEAT_LSFE
FEAT_LSFE (Large System Float Extension), providing atomic floating point
memory operations, is optional from v9.5. This feature adds no new
architectural stare and we have no immediate use for it in the kernel so
simply provide a hwcap for it to support discovery by userspace.
Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Jinjie Ruan [Fri, 15 Aug 2025 03:06:33 +0000 (11:06 +0800)]
arm64: entry: Switch to generic IRQ entry
Currently, x86, Riscv and Loongarch use the generic entry code, which
makes maintainer's work easier and code more elegant. Start converting
arm64 to use the generic entry infrastructure from kernel/entry/* by
switching it to generic IRQ entry, which removes 100+ lines of duplicate
code. arm64 will completely switch to generic entry in a later series.
The changes are below:
- Remove *enter_from/exit_to_kernel_mode(), and wrap with generic
irqentry_enter/exit() as their code and functionality are almost
identical.
- Define ARCH_EXIT_TO_USER_MODE_WORK and implement
arch_exit_to_user_mode_work() to check arm64-specific thread flags
"_TIF_MTE_ASYNC_FAULT" and "_TIF_FOREIGN_FPSTATE".
So also remove *enter_from/exit_to_user_mode(), and wrap with
generic enter_from/exit_to_user_mode() because they are
exactly the same.
- Remove arm64_enter/exit_nmi() and use generic irqentry_nmi_enter/exit()
because they're exactly the same, so the temporary arm64 version
irqentry_state can also be removed.
- Remove PREEMPT_DYNAMIC code, as generic irqentry_exit_cond_resched()
has the same functionality.
- Implement arch_irqentry_exit_need_resched() with
arm64_preempt_schedule_irq() for arm64 which will allow arm64 to do
its architecture specific checks.
Tested-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Suggested-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Jinjie Ruan [Fri, 15 Aug 2025 03:06:32 +0000 (11:06 +0800)]
arm64: entry: Move arm64_preempt_schedule_irq() into __exit_to_kernel_mode()
The arm64 entry code only preempts a kernel context upon a return from
a regular IRQ exception. The generic entry code may preempt a kernel
context for any exception return where irqentry_exit() is used, and so
may preempt other exceptions such as faults.
In preparation for moving arm64 over to the generic entry code, align
arm64 with the generic behaviour by calling
arm64_preempt_schedule_irq() from exit_to_kernel_mode(). To make this
possible, arm64_preempt_schedule_irq()
and dynamic/raw_irqentry_exit_cond_resched() are moved earlier in
the file, with no changes.
As Mark pointed out, this change will have the following 2 key impact:
- " We'll preempt even without taking a "real" interrupt. That
shouldn't result in preemption that wasn't possible before,
but it does change the probability of preempting at certain points,
and might have a performance impact, so probably warrants a
benchmark."
- " We will not preempt when taking interrupts from a region of kernel
code where IRQs are enabled but RCU is not watching, matching the
behaviour of the generic entry code.
This has the potential to introduce livelock if we can ever have a
screaming interrupt in such a region, so we'll need to go figure out
whether that's actually a problem.
Having this as a separate patch will make it easier to test/bisect
for that specifically."
Reviewed-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
To align the structure of the code with irqentry_exit_cond_resched()
from the generic entry code, hoist the need_irq_preemption()
and IS_ENABLED() check earlier. And different preemption check functions
are defined based on whether dynamic preemption is enabled.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Jinjie Ruan [Fri, 15 Aug 2025 03:06:30 +0000 (11:06 +0800)]
entry: Add arch_irqentry_exit_need_resched() for arm64
Compared to the generic entry code, ARM64 does additional checks
when deciding to reschedule on return from interrupt. So introduce
arch_irqentry_exit_need_resched() in the need_resched()
condition of the generic raw_irqentry_exit_cond_resched(), with
a NOP default. This will allow ARM64 to implement the architecture
specific version for switching over to the generic entry code.
Suggested-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Suggested-by: Kevin Brodsky <kevin.brodsky@arm.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Jinjie Ruan [Fri, 15 Aug 2025 03:06:29 +0000 (11:06 +0800)]
arm64: entry: Use preempt_count() and need_resched() helper
The generic entry code uses preempt_count() and need_resched() helpers to
check if it should do preempt_schedule_irq(). Currently, arm64 use its own
check logic, that is "READ_ONCE(current_thread_info()->preempt_count == 0",
which is equivalent to "preempt_count() == 0 && need_resched()".
In preparation for moving arm64 over to the generic entry code, use
these helpers to replace arm64's own code and move it ahead.
No functional changes.
Reviewed-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Jinjie Ruan [Fri, 15 Aug 2025 03:06:28 +0000 (11:06 +0800)]
arm64: entry: Rework arm64_preempt_schedule_irq()
The generic entry code has the form:
| raw_irqentry_exit_cond_resched()
| {
| if (!preempt_count()) {
| ...
| if (need_resched())
| preempt_schedule_irq();
| }
| }
In preparation for moving arm64 over to the generic entry code, align
the structure of the arm64 code with raw_irqentry_exit_cond_resched() from
the generic entry code.
Reviewed-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Jinjie Ruan [Fri, 15 Aug 2025 03:06:27 +0000 (11:06 +0800)]
arm64: entry: Refactor the entry and exit for exceptions from EL1
The generic entry code uses irqentry_state_t to track lockdep and RCU
state across exception entry and return. For historical reasons, arm64
embeds similar fields within its pt_regs structure.
In preparation for moving arm64 over to the generic entry code, pull
these fields out of arm64's pt_regs, and use a separate structure,
matching the style of the generic entry code.
No functional changes.
Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Jinjie Ruan [Fri, 15 Aug 2025 03:06:26 +0000 (11:06 +0800)]
arm64: ptrace: Replace interrupts_enabled() with regs_irqs_disabled()
The generic entry code expects architecture code to provide
regs_irqs_disabled(regs) function, but arm64 does not have this and
provides interrupts_enabled(regs), which has the opposite polarity.
In preparation for moving arm64 over to the generic entry code,
relace arm64's interrupts_enabled() with regs_irqs_disabled() and
update its callers under arch/arm64.
For the moment, a definition of interrupts_enabled() is provided for
the GICv3 driver. Once arch/arm implement regs_irqs_disabled(), this
can be removed.
Delete the fast_interrupts_enabled() macro as it is unused and we
don't want any new users to show up.
No functional changes.
Reviewed-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Fuad Tabba [Fri, 29 Aug 2025 09:51:43 +0000 (10:51 +0100)]
arm64: sysreg: Add validation checks to sysreg header generation script
The gen_sysreg.awk script processes the system register specification in
the sysreg text file to generate C macro definitions. The current script
will silently accept certain errors in the specification file, leading
to incorrect header generation.
For example, a Sysreg or SysregFields can be accidentally duplicated,
causing its macros to be emitted twice. An Enum can contain duplicate
values for different items, which is architecturally incorrect.
Add checks to catch these errors at build time. The script now tracks
all seen Sysreg and SysregFields definitions and checks for duplicates.
It also tracks values within each Enum block to ensure entries are
unique.
Acked-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Fuad Tabba [Fri, 29 Aug 2025 09:51:42 +0000 (10:51 +0100)]
arm64: sysreg: Correct sign definitions for EIESB and DoubleLock
The `ID_AA64MMFR4_EL1.EIESB` field, is an unsigned enumeration, but was
incorrectly defined as a `SignedEnum` when introduced in commit cfc680bb04c5 ("arm64: sysreg: Add layout for ID_AA64MMFR4_EL1"). This is
corrected to `UnsignedEnum`.
Conversely, the `ID_AA64DFR0_EL1.DoubleLock` field, is a signed
enumeration, but was incorrectly defined as an `UnsignedEnum`. This is
corrected to `SignedEnum`, which wasn't correctly set when annotated as
such in commit ad16d4cf0b4f ("arm64/sysreg: Initial unsigned annotations
for ID registers").
Signed-off-by: Fuad Tabba <tabba@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Fuad Tabba [Fri, 29 Aug 2025 09:51:41 +0000 (10:51 +0100)]
arm64: sysreg: Fix and tidy up sysreg field definitions
Fix the value of ID_PFR1_EL1.Security NSACR_RFR to be 0b0010, as per
DDI0601/2025-06, which wasn't correctly set when introduced in commit 1224308075f1 ("arm64/sysreg: Convert ID_PFR1_EL1 to automatic generation").
While at it, remove redundant definitions of CPACR_EL12 and
RCWSMASK_EL1 and fix some typos.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Nikola Z. Ivanov [Tue, 26 Aug 2025 21:49:13 +0000 (00:49 +0300)]
selftests/arm64: Fix grammatical error in string literals
Fix grammatical error in <past tense verb> + <infinitive>
construct related to memory allocation checks.
In essence change "Failed to allocated" to "Failed to allocate".
Signed-off-by: Nikola Z. Ivanov <zlatistiv@gmail.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Will Deacon <will@kernel.org>
Vivek Yadav [Sun, 24 Aug 2025 06:14:01 +0000 (23:14 -0700)]
kselftest/arm64: Supress warning and improve readability
The comment was correct, but `checkpatch` script flagged it with a warning
as shown in the output section. The comment is slightly modified
to improve readability, which also suppresses the warning.
Thomas Weißschuh [Thu, 21 Aug 2025 15:13:02 +0000 (17:13 +0200)]
kselftest/arm64/gcs: Correctly check return value when disabling GCS
The return value was not assigned to 'ret', so the check afterwards
does not do anything.
Fixes: 3d37d4307e0f ("kselftest/arm64: Add very basic GCS test program") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
selftests: arm64: Fix -Waddress warning in tpidr2 test
Thanks to -Waddress, the compiler warns that the ksft_test_result()
invocations in the arm64 tpidr2 selftest are always true. Oops.
Fix the test by, err, actually running the test functions.
Fixes: 6d80cb73131d ("kselftest/arm64: Convert tpidr2 test to use kselftest.h") Signed-off-by: Bala-Vignesh-Reddy <reddybalavignesh9979@gmail.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
selftests: arm64: Check fread return value in exec_target
Fix -Wunused-result warning generated when compiled with gcc 13.3.0,
by checking fread's return value and handling errors, preventing
potential failures when reading from stdin.
Fixes compiler warning:
warning: ignoring return value of 'fread' declared with attribute
'warn_unused_result' [-Wunused-result]
Fixes: 806a15b2545e ("kselftests/arm64: add PAuth test for whether exec() changes keys") Signed-off-by: Bala-Vignesh-Reddy <reddybalavignesh9979@gmail.com> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Mark Brown [Wed, 23 Jul 2025 12:27:45 +0000 (13:27 +0100)]
arm64/sme: Drop inaccurate documentation of streaming mode switches
The SME ABI documentation contains an inaccurate description of the
architectural streaming mode entry/exit behaviour, just remove it since
this is better documented by the architecture or with the rest of the
documentation for the specific software interfaces concerned.
Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
arm64: ftrace: fix unreachable PLT for ftrace_caller in init_module with CONFIG_DYNAMIC_FTRACE
On arm64, it has been possible for a module's sections to be placed more
than 128M away from each other since commit:
commit 3e35d303ab7d ("arm64: module: rework module VA range selection")
Due to this, an ftrace callsite in a module's .init.text section can be
out of branch range for the module's ftrace PLT entry (in the module's
.text section). Any attempt to enable tracing of that callsite will
result in a BRK being patched into the callsite, resulting in a fatal
exception when the callsite is later executed.
Fix this by adding an additional trampoline for .init.text, which will
be within range.
No additional trampolines are necessary due to the way a given
module's executable sections are packed together. Any executable
section beginning with ".init" will be placed in MOD_INIT_TEXT,
and any other executable section, including those beginning with ".exit",
will be placed in MOD_TEXT.
Fixes: 3e35d303ab7d ("arm64: module: rework module VA range selection") Cc: <stable@vger.kernel.org> # 6.5.x Signed-off-by: panfan <panfan@qti.qualcomm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20250905032236.3220885-1-panfan@qti.qualcomm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Miaoqian Lin [Thu, 28 Aug 2025 11:22:43 +0000 (19:22 +0800)]
ACPI/IORT: Fix memory leak in iort_rmr_alloc_sids()
If krealloc_array() fails in iort_rmr_alloc_sids(), the function returns
NULL but does not free the original 'sids' allocation. This results in a
memory leak since the caller overwrites the original pointer with the
NULL return value.
Fixes: 491cf4a6735a ("ACPI/IORT: Add support to retrieve IORT RMR reserved regions") Cc: <stable@vger.kernel.org> # 6.0.x Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/20250828112243.61460-1-linmq006@gmail.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Thomas Weißschuh [Thu, 21 Aug 2025 07:56:44 +0000 (09:56 +0200)]
arm64: uapi: Provide correct __BITS_PER_LONG for the compat vDSO
The generic vDSO library uses the UAPI headers. On arm64 __BITS_PER_LONG is
always '64' even when used from the compat vDSO. In that case __GENMASK()
does an illegal bitshift, invoking undefined behaviour.
Change __BITS_PER_LONG to also work when used from the comapt vDSO.
To not confuse real userspace, only do this when building the kernel.
Reported-by: John Stultz <jstultz@google.com> Closes: https://lore.kernel.org/lkml/CANDhNCqvKOc9JgphQwr0eDyJiyG4oLFS9R8rSFvU0fpurrJFDg@mail.gmail.com/ Fixes: cd3557a7618b ("vdso/gettimeofday: Add support for auxiliary clocks") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250821-vdso-arm64-compat-bitsperlong-v1-1-700bcabe7732@linutronix.de Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Brown [Tue, 12 Aug 2025 14:49:27 +0000 (15:49 +0100)]
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
In fp-trace when allocating a buffer to write SVE register data we open
code the addition of the header size to the VL depeendent register data
size, which lead to an underallocation bug when we cut'n'pasted the code
for FPSIMD format writes. Use the SVE_PT_SIZE() macro that the kernel
UAPI provides for this.
Linus Torvalds [Sun, 10 Aug 2025 05:51:37 +0000 (08:51 +0300)]
Merge tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fixes from Borislav Petkov:
- Remove an obsolete comment and fix spelling
* tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Remove obsolete comment from takedown_cpu()
smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment