Catalin Marinas [Fri, 28 Nov 2025 15:47:53 +0000 (15:47 +0000)]
Merge branch 'for-next/sysreg' into for-next/core
* for-next/sysreg:
: arm64 sysreg updates/cleanups
arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
KVM: arm64: selftests: Consider all 7 possible levels of cache
KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
arm64/sysreg: Add ICH_VMCR_EL2
arm64/sysreg: Move generation of RES0/RES1/UNKN to function
arm64/sysreg: Support feature-specific fields with 'Prefix' descriptor
arm64/sysreg: Fix checks for incomplete sysreg definitions
arm64/sysreg: Replace TCR_EL1 field macros
Catalin Marinas [Fri, 28 Nov 2025 15:47:12 +0000 (15:47 +0000)]
Merge branches 'for-next/misc', 'for-next/kselftest', 'for-next/efi-preempt', 'for-next/assembler-macro', 'for-next/typos', 'for-next/sme-ptrace-disable', 'for-next/local-tlbi-page-reused', 'for-next/mpam', 'for-next/acpi' and 'for-next/documentation', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
perf: arm_spe: Add support for filtering on data source
perf: Add perf_event_attr::config4
perf/imx_ddr: Add support for PMU in DB (system interconnects)
perf/imx_ddr: Get and enable optional clks
perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
arch_topology: Provide a stub topology_core_has_smt() for !CONFIG_GENERIC_ARCH_TOPOLOGY
perf/arm-ni: Fix and optimise register offset calculation
perf: arm_pmuv3: Add new Cortex and C1 CPU PMUs
perf: arm_cspmu: fix error handling in arm_cspmu_impl_unregister()
perf/arm-ni: Add NoC S3 support
perf/arm_cspmu: nvidia: Add pmevfiltr2 support
perf/arm_cspmu: nvidia: Add revision id matching
perf/arm_cspmu: Add pmpidr support
perf/arm_cspmu: Add callback to reset filter config
perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
* for-next/misc:
: Miscellaneous patches
arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
arm64: mm: use untagged address to calculate page index
arm64: mm: make linear mapping permission update more robust for patial range
arm64/mm: Elide TLB flush in certain pte protection transitions
arm64/mm: Rename try_pgd_pgtable_alloc_init_mm
arm64/mm: Allow __create_pgd_mapping() to propagate pgtable_alloc() errors
arm64: add unlikely hint to MTE async fault check in el0_svc_common
arm64: acpi: add newline to deferred APEI warning
arm64: entry: Clean out some indirection
arm64/mm: Ensure PGD_SIZE is aligned to 64 bytes when PA_BITS = 52
arm64/mm: Drop cpu_set_[default|idmap]_tcr_t0sz()
arm64: remove unused ARCH_PFN_OFFSET
arm64: use SOFTIRQ_ON_OWN_STACK for enabling softirq stack
arm64: Remove assertion on CONFIG_VMAP_STACK
* for-next/efi-preempt:
: arm64: Make EFI calls preemptible
arm64/efi: Call EFI runtime services without disabling preemption
arm64/efi: Move uaccess en/disable out of efi_set_pgd()
arm64/efi: Drop efi_rt_lock spinlock from EFI arch wrapper
arm64/fpsimd: Permit kernel mode NEON with IRQs off
arm64/fpsimd: Don't warn when EFI execution context is preemptible
efi/runtime-wrappers: Keep track of the efi_runtime_lock owner
efi: Add missing static initializer for efi_mm::cpus_allowed_lock
* for-next/assembler-macro:
: arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers
arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-uapi headers
arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in uapi headers
* for-next/typos:
: Random typo/spelling fixes
arm64: Fix double word in comments
arm64: Fix typos and spelling errors in comments
* for-next/sme-ptrace-disable:
: Support disabling streaming mode via ptrace on SME only systems
kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace
kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems
arm64/sme: Support disabling streaming mode via ptrace on SME only systems
* for-next/local-tlbi-page-reused:
: arm64, mm: avoid TLBI broadcast if page reused in write fault
arm64, tlbflush: don't TLBI broadcast if page reused in write fault
mm: add spurious fault fixing support for huge pmd
* for-next/mpam: (34 commits)
: Basic Arm MPAM driver (more to follow)
MAINTAINERS: new entry for MPAM Driver
arm_mpam: Add kunit tests for props_mismatch()
arm_mpam: Add kunit test for bitmap reset
arm_mpam: Add helper to reset saved mbwu state
arm_mpam: Use long MBWU counters if supported
arm_mpam: Probe for long/lwd mbwu counters
arm_mpam: Consider overflow in bandwidth counter state
arm_mpam: Track bandwidth counter state for power management
arm_mpam: Add mpam_msmon_read() to read monitor value
arm_mpam: Add helpers to allocate monitors
arm_mpam: Probe and reset the rest of the features
arm_mpam: Allow configuration to be applied and restored during cpu online
arm_mpam: Use a static key to indicate when mpam is enabled
arm_mpam: Register and enable IRQs
arm_mpam: Extend reset logic to allow devices to be reset any time
arm_mpam: Add a helper to touch an MSC from any CPU
arm_mpam: Reset MSC controls from cpuhp callbacks
arm_mpam: Merge supported features during mpam_enable() into mpam_class
arm_mpam: Probe the hardware features resctrl supports
arm_mpam: Add helpers for managing the locking around the mon_sel registers
...
* for-next/acpi:
: arm64 acpi updates
ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
* for-next/documentation:
: arm64 Documentation updates
Documentation/arm64: Fix the typo of register names
Ben Horgan [Tue, 25 Nov 2025 10:12:07 +0000 (10:12 +0000)]
KVM: arm64: selftests: Consider all 7 possible levels of cache
In test_clidr() if an empty cache level is not found then the TEST_ASSERT
will not fire. Fix this by considering all 7 possible levels when iterating
through the hierarchy. Found by inspection.
Signed-off-by: Ben Horgan <ben.horgan@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ben Horgan [Tue, 25 Nov 2025 10:12:06 +0000 (10:12 +0000)]
KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
ARM64_FEATURE_FIELD_BITS is set to 4 but not all ID register fields are 4
bits. See for instance ID_AA64SMFR0_EL1. The last user of this define,
ARM64_FEATURE_FIELD_BITS, is the set_id_regs selftest. Its logic assumes
the fields aren't a single bits; assert that's the case and stop using the
define. As there are no more users, ARM64_FEATURE_FIELD_BITS is removed
from the arm64 tools sysreg.h header. A separate commit removes this from
the kernel version of the header.
Signed-off-by: Ben Horgan <ben.horgan@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Seongsu Park [Wed, 26 Nov 2025 02:10:25 +0000 (11:10 +0900)]
arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
The ATOMIC_FETCH_OP_AND and ATOMIC64_FETCH_OP_AND macros accept 'mb' and
'cl' parameters but never use them in their implementation. These macros
simply delegate to the corresponding andnot functions, which handle the
actual atomic operations and memory barriers.
Signed-off-by: Seongsu Park <sgsu.park@samsung.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Zenon Xiu [Tue, 11 Nov 2025 09:35:39 +0000 (17:35 +0800)]
Documentation/arm64: Fix the typo of register names
The register name 'HWFGWTR_EL2' and 'HWFGRTR_EL2' is wrong, should be 'HFGWTR_EL2' and 'HFGRTR_EL2'.
Find the register description on arm website here,
https://developer.arm.com/documentation/ddi0601/2025-09/AArch64-Registers/HFGWTR-EL2--Hypervisor-Fine-Grained-Write-Trap-Register
https://developer.arm.com/documentation/ddi0601/2025-09/AArch64-Registers/HFGRTR-EL2--Hypervisor-Fine-Grained-Read-Trap-Register?lang=en
James Clark [Tue, 11 Nov 2025 11:37:56 +0000 (11:37 +0000)]
perf: arm_spe: Add support for filtering on data source
SPE_FEAT_FDS adds the ability to filter on the data source of packets.
Like the other existing filters, enable filtering with PMSFCR_EL1.FDS
when any of the filter bits are set.
Each bit position of the 64 bit filter maps to numerical data sources
0-63 described by bits[0:5] in the data source packet (although the full
range of data source is 16 bits so higher value data sources can't be
filtered on). The filter is an OR of all the filter bits, so for example
clearing filter bits 0 and 3 only includes packets from data sources 0
OR 3.
Invert the filter given by userspace so that the default value of 0 is
equivalent to including all values (no filtering). This allows us to
skip adding a new format bit to enable filtering and still support
excluding all data sources which would have been a filter value of 0 if
not for the inversion.
Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
James Clark [Tue, 11 Nov 2025 11:37:55 +0000 (11:37 +0000)]
perf: Add perf_event_attr::config4
Arm FEAT_SPE_FDS adds the ability to filter on the data source of a
packet using another 64-bits of event filtering control. As the existing
perf_event_attr::configN fields are all used up for SPE PMU, an
additional field is needed. Add a new 'config4' field.
Reviewed-by: Leo Yan <leo.yan@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
Joakim Zhang [Mon, 3 Nov 2025 21:48:31 +0000 (16:48 -0500)]
perf/imx_ddr: Add support for PMU in DB (system interconnects)
There is a PMU in DB, which has the same function with PMU in DDR
subsystem, the difference is PMU in DB only supports cycles, axid-read,
axid-write events.
e.g.
perf stat -a -e imx8_db0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD,axi_port=0xPP,axi_channel=0xH/ cmd
perf stat -a -e imx8_db0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD,axi_port=0xPP,axi_channel=0xH/ cmd
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Will Deacon <will@kernel.org>
Cai Xinchen [Thu, 20 Nov 2025 01:30:25 +0000 (01:30 +0000)]
arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
The commit e7bafbf717775 ("arm64: mm: Add top-level dispatcher for
internal mem_encrypt API") adds ARCH_HAS_MEM_ENCRYPT. And then the
commit 42be24a4178fe ("arm64: Enable memory encrypt for Realms") adds
duplicate config. Just remove it.
Fixes: 42be24a4178f ("arm64: Enable memory encrypt for Realms") Signed-off-by: Cai Xinchen <caixinchen1@huawei.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
It is because the KASAN tagged address was used when calculating the page
index. The untagged address should be used.
Fixes: 37cb0aab9068 ("arm64: mm: make linear mapping permission update more robust for patial range") Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ben Horgan [Wed, 19 Nov 2025 12:23:04 +0000 (12:23 +0000)]
MAINTAINERS: new entry for MPAM Driver
Create a maintainer entry for the new MPAM Driver. Add myself and
James Morse as maintainers. James created the driver and I have
taken up the later versions of his series.
Cc: James Morse <james.morse@arm.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:23:01 +0000 (12:23 +0000)]
arm_mpam: Add helper to reset saved mbwu state
resctrl expects to reset the bandwidth counters when the filesystem
is mounted.
To allow this, add a helper that clears the saved mbwu state. Instead
of cross calling to each CPU that can access the component MSC to
write to the counter, set a flag that causes it to be zero'd on the
the next read. This is easily done by forcing a configuration update.
Signed-off-by: James Morse <james.morse@arm.com> Cc: Peter Newman <peternewman@google.com> Reviewed-by: Fenghua Yu <fenghuay@nvdia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Rohit Mathew [Wed, 19 Nov 2025 12:23:00 +0000 (12:23 +0000)]
arm_mpam: Use long MBWU counters if supported
Now that the larger counter sizes are probed, make use of them.
Callers of mpam_msmon_read() may not know (or care!) about the different
counter sizes. Allow them to specify mpam_feat_msmon_mbwu and have the
driver pick the counter to use.
Only 32bit accesses to the MSC are required to be supported by the
spec, but these registers are 64bits. The lower half may overflow
into the higher half between two 32bit reads. To avoid this, use
a helper that reads the top half multiple times to check for overflow.
Signed-off-by: Rohit Mathew <rohit.mathew@arm.com>
[morse: merged multiple patches from Rohit, added explicit counter selection ] Signed-off-by: James Morse <james.morse@arm.com> Cc: Peter Newman <peternewman@google.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Rohit Mathew [Wed, 19 Nov 2025 12:22:59 +0000 (12:22 +0000)]
arm_mpam: Probe for long/lwd mbwu counters
mpam v0.1 and versions above v1.0 support optional long counter for
memory bandwidth monitoring. The MPAMF_MBWUMON_IDR register has fields
indicating support for long counters.
Probe these feature bits.
The mpam_feat_msmon_mbwu feature is used to indicate that bandwidth
monitors are supported, instead of muddling this with which size of
bandwidth monitors, add an explicit 31 bit counter feature.
Signed-off-by: Rohit Mathew <rohit.mathew@arm.com>
[ morse: Added 31bit counter feature to simplify later logic ] Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ben Horgan [Wed, 19 Nov 2025 12:22:58 +0000 (12:22 +0000)]
arm_mpam: Consider overflow in bandwidth counter state
Use the overflow status bit to track overflow on each bandwidth counter
read and add the counter size to the correction when overflow is detected.
This assumes that only a single overflow has occurred since the last read
of the counter. Overflow interrupts, on hardware that supports them could
be used to remove this limitation.
James Morse [Wed, 19 Nov 2025 12:22:57 +0000 (12:22 +0000)]
arm_mpam: Track bandwidth counter state for power management
Bandwidth counters need to run continuously to correctly reflect the
bandwidth.
Save the counter state when the hardware is reset due to CPU hotplug.
Add struct mbwu_state to track the bandwidth counter. Support for
tracking overflow with the same structure will be added in a
subsequent commit.
Cc: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:56 +0000 (12:22 +0000)]
arm_mpam: Add mpam_msmon_read() to read monitor value
Reading a monitor involves configuring what you want to monitor, and
reading the value. Components made up of multiple MSC may need values
from each MSC. MSCs may take time to configure, returning 'not ready'.
The maximum 'not ready' time should have been provided by firmware.
Add mpam_msmon_read() to hide all this. If (one of) the MSC returns
not ready, then wait the full timeout value before trying again.
CC: Shanker Donthineni <sdonthineni@nvidia.com> Cc: Shaopeng Tan (Fujitsu) <tan.shaopeng@fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:55 +0000 (12:22 +0000)]
arm_mpam: Add helpers to allocate monitors
MPAM's MSC support a number of monitors, each of which supports
bandwidth counters, or cache-storage-utilisation counters. To use
a counter, a monitor needs to be configured. Add helpers to allocate
and free CSU or MBWU monitors.
Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:54 +0000 (12:22 +0000)]
arm_mpam: Probe and reset the rest of the features
MPAM supports more features than are going to be exposed to resctrl.
For partid other than 0, the reset values of these controls isn't
known.
Discover the rest of the features so they can be reset to avoid any
side effects when resctrl is in use.
PARTID narrowing allows MSC/RIS to support less configuration space than
is usable. If this feature is found on a class of device we are likely
to use, then reduce the partid_max to make it usable. This allows us
to map a PARTID to itself.
CC: Rohit Mathew <Rohit.Mathew@arm.com> CC: Zeng Heng <zengheng4@huawei.com> CC: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:53 +0000 (12:22 +0000)]
arm_mpam: Allow configuration to be applied and restored during cpu online
When CPUs come online the MSC's original configuration should be restored.
Add struct mpam_config to hold the configuration. For each component, this
has a bitmap of features that have been changed from the reset values. The
mpam_config is also used on RIS reset where all bits are set to ensure all
features are reset.
Once the maximum partid is known, allocate a configuration array for each
component, and reprogram each RIS configuration from this.
CC: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Cc: Fujitsu Fujitsu <Shaopeng Tan tan.shaopeng@fujitsu.com> Cc: Peter Newman peternewman@google.com Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:52 +0000 (12:22 +0000)]
arm_mpam: Use a static key to indicate when mpam is enabled
Once all the MSC have been probed, the system wide usable number of
PARTID is known and the configuration arrays can be allocated.
After this point, checking all the MSC have been probed is pointless,
and the cpuhp callbacks should restore the configuration, instead of
just resetting the MSC.
Add a static key to enable this behaviour. This will also allow MPAM
to be disabled in response to an error, and the architecture code to
enable/disable the context switch of the MPAM system registers.
Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:51 +0000 (12:22 +0000)]
arm_mpam: Register and enable IRQs
Register and enable error IRQs. All the MPAM error interrupts indicate a
software bug, e.g. out of range partid. If the error interrupt is ever
signalled, attempt to disable MPAM.
Only the irq handler accesses the MPAMF_ESR register, so no locking is
needed. The work to disable MPAM after an error needs to happen at process
context as it takes mutex. It also unregisters the interrupts, meaning
it can't be done from the threaded part of a threaded interrupt.
Instead, mpam_disable() gets scheduled.
Enabling the IRQs in the MSC may involve cross calling to a CPU that
can access the MSC.
Once the IRQ is requested, the mpam_disable() path can be called
asynchronously, which will walk structures sized by max_partid. Ensure
this size is fixed before the interrupt is requested.
CC: Rohit Mathew <rohit.mathew@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Rohit Mathew <rohit.mathew@arm.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:50 +0000 (12:22 +0000)]
arm_mpam: Extend reset logic to allow devices to be reset any time
cpuhp callbacks aren't the only time the MSC configuration may need to
be reset. Resctrl has an API call to reset a class.
If an MPAM error interrupt arrives it indicates the driver has
misprogrammed an MSC. The safest thing to do is reset all the MSCs
and disable MPAM.
Add a helper to reset RIS via their class. Call this from mpam_disable(),
which can be scheduled from the error interrupt handler.
Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:49 +0000 (12:22 +0000)]
arm_mpam: Add a helper to touch an MSC from any CPU
Resetting RIS entries from the cpuhp callback is easy as the
callback occurs on the correct CPU. This won't be true for any other
caller that wants to reset or configure an MSC.
Add a helper that schedules the provided function if necessary.
Callers should take the cpuhp lock to prevent the cpuhp callbacks from
changing the MSC state.
Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:48 +0000 (12:22 +0000)]
arm_mpam: Reset MSC controls from cpuhp callbacks
When a CPU comes online, it may bring a newly accessible MSC with
it. Only the default partid has its value reset by hardware, and
even then the MSC might not have been reset since its config was
previously dirtied. e.g. Kexec.
Any in-use partid must have its configuration restored, or reset.
In-use partids may be held in caches and evicted later.
MSC are also reset when CPUs are taken offline to cover cases where
firmware doesn't reset the MSC over reboot using UEFI, or kexec
where there is no firmware involvement.
If the configuration for a RIS has not been touched since it was
brought online, it does not need resetting again.
To reset, write the maximum values for all discovered controls.
CC: Rohit Mathew <Rohit.Mathew@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:47 +0000 (12:22 +0000)]
arm_mpam: Merge supported features during mpam_enable() into mpam_class
To make a decision about whether to expose an mpam class as
a resctrl resource we need to know its overall supported
features and properties.
Once we've probed all the resources, we can walk the tree
and produce overall values by merging the bitmaps. This
eliminates features that are only supported by some MSC
that make up a component or class.
If bitmap properties are mismatched within a component we
cannot support the mismatched feature.
Care has to be taken as vMSC may hold mismatched RIS.
Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:45 +0000 (12:22 +0000)]
arm_mpam: Add helpers for managing the locking around the mon_sel registers
The MSC MON_SEL register needs to be accessed from hardirq for the overflow
interrupt, and when taking an IPI to access these registers on platforms
where MSC are not accessible from every CPU. This makes an irqsave
spinlock the obvious lock to protect these registers. On systems with SCMI
or PCC mailboxes it must be able to sleep, meaning a mutex must be used.
The SCMI or PCC platforms can't support an overflow interrupt, and
can't access the registers from hardirq context.
Clearly these two can't exist for one MSC at the same time.
Add helpers for the MON_SEL locking. For now, use a irqsave spinlock and
only support 'real' MMIO platforms.
In the future this lock will be split in two allowing SCMI/PCC platforms
to take a mutex. Because there are contexts where the SCMI/PCC platforms
can't make an access, mpam_mon_sel_lock() needs to be able to fail. Do
this now, so that all the error handling on these paths is present. This
allows the relevant paths to fail if they are needed on a platform where
this isn't possible, instead of having to make explicit checks of the
interface type.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:44 +0000 (12:22 +0000)]
arm_mpam: Probe hardware to find the supported partid/pmg values
CPUs can generate traffic with a range of PARTID and PMG values,
but each MSC may also have its own maximum size for these fields.
Before MPAM can be used, the driver needs to probe each RIS on
each MSC, to find the system-wide smallest value that can be used.
The limits from requestors (e.g. CPUs) also need taking into account.
While doing this, RIS entries that firmware didn't describe are created
under MPAM_CLASS_UNKNOWN.
This adds the low level MSC write accessors.
While we're here, implement the mpam_register_requestor() call
for the arch code to register the CPU limits. Future callers of this
will tell us about the SMMU and ITS.
Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:43 +0000 (12:22 +0000)]
arm_mpam: Add cpuhp callbacks to probe MSC hardware
Because an MSC can only by accessed from the CPUs in its cpu-affinity
set we need to be running on one of those CPUs to probe the MSC
hardware.
Do this work in the cpuhp callback. Probing the hardware will only
happen before MPAM is enabled, walk all the MSCs and probe those we can
reach that haven't already been probed as each CPU's online call is made.
This adds the low-level MSC register read accessors.
Once all MSCs reported by the firmware have been probed from a CPU in
their respective cpu-affinity set, the probe-time cpuhp callbacks are
replaced. The replacement callbacks will ultimately need to handle
save/restore of the runtime MSC state across power transitions, but for
now there is nothing to do in them: so do nothing.
The architecture's context switch code will be enabled by a static-key,
this can be set by mpam_enable(), but must be done from process context,
not a cpuhp callback because both take the cpuhp lock.
Whenever a new MSC has been probed, the mpam_enable() work is scheduled
to test if all the MSCs have been probed. If probing fails, mpam_disable()
is scheduled to unregister the cpuhp callbacks and free memory.
CC: Lecopzer Chen <lecopzerc@nvidia.com> Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:41 +0000 (12:22 +0000)]
arm_mpam: Add the class and component structures for firmware described ris
An MSC is a container of resources, each identified by their RIS index.
Some RIS are described by firmware to provide their position in the system.
Others are discovered when the driver probes the hardware.
To configure a resource it needs to be found by its class, e.g. 'L2'.
There are two kinds of grouping, a class is a set of components, which
are visible to user-space as there are likely to be multiple instances
of the L2 cache. (e.g. one per cluster or package)
Add support for creating and destroying structures to allow a hierarchy
of resources to be created.
Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:40 +0000 (12:22 +0000)]
arm_mpam: Add probe/remove for mpam msc driver and kbuild boiler plate
Probing MPAM is convoluted. MSCs that are integrated with a CPU may
only be accessible from those CPUs, and they may not be online.
Touching the hardware early is pointless as MPAM can't be used until
the system-wide common values for num_partid and num_pmg have been
discovered.
Start with driver probe/remove and mapping the MSC.
Cc: Carl Worth <carl@os.amperecomputing.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:39 +0000 (12:22 +0000)]
ACPI / MPAM: Parse the MPAM table
Add code to parse the arm64 specific MPAM table, looking up the cache
level from the PPTT and feeding the end result into the MPAM driver.
This happens in two stages. Platform devices are created first for the
MSC devices. Once the driver probes it calls acpi_mpam_parse_resources()
to discover the RIS entries the MSC contains.
For now the MPAM hook mpam_ris_create() is stubbed out, but will update
the MPAM driver with optional discovered data about the RIS entries.
CC: Carl Worth <carl@os.amperecomputing.com> Link: https://developer.arm.com/documentation/den0065/3-0bet/?lang=en Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ben Horgan [Wed, 19 Nov 2025 12:22:38 +0000 (12:22 +0000)]
ACPI: Define acpi_put_table cleanup handler and acpi_get_table_pointer() helper
Define a cleanup helper for use with __free to release the acpi table when
the pointer goes out of scope. Also, introduce the helper
acpi_get_table_pointer() to simplify a commonly used pattern involving
acpi_get_table().
These are first used in a subsequent commit.
Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Define a cleanup helper for use with __free to destroy platform devices
automatically when the pointer goes out of scope. This is only intended to
be used in error cases and so should be used with return_ptr() or
no_free_ptr() directly to avoid the automatic destruction on success.
A first use of this is introduced in a subsequent commit.
Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:36 +0000 (12:22 +0000)]
arm64: kconfig: Add Kconfig entry for MPAM
The bulk of the MPAM driver lives outside the arch code because it
largely manages MMIO devices that generate interrupts. The driver
needs a Kconfig symbol to enable it. As MPAM is only found on arm64
platforms, the arm64 tree is the most natural home for the Kconfig
option.
This Kconfig option will later be used by the arch code to enable
or disable the MPAM context-switch code, and to register properties
of CPUs with the MPAM driver.
Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> CC: Dave Martin <dave.martin@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:34 +0000 (12:22 +0000)]
ACPI / PPTT: Find cache level by cache-id
The MPAM table identifies caches by id. The MPAM driver also wants to know
the cache level to determine if the platform is of the shape that can be
managed via resctrl. Cacheinfo has this information, but only for CPUs that
are online.
Waiting for all CPUs to come online is a problem for platforms where
CPUs are brought online late by user-space.
Add a helper that walks every possible cache, until it finds the one
identified by cache-id, then return the level.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ben Horgan [Wed, 19 Nov 2025 12:22:33 +0000 (12:22 +0000)]
ACPI / PPTT: Add acpi_pptt_cache_v1_full to use pptt cache as one structure
In actbl2.h, acpi_pptt_cache describes the fields in the original
Cache Type Structure. In PPTT table version 3 a new field was added at the
end, cache_id. This is described in acpi_pptt_cache_v1 but rather than
including all v1 fields it just includes this one.
In lieu of this being fixed in acpica, introduce acpi_pptt_cache_v1_full to
contain all the fields of the Cache Type Structure . Update the existing
code to use this new struct. This simplifies the code and removes a
non-standard use of ACPI_ADD_PTR.
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:32 +0000 (12:22 +0000)]
ACPI / PPTT: Stop acpi_count_levels() expecting callers to clear levels
In acpi_count_levels(), the initial value of *levels passed by the
caller is really an implementation detail of acpi_count_levels(), so it
is unreasonable to expect the callers of this function to know what to
pass in for this parameter. The only sensible initial value is 0,
which is what the only upstream caller (acpi_get_cache_info()) passes.
Use a local variable for the starting cache level in acpi_count_levels(),
and pass the result back to the caller via the function return value.
Get rid of the levels parameter, which has no remaining purpose.
Fix acpi_get_cache_info() to match.
Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Wed, 19 Nov 2025 12:22:31 +0000 (12:22 +0000)]
ACPI / PPTT: Add a helper to fill a cpumask from a processor container
The ACPI MPAM table uses the UID of a processor container specified in
the PPTT to indicate the subset of CPUs and cache topology that can
access each MPAM System Component (MSC).
This information is not directly useful to the kernel. The equivalent
cpumask is needed instead.
Add a helper to find the processor container by its id, then walk
the possible CPUs to fill a cpumask with the CPUs that have this
processor container as a parent.
CC: Dave Martin <dave.martin@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Huang Ying [Fri, 14 Nov 2025 08:54:03 +0000 (16:54 +0800)]
arm64, tlbflush: don't TLBI broadcast if page reused in write fault
A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds. When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions. While
running the workload on the x86_64 server machine, it's not. This
causes the performance on arm64 to be much worse than that on x86_64.
During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault. Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload. On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE. However, this
isn't always necessary. Firstly, it's safe to leave some stale
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.
To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable. If there are stale read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stale TLB entries.
To test the patchset, make the usemem.c from
vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically. To mimic the behavior of
the customer workload, run usemem with 4 threads, access 100GB memory,
and call fork()/exec() every 40 seconds. Test results show that with
the patchset the score of usemem improves ~40.6%. The cycles% of TLB
flush functions reduces from ~50.5% to ~0.3% in perf profile.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Huang Ying [Fri, 14 Nov 2025 08:54:02 +0000 (16:54 +0800)]
mm: add spurious fault fixing support for huge pmd
The page faults may be spurious because of the racy access to the page
table. For example, a non-populated virtual page is accessed on 2
CPUs simultaneously, thus the page faults are triggered on both CPUs.
However, it's possible that one CPU (say CPU A) cannot find the reason
for the page fault if the other CPU (say CPU B) has changed the page
table before the PTE is checked on CPU A. Most of the time, the
spurious page faults can be ignored safely. However, if the page
fault is for the write access, it's possible that a stale read-only
TLB entry exists in the local CPU and needs to be flushed on some
architectures. This is called the spurious page fault fixing.
In the current kernel, there is spurious fault fixing support for pte,
but not for huge pmd because no architectures need it. But in the
next patch in the series, we will change the write protection fault
handling logic on arm64, so that some stale huge pmd entries may
remain in the TLB. These entries need to be flushed via the huge pmd
spurious fault fixing mechanism.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Brown [Wed, 15 Oct 2025 17:56:38 +0000 (18:56 +0100)]
kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace
On a system which support SME but not SVE we can now disable streaming mode
via ptrace by writing FPSIMD formatted data through NT_ARM_SVE with a VL of
0. Extend fp-ptrace to cover rather than skip these cases, relax the check
for SVE writes of FPSIMD format data to not skip if SME is supported and
accept 0 as the VL when performing the ptrace write.
Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Brown [Wed, 15 Oct 2025 17:56:37 +0000 (18:56 +0100)]
kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems
In order to allow exiting streaming mode on systems with SME but not SVE
we allow writes of FPSIMD format data via NT_ARM_SVE even when SVE is not
supported, add a test case that covers this to sve-ptrace.
We do not support reads.
Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Brown [Wed, 15 Oct 2025 17:56:36 +0000 (18:56 +0100)]
arm64/sme: Support disabling streaming mode via ptrace on SME only systems
Currently it is not possible to disable streaming mode via ptrace on SME
only systems, the interface for doing this is to write via NT_ARM_SVE but
such writes will be rejected on a system without SVE support. Enable this
functionality by allowing userspace to write SVE_PT_REGS_FPSIMD format data
via NT_ARM_SVE with the vector length set to 0 on SME only systems. Such
writes currently error since we require that a vector length is specified
which should minimise the risk that existing software is relying on current
behaviour.
Reads are not supported since I am not aware of any use case for this and
there is some risk that an existing userspace application may be confused if
it reads NT_ARM_SVE on a system without SVE. Existing kernels will return
FPSIMD formatted register state from NT_ARM_SVE if full SVE state is not
stored, for example if the task has not used SVE. Returning a vector length
of 0 would create a risk that software would try to do things like allocate
space for register state with zero sizes, while returning a vector length of
128 bits would look like SVE is supported. It seems safer to just not make
the changes to add read support.
It remains possible for userspace to detect a SME only system via the ptrace
interface only since reads of NT_ARM_SSVE and NT_ARM_ZA will succeed while
reads of NT_ARM_SVE will fail. Read/write access to the FPSIMD registers in
non-streaming mode is available via REGSET_FPR.
sve_set_common() already avoids allocating SVE storage when doing a FPSIMD
formatted write and allocating SME storage when doing a NT_ARM_SVE write so
we change the function to validate the new case and skip setting a vector
length for it.
The aim is to make a minimally invasive change, no operation that would
previously have succeeded will be affected, and we use a previously
defined interface in new circumstances rather than define completely new
ABI.
Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: David Spickett <david.spickett@linaro.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Sascha Bischoff [Wed, 22 Oct 2025 13:45:37 +0000 (13:45 +0000)]
arm64/sysreg: Add ICH_VMCR_EL2
Add the ICH_VMCR_EL2 register, which is required for the upcoming
GICv5 KVM support. This register has two different field encodings,
based on if it is used for GICv3 or GICv5-based VMs. The
GICv5-specific field encodings are generated with a FEAT_GCIE prefix.
This register is already described in the GICv3 KVM code
directly. This will be ported across to use the generated encodings as
part of an upcoming change.
Sascha Bischoff [Wed, 22 Oct 2025 13:45:37 +0000 (13:45 +0000)]
arm64/sysreg: Move generation of RES0/RES1/UNKN to function
The RESx and UNKN define generation happens in two places
(EndSysreg and EndSysregFields), and was using nearly identical
code. Split this out into a function, and call that instead, rather
then keeping the dupliated code.
There are no changes to the generated sysregs as part of this change.
Sascha Bischoff [Wed, 22 Oct 2025 13:45:36 +0000 (13:45 +0000)]
arm64/sysreg: Support feature-specific fields with 'Prefix' descriptor
Some system register field encodings change based on, for example the
in-use architecture features, or the context in which they are
accessed. In order to support these different field encodings,
introduce the Prefix descriptor (Prefix, EndPrefix) for describing
such sysregs.
The Prefix descriptor can be used in the following way:
Sysreg EXAMPLE 0 1 2 3 4
Prefix FEAT_A
Field 63:0 Foo
EndPrefix
Prefix FEAT_B
Field 63:1 Bar
Res0 0
EndPrefix
Field 63:0 Baz
EndSysreg
This will generate a single set of system register encodings (REG_,
SYS_, ...), and then generate three sets of field definitions for the
system register called EXAMPLE. The first set is prefixed by FEAT_A,
e.g. FEAT_A_EXAMPLE_Foo. The second set is prefixed by FEAT_B, e.g.,
FEAT_B_EXAMPLE_Bar. The third set is not given a prefix at all,
e.g. EXAMPLE_BAZ. For each set, a corresponding set of defines for
Res0, Res1, and Unkn is generated.
The intent for the final prefix-less fields is to describe default or
legacy field encodings. This ensure that prefixed encodings can be
added to already-present sysregs without affecting existing legacy
code. Prefixed fields must be defined before those without a prefix,
and this is checked by the generator. This ensures consisnt ordering
within the sysregs definitions.
The Prefix descriptor can be used within Sysreg or SysregFields
blocks. Field, Res0, Res1, Unkn, Rax, SignedEnum, Enum can all be used
within a Prefix block. Fields and Mapping can not. Fields that vary
with features must be described as part of a SysregFields block,
instead. Mappings, which are just a code comment, make little sense in
this context, and have hence not been included.
There are no changes to the generated system register definitions as
part of this change.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Sascha Bischoff [Wed, 22 Oct 2025 13:45:36 +0000 (13:45 +0000)]
arm64/sysreg: Fix checks for incomplete sysreg definitions
The checks for incomplete sysreg definitions were checking if the
next_bit was greater than 0, which is incorrect and missed occasions
where bit 0 hasn't been defined for a sysreg. The reason is that
next_bit is -1 when all bits have been processed (LSB - 1).
Change the checks to use >= 0, instead. Also, set next_bit in Mapping
to -1 instead of 0 to match these new checks.
There are no changes to the generated sysreg definitons as part of
this change, and conveniently no definitions lack definitions for bit
0.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Yang Shi [Thu, 23 Oct 2025 20:44:28 +0000 (13:44 -0700)]
arm64: mm: make linear mapping permission update more robust for patial range
The commit fcf8dda8cc48 ("arm64: pageattr: Explicitly bail out when changing
permissions for vmalloc_huge mappings") made permission update for
partial range more robust. But the linear mapping permission update
still assumes update the whole range by iterating from the first page
all the way to the last page of the area.
Make it more robust by updating the linear mapping permission from the
page mapped by start address, and update the number of numpages.
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Dev Jain [Fri, 17 Oct 2025 16:02:51 +0000 (21:32 +0530)]
arm64/mm: Elide TLB flush in certain pte protection transitions
Currently arm64 does an unconditional TLB flush in mprotect(). This is not
required for some cases, for example, when changing from PROT_NONE to
PROT_READ | PROT_WRITE (a real usecase - glibc malloc does this to emulate
growing into the non-main heaps), and unsetting uffd-wp in a range.
Therefore, implement pte_needs_flush() for arm64, which is already
implemented by some other arches as well.
Running a userspace program changing permissions back and forth between
PROT_NONE and PROT_READ | PROT_WRITE, and measuring the average time taken
for the none->rw transition, I get a reduction from 3.2 microseconds to
2.85 microseconds, giving a 12.3% improvement.
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linu Cherian [Fri, 17 Oct 2025 05:14:37 +0000 (10:44 +0530)]
arm64/mm: Rename try_pgd_pgtable_alloc_init_mm
With BUG_ON in pgd_pgtable_alloc_init_mm moved up to higher layer,
gfp flags is the only difference between try_pgd_pgtable_alloc_init_mm
and pgd_pgtable_alloc_init_mm. Hence rename the "try" version
to pgd_pgtable_alloc_init_mm_gfp.
Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Linu Cherian <linu.cherian@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
arm64/mm: Allow __create_pgd_mapping() to propagate pgtable_alloc() errors
arch_add_memory() is used to hotplug memory into a system but as a part
of its implementation it calls __create_pgd_mapping(), which uses
pgtable_alloc() in order to build intermediate page tables. As this path
was initally only used during early boot pgtable_alloc() is designed to
BUG_ON() on failure. However, in the event that memory hotplug is
attempted when the system's memory is extremely tight and the allocation
were to fail, it would lead to panicking the system, which is not
desirable. Hence update __create_pgd_mapping and all it's callers to be
non void and propagate -ENOMEM on allocation failure to allow system to
fail gracefully.
But during early boot if there is an allocation failure, we want the
system to panic, hence create a wrapper around __create_pgd_mapping()
called early_create_pgd_mapping() which is designed to panic, if ret
is non zero value. All the init calls are updated to use this wrapper
rather than the modified __create_pgd_mapping() to restore
functionality.
Fixes: 4ab215061554 ("arm64: Add memory hotplug support") Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Signed-off-by: Linu Cherian <linu.cherian@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This just replaces all used TCR_EL1 field macros with tools sysreg variant
based fields and subsequently drops them from the header (pgtable-hwdef.h),
although while retaining the ones used for KVM (represented via the sysreg
tools format).
Cc: Will Deacon <will@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Li Qiang [Fri, 19 Sep 2025 03:33:27 +0000 (11:33 +0800)]
arm64: add unlikely hint to MTE async fault check in el0_svc_common
Add unlikely() hint to the _TIF_MTE_ASYNC_FAULT flag check in
el0_svc_common() since asynchronous MTE faults are expected to be
rare occurrences during normal system call execution.
This optimization helps the compiler to improve instruction caching
and branch prediction for the common case where no asynchronous
MTE faults are pending, while maintaining correct behavior for
the exceptional case where such faults need to be handled prior
to system call execution.
Signed-off-by: Li Qiang <liqiang01@kylinos.cn> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Thomas Huth [Fri, 10 Oct 2025 13:01:16 +0000 (15:01 +0200)]
arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-uapi headers
While the GCC and Clang compilers already define __ASSEMBLER__
automatically when compiling assembly code, __ASSEMBLY__ is a
macro that only gets defined by the Makefiles in the kernel.
This can be very confusing when switching between userspace
and kernelspace coding, or when dealing with uapi headers that
rather should use __ASSEMBLER__ instead. So let's standardize now
on the __ASSEMBLER__ macro that is provided by the compilers.
This is a mostly mechanical patch (done with a simple "sed -i"
statement), except for the following files where comments with
mis-spelled macros were tweaked manually:
Thomas Huth [Fri, 10 Oct 2025 13:01:15 +0000 (15:01 +0200)]
arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in uapi headers
__ASSEMBLY__ is only defined by the Makefile of the kernel, so
this is not really useful for uapi headers (unless the userspace
Makefile defines it, too). Let's switch to __ASSEMBLER__ which
gets set automatically by the compiler when compiling assembly
code.
Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linus Walleij [Wed, 5 Nov 2025 14:27:01 +0000 (15:27 +0100)]
arm64: entry: Clean out some indirection
The conversion to generic IRQ entry left some functions
in the EL1 (kernel) IRQ entry path very shallow, so drop
the __inner_functions() where appropriate, saving some
time and stack.
This is not a fix but an optimization.
Drop stale comments about irqentry_enter/exit() while we
are at it.
arm64/mm: Ensure PGD_SIZE is aligned to 64 bytes when PA_BITS = 52
Although the comment clearly states about PGD table's alignment requirement
(when PA_BITS = 52) but the subsequent BUILD_BUG_ON() tests size comparison
to 64 bytes instead. So change it as an actual alignment test.
Ard Biesheuvel [Wed, 15 Oct 2025 20:56:42 +0000 (22:56 +0200)]
arm64/efi: Call EFI runtime services without disabling preemption
The only remaining reason why EFI runtime services are invoked with
preemption disabled is the fact that the mm is swapped out behind the
back of the context switching code.
The kernel no longer disables preemption in kernel_neon_begin().
Furthermore, the EFI spec is being clarified to explicitly state that
only baseline FP/SIMD is permitted in EFI runtime service
implementations, and so the existing kernel mode NEON context switching
code is sufficient to preserve and restore the execution context of an
in-progress EFI runtime service call.
Most EFI calls are made from the efi_rts_wq, which is serviced by a
kthread. As kthreads never return to user space, they usually don't have
an mm, and so we can use the existing infrastructure to swap in the
efi_mm while the EFI call is in progress. This is visible to the
scheduler, which will therefore reactivate the selected mm when
switching out the kthread and back in again.
Given that the EFI spec explicitly permits runtime services to be called
with interrupts enabled, firmware code is already required to tolerate
interruptions. So rather than disable preemption, disable only migration
so that EFI runtime services are less likely to cause scheduling delays.
To avoid potential issues where runtime services are interrupted while
polling the secure firmware for async completions, keep migration
disabled so that a runtime service invocation does not resume on a
different CPU from the one it was started on.
Note, though, that the firmware executes at the same privilege level as
the kernel, and is therefore able to disable interrupts altogether.
Ard Biesheuvel [Wed, 15 Oct 2025 20:56:41 +0000 (22:56 +0200)]
arm64/efi: Move uaccess en/disable out of efi_set_pgd()
efi_set_pgd() will no longer be called when invoking EFI runtime
services via the efi_rts_wq work queue, but the uaccess en/disable are
still needed when using PAN emulation using TTBR0 switching. So move
these into the callers.
Ard Biesheuvel [Wed, 15 Oct 2025 20:56:40 +0000 (22:56 +0200)]
arm64/efi: Drop efi_rt_lock spinlock from EFI arch wrapper
Since commit
5894cf571e14 ("acpi/prmt: Use EFI runtime sandbox to invoke PRM handlers")
all EFI runtime calls on arm64 are routed via the EFI runtime wrappers,
which are serialized using the efi_runtime_lock semaphore.
This means the efi_rt_lock spinlock in the arm64 arch wrapper code has
become redundant, and can be dropped. For robustness, replace it with an
assert that the EFI runtime lock is in fact held by 'current'.
Ard Biesheuvel [Wed, 15 Oct 2025 20:56:39 +0000 (22:56 +0200)]
arm64/fpsimd: Permit kernel mode NEON with IRQs off
Currently, may_use_simd() will return false when called from a context
where IRQs are disabled. One notable case where this happens is when
calling the ResetSystem() EFI runtime service from the reboot/poweroff
code path. For this case alone, there is a substantial amount of FP/SIMD
support code to handle the corner case where a EFI runtime service is
invoked with IRQs disabled.
The only reason kernel mode SIMD is not allowed when IRQs are disabled
is that re-enabling softirqs in this case produces a noisy diagnostic
when lockdep is enabled. The warning is valid, in the sense that
delivering pending softirqs over the back of the call to
local_bh_enable() is problematic when IRQs are disabled.
While the API lacks a facility to simply mask and unmask softirqs
without triggering their delivery, disabling softirqs is not needed to
begin with when IRQs are disabled, given that softirqs are only every
taken asynchronously over the back of a hard IRQ.
So dis/enable softirq processing conditionally, based on whether IRQs
are enabled, and relax the check in may_use_simd().
Ard Biesheuvel [Wed, 15 Oct 2025 20:56:38 +0000 (22:56 +0200)]
arm64/fpsimd: Don't warn when EFI execution context is preemptible
Kernel mode FP/SIMD no longer requires preemption to be disabled, so
only warn on uses of FP/SIMD from preemptible context if the fallback
path is taken for cases where kernel mode NEON would not be allowed
otherwise.
Ard Biesheuvel [Wed, 15 Oct 2025 20:56:37 +0000 (22:56 +0200)]
efi/runtime-wrappers: Keep track of the efi_runtime_lock owner
The EFI runtime wrappers use a file local semaphore to serialize access
to the EFI runtime services. This means that any calls to the arch
wrappers around the runtime services will also be serialized, removing
the need for redundant locking.
For robustness, add a facility that allows those arch wrappers to assert
that the semaphore was taken by the current task.
These TCR_El1 helpers don't have any other callers. Drop these redundant
indirections completely thus making this code more compact and readable.
No functional change.
Cc: Will Deacon <will@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Mon, 3 Nov 2025 16:04:17 +0000 (16:04 +0000)]
kselftest/arm64: Align zt-test register dumps
The zt-test output is awkward to read, as the 'Expected' value isn't
dumped on its own line and isn't aligned with the 'Got' value beneath.
For example:
Admittedly this isn't all that important when the 'Got' value is all
zeroes, but otherwise this would be a major help for identifying which
portion of the 'Got' value is not as expected.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kselftest@vger.kernel.org Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Omar Sandoval [Wed, 8 Oct 2025 23:01:30 +0000 (16:01 -0700)]
arm64: remove unused ARCH_PFN_OFFSET
This is only relevant to the FLATMEM memory model, which isn't an option
since commit 782276b4d0ad ("arm64: Force SPARSEMEM_VMEMMAP as the only
memory management model").
Ryo Takakura [Wed, 5 Nov 2025 07:42:59 +0000 (16:42 +0900)]
arm64: use SOFTIRQ_ON_OWN_STACK for enabling softirq stack
For those architectures with HAVE_SOFTIRQ_ON_OWN_STACK use
their dedicated softirq stack when !PREEMPT_RT. This condition
is ensured by SOFTIRQ_ON_OWN_STACK.
Let arm64 use SOFTIRQ_ON_OWN_STACK as well to select its
usage of the stack.
Signed-off-by: Ryo Takakura <ryotkkr98@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Yicong Yang [Wed, 5 Nov 2025 10:38:49 +0000 (18:38 +0800)]
arch_topology: Provide a stub topology_core_has_smt() for !CONFIG_GENERIC_ARCH_TOPOLOGY
The arm_pmu driver is using topology_core_has_smt() for retrieving
the SMT implementation which depends on CONFIG_GENERIC_ARCH_TOPOLOGY.
The config is optional on arm platforms so provide a
!CONFIG_GENERIC_ARCH_TOPOLOGY stub for topology_core_has_smt().
Fixes: c3d78c34ad00 ("perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511041757.vuCGOmFc-lkp@intel.com/ Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Yicong Yang <yangyccccc@gmail.com> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
Robin Murphy [Tue, 4 Nov 2025 18:29:40 +0000 (18:29 +0000)]
perf/arm-ni: Fix and optimise register offset calculation
LKP points out an operator precedence oversight in the new NoC S3
support that, annoyingly, my local W=1 build didn't flag. In fixing
that, we can also take the similarly-missed opportunity to cache the
version check itself at event_init time.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511041749.ok8zDP6u-lkp@intel.com/ Fixes: 8fa08f8835e5 ("perf/arm-ni: Add NoC S3 support") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Ma Ke [Wed, 22 Oct 2025 11:53:25 +0000 (19:53 +0800)]
perf: arm_cspmu: fix error handling in arm_cspmu_impl_unregister()
driver_find_device() calls get_device() to increment the reference
count once a matching device is found. device_release_driver()
releases the driver, but it does not decrease the reference count that
was incremented by driver_find_device(). At the end of the loop, there
is no put_device() to balance the reference count. To avoid reference
count leakage, add put_device() to decrease the reference count.
Found by code review.
Cc: stable@vger.kernel.org Fixes: bfc653aa89cb ("perf: arm_cspmu: Separate Arm and vendor module") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Signed-off-by: Will Deacon <will@kernel.org>
Robin Murphy [Mon, 20 Oct 2025 11:44:19 +0000 (12:44 +0100)]
perf/arm-ni: Add NoC S3 support
NoC S3 and its SI L1 sibling look largely similar to their predecessors,
but add the notion of subfeatures to the discovery process, which we now
use to find the event muxes for each device node. Plus, as ever, more
mildly annoying shuffling around of some of the PMU registers (this time
it's the counters...)
Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Besar Wicaksono [Tue, 30 Sep 2025 00:26:04 +0000 (00:26 +0000)]
perf/arm_cspmu: nvidia: Add pmevfiltr2 support
Support NVIDIA PMU that utilizes the optional event filter2 register.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
Besar Wicaksono [Tue, 30 Sep 2025 00:26:03 +0000 (00:26 +0000)]
perf/arm_cspmu: nvidia: Add revision id matching
Distinguish NVIDIA devices by revision and variant bits
in PMIIDR register in addition to product id.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
Besar Wicaksono [Tue, 30 Sep 2025 00:26:02 +0000 (00:26 +0000)]
perf/arm_cspmu: Add pmpidr support
The PMIIDR value is composed by the values in PMPIDR registers.
We can use PMPIDR registers as alternative for device
identification for systems that do not implement PMIIDR.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
Besar Wicaksono [Tue, 30 Sep 2025 00:26:01 +0000 (00:26 +0000)]
perf/arm_cspmu: Add callback to reset filter config
Implementer may need to reset a filter config when
stopping a counter, thus adding a callback for this.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
Yicong Yang [Mon, 22 Sep 2025 03:30:10 +0000 (11:30 +0800)]
perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's
preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count
processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if
one of the SMT siblings is not idle on a multi-threaded implementation.
So don't use it on SMT cores.
Introduce topology_core_has_smt() for knowing the SMT implementation and
cached it in arm_pmu::has_smt during allocation.
When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this
patch we'll get:
[root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
Performance counter stats for 'CPU(s) 2-3':
With this patch the idle state of CPU3 is observed as expected:
[root@client1 ~]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
Performance counter stats for 'CPU(s) 2-3':
Linus Torvalds [Sun, 12 Oct 2025 20:27:56 +0000 (13:27 -0700)]
Merge tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
"One revert because of a regression in the I2C core which has sadly not
showed up during its time in -next"
* tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
Revert "i2c: boardinfo: Annotate code used in init phase only"
Linus Torvalds [Sun, 12 Oct 2025 15:45:52 +0000 (08:45 -0700)]
Merge tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Skip interrupt ID 0 in sifive-plic during suspend/resume because
ID 0 is reserved and accessing reserved register space could result
in undefined behavior
- Fix a function's retval check in aspeed-scu-ic
* tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume
irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check
Linus Torvalds [Sat, 11 Oct 2025 23:06:04 +0000 (16:06 -0700)]
Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.
* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
Linus Torvalds [Sat, 11 Oct 2025 22:47:12 +0000 (15:47 -0700)]
Merge tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:
- Fix UAPI types check in headers_check.pl
- Only enable -Werror for hostprogs with CONFIG_WERROR / W=e
- Ignore fsync() error when output of gen_init_cpio is a pipe
- Several little build fixes for recent modules.builtin.modinfo series
* tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols
s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections
kbuild: Add '.rel.*' strip pattern for vmlinux
kbuild: Restore pattern to avoid stripping .rela.dyn from vmlinux
gen_init_cpio: Ignore fsync() returning EINVAL on pipes
scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs
kbuild: uapi: Strip comments before size type check
Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Closes: https://lore.kernel.org/r/29ec0082-4dd4-4120-acd2-44b35b4b9487@oss.qualcomm.com Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>