s390/disassembler: Remove duplicate instruction format RSY_RDRU
Instruction format RSY_RDRU is a duplicate of RSY_RURD2. Use the latter,
as it follows the s390-specific conventions for instruction format
naming used in binutils.
s390/boot: Use boot_printk() instead of sclp_early_printk()
Consistently use boot_printk() everywhere instead of sclp_early_printk() at
some places. For some places it was required (e.g. als.c), in order to stay
in code compiled for the same architecture level, for other places it is
not obvious why sclp_early_printk() was used instead of
decompressor_printk().
Given that the whole decompressor code is compiled for the same
architecture level, there is no requirement left to use different
printk functions.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
s390/boot: Compile all files with the same march flag
Only a couple of files of the decompressor are compiled with the
minimum architecture level. This is problematic for potential function
calls between compile units, especially if a target function is within
a compile until compiled for a higher architecture level, since that
may lead to an unexpected operation exception.
Therefore compile all files of the decompressor for the same (minimum)
architecture level.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Replace CONFIG_HAVE_MARCH_*_FEATURES with MARCH_HAS_*_FEATURES
everywhere so code gets compiled correctly depending on if the
target is the kernel or the decompressor.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Provide MARCH_HAS_*_FEATURES defines which are supposed to be used
everywhere instead of the CONFIG_HAVE_MARCH_*_FEATURES defines.
Various header files contain code which depend on the
CONFIG_HAVE_MARCH_*_FEATURES defines, allowing for compile time
optimizations. If such code is used within the decompressor wrong code may
be generated (the compiler may generate instructions which are not
available for the minimum architecture level of the decompressor).
Therefore provide a new header file with MARCH_HAS_*_FEATURES defines,
which are only available if __DECOMPRESSOR is not defined. This way code
generation for the kernel image is still optimized depending on
CONFIG_HAVE_MARCH_*_FEATURES, while code generated for the decompressor is
compiled for the minimum architecture level.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
s390/facility: Disable compile time optimization for decompressor code
Disable compile time optimizations of test_facility() for the
decompressor. The decompressor should not contain any optimized code
depending on the architecture level set the kernel image is compiled
for to avoid unexpected operation exceptions.
Add a __DECOMPRESSOR check to test_facility() to enforce that
facilities are always checked during runtime for the decompressor.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The decompressor code is partially compiled with march=z900 so it is
possible to print an error message in case a kernel is booted on a
machine which misses facilities to execute the kernel.
Given that the decompressor code also includes header files from the
core kernel this causes problems for inline assemblies and other code
where the minimum assumed architecture level is set to z10 in the
meantime. If such code is also used in the decompressor (e.g. inline
functions) z900 support must be implemented again.
In order to avoid this and to keep things simple just raise the
minimum architecture level to z10 for the decompressor just like for
the kernel.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The bss section of the decompressor is part of the compressed kernel image
since commit 980d5f9ab36b ("s390/boot: enable .bss section for compressed
kernel").
Remove a now incorrect comment that states that the bss section must not be
accessed.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Since commit "s390/sha3: Support sha3 performance enhancements"
the selftests of the sha3_256_s390 and sha3_512_s390 kernel digests
sometimes fail with:
alg: shash: sha3-256-s390 test failed (wrong result) on test vector 3,
cfg="import/export"
alg: self-tests for sha3-256 using sha3-256-s390 failed (rc=-22)
or with
alg: ahash: sha3-256-s390 test failed (wrong result) on test vector 3,
cfg="digest misaligned splits crossing pages"
alg: self-tests for sha3-256 using sha3-256-s390 failed (rc=-22)
The first failure is because the newly introduced context field
'first_message_part' is not copied during export and import operations.
Because of that the value of 'first_message_part' is more or less random
after an import into a newly allocated context and may or may not fit to
the state of the imported SHA3 operation, causing an invalid hash when it
does not fit.
Save the 'first_message_part' field in the currently unused field 'partial'
of struct sha3_state, even though the meaning of 'partial' is not exactly
the same as 'first_message_part'. For the caller the returned state blob
is opaque and it must only be ensured that the state can be imported later
on by the module that exported it.
The second failure is when on entry of s390_sha_update() the flag
'first_message_part' is on, and kimd is called in the first 'if (index)'
block as well as in the second 'if (len >= bsize)' block. In this case,
the 'first_message_part' is turned off after the first kimd, but the
function code incorrectly retains the NIP flag. Reset the NIP flag after
the first kimd unconditionally besides turning 'first_message_part' off.
s390/pkey: Add AES xts and HMAC clear key token support
Add support for deriving protected keys from clear key token for
AES xts and HMAC keys via PCKMO instruction. Add support for
protected key generation and unwrap of protected key tokens for
these key types. Furthermore 4 new sysfs attributes are introduced:
s390/mm: Add cond_resched() to cmm_alloc/free_pages()
Adding/removing large amount of pages at once to/from the CMM balloon
can result in rcu_sched stalls or workqueue lockups, because of busy
looping w/o cond_resched().
Prevent this by adding a cond_resched(). cmm_free_pages() holds a
spin_lock while looping, so it cannot be added directly to the existing
loop. Instead, introduce a wrapper function that operates on maximum 256
pages at once, and add it there.
Add three counters to follow and understand hiperdispatch behavior;
* adjustment_count (amount of capacity adjustments triggered)
* greedy_time_ms (time spent while all cpus are on high capacity)
* conservative_time_ms (time spent while only entitled cpus are on high
capacity)
These counters can be found under /sys/kernel/debug/s390/hiperdispatch/
Time counters are in <msec> format and only cover the time spent
when hiperdispatch is active.
Add two attributes for debug purposes. They can be found under;
/sys/devices/system/cpu/hiperdispatch/
* hd_stime_threshold : allows user to adjust steal time threshold
* hd_delay_factor : allows user to adjust delay factor of hiperdispatch
work (after topology updates, delayed work is
always delayed extra by this factor)
hd_stime_threshold can have values between 0-100 as it represents a
percentage value.
hd_delay_factor can have values greater than 1. It is multiplied with
the default delay to achieve a longer interval, pushing back the next
hiperdispatch adjustment after a topology update.
Ex:
if delay interval is 250ms and the delay factor is 4;
delayed interval is now 1000ms(1sec). After each capacity adjustment
or topology change, work has a delayed interval of 1 sec for one
interval.
Expose hiperdispatch controls via sysctl. The user can now toggle
hiperdispatch via assigning 0 or 1 to s390.hiperdispatch attribute.
When hiperdipatch is toggled on, it tries to adjust CPU capacities,
while system is in vertical polarization to gain performance benefits
from different CPU polarizations. Disabling hiperdispatch reverts the
CPU capacities to their default (HIGH_CAPACITY) and stops the dynamic
adjustments.
Introduce a kconfig option HIPERDISPATCH_ON which allows users to
use hiperdispatch by default on vertical polarization. Using the
sysctl attribute s390.hiperdispatch would overwrite this behavior.
Mete Durlu [Mon, 12 Aug 2024 11:39:36 +0000 (13:39 +0200)]
s390/hiperdispatch: Add trace events
Add trace events to debug hiperdispatch behavior and track domain
rebuilding. Two events provide information about the decision making of
hiperdispatch and the adjustments made.
Mete Durlu [Mon, 12 Aug 2024 11:39:35 +0000 (13:39 +0200)]
s390/hiperdispatch: Add steal time averaging
The measurements done by hiperdispatch can have sudden spikes and dips
during run time. To prevent these outliers effecting the decision making
process and causing adjustment overhead, use weighted average of the
steal time.
Mete Durlu [Mon, 12 Aug 2024 11:39:34 +0000 (13:39 +0200)]
s390/hiperdispatch: Introduce hiperdispatch
When LPAR is in vertical polarization, CPUs get different polarization
values, namely vertical high, vertical medium and vertical low. These
values represent the likelyhood of the CPU getting physical runtime.
Vertical high CPUs will always get runtime and others get varying
runtime depending on the load the CEC is under.
Vertical high and vertical medium CPUs are considered the CPUs which the
current LPAR has the entitlement to run on. The vertical lows are on the
other hand are borrowed CPUs which would only be given to the LPAR by
hipervisor when the other LPARs are not utilizing them.
Using the CPU capacities, hint linux scheduler when it should prioritise
vertical high and vertical medium CPUs over vertical low CPUs.
By tracking various system statistics hiperdispatch determines when to
adjust cpu capacities.
After each adjustment, rebuilding of scheduler domains is necessary to
notify the scheduler about capacity changes but since this operation is
costly it should be done as sparsely as possible.
Mete Durlu [Mon, 12 Aug 2024 11:39:33 +0000 (13:39 +0200)]
s390/smp: Add cpu capacities
Linux scheduler allows architectures to assign capacity values to
individual CPUs. This hints scheduler the performance difference between
CPUs and allows more efficient task distribution them. Implement
helper methods to set and get CPU capacities for s390. This is
particularly helpful in vertical polarization configurations of LPARs.
On vertical polarization an LPARs CPUs can get different polarization
values depending on the CEC configuration. CPUs with different
polarization values can perform different from each other, using CPU
capacities this can be reflected to linux scheduler.
Tobias Huschle [Mon, 12 Aug 2024 11:39:32 +0000 (13:39 +0200)]
s390/topology: Add config option to switch to vertical during boot
By default, all systems on s390 start in horizontal cpu polarization.
Selecting the new config option SCHED_TOPOLOGY_VERTICAL allows to build
a kernel that switches to vertical polarization during boot.
Tobias Huschle [Mon, 12 Aug 2024 11:39:31 +0000 (13:39 +0200)]
s390/topology: Add sysctl handler for polarization
Provide an additional path to set the polarization of the system, such
that a user no longer relies on the sysfs interface only and is able
configure the polarization for every reboot via sysctl control files.
The new sysctl can be set as follows:
- s390.polarization=0 for horizontal polarization
- s390.polarization=1 for vertical polarization
Tobias Huschle [Mon, 12 Aug 2024 11:39:30 +0000 (13:39 +0200)]
s390/wti: Add debugfs file to display missed grace periods per cpu
Introduce a new debug file which allows to determine how many warning
track grace periods were missed on each CPU.
The new file can be found as /sys/kernel/debug/s390/wti
It is formatted as:
CPU0 CPU1 [...] CPUx
xyz xyz [...] xyz
Tobias Huschle [Mon, 12 Aug 2024 11:39:29 +0000 (13:39 +0200)]
s390/wti: Add wti accounting for missed grace periods
A virtual CPU that has received a warning-track interrupt may fail to
acknowledge the interrupt within the warning-track grace period.
While this is usually not a problem, it will become necessary to
investigate if there is a large number of such missed warning-track
interrupts. Therefore, it is necessary to track these events.
The information is tracked through the s390 debug facility and can be
found under /sys/kernel/debug/s390dbf/wti/.
The hex_ascii output is formatted as:
<pid> <symbol>
The values pid and current psw are collected when a warning track
interrupt is received. Symbol is either the kernel symbol matching the
collected psw or redacted to <user> when running in user space.
Each line represents the currently executing process when a warning
track interrupt was received which was then not acknowledged within its
grace period.
Tobias Huschle [Mon, 12 Aug 2024 11:39:28 +0000 (13:39 +0200)]
s390/wti: Prepare graceful CPU pre-emption on wti reception
When a warning track interrupt is received, the kernel has only a very
limited amount of time to make sure, that the CPU can be yielded as
gracefully as possible before being pre-empted by the hypervisor.
The interrupt handler for the wti therefore unparks a kernel thread
which has being created on boot re-using the CPU hotplug kernel thread
infrastructure. These threads exist per CPU and are assigned the
highest possible real-time priority. This makes sure, that said threads
will execute as soon as possible as the scheduler should pre-empt any
other running user tasks to run the real-time thread.
Furthermore, the interrupt handler disables all I/O interrupts to
prevent additional interrupt processing on the soon-preempted CPU.
Interrupt handlers are likely to take kernel locks, which in the worst
case, will be kept while the interrupt handler is pre-empted from itself
underlying physical CPU. In that case, all tasks or interrupt handlers
on other CPUs would have to wait for the pre-empted CPU being dispatched
again. By preventing further interrupt processing, this risk is
minimized.
Once the CPU gets dispatched again, the real-time kernel thread regains
control, reenables interrupts and parks itself again.
Tobias Huschle [Mon, 12 Aug 2024 11:39:27 +0000 (13:39 +0200)]
s390/wti: Introduce infrastructure for warning track interrupt
The warning-track interrupt (wti) provides a notification that the
receiving CPU will be pre-empted from its physical CPU within a short
time frame. This time frame is called grace period and depends on the
machine type. Giving up the CPU on time may prevent a task to get stuck
while holding a resource.
Vasily Gorbik [Wed, 28 Aug 2024 17:07:00 +0000 (19:07 +0200)]
s390/ftrace: Avoid extra serialization for graph caller patching
The only context where ftrace_enable_ftrace_graph_caller()
or ftrace_disable_ftrace_graph_caller() is called also calls
ftrace_arch_code_modify_post_process(), which already performs
text_poke_sync_lock().
Vasily Gorbik [Wed, 28 Aug 2024 17:06:58 +0000 (19:06 +0200)]
s390/ftrace: Use get/copy_from_kernel_nofault consistently
Use get/copy_from_kernel_nofault to access the kernel text consistently.
Replace memcmp() in ftrace_init_nop() to ensure that in case of
inconsistencies in the 'mcount' table, the kernel reports a failure
instead of potentially crashing.
When sequential instruction fetching facility is present,
certain guarantees are provided for code patching. In particular,
atomic overwrites within 8 aligned bytes is safe from an
instruction-fetching point of view.
Sven Schnelle [Wed, 28 Aug 2024 11:50:04 +0000 (13:50 +0200)]
s390/entry: Unify save_area_sync and save_area_async
In the past two save areas existed because interrupt handlers
and system call / program check handlers where entered with
interrupts enabled. To prevent a handler from overwriting the
save areas from the previous handler, interrupts used the async
save area, while system call and program check handler used the
sync save area.
Since the removal of critical section cleanup from entry.S, handlers are
entered with interrupts disabled. When the interrupts are re-enabled,
the save area is no longer need. Therefore merge both save areas into one.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Note this only happens when the very first random data providing
crypto card appears via hot plug in the system AND is in disabled
state ("deconfig"). Then the initial pull of random data fails and
a re-scan of the AP bus is triggered while already in the middle
of an AP bus scan caused by the appearing new hardware.
The fix is relatively simple once the scenario us understood:
The AP bus force rescan function will immediately return if there
is currently an AP bus scan running with the very same thread id.
Fixes: eacf5b3651c5 ("s390/ap: introduce mutex to lock the AP bus scan") Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
On newer machines the SHA3 performance of CPACF instructions KIMD and
KLMD can be enhanced by using additional modifier bits. This allows the
application to omit initializing the ICV, but also affects the internal
processing of the instructions. Performance is mostly gained when
processing short messages.
The new CPACF feature is backwards compatible with older machines, i.e.
the new modifier bits are ignored on older machines. However, to save the
ICV initialization, the application must detect the MSA level and omit
the ICV initialization only if this feature is supported.
s390/pkey: Add function to enforce pkey handler modules load
There is a use case during early boot with an secure key encrypted
root file system where the paes cipher may try to derive a protected
key from secure key while the AP bus is still in the process of
scanning the bus and building up the zcrypt device drivers. As the
detection of CEX cards also triggers the modprobe of the pkey handler
modules, these modules may come into existence too late.
Yet another use case happening during early boot is for use of an
protected key encrypted swap file(system). There is an ephemeral
protected key read via sysfs to set up the swap file. But this only
works when the pkey_pckmo module is already in - which may happen at a
later time as the load is triggered via CPU feature.
This patch introduces a new function pkey_handler_request_modules()
and invokes it which unconditional tries to load in the pkey handler
modules. This function is called for the in-kernel API to derive a
protected key from whatever and in the sysfs API when the first
attempt to simple invoke the handler function failed.
s390/pkey: Add slowpath function to CCA and EP11 handler
For some keys there exists an alternative but usually slower
path to convert the key material into a protected key.
This patch introduces a new handler function
slowpath_key_to_protkey()
which provides this alternate path for the CCA and EP11
handler code. With that even the knowledge about how
and when this can be used within the pkey API code can
be removed. So now the pkey API just tries the primary
way and if that fails simple tries the alternative way.
s390/pkey: Introduce pkey base with handler registry and handler modules
Introduce pkey base kernel code with a simple pkey handler registry.
Regroup the pkey code into these kernel modules:
- pkey is the pkey api supporting the ioctls, sysfs and in-kernel api.
Also the pkey base code which offers the handler registry and
handler wrapping invocation functions is integrated there. This
module is automatically loaded in via CPU feature if the MSA feature
is available.
- pkey-cca is the CCA related handler code kernel module a offering
CCA specific implementation for pkey. This module is loaded in
via MODULE_DEVICE_TABLE when a CEX[4-8] card becomes available.
- pkey-ep11 is the EP11 related handler code kernel module offering an
EP11 specific implementation for pkey. This module is loaded in via
MODULE_DEVICE_TABLE when a CEX[4-8] card becomes available.
- pkey-pckmo is the PCKMO related handler code kernel module. This
module is loaded in via CPU feature if the MSA feature is available,
but on init a check for availability of the pckmo instruction is
performed.
The handler modules register via a pkey_handler struct at the pkey
base code and the pkey customer (that is currently the pkey api code
fetches a handler via pkey handler registry functions and calls the
unified handler functions via the pkey base handler functions.
As a result the pkey-cca, pkey-ep11 and pkey-pckmo modules get
independent from each other and it becomes possible to write new
handlers which offer another kind of implementation without implicit
dependencies to other handler implementations and/or kernel device
drivers.
For each of these 4 kernel modules there is an individual Kconfig
entry: CONFIG_PKEY for the base and api, CONFIG_PKEY_CCA for the PKEY
CCA support handler, CONFIG_PKEY_EP11 for the EP11 support handler and
CONFIG_PKEY_PCKMO for the pckmo support. The both CEX related handler
modules (PKEY CCA and PKEY EP11) have a dependency to the zcrypt api
of the zcrypt device driver.
s390/pkey: Unify pkey cca, ep11 and pckmo functions signatures
As a preparation step for introducing a common function API
between the pkey API module and the handlers (that is the
cca, ep11 and pckmo code) this patch unifies the functions
signatures exposed by the handlers and reworks all the
invocation code of these functions.
s390/pkey: Rework and split PKEY kernel module code
This is a huge rework of all the pkey kernel module code.
The goal is to split the code into individual parts with
a dedicated calling interface:
- move all the sysfs related code into pkey_sysfs.c
- all the CCA related code goes to pkey_cca.c
- the EP11 stuff has been moved to pkey_ep11.c
- the PCKMO related code is now in pkey_pckmo.c
The CCA, EP11 and PCKMO code may be seen as "handlers" with
a similar calling interface. The new header file pkey_base.h
declares this calling interface. The remaining code in
pkey_api.c handles the ioctl, the pkey module things and the
"handler" independent code on top of the calling interface
invoking the handlers.
This regrouping of the code will be the base for a real
pkey kernel module split into a pkey base module which acts
as a dispatcher and handler modules providing their service.
Split the very huge ioctl handling function pkey_unlocked_ioctl()
into individual functions per each IOCTL command.
There is no change in functional code coming with this patch.
The work is a simple copy-and-paste with the goal to have
the functionality absolutely untouched.
Mete Durlu [Mon, 26 Aug 2024 12:22:48 +0000 (14:22 +0200)]
s390/hypfs_diag: Remove unused dentry variable
Remove leftover dentry variable after hypfs refactoring.
Before 2fcb3686e160, hypfs_diag.c and other hypfs files were using
debugfs_create_file() explicitly for creating debugfs files and
were storing the returned pointer.
After the refactor, common debugfs file operations and also the
related dentry pointers have been moved into hypfs_dbfs.c and
redefined as new common mechanisms.
Therefore the dentry variable and the debugfs_remove() function
calls in hypfs_diag.c are now redundant.
Current code is not effected since the dentry pointer in
hypfs_diag is implicitly assigned to NULL and debugfs_remove()
returns without an error if the passed pointer is NULL.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Vasily Gorbik [Mon, 26 Aug 2024 21:16:06 +0000 (23:16 +0200)]
s390: Always enable EXPOLINE_EXTERN if supported
Since commit ba05b39d54ee ("s390/expoline: Make modules use kernel
expolines"), there is no longer any reason not to use
CONFIG_EXPOLINE_EXTERN when supported by the compiler.
On the positive side:
- there is only a single set of expolines generated and used by both the
kernel code and modules,
- it eliminates expolines "comdat" sections, which can confuse tools
like kpatch.
Always enable EXPOLINE_EXTERN if supported by the compiler.
Jens Remus [Fri, 23 Aug 2024 10:05:15 +0000 (12:05 +0200)]
s390/disassembler: Update instruction mnemonics to latest spec
Over the course of CPU generations a few instructions got extended,
changing their base mnemonic, while keeping the former as an extended
mnemonic. Update the instruction mnemonics in the disassembler to their
latest base mnemonic as documented in the latest IBM z/Architecture
Principles of Operation specification [1].
With the IBM z14 the base mnemonics of the following vector instructions
have been changed:
- Vector FP Load Lengthened (VFLL)
- Vector FP Load Rounded (VFLR)
With Message-Security-Assist Extension 5 Perform Pseudorandom Number
Operation (PPNO) has been renamed to Perform Random Number Operation
(PRNO).
With Vector Enhancements Facility 2 the base mnemonics of the following
vector instructions have been changed:
- Vector FP Convert from Fixed (VCFPS)
- Vector FP Convert from Logical (VCFPL)
- Vector FP Convert to Fixed (VCSFP)
- Vector FP Convert to Logical (VCLFP)
[1] IBM z/Architecture Principles of Operation, SA22-7832-13, IBM z16,
https://publibfp.dhe.ibm.com/epubs/pdf/a227832d.pdf
Vasily Gorbik [Sat, 24 Aug 2024 00:14:04 +0000 (02:14 +0200)]
s390/ftrace: Avoid calling unwinder in ftrace_return_address()
ftrace_return_address() is called extremely often from
performance-critical code paths when debugging features like
CONFIG_TRACE_IRQFLAGS are enabled. For example, with debug_defconfig,
ftrace selftests on my LPAR currently execute ftrace_return_address()
as follows:
ftrace_return_address(0) - 0 times (common code uses __builtin_return_address(0) instead)
ftrace_return_address(1) - 2,986,805,401 times (with this patch applied)
ftrace_return_address(2) - 140 times
ftrace_return_address(>2) - 0 times
The use of __builtin_return_address(n) was replaced by return_address()
with an unwinder call by commit cae74ba8c295 ("s390/ftrace:
Use unwinder instead of __builtin_return_address()") because
__builtin_return_address(n) simply walks the stack backchain and doesn't
check for reaching the stack top. For shallow stacks with fewer than
"n" frames, this results in reads at low addresses and random
memory accesses.
While calling the fully functional unwinder "works", it is very slow
for this purpose. Moreover, potentially following stack switches and
walking past IRQ context is simply wrong thing to do for
ftrace_return_address().
Reimplement return_address() to essentially be __builtin_return_address(n)
with checks for reaching the stack top. Since the ftrace_return_address(n)
argument is always a constant, keep the implementation in the header,
allowing both GCC and Clang to unroll the loop and optimize it to the
bare minimum.
Jens Remus [Thu, 22 Aug 2024 15:30:21 +0000 (17:30 +0200)]
s390/build: Avoid relocation information in final vmlinux
Since commit 778666df60f0 ("s390: compile relocatable kernel without
-fPIE") the kernel vmlinux ELF file is linked with --emit-relocs to
preserve all relocations, so that all absolute relocations can be
extracted using the 'relocs' tool to adjust them during boot.
Port and adapt Petr Pavlu's x86 commit 9d9173e9ceb6 ("x86/build: Avoid
relocation information in final vmlinux") to s390 to strip all
relocations from the final vmlinux ELF file to optimize its size.
Following is his original commit message with minor adaptions for s390:
The Linux build process on s390 roughly consists of compiling all input
files, statically linking them into a vmlinux ELF file, and then taking
and turning this file into an actual bzImage bootable file.
vmlinux has in this process two main purposes:
1) It is an intermediate build target on the way to produce the final
bootable image.
2) It is a file that is expected to be used by debuggers and standard
ELF tooling to work with the built kernel.
For the second purpose, a vmlinux file is typically collected by various
package build recipes, such as distribution spec files, including the
kernel's own tar-pkg target.
When building the kernel vmlinux contains also relocation information
produced by using the --emit-relocs linker option. This is utilized by
subsequent build steps to create relocs.S and produce a relocatable
image. However, the information is not needed by debuggers and other
standard ELF tooling.
The issue is then that the collected vmlinux file and hence distribution
packages end up unnecessarily large because of this extra data. The
following is a size comparison of vmlinux v6.10 with and without the
relocation information:
Optimize a resulting vmlinux by adding a postlink step that splits the
relocation information into relocs.S and then strips it from the vmlinux
binary.
Vasily Gorbik [Wed, 21 Aug 2024 18:06:16 +0000 (20:06 +0200)]
s390/ftrace: Use kernel ftrace trampoline for modules
Now that both the kernel modules area and the kernel image itself are
located within 4 GB, there is no longer a need to maintain a separate
ftrace_plt trampoline. Use the existing trampoline in the kernel.
Heiko Carstens [Thu, 1 Aug 2024 15:16:12 +0000 (17:16 +0200)]
s390/early: Dump register contents and call trace for early crashes
If the early program check handler cannot resolve a program check dump
register contents and a call trace to the console before loading a disabled
wait psw. This makes debugging much easier.
Emit an extra message with early_printk() for cases where regular printk()
via the early console is not yet working so that at least some information
is available.
Rework debug messages:
- Remove most of the debug_sprintf_event() invocations.
- Do not split string format statements
- Remove colon after function name.
Thomas Richter [Thu, 8 Aug 2024 08:58:57 +0000 (10:58 +0200)]
s390/cpum_sf: Ignore qsi() return code
qsi() executes the instruction qsi (query sample information)
and stores the result of the query in a sample information block
pointed to by the function argument. The instruction does not
change the condition code register. The return code is always
zero. No need to check for errors. Remove now unreferenced
macros PMC_FAILURE and RS_INIT_FAILURE_QSI.
Thomas Richter [Thu, 8 Aug 2024 08:58:56 +0000 (10:58 +0200)]
s390/cpum_sf: Ignore lsctl() return code in sf_disable()
sf_disable() returns the condition code of instruction lsctl (load
sampling controls). However the parameter to lsctl() in
sf_disable() is a sample control block containing
all zeroes. This invocation of lsctl() does not fail and returns
always zero even when there is no authorization for sampling
on the machine. In short, sampling can be always turned off.
Ignore the return code of sf_disable() and change the function
return to void.
Add missing warning handling to the early program check handler. This
way a warning is printed to the console as soon as the early console
is setup, and the kernel continues to boot.
Before this change a disabled wait psw was loaded instead and the
machine was silently stopped without giving an idea about what
happened.
s390/entry: Make early program check handler relocated lowcore aware
Add the missing pieces so the early program check handler also works
with a relocated lowcore. Right now the result of an early program
check in case of a relocated lowcore would be a program check loop.
Generate the address marker array dynamically instead of modifying a large
static array at kernel startup. Each marker is added twice to the array:
with and without a "start" indicator. This way the code and logic stays
similar to other architectures.
Thomas Richter [Thu, 25 Jul 2024 12:13:06 +0000 (14:13 +0200)]
s390/cpum_sf: Use variable name cpuhw consistently
All functions but setup_pmc_cpu() use a local variable named
cpuhw to refer to struct cpu_hw_sf.
In setup_pmc_cpu() rename variable cpusf to cpuhw. This makes
the naming scheme consistent with all other functions.
No functional change.
Thomas Richter [Mon, 1 Jul 2024 08:59:09 +0000 (10:59 +0200)]
s390/cpum_cf: Move defines from header file to source file
The macros PERF_CPUM_CF_MAX_CTR and PERF_EVENT_CPUM_CF_DIAG
are used in only one source file arch/s390/kernel/perf_cpum_cf.c.
Move these defines from the header file
arch/s390/include/asm/perf_event.h to the only user.
No functional change.
Thomas Richter [Thu, 27 Jun 2024 07:45:48 +0000 (09:45 +0200)]
s390/cpum_sf: Move defines from header file to source file
Some defines in common header file arch/s390/include/asm/perf_event.h
are only used in one source file arch/s390/kernel/perf_cpum_sf.c.
Move these defines from header to source file.
No functional change.
Thomas Richter [Tue, 25 Jun 2024 10:36:21 +0000 (12:36 +0200)]
s390/cpum_sf: Rename macro to consistent prefix
Rename macro SAMPLE_FREQ_MODE to SAMPL_FREQ_MODE to make its
prefix consistent with all other macro starting with prefix
SAMPL_XXX.
No functional change.
Thomas Richter [Thu, 27 Jun 2024 07:02:06 +0000 (09:02 +0200)]
s390/cpum_sf: Remove unused defines REG_NONE and REG_OVERFLOW
Member hw_perf_event::reg.reg is set but never used, so remove it.
Defines REG_NONE and REG_OVERFLOW are not referenced anymore.
The initialization to zero takes place in function
perf_event_alloc() where
...
event = kmem_cache_alloc_node(perf_event_cache,
GFP_KERNEL | __GFP_ZERO, node);
...
makes sure memory allocated for the event is zero'ed.
This is done in the kernel's common code in kernel/events/core.c
The struct perf_event contains member hw_perf_event as in
Tetsuo Handa [Sun, 4 Aug 2024 09:48:10 +0000 (18:48 +0900)]
profiling: remove profile=sleep support
The kernel sleep profile is no longer working due to a recursive locking
bug introduced by commit 42a20f86dc19 ("sched: Add wrapper for get_wchan()
to keep task blocked")
Booting with the 'profile=sleep' kernel command line option added or
executing
# echo -n sleep > /sys/kernel/profiling
after boot causes the system to lock up.
Lockdep reports
kthreadd/3 is trying to acquire lock: ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: get_wchan+0x32/0x70
but task is already holding lock: ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: try_to_wake_up+0x53/0x370
However, since nobody noticed this regression for more than two years,
let's remove 'profile=sleep' support based on the assumption that nobody
needs this functionality.
Linus Torvalds [Sun, 4 Aug 2024 15:57:08 +0000 (08:57 -0700)]
Merge tag 'x86-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- Prevent a deadlock on cpu_hotplug_lock in the aperf/mperf driver.
A recent change in the ACPI code which consolidated code pathes moved
the invocation of init_freq_invariance_cppc() to be moved to a CPU
hotplug handler. The first invocation on AMD CPUs ends up enabling a
static branch which dead locks because the static branch enable tries
to acquire cpu_hotplug_lock but that lock is already held write by
the hotplug machinery.
Use static_branch_enable_cpuslocked() instead and take the hotplug
lock read for the Intel code path which is invoked from the
architecture code outside of the CPU hotplug operations.
- Fix the number of reserved bits in the sev_config structure bit field
so that the bitfield does not exceed 64 bit.
- Add missing Zen5 model numbers
- Fix the alignment assumptions of pti_clone_pgtable() and
clone_entry_text() on 32-bit:
The code assumes PMD aligned code sections, but on 32-bit the kernel
entry text is not PMD aligned. So depending on the code size and
location, which is configuration and compiler dependent, entry text
can cross a PMD boundary. As the start is not PMD aligned adding PMD
size to the start address is larger than the end address which
results in partially mapped entry code for user space. That causes
endless recursion on the first entry from userspace (usually #PF).
Cure this by aligning the start address in the addition so it ends up
at the next PMD start address.
clone_entry_text() enforces PMD mapping, but on 32-bit the tail might
eventually be PTE mapped, which causes a map fail because the PMD for
the tail is not a large page mapping. Use PTI_LEVEL_KERNEL_IMAGE for
the clone() invocation which resolves to PTE on 32-bit and PMD on
64-bit.
- Zero the 8-byte case for get_user() on range check failure on 32-bit
The recend consolidation of the 8-byte get_user() case broke the
zeroing in the failure case again. Establish it by clearing ECX
before the range check and not afterwards as that obvioulsy can't be
reached when the range check fails
* tag 'x86-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uaccess: Zero the 8-byte get_range case on failure on 32-bit
x86/mm: Fix pti_clone_entry_text() for i386
x86/mm: Fix pti_clone_pgtable() alignment assumption
x86/setup: Parse the builtin command line before merging
x86/CPU/AMD: Add models 0x60-0x6f to the Zen5 range
x86/sev: Fix __reserved field in sev_config
x86/aperfmperf: Fix deadlock on cpu_hotplug_lock
Linus Torvalds [Sun, 4 Aug 2024 15:50:16 +0000 (08:50 -0700)]
Merge tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for the timer/clocksource code:
- The recent fix to make the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed in testing as some compilers hoist the access
into the non-preemotible section where the pointer is actually
used, but obviously compilers can rightfully invoke it where the
code put it.
Move it into the non-preemptible section right to the actual usage
side to cure it.
- The clocksource watchdog is supposed to emit a warning when the
retry count is greater than one and the number of retries reaches
the limit.
The condition is backwards and warns always when the count is
greater than one. Fixup the condition to prevent spamming dmesg"
* tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
tick/broadcast: Move per CPU pointer access into the atomic section
Linus Torvalds [Sun, 4 Aug 2024 15:46:14 +0000 (08:46 -0700)]
Merge tag 'sched-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- When stime is larger than rtime due to accounting imprecision, then
utime = rtime - stime becomes negative. As this is unsigned math, the
result becomes a huge positive number.
Cure it by resetting stime to rtime in that case, so utime becomes 0.
- Restore consistent state when sched_cpu_deactivate() fails.
When offlining a CPU fails in sched_cpu_deactivate() after the SMT
present counter has been decremented, then the function aborts but
fails to increment the SMT present counter and leaves it imbalanced.
Consecutive operations cause it to underflow. Add the missing fixup
for the error path.
For SMT accounting the runqueue needs to marked online again in the
error exit path to restore consistent state.
* tag 'sched-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()
sched/core: Introduce sched_set_rq_on/offline() helper
sched/smt: Fix unbalance sched_smt_present dec/inc
sched/smt: Introduce sched_smt_present_inc/dec() helper
sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
Linus Torvalds [Sun, 4 Aug 2024 15:42:18 +0000 (08:42 -0700)]
Merge tag 'perf-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Thomas Gleixner:
- Move the smp_processor_id() invocation back into the non-preemtible
region, so that the result is valid to use
- Add the missing package C2 residency counters for Sierra Forest CPUs
to make the newly added support actually useful
* tag 'perf-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix smp_processor_id()-in-preemptible warnings
perf/x86/intel/cstate: Add pkg C2 residency counter for Sierra Forest
Linus Torvalds [Sun, 4 Aug 2024 15:36:57 +0000 (08:36 -0700)]
Merge tag 'irq-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A couple of fixes for interrupt chip drivers:
- Make sure to skip the clear register space in the MBIGEN driver
when calculating the node register index. Otherwise the clear
register is clobbered and the wrong node registers are accessed.
- Fix a signed/unsigned confusion in the loongarch CPU driver which
converts an error code to a huge "valid" interrupt number.
- Convert the mesion GPIO interrupt controller lock to a raw spinlock
so it works on RT.
- Add a missing static to a internal function in the pic32 EVIC
driver"
* tag 'irq-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mbigen: Fix mbigen node address layout
irqchip/meson-gpio: Convert meson_gpio_irq_controller::lock to 'raw_spinlock_t'
irqchip/irq-pic32-evic: Add missing 'static' to internal function
irqchip/loongarch-cpu: Fix return value of lpic_gsi_to_irq()
Linus Torvalds [Sun, 4 Aug 2024 15:32:31 +0000 (08:32 -0700)]
Merge tag 'locking-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two fixes for locking and jump labels:
- Ensure that the atomic_cmpxchg() conditions are correct and
evaluating to true on any non-zero value except 1. The missing
check of the return value leads to inconsisted state of the jump
label counter.
- Add a missing type conversion in the paravirt spinlock code which
makes loongson build again"
* tag 'locking-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_label: Fix the fix, brown paper bags galore
locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()
arm: dts: arm: versatile-ab: Fix duplicate clock node name
Commit 04f08ef291d4 ("arm/arm64: dts: arm: Use generic clock and
regulator nodenames") renamed nodes and created 2 "clock-24000000" nodes
(at different paths).
The kernel can't handle these duplicate names even though they are at
different paths. Fix this by renaming one of the nodes to "clock-pclk".
This name is aligned with other Arm boards (those didn't have a known
frequency to use in the node name).
Linus Torvalds [Sun, 4 Aug 2024 15:18:40 +0000 (08:18 -0700)]
Merge tag '6.11-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- two reparse point fixes
- minor cleanup
- additional trace point (to help debug a recent problem)
* tag '6.11-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal version number
smb: client: fix FSCTL_GET_REPARSE_POINT against NetApp
smb3: add dynamic tracepoints for shutdown ioctl
cifs: Remove cifs_aio_ctx
smb: client: handle lack of FSCTL_GET_REPARSE_POINT support
Linus Torvalds [Sun, 4 Aug 2024 15:12:33 +0000 (08:12 -0700)]
Merge tag 'media/v6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- two Kconfig fixes
- one fix for the UVC driver addressing probing time detection of a UVC
custom controls
- one fix related to PDF generation
* tag 'media/v6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: v4l: Fix missing tabular column hint for Y14P format
media: intel/ipu6: select AUXILIARY_BUS in Kconfig
media: ipu-bridge: fix ipu6 Kconfig dependencies
media: uvcvideo: Fix custom control mapping probing
Linus Torvalds [Sat, 3 Aug 2024 22:12:56 +0000 (15:12 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"One core change that reverts the double message print patch in sd.c
(it was causing regressions on embedded systems).
The rest are driver fixes in ufs, mpt3sas and mpi3mr"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: exynos: Don't resume FMP when crypto support is disabled
scsi: mpt3sas: Avoid IOMMU page faults on REPORT ZONES
scsi: mpi3mr: Avoid IOMMU page faults on REPORT ZONES
scsi: ufs: core: Do not set link to OFF state while waking up from hibernation
scsi: Revert "scsi: sd: Do not repeat the starting disk message"
scsi: ufs: core: Fix deadlock during RTC update
scsi: ufs: core: Bypass quick recovery if force reset is needed
scsi: ufs: core: Check LSDBS cap when !mcq
Linus Torvalds [Sat, 3 Aug 2024 16:09:25 +0000 (09:09 -0700)]
Merge tag 'xfs-6.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Fix memory leak when corruption is detected during scrubbing parent
pointers
- Allow SECURE namespace xattrs to use reserved block pool to in order
to prevent ENOSPC
- Save stack space by passing tracepoint's char array to file_path()
instead of another stack variable
- Remove unused parameter in macro XFS_DQUOT_LOGRES
- Replace comma with semicolon in a couple of places
* tag 'xfs-6.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: convert comma to semicolon
xfs: convert comma to semicolon
xfs: remove unused parameter in macro XFS_DQUOT_LOGRES
xfs: fix file_path handling in tracepoints
xfs: allow SECURE namespace xattrs to use reserved block pool
xfs: fix a memory leak
Linus Torvalds [Sat, 3 Aug 2024 16:03:21 +0000 (09:03 -0700)]
Merge tag 'parisc-for-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
- fix unaligned memory accesses when calling BPF functions
- adjust memory size constants to fix possible DMA corruptions
* tag 'parisc-for-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: fix a possible DMA corruption
parisc: fix unaligned accesses in BPF
Linus Torvalds [Sat, 3 Aug 2024 01:12:06 +0000 (18:12 -0700)]
runtime constants: deal with old decrepit linkers
The runtime constants linker script depended on documented linker
behavior [1]:
"If an output section’s name is the same as the input section’s name
and is representable as a C identifier, then the linker will
automatically PROVIDE two symbols: __start_SECNAME and __stop_SECNAME,
where SECNAME is the name of the section. These indicate the start
address and end address of the output section respectively"
to just automatically define the symbol names for the bounds of the
runtime constant arrays.
It turns out that this isn't actually something we can rely on, with old
linkers not generating these automatic symbols. It looks to have been
introduced in binutils-2.29 back in 2017, and we still support building
with versions all the way back to binutils-2.25 (from 2015).
And yes, Oleg actually seems to be using such ancient versions of
binutils.
So instead of depending on the implicit symbols from "section names
match and are representable C identifiers", just do this all manually.
It's not like it causes us any extra pain, we already have to do that
for all the other sections that we use that often have special
characters in them.
Linus Torvalds [Fri, 2 Aug 2024 21:18:31 +0000 (14:18 -0700)]
Merge tag 'io_uring-6.11-20240802' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Two minor tweaks for the NAPI handling, both from Olivier:
- Kill two unused list definitions
- Ensure that multishot NAPI doesn't age away"
* tag 'io_uring-6.11-20240802' of git://git.kernel.dk/linux:
io_uring: remove unused local list heads in NAPI functions
io_uring: keep multishot request NAPI timeout current
Linus Torvalds [Fri, 2 Aug 2024 21:10:11 +0000 (14:10 -0700)]
Merge tag 'thermal-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"These fix a few issues related to the MSI IRQs management in the
int340x thermal driver, fix a thermal core issue that may lead to
missing trip point crossing events and update the thermal core
documentation.
Specifics:
- Fix MSI error path cleanup in int340x, allow it to work with a
subset of thermal MSI IRQs if some of them are not working and make
it free all MSI IRQs on module exit (Srinivas Pandruvada)
- Fix a thermal core issue that may lead to missing trip point
crossing events in some cases when thermal_zone_set_trips() is used
and update the thermal core documentation (Rafael Wysocki)"
* tag 'thermal-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: core: Update thermal zone registration documentation
thermal: trip: Avoid skipping trips in thermal_zone_set_trips()
thermal: intel: int340x: Free MSI IRQ vectors on module exit
thermal: intel: int340x: Allow limited thermal MSI support
thermal: intel: int340x: Fix kernel warning during MSI cleanup
Linus Torvalds [Fri, 2 Aug 2024 17:33:06 +0000 (10:33 -0700)]
Merge tag 'ceph-for-6.11-rc2' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a potential hang in the MDS when cap revocation races with
the client releasing the caps in question, marked for stable"
* tag 'ceph-for-6.11-rc2' of https://github.com/ceph/ceph-client:
ceph: force sending a cap update msg back to MDS for revoke op