Nathan Lynch [Wed, 17 Nov 2021 06:02:58 +0000 (00:02 -0600)]
powerpc/rtas: rtas_busy_delay() improvements
Generally RTAS cannot block, and in PAPR it is required to return control
to the OS within a few tens of microseconds. In order to support operations
which may take longer to complete, many RTAS primitives can return
intermediate -2 ("busy") or 990x ("extended delay") values, which indicate
that the OS should reattempt the same call with the same arguments at some
point in the future.
Current versions of PAPR are less than clear about this, but the intended
meanings of these values in more detail are:
RTAS_BUSY (-2): RTAS has suspended a potentially long-running operation in
order to meet its latency obligation and give the OS the opportunity to
perform other work. RTAS can resume making progress as soon as the OS
reattempts the call.
RTAS_EXTENDED_DELAY_{MIN...MAX} (9900-9905): RTAS must wait for an external
event to occur or for internal contention to resolve before it can complete
the requested operation. The value encodes a non-binding hint as to roughly
how long the OS should wait before calling again, but the OS is allowed to
reattempt the call sooner or even immediately.
Linux of course must take its own CPU scheduling obligations into account
when handling these statuses; e.g. a task which receives an RTAS_BUSY
status should check whether to reschedule before it attempts the RTAS call
again to avoid starving other tasks.
rtas_busy_delay() is a helper function that "consumes" a busy or extended
delay status. Common usage:
int rc;
do {
rc = rtas_call(rtas_token("some-function"), ...);
} while (rtas_busy_delay(rc));
/* convert rc to Linux error value, etc */
If rc is a busy or extended delay status, the caller can rely on
rtas_busy_delay() to perform an appropriate sleep or reschedule and return
nonzero. Other statuses are handled normally by the caller.
The current implementation of rtas_busy_delay() both oversleeps and
overuses the CPU:
* It performs msleep() for all 990x and even when no delay is
suggested (-2), but this is understood to actually sleep for two jiffies
minimum in practice (20ms with HZ=100). 9900 (1ms) and 9901 (10ms)
appear to be the most common extended delay statuses, and the
oversleeping measurably lengthens DLPAR operations, which perform
many RTAS calls.
* It does not sleep on 990x unless need_resched() is true, causing code
like the loop above to needlessly retry, wasting CPU time.
Alter the logic to align better with the intended meanings:
* When passed RTAS_BUSY, perform cond_resched() and return without
sleeping. The caller should reattempt immediately
* Always sleep when passed an extended delay status, using usleep_range()
for precise shorter sleeps. Limit the sleep time to one second even
though there are higher architected values.
Change rtas_busy_delay()'s return type to bool to better reflect its usage,
and add kernel-doc.
rtas_busy_delay_time() is unchanged, even though it "incorrectly" returns 1
for RTAS_BUSY. There are users of that API with open-coded delay loops in
sensitive contexts that will have to be taken on an individual basis.
Brief results for addition and removal of 5GB memory on a small P9 PowerVM
partition follow. Load was generated with stress-ng --cpu N. For add,
elapsed time is greatly reduced without significant change in the number of
RTAS calls or time spent on CPU. For remove, elapsed time is modestly
reduced, with significant reductions in RTAS calls and time spent on CPU.
This code supports functions from Power4-era servers that are not present
on targets currently supported by arch/powerpc. System manuals from this
time have this description:
Scan Dump data is a set of chip data that the service processor gathers
after a system malfunction. It consists of chip scan rings, chip trace
arrays, and Scan COM (SCOM) registers. This data is stored in the
scan-log partition of the system’s Nonvolatile Random Access
Memory (NVRAM).
PowerVM partition firmware development doesn't recognize the associated
function call or property, and they don't see any references to them in
their codebase. It seems to have been specific to non-virtualized pseries.
References:
Historical Linux commit from February 2003 (interesting to note this seems
to be the source of non-GPL exports for rtas_call etc):
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=f92e361842d5251e50562b09664082dcbd0548bb
IntelliStation and pSeries docs which refer to the feature:
http://ps-2.retropc.se/basil.holloway/ALL%20PDF/380635.pdf
http://ps-2.kev009.com/rs6000/manuals/p/p615-6C3-6E3/6C3_and_6E3_Users_Guide_SA38-0629.pdf
Nathan Lynch [Tue, 16 Nov 2021 21:58:06 +0000 (15:58 -0600)]
powerpc/rtas: kernel-doc fixes
Fix the following issues reported by kernel-doc:
$ scripts/kernel-doc -v -none arch/powerpc/kernel/rtas.c
arch/powerpc/kernel/rtas.c:810: info: Scanning doc for function rtas_activate_firmware
arch/powerpc/kernel/rtas.c:818: warning: contents before sections
arch/powerpc/kernel/rtas.c:841: info: Scanning doc for function rtas_call_reentrant
arch/powerpc/kernel/rtas.c:893: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Find a specific pseries error log in an RTAS extended event log.
Christophe Leroy [Mon, 15 Nov 2021 10:12:22 +0000 (11:12 +0100)]
powerpc/code-patching: Improve verification of patchability
Today, patch_instruction() assumes that it is called exclusively on
valid addresses, and only checks that it is not called on an init
address after init section has been freed.
Improve verification by calling kernel_text_address() instead.
kernel_text_address() already includes a verification of
initmem release.
Hari Bathini [Tue, 12 Oct 2021 12:30:56 +0000 (18:00 +0530)]
bpf ppc32: Access only if addr is kernel address
With KUAP enabled, any kernel code which wants to access userspace
needs to be surrounded by disable-enable KUAP. But that is not
happening for BPF_PROBE_MEM load instruction. Though PPC32 does not
support read protection, considering the fact that PTR_TO_BTF_ID
(which uses BPF_PROBE_MEM mode) could either be a valid kernel pointer
or NULL but should never be a pointer to userspace address, execute
BPF_PROBE_MEM load only if addr is kernel address, otherwise set
dst_reg=0 and move on.
This will catch NULL, valid or invalid userspace pointers. Only bad
kernel pointer will be handled by BPF exception table.
[Alexei suggested for x86]
Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-9-hbathini@linux.ibm.com
Hari Bathini [Tue, 12 Oct 2021 12:30:55 +0000 (18:00 +0530)]
bpf ppc32: Add BPF_PROBE_MEM support for JIT
BPF load instruction with BPF_PROBE_MEM mode can cause a fault
inside kernel. Append exception table for such instructions
within BPF program.
Unlike other archs which uses extable 'fixup' field to pass dest_reg
and nip, BPF exception table on PowerPC follows the generic PowerPC
exception table design, where it populates both fixup and extable
sections within BPF program. fixup section contains 3 instructions,
first 2 instructions clear dest_reg (lower & higher 32-bit registers)
and last instruction jumps to next instruction in the BPF code.
extable 'insn' field contains relative offset of the instruction and
'fixup' field contains relative offset of the fixup entry. Example
layout of BPF program with extable present:
Ravi Bangoria [Tue, 12 Oct 2021 12:30:54 +0000 (18:00 +0530)]
bpf ppc64: Access only if addr is kernel address
On PPC64 with KUAP enabled, any kernel code which wants to
access userspace needs to be surrounded by disable-enable KUAP.
But that is not happening for BPF_PROBE_MEM load instruction.
So, when BPF program tries to access invalid userspace address,
page-fault handler considers it as bad KUAP fault:
Kernel attempted to read user page (d0000000) - exploit attempt? (uid: 0)
Considering the fact that PTR_TO_BTF_ID (which uses BPF_PROBE_MEM
mode) could either be a valid kernel pointer or NULL but should
never be a pointer to userspace address, execute BPF_PROBE_MEM load
only if addr is kernel address, otherwise set dst_reg=0 and move on.
This will catch NULL, valid or invalid userspace pointers. Only bad
kernel pointer will be handled by BPF exception table.
[Alexei suggested for x86]
Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211012123056.485795-7-hbathini@linux.ibm.com
Ravi Bangoria [Tue, 12 Oct 2021 12:30:53 +0000 (18:00 +0530)]
bpf ppc64: Add BPF_PROBE_MEM support for JIT
BPF load instruction with BPF_PROBE_MEM mode can cause a fault
inside kernel. Append exception table for such instructions
within BPF program.
Unlike other archs which uses extable 'fixup' field to pass dest_reg
and nip, BPF exception table on PowerPC follows the generic PowerPC
exception table design, where it populates both fixup and extable
sections within BPF program. fixup section contains two instructions,
first instruction clears dest_reg and 2nd jumps to next instruction
in the BPF code. extable 'insn' field contains relative offset of
the instruction and 'fixup' field contains relative offset of the
fixup entry. Example layout of BPF program with extable present:
The EEH recovery logic in eeh_handle_normal_event() has some pretty strange
flow control. If we remove all the actual recovery logic we're left with
the following skeleton:
Most of the "if () { ... }" blocks above change "result" to
PCI_ERS_RESULT_DISCONNECTED if an error occurs in that recovery step. This
makes the control flow a bit confusing since it breaks the early-exit
pattern that is generally used in Linux. In any case we end up handling the
error in the final else block so why not just jump there directly? Doing so
also allows us to de-indent a bunch of code.
No functional changes.
[dja: rebase on top of linux-next + my preceeding refactor,
move clearing the EEH_DEV_NO_HANDLER bit above the first goto so that
it is always clear in the error handler code as it was before.]
Daniel Axtens [Fri, 15 Oct 2021 07:06:27 +0000 (18:06 +1100)]
powerpc/eeh: Small refactor of eeh_handle_normal_event()
The control flow of eeh_handle_normal_event() is a bit tricky.
Break out one of the error handling paths - rather than be in an else
block, we'll make it part of the regular body of the function and put a
'goto out;' in the true limb of the if.
StoreEOI is activated by default on platforms supporting the feature
(POWER10) and will be used as soon as firmware advertises its
availability. The kernel parameter provides a way to deactivate its
use. It can be still be reactivated through debugfs.
The XIVE driver under Linux uses a single interrupt priority and only
one event queue is configured per CPU. Expose the contents under
a 'xive/eqs/cpuX' debugfs file.
powerpc/xive: Rename the 'cpus' debugfs file to 'ipis'
and remove the EQ entries output which is not very useful since only
the next two events of the queue are taken into account. We will
improve the dump of the EQ in the next patches.
StoreEOI (the capability to EOI with a store) requires load-after-store
ordering in some cases to be reliable. P10 introduced a new offset for
load operations to enforce correct ordering and the XIVE driver has
the required support since kernel 5.8, commit b1f9be9392f0
("powerpc/xive: Enforce load-after-store ordering when StoreEOI is active")
Since skiboot v7, StoreEOI support is advertised on P10 with a new flag
on the PowerNV platform. See skiboot commit 4bd7d84afe46 ("xive/p10:
Introduce a new OPAL_XIVE_IRQ_STORE_EOI2 flag"). When detected,
activate the feature.
Nicholas Piggin [Mon, 3 May 2021 13:02:43 +0000 (23:02 +1000)]
powerpc/powernv: Remove POWER9 PVR version check for entry and uaccess flushes
These aren't necessarily POWER9 only, and it's not to say some new
vulnerability may not get discovered on other processors for which
we would like the flexibility of having the workaround enabled by
firmware.
Remove the restriction that the workarounds only apply to POWER9.
However POWER7 and POWER8 are not affected, and they may not have
older firmware that does not advertise this, so clear these workarounds
manually.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au>
[mpe: Incorporate changes from Nick, reword comment slightly.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210503130243.891868-5-npiggin@gmail.com
Michael Ellerman [Thu, 25 Nov 2021 00:23:59 +0000 (11:23 +1100)]
Merge branch 'topic/ppc-kvm' into next
This merge's Nick's big P9 KVM series, original cover letter follows:
KVM: PPC: Book3S HV P9: entry/exit optimisations
This reduces radix guest full entry/exit latency on POWER9 and POWER10
by 2x.
Nested HV guests should see smaller improvements in their L1 entry/exit,
but this is also combined with most L0 speedups also applying to nested
entry. nginx localhost throughput test in a SMP nested guest is improved
about 10% (in a direct guest it doesn't change much because it uses XIVE
for IPIs) when L0 and L1 are patched.
- Test SPR values to avoid mtSPRs where possible. mtSPRs are expensive.
- Reduce mftb. mftb is expensive.
- Demand fault certain facilities to avoid saving and/or restoring them
(at the cost of fault when they are used, but this is mitigated over
a number of entries, like the facilities when context switching
processes). PM, TM, and EBB so far.
- Defer some sequences that are made just in case a guest is interrupted
in the middle of a critical section to the case where the guest is
scheduled on a different CPU, rather than every time (at the cost of
an extra IPI in this case). Namely the tlbsync sequence for radix with
GTSE, which is very expensive.
- Reduce locking, barriers, atomics related to the vcpus-per-vcore > 1
handling that the P9 path does not require.
On POWER9 and newer, rather than the complex HMI synchronisation and
subcore state, have each thread un-apply the guest TB offset before
calling into the early HMI handler.
This allows the subcore state to be avoided, including subcore enter
/ exit guest, which includes an expensive divide that shows up
slightly in profiles.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:30 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Stop using vc->dpdes
The P9 path uses vc->dpdes only for msgsndp / SMT emulation. This adds
an ordering requirement between vcpu->doorbell_request and vc->dpdes for
no real benefit. Use vcpu->doorbell_request directly.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:27 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Avoid cpu_in_guest atomics on entry and exit
cpu_in_guest is set to determine if a CPU needs to be IPI'ed to exit
the guest and notice the need_tlb_flush bit.
This can be implemented as a global per-CPU pointer to the currently
running guest instead of per-guest cpumasks, saving 2 atomics per
entry/exit. P7/8 doesn't require cpu_in_guest, nor does a nested HV
(only the L0 does), so move it to the P9 HV path.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:25 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Avoid changing MSR[RI] in entry and exit
kvm_hstate.in_guest provides the equivalent of MSR[RI]=0 protection,
and it covers the existing MSR[RI]=0 section in late entry and early
exit, so clearing and setting MSR[RI] in those cases does not
actually do anything useful.
Remove the RI manipulation and replace it with comments. Make the
in_guest memory accesses a bit closer to a proper critical section
pattern. This speeds up guest entry/exit performance.
This also removes the MSR[RI] warnings which aren't very interesting
and would cause crashes if they hit due to causing an interrupt in
non-recoverable code.
slbmfee/slbmfev instructions are very expensive, moreso than a regular
mfspr instruction, so minimising them significantly improves hash guest
exit performance. The slbmfev is only required if slbmfee found a valid
SLB entry.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:22 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV Nested: Avoid extra mftb() in nested entry
mftb() is expensive and one can be avoided on nested guest dispatch.
If the time checking code distinguishes between the L0 timer and the
nested HV timer, then both can be tested in the same place with the
same mftb() value.
This also nicely illustrates the relationship between the L0 and nested
HV timers.
Use the existing TLB flushing logic to IPI the previous CPU and run the
necessary barriers before running a guest vCPU on a new physical CPU,
to do the necessary radix GTSE barriers for handling the case of an
interrupted guest tlbie sequence.
This requires the vCPU TLB flush sequence that is currently just done
on one thread, to be expanded to ensure the other threads execute a
ptesync, because causing them to exit the guest will no longer cause a
ptesync by itself.
This results in more IPIs than the TLB flush logic requires, but it's
a significant win for common case scheduling when the vCPU remains on
the same physical CPU.
This saves about 520 cycles (nearly 10%) on a guest entry+exit micro
benchmark on a POWER9.
Tighten up partition switching code synchronisation and comments.
In particular, hwsync ; isync is required after the last access that is
performed in the context of a partition, before the partition is
switched away from.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:16 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Use Linux SPR save/restore to manage some host SPRs
Linux implements SPR save/restore including storage space for registers
in the task struct for process context switching. Make use of this
similarly to the way we make use of the context switching fp/vec save
restore.
This improves code reuse, allows some stack space to be saved, and helps
with avoiding VRSAVE updates if they are not required.
Use HFSCR facility disabling to implement demand faulting for TM, with
a hysteresis counter similar to the load_fp etc counters in context
switching that implement the equivalent demand faulting for userspace
facilities.
This speeds up guest entry/exit by avoiding the register save/restore
when a guest is not frequently using them. When a guest does use them
often, there will be some additional demand fault overhead, but these
are not commonly used facilities.
Use HFSCR facility disabling to implement demand faulting for EBB, with
a hysteresis counter similar to the load_fp etc counters in context
switching that implement the equivalent demand faulting for userspace
facilities.
This speeds up guest entry/exit by avoiding the register save/restore
when a guest is not frequently using them. When a guest does use them
often, there will be some additional demand fault overhead, but these
are not commonly used facilities.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:11 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Switch PMU to guest as late as possible
This moves PMU switch to guest as late as possible in entry, and switch
back to host as early as possible at exit. This helps the host get the
most perf coverage of KVM entry/exit code as possible.
This is slightly suboptimal for SPR scheduling point of view when the
PMU is enabled, but when perf is disabled there is no real difference.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:07 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Move host OS save/restore functions to built-in
Move the P9 guest/host register switching functions to the built-in
P9 entry code, and export it for nested to use as well.
This allows more flexibility in scheduling these supervisor privileged
SPR accesses with the HV privileged and PR SPR accesses in the low level
entry code.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:05 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Juggle SPR switching around
This juggles SPR switching on the entry and exit sides to be more
symmetric, which makes the next refactoring patch possible with no
functional change.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:01 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV P9: Move TB updates
Move the TB updates between saving and loading guest and host SPRs,
to improve scheduling by keeping issue-NTC operations together as
much as possible.
Nicholas Piggin [Tue, 23 Nov 2021 09:52:00 +0000 (19:52 +1000)]
KVM: PPC: Book3S HV: Change dec_expires to be relative to guest timebase
Change dec_expires to be relative to the guest timebase, and allow
it to be moved into low level P9 guest entry functions, to improve
SPR access scheduling.
Moving the mtmsrd after the host SPRs are saved and before the guest
SPRs start to be loaded can prevent an SPR scoreboard stall (because
the mtmsrd is L=1 type which does not cause context synchronisation.
This is also now more convenient to combined with the mtmsrd L=0
instruction to enable facilities just below, but that is not done yet.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:57 +0000 (19:51 +1000)]
KVM: PPC: Book3S HV P9: Reduce mtmsrd instructions required to save host SPRs
This reduces the number of mtmsrd required to enable facility bits when
saving/restoring registers, by having the KVM code set all bits up front
rather than using individual facility functions that set their particular
MSR bits.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:53 +0000 (19:51 +1000)]
KVM: PPC: Book3S HV P9: Demand fault PMU SPRs when marked not inuse
The pmcregs_in_use field in the guest VPA can not be trusted to reflect
what the guest is doing with PMU SPRs, so the PMU must always be managed
(stopped) when exiting the guest, and SPR values set when entering the
guest to ensure it can't cause a covert channel or otherwise cause other
guests or the host to misbehave.
So prevent guest access to the PMU with HFSCR[PM] if pmcregs_in_use is
clear, and avoid the PMU SPR access on every partition switch. Guests
that set pmcregs_in_use incorrectly or when first setting it and using
the PMU will take a hypervisor facility unavailable interrupt that will
bring in the PMU SPRs.
Rather than guest/host save/retsore functions, implement context switch
functions that take care of details like the VPA update for nested.
The reason to split these kind of helpers into explicit save/load
functions is mainly to schedule SPR access nicely, but PMU is a special
case where the load requires mtSPR (to stop counters) and other
difficulties, so there's less possibility to schedule those nicely. The
SPR accesses also have side-effects if the PMU is running, and in later
changes we keep the host PMU running as long as possible so this code
can be better profiled, which also complicates scheduling.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:50 +0000 (19:51 +1000)]
powerpc/64s: Implement PMU override command line option
It can be useful in simulators (with very constrained environments)
to allow some PMCs to run from boot so they can be sampled directly
by a test harness, rather than having to run perf.
A previous change freezes counters at boot by default, so provide
a boot time option to un-freeze (plus a bit more flexibility).
Nicholas Piggin [Tue, 23 Nov 2021 09:51:49 +0000 (19:51 +1000)]
powerpc/64s: Always set PMU control registers to frozen/disabled when not in use
KVM PMU management code looks for particular frozen/disabled bits in
the PMU registers so it knows whether it must clear them when coming
out of a guest or not. Setting this up helps KVM make these optimisations
without getting confused. Longer term the better approach might be to
move guest/host PMU switching to the perf subsystem.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:48 +0000 (19:51 +1000)]
KVM: PPC: Book3S HV: Don't always save PMU for guest capable of nesting
Provide a config option that controls the workaround added by commit 63279eeb7f93 ("KVM: PPC: Book3S HV: Always save guest pmu for guest
capable of nesting"). The option defaults to y for now, but is expected
to go away within a few releases.
Nested capable guests running with the earlier commit 178266389794
("KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest
SPRs are live") will now indicate the PMU in-use status of their guests,
which means the parent does not need to unconditionally save the PMU for
nested capable guests.
After this latest round of performance optimisations, this option costs
about 540 cycles or 10% entry/exit performance on a POWER9 nested-capable
guest.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
References: 178266389794 ("KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest SPRs are live") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211123095231.1036501-11-npiggin@gmail.com
Nicholas Piggin [Tue, 23 Nov 2021 09:51:47 +0000 (19:51 +1000)]
powerpc/64s: Keep AMOR SPR a constant ~0 at runtime
This register controls supervisor SPR modifications, and as such is only
relevant for KVM. KVM always sets AMOR to ~0 on guest entry, and never
restores it coming back out to the host, so it can be kept constant and
avoid the mtSPR in KVM guest entry.
HV interrupts may be taken with the MMU enabled when radix guests are
running. Enable LPCR[HAIL] on ISA v3.1 processors for radix guests.
Make this depend on the host LPCR[HAIL] being enabled. Currently that is
always enabled, but having this test means any issue that might require
LPCR[HAIL] to be disabled in the host will not have to be duplicated in
KVM.
This optimisation takes 1380 cycles off a NULL hcall entry+exit micro
benchmark on a POWER10.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:45 +0000 (19:51 +1000)]
powerpc/time: add API for KVM to re-arm the host timer/decrementer
Rather than have KVM look up the host timer and fiddle with the
irq-work internal details, have the powerpc/time.c code provide a
function for KVM to re-arm the Linux timer code when exiting a
guest.
This is implementation has an improvement over existing code of
marking a decrementer interrupt as soft-pending if a timer has
expired, rather than setting DEC to a -ve value, which tended to
cause host timers to take two interrupts (first hdec to exit the
guest, then the immediate dec).
Nicholas Piggin [Tue, 23 Nov 2021 09:51:44 +0000 (19:51 +1000)]
KVM: PPC: Book3S HV P9: Reduce mftb per guest entry/exit
mftb is serialising (dispatch next-to-complete) so it is heavy weight
for a mfspr. Avoid reading it multiple times in the entry or exit paths.
A small number of cycles delay to timers is tolerable.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:43 +0000 (19:51 +1000)]
KVM: PPC: Book3S HV P9: Use large decrementer for HDEC
On processors that don't suppress the HDEC exceptions when LPCR[HDICE]=0,
this could help reduce needless guest exits due to leftover exceptions on
entering the guest.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:42 +0000 (19:51 +1000)]
KVM: PPC: Book3S HV P9: Use host timer accounting to avoid decrementer read
There is no need to save away the host DEC value, as it is derived
from the host timer subsystem which maintains the next timer time,
so it can be restored from there.
Nicholas Piggin [Tue, 23 Nov 2021 09:51:41 +0000 (19:51 +1000)]
KMV: PPC: Book3S HV P9: Use set_dec to set decrementer to host
The host Linux timer code arms the decrementer with the value
'decrementers_next_tb - current_tb' using set_dec(), which stores
val - 1 on Book3S-64, which is not quite the same as what KVM does
to re-arm the host decrementer when exiting the guest.
This shouldn't be a significant change, but it makes the logic match
and avoids this small extra change being brought into the next patch.
The POWER9 ERAT flush instruction is a SLBIA with IH=7, which is a
reserved value on POWER7/8. On POWER8 this invalidates the SLB entries
above index 0, similarly to SLBIA IH=0.
If the SLB entries are invalidated, and then the guest is bypassed, the
host SLB does not get re-loaded, so the bolted entries above 0 will be
lost. This can result in kernel stack access causing a SLB fault.
Kernel stack access causing a SLB fault was responsible for the infamous
mega bug (search "Fix SLB reload bug"). Although since commit 48e7b7695745 ("powerpc/64s/hash: Convert SLB miss handlers to C") that
starts using the kernel stack in the SLB miss handler, it might only
result in an infinite loop of SLB faults. In any case it's a bug.
Fix this by only executing the instruction on >= POWER9 where IH=7 is
defined not to invalidate the SLB. POWER7/8 don't require this ERAT
flush.
Fixes: 500871125920 ("KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211119031627.577853-1-npiggin@gmail.com
Linus Torvalds [Sun, 21 Nov 2021 19:25:19 +0000 (11:25 -0800)]
Merge tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- Move the command line preparation and the early command line parsing
earlier so that the command line parameters which affect
early_reserve_memory(), e.g. efi=nosftreserve, are taken into
account. This was broken when the invocation of
early_reserve_memory() was moved recently.
- Use an atomic type for the SGX page accounting, which is read and
written locklessly, to plug various race conditions related to it.
* tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Fix free page accounting
x86/boot: Pull up cmdline preparation and early param parsing
Linus Torvalds [Sun, 21 Nov 2021 19:17:50 +0000 (11:17 -0800)]
Merge tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Thomas Gleixner:
- Remove unneded PEBS disabling when taking LBR snapshots to prevent an
unchecked MSR access error.
- Fix IIO event constraints for Snowridge and Skylake server chips.
* tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/perf: Fix snapshot_branch_stack warning in VM
perf/x86/intel/uncore: Fix IIO event constraints for Snowridge
perf/x86/intel/uncore: Fix IIO event constraints for Skylake Server
perf/x86/intel/uncore: Fix filter_tid mask for CHA events on Skylake Server
Linus Torvalds [Sun, 21 Nov 2021 18:26:35 +0000 (10:26 -0800)]
Merge tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc fixes from Michael Ellerman:
- Fix a bug in copying of sigset_t for 32-bit systems, which caused X
to not start.
- Fix handling of shared LSIs (rare) with the xive interrupt controller
(Power9/10).
- Fix missing TOC setup in some KVM code, which could result in oopses
depending on kernel data layout.
- Fix DMA mapping when we have persistent memory and only one DMA
window available.
- Fix further problems with STRICT_KERNEL_RWX on 8xx, exposed by a
recent fix.
- A couple of other minor fixes.
Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Cédric Le Goater,
Christian Zigotzky, Christophe Leroy, Daniel Axtens, Finn Thain, Greg
Kurz, Masahiro Yamada, Nicholas Piggin, and Uwe Kleine-König.
* tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/xive: Change IRQ domain to a tree domain
powerpc/8xx: Fix pinned TLBs with CONFIG_STRICT_KERNEL_RWX
powerpc/signal32: Fix sigset_t copy
powerpc/book3e: Fix TLBCAM preset at boot
powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window
powerpc/pseries/ddw: simplify enable_ddw()
powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for persistent memory"
powerpc/pseries: Fix numa FORM2 parsing fallback code
powerpc/pseries: rename numa_dist_table to form2_distances
powerpc: clean vdso32 and vdso64 directories
powerpc/83xx/mpc8349emitx: Drop unused variable
KVM: PPC: Book3S HV: Use GLOBAL_TOC for kvmppc_h_set_dabr/xdabr()
Linus Torvalds [Sat, 20 Nov 2021 21:17:24 +0000 (13:17 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"15 patches.
Subsystems affected by this patch series: ipc, hexagon, mm (swap,
slab-generic, kmemleak, hugetlb, kasan, damon, and highmem), and proc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
proc/vmcore: fix clearing user buffer by properly using clear_user()
kmap_local: don't assume kmap PTEs are linear arrays in memory
mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
kasan: test: silence intentional read overflow warnings
hugetlb, userfaultfd: fix reservation restore on userfaultfd error
hugetlb: fix hugetlb cgroup refcounting during mremap
mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
hexagon: ignore vmlinux.lds
hexagon: clean up timer-regs.h
hexagon: export raw I/O routines for modules
mm: emit the "free" trace report before freeing memory in kmem_cache_free()
shm: extend forced shm destroy to support objects from several IPC nses
ipc: WARN if trying to remove ipc object which is absent
mm/swap.c:put_pages_list(): reinitialise the page list
Linus Torvalds [Sat, 20 Nov 2021 19:05:10 +0000 (11:05 -0800)]
Merge tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Flip a cap check to avoid a selinux error (Alistair)
- Fix for a regression this merge window where we can miss a queue ref
put (me)
- Un-mark pstore-blk as broken, as the condition that triggered that
change has been rectified (Kees)
- Queue quiesce and sync fixes (Ming)
- FUA insertion fix (Ming)
- blk-cgroup error path put fix (Yu)
* tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block:
blk-mq: don't insert FUA request with data into scheduler queue
blk-cgroup: fix missing put device in error path from blkg_conf_pref()
block: avoid to quiesce queue in elevator_init_mq
Revert "mark pstore-blk as broken"
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
block: fix missing queue put in error path
block: Check ADMIN before NICE for IOPRIO_CLASS_RT
Linus Torvalds [Sat, 20 Nov 2021 18:59:03 +0000 (10:59 -0800)]
Merge tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"There is an ACPI stubs fix which is ACKed by the ACPI maintainer for
merging through my tree.
One item stand out and that is that I delete the <linux/sdb.h> header
that is used by nothing. I deleted this subsystem (through the GPIO
tree) a while back so I feel responsible for tidying up the floor.
Other than that it is the usual mistakes, a bit noisy around build
issue and Kconfig then driver fixes.
Specifics:
- Fix some stubs causing compile issues for ACPI.
- Fix some wakeups on AMD IRQs shared between GPIO and SCI.
- Fix a build warning in the Tegra driver.
- Fix a Kconfig issue in the Qualcomm driver.
- Add a missing include the RALink driver.
- Return a valid type for the Apple pinctrl IRQs.
- Implement some Qualcomm SDM845 dual-edge errata.
- Remove the unused <linux/sdb.h> header. (The subsystem was once
deleted by the pinctrl maintainer...)
- Fix a duplicate initialized in the Tegra driver.
- Fix register offsets for UFS and SDC in the Qualcomm SM8350 driver"
* tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: qcom: sm8350: Correct UFS and SDC offsets
pinctrl: tegra194: remove duplicate initializer again
Remove unused header <linux/sdb.h>
pinctrl: qcom: sdm845: Enable dual edge errata
pinctrl: apple: Always return valid type in apple_gpio_irq_type
pinctrl: ralink: include 'ralink_regs.h' in 'pinctrl-mt7620.c'
pinctrl: qcom: fix unmet dependencies on GPIOLIB for GPIOLIB_IRQCHIP
pinctrl: tegra: Return const pointer from tegra_pinctrl_get_group()
pinctrl: amd: Fix wakeups when IRQ is shared with SCI
ACPI: Add stubs for wakeup handler functions
Linus Torvalds [Sat, 20 Nov 2021 18:55:50 +0000 (10:55 -0800)]
Merge tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Add missing Kconfig option for ftrace direct multi sample, so it can
be compiled again, and also add s390 support for this sample.
- Update Christian Borntraeger's email address.
- Various fixes for memory layout setup. Besides other this makes it
possible to load shared DCSS segments again.
- Fix copy to user space of swapped kdump oldmem.
- Remove -mstack-guard and -mstack-size compile options when building
vdso binaries. This can happen when CONFIG_VMAP_STACK is disabled and
results in broken vdso code which causes more or less random
exceptions. Also remove the not needed -nostdlib option.
- Fix memory leak on cpu hotplug and return code handling in kexec
code.
- Wire up futex_waitv system call.
- Replace snprintf with sysfs_emit where appropriate.
* tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
ftrace/samples: add s390 support for ftrace direct multi sample
ftrace/samples: add missing Kconfig option for ftrace direct multi sample
MAINTAINERS: update email address of Christian Borntraeger
s390/kexec: fix memory leak of ipl report buffer
s390/kexec: fix return code handling
s390/dump: fix copying to user-space of swapped kdump oldmem
s390: wire up sys_futex_waitv system call
s390/vdso: filter out -mstack-guard and -mstack-size
s390/vdso: remove -nostdlib compiler flag
s390: replace snprintf in show functions with sysfs_emit
s390/boot: simplify and fix kernel memory layout setup
s390/setup: re-arrange memblock setup
s390/setup: avoid using memblock_enforce_memory_limit
s390/setup: avoid reserving memory above identity mapping
Linus Torvalds [Sat, 20 Nov 2021 18:47:16 +0000 (10:47 -0800)]
Merge tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small cifs/smb3 fixes: two to address minor coverity issues and
one cleanup"
* tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: introduce cifs_ses_mark_for_reconnect() helper
cifs: protect srv_count with cifs_tcp_ses_lock
cifs: move debug print out of spinlock
proc/vmcore: fix clearing user buffer by properly using clear_user()
To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and
has some logically unplugged memory inside an added Linux memory block,
I can easily trigger a BUG by copying the vmcore via "cp":
Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on
wrong access. In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().
To fix, properly use clear_user() when we're dealing with a user buffer.
Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ard Biesheuvel [Sat, 20 Nov 2021 00:43:55 +0000 (16:43 -0800)]
kmap_local: don't assume kmap PTEs are linear arrays in memory
The kmap_local conversion broke the ARM architecture, because the new
code assumes that all PTEs used for creating kmaps form a linear array
in memory, and uses array indexing to look up the kmap PTE belonging to
a certain kmap index.
On ARM, this cannot work, not only because the PTE pages may be
non-adjacent in memory, but also because ARM/!LPAE interleaves hardware
entries and extended entries (carrying software-only bits) in a way that
is not compatible with array indexing.
Fortunately, this only seems to affect configurations with more than 8
CPUs, due to the way the per-CPU kmap slots are organized in memory.
Work around this by permitting an architecture to set a Kconfig symbol
that signifies that the kmap PTEs do not form a lineary array in memory,
and so the only way to locate the appropriate one is to walk the page
tables.
Link: https://lore.kernel.org/linux-arm-kernel/20211026131249.3731275-1-ardb@kernel.org/ Link: https://lkml.kernel.org/r/20211116094737.7391-1-ardb@kernel.org Fixes: 2a15ba82fa6c ("ARM: highmem: Switch to generic kmap atomic") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reported-by: Quanyang Wang <quanyang.wang@windriver.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SeongJae Park [Sat, 20 Nov 2021 00:43:52 +0000 (16:43 -0800)]
mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
DAMON debugfs is supposed to protect dbgfs_ctxs, dbgfs_nr_ctxs, and
dbgfs_dirs using damon_dbgfs_lock. However, some of the code is
accessing the variables without the protection. This fixes it by
protecting all such accesses.
Link: https://lkml.kernel.org/r/20211110145758.16558-3-sj@kernel.org Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SeongJae Park [Sat, 20 Nov 2021 00:43:49 +0000 (16:43 -0800)]
mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
Patch series "DAMON fixes".
This patch (of 2):
DAMON users can trigger below warning in '__alloc_pages()' by invoking
write() to some DAMON debugfs files with arbitrarily high count
argument, because DAMON debugfs interface allocates some buffers based
on the user-specified 'count'.
As done in commit d73dad4eb5ad ("kasan: test: bypass __alloc_size
checks") for __write_overflow warnings, also silence some more cases
that trip the __read_overflow warnings seen in 5.16-rc1[1]:
In file included from include/linux/string.h:253,
from include/linux/bitmap.h:10,
from include/linux/cpumask.h:12,
from include/linux/mm_types_task.h:14,
from include/linux/mm_types.h:5,
from include/linux/page-flags.h:13,
from arch/arm64/include/asm/mte.h:14,
from arch/arm64/include/asm/pgtable.h:12,
from include/linux/pgtable.h:6,
from include/linux/kasan.h:29,
from lib/test_kasan.c:10:
In function 'memcmp',
inlined from 'kasan_memcmp' at lib/test_kasan.c:897:2:
include/linux/fortify-string.h:263:25: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
263 | __read_overflow();
| ^~~~~~~~~~~~~~~~~
In function 'memchr',
inlined from 'kasan_memchr' at lib/test_kasan.c:872:2:
include/linux/fortify-string.h:277:17: error: call to '__read_overflow' declared with attribute error: detected read beyond size of object (1st parameter)
277 | __read_overflow();
| ^~~~~~~~~~~~~~~~~