Leonardo Bras [Tue, 17 Aug 2021 06:39:28 +0000 (03:39 -0300)]
powerpc/pseries/iommu: Make use of DDW for indirect mapping
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.
As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.
By using DDW, indirect mapping can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.
Indirect mapping will only be used if direct mapping is not a
possibility.
For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.
Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.
Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.
Leonardo Bras [Tue, 17 Aug 2021 06:39:27 +0000 (03:39 -0300)]
powerpc/pseries/iommu: Find existing DDW with given property name
At the moment pseries stores information about created directly mapped
DDW window in DIRECT64_PROPNAME.
With the objective of implementing indirect DMA mapping with DDW, it's
necessary to have another propriety name to make sure kexec'ing into older
kernels does not break, as it would if we reuse DIRECT64_PROPNAME.
In order to have this, find_existing_ddw_windows() needs to be able to
look for different property names.
Extract find_existing_ddw_windows() into find_existing_ddw_windows_named()
and calls it with current property name.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-10-leobras.c@gmail.com
Leonardo Bras [Tue, 17 Aug 2021 06:39:26 +0000 (03:39 -0300)]
powerpc/pseries/iommu: Update remove_dma_window() to accept property name
Update remove_dma_window() so it can be used to remove DDW with a given
property name.
This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.
Also, add return values to it so we can check if the property was found
while removing the active DDW. This allows skipping the remaining property
names while reducing the impact of multiple property names.
Leonardo Bras [Tue, 17 Aug 2021 06:39:25 +0000 (03:39 -0300)]
powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper
Add a new helper _iommu_table_setparms(), and use it in
iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated
code.
Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(),
so move it to the new helper. Since we need the iommu_table_ops to be
declared before used, declare iommu_table_lpar_multi_ops and
iommu_table_pseries_ops to before their respective iommu_table_setparms*().
Leonardo Bras [Tue, 17 Aug 2021 06:39:24 +0000 (03:39 -0300)]
powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.
This created an opportunity to reorganize the second part of enable_ddw():
Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().
With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().
This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.
Also, the error path got remove_ddw() replaced by a new helper
__remove_dma_window(), which only removes the new DDW with an rtas-call.
For this, a new helper clean_dma_window() was needed to clean anything
that could left if walk_system_ram_range() fails.
Leonardo Bras [Tue, 17 Aug 2021 06:39:23 +0000 (03:39 -0300)]
powerpc/pseries/iommu: Allow DDW windows starting at 0x00
enable_ddw() currently returns the address of the DMA window, which is
considered invalid if has the value 0x00.
Also, it only considers valid an address returned from find_existing_ddw
if it's not 0x00.
Changing this behavior makes sense, given the users of enable_ddw() only
need to know if direct mapping is possible. It can also allow a DMA window
starting at 0x00 to be used.
This will be helpful for using a DDW with indirect mapping, as the window
address will be different than 0x00, but it will not map the whole
partition.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-6-leobras.c@gmail.com
There are two functions creating direct_window_list entries in a
similar way, so create a ddw_list_new_entry() to avoid duplicity and
simplify those functions.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-5-leobras.c@gmail.com
Leonardo Bras [Tue, 17 Aug 2021 06:39:20 +0000 (03:39 -0300)]
powerpc/kernel/iommu: Add new iommu_table_in_use() helper
Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.
It should be enough to replace all instances of !bitmap_empty(tbl...).
iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.
Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.
Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.
In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.
IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().
Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-2-leobras.c@gmail.com
powerpc/numa: Update cpu_cpu_map on CPU online/offline
cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated. This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.
powerpc/numa: Print debug statements only when required
Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.
powerpc supported numa=debug which is not documented. This option was
used to print early debug output. However something more flexible can be
achieved by using CONFIG_DYNAMIC_DEBUG.
Hence drop dbg (and numa=debug) in favour of pr_debug
Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Rebase on to powerpc/next form2 affinity changes] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210826100521.412639-2-srikar@linux.vnet.ibm.com
powerpc/smp: Enable CACHE domain for shared processor
Currently CACHE domain is not enabled on shared processor mode PowerVM
LPARS. On PowerVM systems, 'ibm,thread-group' device-tree property 2
under cpu-device-node indicates which all CPUs share L2-cache. However
'ibm,thread-group' device-tree property 2 is a relatively new property.
In absence of 'ibm,thread-group' property 2, 'l2-cache' device property
under cpu-device-node could help system to identify CPUs sharing L2-cache.
However this property is not exposed by PhyP in shared processor mode
configurations.
In absence of properties that inform OS about which CPUs share L2-cache,
fallback on core boundary.
Here are some stats from Power9 shared LPAR with the changes.
$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 3
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 16-23
NUMA node1 CPU(s): 0-15,24-31
Physical sockets: 2
Physical chips: 1
Physical cores/chip: 10
Before patch
$ grep -r . /sys/kernel/debug/sched/domains/cpu0/domain*/name
Before
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
After
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA
powerpc/smp: Update cpu_core_map on all PowerPc systems
lscpu() uses core_siblings to list the number of sockets in the
system. core_siblings is set using topology_core_cpumask.
While optimizing the powerpc bootup path, Commit 4ca234a9cbd7
("powerpc/smp: Stop updating cpu_core_mask"). it was found that
updating cpu_core_mask() ended up taking a lot of time. It was thought
that on Powerpc, cpu_core_mask() would always be same as
cpu_cpu_mask() i.e number of sockets will always be equal to number of
nodes. As an optimization, cpu_core_mask() was made a snapshot of
cpu_cpu_mask().
However that was found to be false with PowerPc KVM guests, where each
node could have more than one socket. So with Commit c47f892d7aa6
("powerpc/smp: Reintroduce cpu_core_mask"), cpu_core_mask was updated
based on chip_id but in an optimized way using some mask manipulations
and chip_id caching.
However on non-PowerNV and non-pseries KVM guests (i.e not
implementing cpu_to_chip_id(), continued to use a copy of
cpu_cpu_mask().
There are two issues that were noticed on such systems
1. lscpu would report one extra socket.
On a IBM,9009-42A (aka zz system) which has only 2 chips/ sockets/
nodes, lscpu would report
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s): 3 <--------------
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
2. Currently cpu_cpu_mask is updated when a core is
added/removed. However its not updated when smt mode switching or on
CPUs are explicitly offlined. However all other percpu masks are
updated to ensure only active/online CPUs are in the masks.
This results in build_sched_domain traces since there will be CPUs in
cpu_cpu_mask() but those CPUs are not present in SMT / CACHE / MC /
NUMA domains. A loop of threads running smt mode switching and core
add/remove will soon show this trace.
Hence cpu_cpu_mask has to be update at smt mode switch.
This will have impact on cpu_core_mask(). cpu_core_mask() is a
snapshot of cpu_cpu_mask. Different CPUs within the same socket will
end up having different cpu_core_masks since they are snapshots at
different points of time. This means when lscpu will start reporting
many more sockets than the actual number of sockets/ nodes / chips.
Different ways to handle this problem:
A. Update the snapshot aka cpu_core_mask for all CPUs whenever
cpu_cpu_mask is updated. This would a non-optimal solution.
B. Instead of a cpumask_var_t, make cpu_core_map a cpumask pointer
pointing to cpu_cpu_mask. However percpu cpumask pointer is frowned
upon and we need a clean way to handle PowerPc KVM guest which is
not a snapshot.
C. Update cpu_core_masks all PowerPc systems like in PowerPc KVM
guests using mask manipulations. This approach is relatively simple
and unifies with the existing code.
D. On top of 3, we could also resurrect get_physical_package_id which
could return a nid for the said CPU. However this is not needed at this
time.
Option C is the preferred approach for now.
While this is somewhat a revert of Commit 4ca234a9cbd7 ("powerpc/smp:
Stop updating cpu_core_mask").
1. Plain revert has some conflicts
2. For chip_id == -1, the cpu_core_mask is made identical to
cpu_cpu_mask, unlike previously where cpu_core_mask was set to a core
if chip_id doesn't exist.
This goes by the principle that if chip_id is not exposed, then
sockets / chip / node share the same set of CPUs.
With the fix, lscpu o/p would be
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s): 2 <--------------
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
Current code assumes that num_possible_cpus() is always greater than
threads_per_core. However this may not be true when using nr_cpus=2 or
similar options. Handle the case where num_possible_cpus() is not an
exact multiple of threads_per_core.
Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup") Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Debugged-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210826100401.412519-2-srikar@linux.vnet.ibm.com
Christophe Leroy [Wed, 25 Aug 2021 13:34:45 +0000 (13:34 +0000)]
powerpc: Redefine HMT_xxx macros as empty on PPC32
HMT_xxx macros are macros for adjusting thread priority
(hardware multi-threading) are macros inherited from PPC64
via commit 5f7c690728ac ("[PATCH] powerpc: Merged ppc_asm.h")
Those instructions are pointless on PPC32, but some common
fonctions like arch_cpu_idle() use them.
So make them empty on PPC32 to avoid those instructions.
Michael Ellerman [Thu, 26 Aug 2021 14:49:06 +0000 (00:49 +1000)]
Merge changes from Paul Gortmaker
Merge the changes to retire the legacy WR sbc8548 and sbc8641 platforms
from Paul. These were sent as a pull request, but I rebased them onto
rc2 so as not to pull too many unrelated changes in to my next.
Description from Paul's pull request follows:
In v2.6.27 (2008, 917f0af9e5a9) the sbc8260 support was implicitly
retired by not being carried forward through the ppc --> powerpc
device tree transition.
Then, in v3.6 (2012, b048b4e17cbb) we retired the support for the
sbc8560 boards.
Next, in v4.18 (2017, 3bc6cf5a86e5) we retired the support for the
2006 vintage sbc834x boards.
The sbc8548 and sbc8641d boards were maybe 1-2 years newer than the
sbc834x boards, but it is also 3+ years later, so it makes sense to
now retire them as well - which is what is done here.
These two remaining WR boards were based on the Freescale MPC8548-CDS
and the MPC8641D-HPCN reference board implementations. Having had the
chance to use these and many other Fsl ref boards, I know this: The
Freescale reference boards were typically produced in limited quantity
and primarily available to BSP developers and hardware designers, and
not likely to have found a 2nd life with hobbyists and/or collectors.
It was good to have that BSP code subjected to mainline review and
hence also widely available back in the day. But given the above, we
should probably also be giving serious consideration to retiring
additional similar age/type reference board platforms as well.
I've always felt it is important for us to be proactive in retiring
old code, since it has a genuine non-zero carrying cost, as described
in the 930d52c012b8 merge log. But for the here and now, we just
clean up the remaining BSP code that I had added for SBC platforms.
Paul Gortmaker [Thu, 7 Jan 2021 18:45:32 +0000 (13:45 -0500)]
powerpc: retire sbc8641d board support
The support was for this was added to mainline over 12 years ago, in
v2.6.26 [4e8aae89a35d] just around the ppc --> powerpc migration.
I believe the board was introduced shortly after the sbc8548 board,
making it roughly a 14 year old platform - with the CPU speed and
memory size typical for that era.
I haven't had one of these boards for several years, and availability
was discontinued several years before that.
Given that, there is no point in adding a burden to testing coverage
that builds all possible defconfigs, so it makes sense to remove it.
Of course it will remain in the git history forever, for anyone who
happens to find a functional board and wants to tinker with it.
Acked-by: Scott Wood <oss@buserror.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Gortmaker [Thu, 7 Jan 2021 13:40:38 +0000 (08:40 -0500)]
powerpc: retire sbc8548 board support
The support was for this was mainlined 13 years ago, in v2.6.25
[0e0fffe88767] just around the ppc --> powerpc migration.
I believe the board was introduced a year or two before that, so it
is roughly a 15 year old platform - with the CPU speed and memory size
that was typical for that era.
I haven't had one of these boards for several years, and availability
was discontinued several years before that.
Given that, there is no point in adding a burden to testing coverage
that builds all possible defconfigs, so it makes sense to remove it.
Of course it will remain in the git history forever, for anyone who
happens to find a functional board and wants to tinker with it.
Acked-by: Scott Wood <oss@buserror.net> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 23 Aug 2021 08:24:20 +0000 (08:24 +0000)]
powerpc: Remove MSR_PR check in interrupt_exit_{user/kernel}_prepare()
In those hot functions that are called at every interrupt, any saved
cycle is worth it.
interrupt_exit_user_prepare() and interrupt_exit_kernel_prepare() are
called from three places:
- From entry_32.S
- From interrupt_64.S
- From interrupt_exit_user_restart() and interrupt_exit_kernel_restart()
In entry_32.S, there are inambiguously called based on MSR_PR:
interrupt_return_\srr\():
ld r4,_MSR(r1)
andi. r0,r4,MSR_PR
beq interrupt_return_\srr\()_kernel
interrupt_return_\srr\()_user: /* make backtraces match the _kernel variant */
addi r3,r1,STACK_FRAME_OVERHEAD
bl interrupt_exit_user_prepare
...
interrupt_return_\srr\()_kernel:
addi r3,r1,STACK_FRAME_OVERHEAD
bl interrupt_exit_kernel_prepare
In interrupt_exit_user_restart() and interrupt_exit_kernel_restart(),
MSR_PR is verified respectively by BUG_ON(!user_mode(regs)) and
BUG_ON(user_mode(regs)) prior to calling interrupt_exit_user_prepare()
and interrupt_exit_kernel_prepare().
The verification in interrupt_exit_user_prepare() and
interrupt_exit_kernel_prepare() are therefore useless and can be removed.
Xiongwei Song [Sat, 7 Aug 2021 01:02:38 +0000 (09:02 +0800)]
powerpc: Add dear as a synonym for pt_regs.dar register
Create an anonymous union for dar and dear regsiters, we can reference
dear to get the effective address when CONFIG_4xx=y or CONFIG_BOOKE=y.
Otherwise, reference dar. This makes code more clear.
Xiongwei Song [Sat, 7 Aug 2021 01:02:36 +0000 (09:02 +0800)]
powerpc: Add esr as a synonym for pt_regs.dsisr
Create an anonymous union for dsisr and esr regsiters, we can reference
esr to get the exception detail when CONFIG_4xx=y or CONFIG_BOOKE=y.
Otherwise, reference dsisr. This makes code more clear.
Jordan Niethe [Thu, 29 Jul 2021 04:13:17 +0000 (14:13 +1000)]
selftests: Skip TM tests on synthetic TM implementations
Transactional Memory was removed from the architecture in ISA v3.1. For
threads running in P8/P9 compatibility mode on P10 a synthetic TM
implementation is provided. In this implementation, tbegin. always sets
cr0 eq meaning the abort handler is always called. This is not an issue
as users of TM are expected to have a fallback non transactional way to
make forward progress in the abort handler. The TEXASR indicates if a
transaction failure is due to a synthetic implementation.
Some of the TM self tests need a non-degenerate TM implementation for
their testing to be meaningful so check for a synthetic implementation
and skip the test if so.
Jordan Niethe [Thu, 29 Jul 2021 04:13:16 +0000 (14:13 +1000)]
selftests/powerpc: Add missing clobbered register to to ptrace TM tests
ISA v3.1 removes TM but includes a synthetic implementation for
backwards compatibility. With this implementation, the tests
ptrace-tm-spd-gpr and ptrace-tm-gpr should never be able to make any
forward progress and eventually should be killed by the timeout.
Instead on a P10 running in P9 mode, ptrace_tm_gpr fails like so:
test: ptrace_tm_gpr
tags: git_version:unknown
Starting the child
...
...
GPR[27]: 1 Expected: 2
GPR[28]: 1 Expected: 2
GPR[29]: 1 Expected: 2
GPR[30]: 1 Expected: 2
GPR[31]: 1 Expected: 2
[FAIL] Test FAILED on line 98
failure: ptrace_tm_gpr
selftests: ptrace-tm-gpr [FAIL]
The problem is in the inline assembly of the child. r0 is loaded with a
value in the child's transaction abort handler but this register is not
included in the clobbers list. This means it is possible that this
statement:
cptr[1] = 0;
which is meant to signal the parent to wait may actually use the value
placed into r0 by the inline assembly incorrectly signal the parent to
continue.
By inspection the same problem is present in ptrace-tm-spd-gpr.
Adding r0 to the clobbbers list makes the test fail correctly via a
timeout on a P10 running in P8/P9 compatibility mode.
Kajol Jain [Wed, 18 Aug 2021 17:15:56 +0000 (22:45 +0530)]
powerpc/perf: Fix the check for SIAR value
Incase of random sampling, there can be scenarios where
Sample Instruction Address Register(SIAR) may not latch
to the sampled instruction and could result in
the value of 0. In these scenarios it is preferred to
return regs->nip. These corner cases are seen in the
previous generation (p9) also.
Patch adds the check for SIAR value along with regs_use_siar
and siar_valid checks so that the function will return
regs->nip incase SIAR is zero.
Patch drops the code under PPMU_P10_DD1 flag check
which handles SIAR 0 case only for Power10 DD1.
Nicholas Piggin [Wed, 11 Aug 2021 16:00:43 +0000 (02:00 +1000)]
KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest SPRs are live
After the L1 saves its PMU SPRs but before loading the L2's PMU SPRs,
switch the pmcregs_in_use field in the L1 lppaca to the value advertised
by the L2 in its VPA. On the way out of the L2, set it back after saving
the L2 PMU registers (if they were in-use).
This transfers the PMU liveness indication between the L1 and L2 at the
points where the registers are not live.
This fixes the nested HV bug for which a workaround was added to the L0
HV by commit 63279eeb7f93a ("KVM: PPC: Book3S HV: Always save guest pmu
for guest capable of nesting"), which explains the problem in detail.
That workaround is no longer required for guests that include this bug
fix.
Fixes: 360cae313702 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Link: https://lore.kernel.org/r/20210811160134.904987-10-npiggin@gmail.com
Fabiano Rosas [Wed, 11 Aug 2021 16:00:41 +0000 (02:00 +1000)]
KVM: PPC: Book3S HV Nested: Stop forwarding all HFUs to L1
If the nested hypervisor has no access to a facility because it has
been disabled by the host, it should also not be able to see the
Hypervisor Facility Unavailable that arises from one of its guests
trying to access the facility.
This patch turns a HFU that happened in L2 into a Hypervisor Emulation
Assistance interrupt and forwards it to L1 for handling. The ones that
happened because L1 explicitly disabled the facility for L2 are still
let through, along with the corresponding Cause bits in the HFSCR.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
[np: move handling into kvmppc_handle_nested_exit] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210811160134.904987-8-npiggin@gmail.com
Nicholas Piggin [Wed, 11 Aug 2021 16:00:40 +0000 (02:00 +1000)]
KVM: PPC: Book3S HV Nested: Make nested HFSCR state accessible
When the L0 runs a nested L2, there are several permutations of HFSCR
that can be relevant. The HFSCR that the L1 vcpu L1 requested, the
HFSCR that the L1 vcpu may use, and the HFSCR that is actually being
used to run the L2.
The L1 requested HFSCR is not accessible outside the nested hcall
handler, so copy that into a new kvm_nested_guest.hfscr field.
The permitted HFSCR is taken from the HFSCR that the L1 runs with,
which is also not accessible while the hcall is being made. Move
this into a new kvm_vcpu_arch.hfscr_permitted field.
These will be used by the next patch to improve facility handling
for nested guests, and later by facility demand faulting patches.
As one of the arguments of the H_ENTER_NESTED hypercall, the nested
hypervisor (L1) prepares a structure containing the values of various
hypervisor-privileged registers with which it wants the nested guest
(L2) to run. Since the nested HV runs in supervisor mode it needs the
host to write to these registers.
To stop a nested HV manipulating this mechanism and using a nested
guest as a proxy to access a facility that has been made unavailable
to it, we have a routine that sanitises the values of the HV registers
before copying them into the nested guest's vcpu struct.
However, when coming out of the guest the values are copied as they
were back into L1 memory, which means that any sanitisation we did
during guest entry will be exposed to L1 after H_ENTER_NESTED returns.
This patch alters this sanitisation to have effect on the vcpu->arch
registers directly before entering and after exiting the guest,
leaving the structure that is copied back into L1 unchanged (except
when we really want L1 to access the value, e.g the Cause bits of
HFSCR).
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Link: https://lore.kernel.org/r/20210811160134.904987-6-npiggin@gmail.com
Have the TM softpatch emulation code set up the HFAC interrupt and
return -1 in case an instruction was executed with HFSCR bits clear,
and have the interrupt exit handler fall through to the HFAC handler.
When the L0 is running a nested guest, this ensures the HFAC interrupt
is correctly passed up to the L1.
The "direct guest" exit handler will turn these into PROGILL program
interrupts so functionality in practice will be unchanged. But it's
possible an L1 would want to handle these in a different way.
Also rearrange the FAC interrupt emulation code to match the HFAC format
while here (mainly, adding the FSCR_INTR_CAUSE mask).
The softpatch interrupt sets HSRR0 to the faulting instruction +4, so
it should subtract 4 for the faulting instruction address in the case
it is a TM softpatch interrupt (the instruction was not executed) and
it was not emulated.
Nicholas Piggin [Wed, 11 Aug 2021 16:00:35 +0000 (02:00 +1000)]
KVM: PPC: Book3S HV: Initialise vcpu MSR with MSR_ME
It is possible to create a VCPU without setting the MSR before running
it, which results in a warning in kvmhv_vcpu_entry_p9() that MSR_ME is
not set. This is pretty harmless because the MSR_ME bit is added to
HSRR1 before HRFID to guest, and a normal qemu guest doesn't hit it.
Fabiano Rosas [Thu, 5 Aug 2021 21:26:16 +0000 (18:26 -0300)]
KVM: PPC: Book3S HV: Stop exporting symbols from book3s_64_mmu_radix
The book3s_64_mmu_radix.o object is not part of the KVM builtins and
all the callers of the exported symbols are in the same kvm-hv.ko
module so we should not need to export any symbols.
Fabiano Rosas [Thu, 5 Aug 2021 21:26:15 +0000 (18:26 -0300)]
KVM: PPC: Book3S HV: Add sanity check to copy_tofrom_guest
Both paths into __kvmhv_copy_tofrom_guest_radix ensure that we arrive
with an effective address that is smaller than our total addressable
space and addresses quadrant 0.
- The H_COPY_TOFROM_GUEST hypercall path rejects the call with
H_PARAMETER if the effective address has any of the twelve most
significant bits set.
- The kvmhv_copy_tofrom_guest_radix path clears the top twelve bits
before calling the internal function.
Although the callers make sure that the effective address is sane, any
future use of the function is exposed to a programming error, so add a
sanity check.
The __kvmhv_copy_tofrom_guest_radix function was introduced along with
nested HV guest support. It uses the platform's Radix MMU quadrants to
provide a nested hypervisor with fast access to its nested guests
memory (H_COPY_TOFROM_GUEST hypercall). It has also since been added
as a fast path for the kvmppc_ld/st routines which are used during
instruction emulation.
The commit def0bfdbd603 ("powerpc: use probe_user_read() and
probe_user_write()") changed the low level copy function from
raw_copy_from_user to probe_user_read, which adds a check to
access_ok. In powerpc that is:
and TASK_SIZE_MAX is 0x0010000000000000UL for 64-bit, which means that
setting the two MSBs of the effective address (which correspond to the
quadrant) now cause access_ok to reject the access.
This was not caught earlier because the most common code path via
kvmppc_ld/st contains a fallback (kvm_read_guest) that is likely to
succeed for L1 guests. For nested guests there is no fallback.
Another issue is that probe_user_read (now __copy_from_user_nofault)
does not return the number of bytes not copied in case of failure, so
the destination memory is not being cleared anymore in
kvmhv_copy_from_guest_radix:
ret = kvmhv_copy_tofrom_guest_radix(vcpu, eaddr, to, NULL, n);
if (ret > 0) <-- always false!
memset(to + (n - ret), 0, ret);
This patch fixes both issues by skipping access_ok and open-coding the
low level __copy_to/from_user_inatomic.
Fixes: def0bfdbd603 ("powerpc: use probe_user_read() and probe_user_write()") Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210805212616.2641017-2-farosas@linux.ibm.com
Cédric Le Goater [Mon, 23 Aug 2021 09:00:38 +0000 (11:00 +0200)]
powerpc/prom: Fix unused variable ‘reserve_map’ when CONFIG_PPC32 is not set
This fixes a compile error with W=1.
arch/powerpc/kernel/prom.c: In function ‘early_reserve_mem’:
arch/powerpc/kernel/prom.c:625:10: error: variable ‘reserve_map’ set but not used [-Werror=unused-but-set-variable]
__be64 *reserve_map;
^~~~~~~~~~~
cc1: all warnings being treated as errors
Christophe Leroy [Mon, 23 Aug 2021 06:45:20 +0000 (06:45 +0000)]
powerpc/syscalls: Remove __NR__exit
__NR__exit is nowhere used. On most architectures it was removed by
commit 135ab6ec8fda ("[PATCH] remove remaining errno and
__KERNEL_SYSCALLS__ references") but not on powerpc.
powerpc removed __KERNEL_SYSCALLS__ in commit 3db03b4afb3e ("[PATCH]
rename the provided execve functions to kernel_execve"), but __NR__exit
was left over.
Kajol Jain [Fri, 13 Aug 2021 08:21:58 +0000 (13:51 +0530)]
powerpc/perf/hv-gpci: Fix counter value parsing
H_GetPerformanceCounterInfo (0xF080) hcall returns the counter data in
the result buffer. Result buffer has specific format defined in the PAPR
specification. One of the fields is counter offset and width of the
counter data returned.
Counter data are returned in a unsigned char array in big endian byte
order. To get the final counter data, the values must be left shifted
byte at a time. But commit 220a0c609ad17 ("powerpc/perf: Add support for
the hv gpci (get performance counter info) interface") made the shifting
bitwise and also assumed little endian order. Because of that, hcall
counters values are reported incorrectly.
In particular this can lead to counters go backwards which messes up the
counter prev vs now calculation and leads to huge counter value
reporting:
Lukas Bulwahn [Thu, 19 Aug 2021 11:39:53 +0000 (13:39 +0200)]
powerpc/kvm: Remove obsolete and unneeded select
Commit a278e7ea608b ("powerpc: Fix compile issue with force DAWR")
selects the non-existing config PPC_DAWR_FORCE_ENABLE for config
KVM_BOOK3S_64_HANDLER. As this commit also introduces a config PPC_DAWR
and this config PPC_DAWR is selected with PPC if PPC64, there is no
need for any further select in the KVM_BOOK3S_64_HANDLER.
Remove an obsolete and unneeded select in config KVM_BOOK3S_64_HANDLER.
The issue was identified with ./scripts/checkkconfigsymbols.py.
Fixes: a278e7ea608b ("powerpc: Fix compile issue with force DAWR") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210819113954.17515-2-lukas.bulwahn@gmail.com
Wan Jiabing [Tue, 23 Mar 2021 06:29:05 +0000 (14:29 +0800)]
powerpc: Remove duplicate includes
interrupt.c: asm/interrupt.h has been included at line 12, so remove the
duplicate one at line 10.
time.c: linux/sched/clock.h has been included at line 33,so remove the
duplicate one at line 56 and move sched/cputime.h under sched including
segament.
Joel Stanley [Tue, 17 Aug 2021 04:54:06 +0000 (14:24 +0930)]
powerpc/config: Renable MTD_PHYSMAP_OF
CONFIG_MTD_PHYSMAP_OF is not longer enabled as it depends on
MTD_PHYSMAP which is not enabled.
This is a regression from commit 642b1e8dbed7 ("mtd: maps: Merge
physmap_of.c into physmap-core.c"), which added the extra dependency.
Add CONFIG_MTD_PHYSMAP=y so this stays in the config, as Christophe said
it is useful for build coverage.
Fixes: 642b1e8dbed7 ("mtd: maps: Merge physmap_of.c into physmap-core.c") Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817045407.2445664-3-joel@jms.id.au
Fangrui Song [Fri, 13 Aug 2021 20:05:11 +0000 (13:05 -0700)]
powerpc: Add "-z notext" flag to disable diagnostic
Object files used to link .tmp_vmlinux.kallsyms1 have many
R_PPC64_ADDR64 relocations in non-SHF_WRITE sections. There are many
text relocations (e.g. in .rela___ksymtab_gpl+* and .rela__mcount_loc
sections) in a -pie link and are disallowed by LLD:
ld.lld: error: can't create dynamic relocation R_PPC64_ADDR64 against local symbol in readonly segment; recompile object files with -fPIC or pass '-Wl,-z,notext' to allow text relocations in the output
>>> defined in arch/powerpc/kernel/head_64.o
>>> referenced by arch/powerpc/kernel/head_64.o:(__restart_table+0x10)
Newer GNU ld configured with "--enable-textrel-check=error" will report
an error as well:
Add "-z notext" to suppress the errors. Non-CONFIG_RELOCATABLE builds
use the default -no-pie mode and thus R_PPC64_ADDR64 relocations can be
resolved at link-time.
Reported-by: Itaru Kitayama <itaru.kitayama@riken.jp> Co-developed-by: Bill Wendling <morbo@google.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Bill Wendling <morbo@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210813200511.1905703-1-morbo@google.com
powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
powerpc BUG_ON() and WARN_ON() are based on using twnei instruction.
For catching simple conditions like a variable having value 0, this
is efficient because it does the test and the trap at the same time.
But most conditions used with BUG_ON or WARN_ON are more complex and
forces GCC to format the condition into a 0 or 1 value in a register.
This will usually require 2 to 3 instructions.
The most efficient solution would be to use __builtin_trap() because
GCC is able to optimise the use of the different trap instructions
based on the requested condition, but this is complex if not
impossible for the following reasons:
- __builtin_trap() is a non-recoverable instruction, so it can't be
used for WARN_ON
- Knowing which line of code generated the trap would require the
analysis of DWARF information. This is not a feature we have today.
As mentioned in commit 8d4fbcfbe0a4 ("Fix WARN_ON() on bitfield ops")
the way WARN_ON() is implemented is suboptimal. That commit also
mentions an issue with 'long long' condition. It fixed it for
WARN_ON() but the same problem still exists today with BUG_ON() on
PPC32. It will be fixed by using the generic implementation.
By using the generic implementation, gcc will naturally generate a
branch to the unconditional trap generated by BUG().
As modern powerpc implement zero-cycle branch,
that's even more efficient.
And for the functions using WARN_ON() and its return, the test
on return from WARN_ON() is now also used for the WARN_ON() itself.
On PPC64 we don't want it because we want to be able to use CFAR
register to track how we entered the code that trapped. The CFAR
register would be clobbered by the branch.
A simple test function:
unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}
powerpc/pseries: Add support for FORM2 associativity
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.
powerpc/pseries: Add a helper for form1 cpu distance
This helper is only used with the dispatch trace log collection.
A later patch will add Form2 affinity support and this change helps
in keeping that simpler. Also add a comment explaining we don't expect
the code to be called with FORM0
powerpc/pseries: Consolidate different NUMA distance update code paths
The associativity details of the newly added resourced are collected from
the hypervisor via "ibm,configure-connector" rtas call. Update the numa
distance details of the newly added numa node after the above call.
Instead of updating NUMA distance every time we lookup a node id
from the associativity property, add helpers that can be used
during boot which does this only once. Also remove the distance
update from node id lookup helpers.
Currently, we duplicate parsing code for ibm,associativity and
ibm,associativity-lookup-arrays in the kernel. The associativity array provided
by these device tree properties are very similar and hence can use
a helper to parse the node id and numa distance details.
powerpc: rename powerpc_debugfs_root to arch_debugfs_dir
No functional change in this patch. arch_debugfs_dir is the generic kernel
name declared in linux/debugfs.h for arch-specific debugfs directory.
Architectures like x86/s390 already use the name. Rename powerpc
specific powerpc_debugfs_root to arch_debugfs_dir.
cpufreq: powernv: Fix init_chip_info initialization in numa=off
In the numa=off kernel command-line configuration init_chip_info() loops
around the number of chips and attempts to copy the cpumask of that node
which is NULL for all iterations after the first chip.
Hence, store the cpu mask for each chip instead of derving cpumask from
node while populating the "chips" struct array and copy that to the
chips[i].mask
Fixes: 053819e0bf84 ("cpufreq: powernv: Handle throttling due to Pmax capping at chip level") Cc: stable@vger.kernel.org # v4.3+ Reported-by: Shirisha Ganta <shirisha.ganta1@ibm.com> Signed-off-by: Pratik R. Sampat <psampat@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Rename goto label to out_free_chip_cpu_mask] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210728120500.87549-2-psampat@linux.ibm.com
powerpc: wii.dts: Reduce the size of the control area
This is wrong, but needed in order to avoid overlapping ranges with the
OTP area added in the next commit. A refactor of this part of the
device tree is needed: according to Wiibrew[1], this area starts at
0x0d800000 and spans 0x400 bytes (that is, 0x100 32-bit registers),
encompassing PIC and GPIO registers, amongst the ones already exposed in
this device tree, which should become children of the control@d800000
node.
Marc Zyngier [Mon, 2 Aug 2021 16:26:28 +0000 (17:26 +0100)]
powerpc: Bulk conversion to generic_handle_domain_irq()
Wherever possible, replace constructs that match either
generic_handle_irq(irq_find_mapping()) or
generic_handle_irq(irq_linear_revmap()) to a single call to
generic_handle_domain_irq().
KVM: PPC: Book3S HV: XIVE: Add support for automatic save-restore
On P10, the feature doing an automatic "save & restore" of a VCPU
interrupt context is set by default in OPAL. When a VP context is
pulled out, the state of the interrupt registers are saved by the XIVE
interrupt controller under the internal NVP structure representing the
VP. This saves a costly store/load in guest entries and exits.
If OPAL advertises the "save & restore" feature in the device tree,
it should also have set the 'H' bit in the CAM line. Check that when
vCPUs are connected to their ICP in KVM before going any further.
There is no need to use the lockup detector ("noirqdebug") for IPIs.
The ipistorm benchmark measures a ~10% improvement on high systems
when this flag is set.
powerpc/xive: Use XIVE domain under xmon and debugfs
The default domain of the PCI/MSIs is not the XIVE domain anymore. To
list the IRQ mappings under XMON and debugfs, query the IRQ data from
the low level XIVE domain.