powerpc/pci: Move PHB discovery for PCI_DN using platforms
Make powernv, pseries, powermac and maple use ppc_mc.discover_phbs.
These platforms need to be done together because they all depend on
pci_dn's being created from the DT. The pci_dn contains a pointer to
the relevant pci_controller so they need to be created after the
pci_controller structures are available, but before PCI devices are
scanned. Currently this ordering is provided by initcalls and the
sequence is:
1. PHBs are discovered (setup_arch) (early boot, pre-initcalls)
2. pci_dn are created from the unflattended DT (core initcall)
3. PHBs are scanned pcibios_init() (subsys initcall)
The new ppc_md.discover_phbs() function is also a core_initcall so we
can't guarantee ordering between the creation of pci_controllers and
the creation of pci_dn's which require a pci_controller. We could use
the postcore, or core_sync initcall levels, but it's cleaner to just
move the pci_dn setup into the per-PHB inits which occur inside of
.discover_phb() for these platforms. This brings the boot-time path in
line with the PHB hotplug path that is used for pseries DLPAR
operations too.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Squash powermac & maple in to avoid breakage those platforms,
convert memblock allocs to use kmalloc to avoid warnings] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201103043523.916109-2-oohall@gmail.com
On many powerpc platforms the discovery and initalisation of
pci_controllers (PHBs) happens inside of setup_arch(). This is very early
in boot (pre-initcalls) and means that we're initialising the PHB long
before many basic kernel services (slab allocator, debugfs, a real ioremap)
are available.
On PowerNV this causes an additional problem since we map the PHB registers
with ioremap(). As of commit d538aadc2718 ("powerpc/ioremap: warn on early
use of ioremap()") a warning is printed because we're using the "incorrect"
API to setup and MMIO mapping in searly boot. The kernel does provide
early_ioremap(), but that is not intended to create long-lived MMIO
mappings and a seperate warning is printed by generic code if
early_ioremap() mappings are "leaked."
This is all fixable with dumb hacks like using early_ioremap() to setup
the initial mapping then replacing it with a real ioremap later on in
boot, but it does raise the question: Why the hell are we setting up the
PHB's this early in boot?
The old and wise claim it's due to "hysterical rasins." Aside from amused
grapes there doesn't appear to be any real reason to maintain the current
behaviour. Already most of the newer embedded platforms perform PHB
discovery in an arch_initcall and between the end of setup_arch() and the
start of initcalls none of the generic kernel code does anything PCI
related. On powerpc scanning PHBs occurs in a subsys_initcall so it should
be possible to move the PHB discovery to a core, postcore or arch initcall.
This patch adds the ppc_md.discover_phbs hook and a core_initcall stub that
calls it. The core_initcalls are the earliest to be called so this will
any possibly issues with dependency between initcalls. This isn't just an
academic issue either since on pseries and PowerNV EEH init occurs in an
arch_initcall and depends on the pci_controllers being available, similarly
the creation of pci_dns occurs at core_initcall_sync (i.e. between core and
postcore initcalls). These problems need to be addressed seperately.
The pnv_phb->initialized flag is an odd beast. It was added back in 2012 in
commit db1266c85261 ("powerpc/powernv: Skip check on PE if necessary") to
allow devices to be enabled even if the device had not yet been assigned to
a PE. Allowing the device to be enabled before the PE is configured may
cause spurious EEH events since none of the IOMMU context has been setup.
I'm not entirely sure why this was ever necessary. My best guess is that it
was an workaround for a bug or some other undesireable behaviour from the
PCI core. Either way, it's unnecessary now since as of commit dc3d8f85bb57
("powerpc/powernv/pci: Re-work bus PE configuration") we can guarantee that
the PE will be configured before the PCI core will allow drivers to bind to
the device.
It's also worth pointing out that the ->initialized flag is only set in
pnv_pci_ioda_create_dbgfs(). That function has its entire body wrapped
in #ifdef CONFIG_DEBUG_FS. As a result, for kernels built without debugfs
(i.e. petitboot) the other checks in pnv_pci_enable_device_hook() are
bypassed entirely.
Christophe Leroy [Wed, 23 Dec 2020 09:38:48 +0000 (09:38 +0000)]
powerpc/xmon: Enable breakpoints on 8xx
Since commit 4ad8622dc548 ("powerpc/8xx: Implement hw_breakpoint"),
8xx has breakpoints so there is no reason to opt breakpoint logic
out of xmon for the 8xx.
Markus Elfring [Tue, 27 Aug 2019 06:44:20 +0000 (08:44 +0200)]
powerpc/82xx: Delete an unnecessary of_node_put() call in pq2ads_pci_init_irq()
A null pointer would be passed to a call of the function “of_node_put”
immediately after a call of the function “of_find_compatible_node” failed
at one place.
Remove this superfluous function call.
This issue was detected by using the Coccinelle software.
Markus Elfring [Tue, 27 Aug 2019 11:34:02 +0000 (13:34 +0200)]
powerpc/pseries: Delete an unnecessary kfree() call in dlpar_store()
A null pointer would be passed to a call of the function “kfree”
immediately after a call of the function “kstrdup” failed at one place.
Remove this superfluous function call.
This issue was detected by using the Coccinelle software.
Qian Cai [Sun, 10 May 2020 05:15:59 +0000 (01:15 -0400)]
powerpc/mm/book3s64/iommu: fix some RCU-list locks
It is safe to traverse mm->context.iommu_group_mem_list with either
mem_list_mutex or the RCU read lock held. Silence a few RCU-list false
positive warnings and fix a few missing RCU read locks.
arch/powerpc/mm/book3s64/iommu_api.c:330 RCU-list traversed in non-reader section!!
Ganesh Goudar [Thu, 28 Jan 2021 10:41:43 +0000 (16:11 +0530)]
powerpc/mce: Remove per cpu variables from MCE handlers
Access to per-cpu variables requires translation to be enabled on
pseries machine running in hash mmu mode, Since part of MCE handler
runs in realmode and part of MCE handling code is shared between ppc
architectures pseries and powernv, it becomes difficult to manage
these variables differently on different architectures, So have
these variables in paca instead of having them as per-cpu variables
to avoid complications.
Ganesh Goudar [Thu, 28 Jan 2021 10:41:42 +0000 (16:11 +0530)]
powerpc/mce: Reduce the size of event arrays
Maximum recursive depth of MCE is 4, Considering the maximum depth
allowed reduce the size of event to 10 from 100. This saves us ~19kB
of memory and has no fatal consequences.
Pingfan Liu [Thu, 22 Oct 2020 06:51:19 +0000 (14:51 +0800)]
powerpc/time: Enable sched clock for irqtime
When CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_VIRT_CPU_ACCOUNTING_GEN, powerpc
does not enable "sched_clock_irqtime" and can not utilize irq time
accounting.
Like x86, powerpc does not use the sched_clock_register() interface. So it
needs an dedicated call to enable_sched_clock_irqtime() to enable irq time
accounting.
The "ibm,arch-vec-5-platform-support" property is a list of pairs of
bytes representing the options and values supported by the platform
firmware. At boot time, Linux scans this list and activates the
available features it recognizes : Radix and XIVE.
A recent change modified the number of entries to loop on and 8 bytes,
4 pairs of { options, values } entries are always scanned. This is
fine on KVM but not on PowerVM which can advertises less. As a
consequence on this platform, Linux reads extra entries pointing to
random data, interprets these as available features and tries to
activate them, leading to a firmware crash in
ibm,client-architecture-support.
Fix that by using the property length of "ibm,arch-vec-5-platform-support".
Fixes: ab91239942a9 ("powerpc/prom: Remove VLA in prom_check_platform_support()") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210122075029.797013-1-clg@kaod.org
powerpc/eeh: Add a debugfs interface to check if a driver supports recovery
If a PCI device's current driver implements the error handling callbacks
EEH can use them to recover the device after an error occurs. For devices
without the error handling callbacks we recover them by removing the device
and re-scanning it so the PCI core puts the device back into a known good
state.
Currently there's no way for userspace to determine if the driver supports
recovery or not which makes it difficult to write automated tests for EEH.
This patch addressing that by adding a debugfs interface for querying if
a specific device can be recovered or not.
The basic EEH test ignores VFs since we the way the eeh_dev_break debugfs
interface works means that if multiple VFs are enabled we may cause errors
on all them them. However, we can work around that by only enabling a
single VF at a time.
This patch adds some infrastructure for finding SR-IOV capable devices and
enabling / disabling VFs so we can exercise the VF specific EEH recovery
paths. Two new tests are added, one for testing EEH aware devices and one
for EEH un-aware VFs.
powerpc/sstep: Fix incorrect return from analyze_instr()
We currently just percolate the return value from analyze_instr()
to the caller of emulate_step(), especially if it is a -1.
For one particular case (opcode = 4) for instructions that aren't
currently emulated, we are returning 'should not be single-stepped'
while we should have returned 0 which says 'did not emulate, may
have to single-step'.
Fixes: 930d6288a26787 ("powerpc: sstep: Add support for maddhd, maddhdu, maddld instructions") Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.ibm.com> Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Reviewed-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/161157999039.64773.14950289716779364766.stgit@thinktux.local
Nicholas Piggin [Mon, 18 Jan 2021 12:34:51 +0000 (22:34 +1000)]
powerpc: Always enable queued spinlocks for 64s, disable for others
Queued spinlocks have shown to have good performance and fairness
properties even on smaller (2 socket) POWER systems. This selects them
automatically for 64s. For other platforms they are de-selected, the
standard spinlock is far simpler and smaller code, and single chips
with a handful of cores is unlikely to show any improvement.
CONFIG_EXPERT still allows this to be changed, e.g., to help debug
performance or correctness issues.
Cédric Le Goater [Sat, 12 Dec 2020 14:27:07 +0000 (15:27 +0100)]
powerpc/vas: Fix IRQ name allocation
The VAS device allocates a generic interrupt to handle page faults but
the IRQ name doesn't show under /proc. This is because it's on
stack. Allocate the name.
KVM: PPC: Make the VMX instruction emulation routines static
These are only used locally. It fixes these W=1 compile errors :
../arch/powerpc/kvm/powerpc.c:1521:5: error: no previous prototype for ‘kvmppc_get_vmx_dword’ [-Werror=missing-prototypes]
1521 | int kvmppc_get_vmx_dword(struct kvm_vcpu *vcpu, int index, u64 *val)
| ^~~~~~~~~~~~~~~~~~~~
../arch/powerpc/kvm/powerpc.c:1539:5: error: no previous prototype for ‘kvmppc_get_vmx_word’ [-Werror=missing-prototypes]
1539 | int kvmppc_get_vmx_word(struct kvm_vcpu *vcpu, int index, u64 *val)
| ^~~~~~~~~~~~~~~~~~~
../arch/powerpc/kvm/powerpc.c:1557:5: error: no previous prototype for ‘kvmppc_get_vmx_hword’ [-Werror=missing-prototypes]
1557 | int kvmppc_get_vmx_hword(struct kvm_vcpu *vcpu, int index, u64 *val)
| ^~~~~~~~~~~~~~~~~~~~
../arch/powerpc/kvm/powerpc.c:1575:5: error: no previous prototype for ‘kvmppc_get_vmx_byte’ [-Werror=missing-prototypes]
1575 | int kvmppc_get_vmx_byte(struct kvm_vcpu *vcpu, int index, u64 *val)
| ^~~~~~~~~~~~~~~~~~~
Fixes: acc9eb9305fe ("KVM: PPC: Reimplement LOAD_VMX/STORE_VMX instruction mmio emulation with analyse_instr() input") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210104143206.695198-19-clg@kaod.org
../arch/powerpc/mm/book3s64/slb.c:380:6: error: no previous prototype for ‘preload_new_slb_context’ [-Werror=missing-prototypes]
380 | void preload_new_slb_context(unsigned long start, unsigned long sp)
| ^~~~~~~~~~~~~~~~~~~~~~~
../arch/powerpc/mm/book3s64/hash_utils.c:1867:6: error: no previous prototype for ‘hpte_insert_repeating’ [-Werror=missing-prototypes]
1867 | long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
| ^~~~~~~~~~~~~~~~~~~~~
../arch/powerpc/mm/book3s64/hash_utils.c:1515:5: error: no previous prototype for ‘__hash_page’ [-Werror=missing-prototypes]
1515 | int __hash_page(unsigned long trap, unsigned long ea, unsigned long dsisr,
| ^~~~~~~~~~~
../arch/powerpc/mm/book3s64/hash_utils.c:1850:6: error: no previous prototype for ‘low_hash_fault’ [-Werror=missing-prototypes]
1850 | void low_hash_fault(struct pt_regs *regs, unsigned long address, int rc)
| ^~~~~~~~~~~~~~
Commit 650b55b707fd ("powerpc: Add prefixed instructions to instruction
data type") removed the use of patch_imm32_load_insns(). Clean it up
to fix this W=1 compile error :
../arch/powerpc/kernel/optprobes.c:149:6: error: no previous prototype for ‘patch_imm32_load_insns’ [-Werror=missing-prototypes]
149 | void patch_imm32_load_insns(unsigned int val, kprobe_opcode_t *addr)
Fixes: 650b55b707fd ("powerpc: Add prefixed instructions to instruction data type") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210104143206.695198-11-clg@kaod.org
powerpc/pseries/ras: Make init_ras_hotplug_IRQ() static
init_ras_hotplug_IRQ() is a local routine used by a machine init call
and it doesn't need to be external.
It fixes this W=1 compile error:
../arch/powerpc/platforms/pseries/ras.c:125:12: error: no previous prototype for ‘init_ras_hotplug_IRQ’ [-Werror=missing-prototypes]
125 | int __init init_ras_hotplug_IRQ(void)
| ^~~~~~~~~~~~~~~~~~~~
powerpc/pseries/eeh: Make pseries_pcibios_bus_add_device() static
pseries_pcibios_bus_add_device() is a local routine defining the
pcibios_bus_add_device() handler of the pseries machine in
eeh_pseries_init(). It doesn't need to be external.
It fixes this W=1 compile error:
../arch/powerpc/platforms/pseries/eeh_pseries.c:46:6: error: no previous prototype for ‘pseries_pcibios_bus_add_device’ [-Werror=missing-prototypes]
46 | void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: dae7253f9f78 ("powerpc/pseries: Add pseries SR-IOV Machine dependent calls") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210104143206.695198-4-clg@kaod.org
The last use of 'status' was removed in 2012. Remove the variable to
fix this W=1 compile error.
../arch/powerpc/platforms/pseries/ras.c: In function ‘ras_epow_interrupt’:
../arch/powerpc/platforms/pseries/ras.c:318:6: error: variable ‘status’ set but not used [-Werror=unused-but-set-variable]
318 | int status;
| ^~~~~~
Kajol Jain [Mon, 28 Dec 2020 08:52:04 +0000 (14:22 +0530)]
powerpc/perf/hv-24x7: Dont create sysfs event files for dummy events
The hv_24x7 performance monitoring unit creates a list of supported
events from the event catalog obtained via HCALL. The hv_24x7 catalog
could also contain invalid or dummy events with names like RESERVED*.
These events do not have any hardware counters backing them. Add a
check to string compare the event names to filter such events out.
Michael Ellerman [Thu, 17 Dec 2020 00:53:06 +0000 (11:53 +1100)]
powerpc/64s/kuap: Use mmu_has_feature()
In commit 8150a153c013 ("powerpc/64s: Use early_mmu_has_feature() in
set_kuap()") we switched the KUAP code to use early_mmu_has_feature(),
to avoid a bug where we called set_kuap() before feature patching had
been done, leading to recursion and crashes.
That path, which called probe_kernel_read() from printk(), has since
been removed, see commit 2ac5a3bf7042 ("vsprintf: Do not break early
boot with probing addresses").
Additionally probe_kernel_read() no longer invokes any KUAP routines,
since commit fe557319aa06 ("maccess: rename probe_kernel_{read,write}
to copy_{from,to}_kernel_nofault") and c33165253492 ("powerpc: use
non-set_fs based maccess routines").
So it should now be safe to use mmu_has_feature() in the KUAP
routines, because we shouldn't invoke them prior to feature patching.
This is essentially a revert of commit 8150a153c013 ("powerpc/64s: Use
early_mmu_has_feature() in set_kuap()"), but we've since added a
second usage of early_mmu_has_feature() in get_kuap(), so we convert
that to use mmu_has_feature() as well.
Linus Torvalds [Sat, 2 Jan 2021 20:22:46 +0000 (12:22 -0800)]
Merge tag 's390-5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 cleanups from Vasily Gorbik:
"Update defconfigs and sort config select list"
* tag 's390-5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/Kconfig: sort config S390 select list once again
s390: update defconfigs
Linus Torvalds [Sat, 2 Jan 2021 19:53:05 +0000 (11:53 -0800)]
Merge tag 'pm-5.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a crash in intel_pstate during resume from suspend-to-RAM
that may occur after recent changes and two resource leaks in error
paths in the operating performance points (OPP) framework, add a new
C-states table to intel_idle and update the cpuidle MAINTAINERS entry
to cover the governors too.
Specifics:
- Fix recently introduced crash in the intel_pstate driver that
occurs if scale-invariance is disabled during resume from
suspend-to-RAM due to inconsistent changes of APERF or MPERF MSR
values made by the platform firmware (Rafael Wysocki).
- Fix a memory leak and add a missing clk_put() in error paths in the
OPP framework (Quanyang Wang, Viresh Kumar).
- Add new C-states table for SnowRidge processors to the intel_idle
driver (Artem Bityutskiy).
- Update the MAINTAINERS entry for cpuidle to make it clear that the
governors are covered by it too (Lukas Bulwahn)"
* tag 'pm-5.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
intel_idle: add SnowRidge C-state table
cpufreq: intel_pstate: Fix fast-switch fallback path
opp: Call the missing clk_put() on error
opp: fix memory leak in _allocate_opp_table
MAINTAINERS: include governors into CPU IDLE TIME MANAGEMENT FRAMEWORK
Linus Torvalds [Fri, 1 Jan 2021 20:58:07 +0000 (12:58 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).
The big core two fixes are for power management ("block: Do not accept
any requests while suspended" and "block: Fix a race in the runtime
power management code") which finally sorts out the resume problems
we've occasionally been having.
To make the resume fix, there are seven necessary precursors which
effectively renames REQ_PREEMPT to REQ_PM, so every "special" request
in block is automatically a power management exempt one.
All of the non-PM preempt cases are removed except for the one in the
SCSI Parallel Interface (spi) domain validation which is a genuine
case where we have to run requests at high priority to validate the
bus so this becomes an autopm get/put protected request"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits)
scsi: cxgb4i: Fix TLS dependency
scsi: ufs: Un-inline ufshcd_vops_device_reset function
scsi: ufs: Re-enable WriteBooster after device reset
scsi: ufs-mediatek: Use correct path to fix compile error
scsi: mpt3sas: Signedness bug in _base_get_diag_triggers()
scsi: block: Do not accept any requests while suspended
scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
scsi: block: Introduce BLK_MQ_REQ_PM
scsi: block: Fix a race in the runtime power management code
scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
scsi: ufs-pci: Fix restore from S4 for Intel controllers
scsi: ufs-mediatek: Keep VCC always-on for specific devices
scsi: ufs: Allow regulators being always-on
scsi: ufs: Clear UAC for RPMB after ufshcd resets
...
Linus Torvalds [Fri, 1 Jan 2021 20:49:09 +0000 (12:49 -0800)]
Merge tag 'block-5.11-2021-01-01' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Two minor block fixes from this last week that should go into 5.11:
- Add missing NOWAIT debugfs definition (Andres)
- Fix kerneldoc warning introduced this merge window (Randy)"
* tag 'block-5.11-2021-01-01' of git://git.kernel.dk/linux-block:
block: add debugfs stanza for QUEUE_FLAG_NOWAIT
fs: block_dev.c: fix kernel-doc warnings from struct block_device changes
Linus Torvalds [Fri, 1 Jan 2021 20:29:49 +0000 (12:29 -0800)]
Merge tag 'io_uring-5.11-2021-01-01' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few fixes that should go into 5.11, all marked for stable as well:
- Fix issue around identity COW'ing and users that share a ring
across processes
- Fix a hang associated with unregistering fixed files (Pavel)
- Move the 'process is exiting' cancelation a bit earlier, so
task_works aren't affected by it (Pavel)"
* tag 'io_uring-5.11-2021-01-01' of git://git.kernel.dk/linux-block:
kernel/io_uring: cancel io_uring before task works
io_uring: fix io_sqe_files_unregister() hangs
io_uring: add a helper for setting a ref node
io_uring: don't assume mm is constant across submits
Linus Torvalds [Mon, 28 Dec 2020 19:40:22 +0000 (11:40 -0800)]
depmod: handle the case of /sbin/depmod without /sbin in PATH
Commit 436e980e2ed5 ("kbuild: don't hardcode depmod path") stopped
hard-coding the path of depmod, but in the process caused trouble for
distributions that had that /sbin location, but didn't have it in the
PATH (generally because /sbin is limited to the super-user path).
Work around it for now by just adding /sbin to the end of PATH in the
depmod.sh script.
Pavel Begunkov [Wed, 30 Dec 2020 21:34:16 +0000 (21:34 +0000)]
kernel/io_uring: cancel io_uring before task works
For cancelling io_uring requests it needs either to be able to run
currently enqueued task_works or having it shut down by that moment.
Otherwise io_uring_cancel_files() may be waiting for requests that won't
ever complete.
Go with the first way and do cancellations before setting PF_EXITING and
so before putting the task_work infrastructure into a transition state
where task_work_run() would better not be called.
Pavel Begunkov [Wed, 30 Dec 2020 21:34:15 +0000 (21:34 +0000)]
io_uring: fix io_sqe_files_unregister() hangs
io_sqe_files_unregister() uninterruptibly waits for enqueued ref nodes,
however requests keeping them may never complete, e.g. because of some
userspace dependency. Make sure it's interruptible otherwise it would
hang forever.
Linus Torvalds [Wed, 30 Dec 2020 20:02:12 +0000 (12:02 -0800)]
Merge tag 'ceph-for-5.11-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A fix for an edge case in MClientRequest encoding and a couple of
trivial fixups for the new msgr2 support"
* tag 'ceph-for-5.11-rc2' of git://github.com/ceph/ceph-client:
libceph: add __maybe_unused to DEFINE_MSGR2_FEATURE
libceph: align session_key and con_secret to 16 bytes
libceph: fix auth_signature buffer allocation in secure mode
ceph: reencode gid_list when reconnecting
Artem Bityutskiy [Sun, 27 Dec 2020 10:11:16 +0000 (12:11 +0200)]
intel_idle: add SnowRidge C-state table
Add C-state table for the SnowRidge SoC which is found on Intel Jacobsville
platforms.
The following has been changed.
1. C1E latency changed from 10us to 15us. It was measured using the
open source "wult" tool (the "nic" method, 15us is the 99.99th
percentile).
2. C1E power break even changed from 20us to 25us, which may result
in less C1E residency in some workloads.
3. C6 latency changed from 50us to 130us. Measured the same way as C1E.
The C6 C-state is supported only by some SnowRidge revisions, so add a C-state
table commentary about this.
On SnowRidge, C6 support is enumerated via the usual mechanism: "mwait" leaf of
the "cpuid" instruction. The 'intel_idle' driver does check this leaf, so even
though C6 is present in the table, the driver will only use it if the CPU does
support it.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When sugov_update_single_perf() falls back to the "frequency"
path due to the missing scale-invariance, it will call
cpufreq_driver_fast_switch() via sugov_fast_switch()
and the driver's ->fast_switch() callback will be invoked,
so it must not be NULL.
However, after commit a365ab6b9dfb ("cpufreq: intel_pstate: Implement
the ->adjust_perf() callback") intel_pstate sets ->fast_switch() to
NULL when it is going to use intel_cpufreq_adjust_perf(), which is a
mistake, because on x86 the scale-invariance may be turned off
dynamically, so modify it to retain the original ->adjust_perf()
callback pointer.
Fixes: a365ab6b9dfb ("cpufreq: intel_pstate: Implement the ->adjust_perf() callback") Reported-by: Kenneth R. Crudup <kenny@panix.com> Tested-by: Kenneth R. Crudup <kenny@panix.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Merge branch 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull operating performance points (OPP) framework fixes for 5.11-rc2
from Viresh Kumar:
"This contains two patches to fix freeing of resources in error paths."
* 'opp/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
opp: Call the missing clk_put() on error
opp: fix memory leak in _allocate_opp_table
Randy Dunlap [Tue, 29 Dec 2020 03:47:06 +0000 (19:47 -0800)]
fs: block_dev.c: fix kernel-doc warnings from struct block_device changes
Fix new kernel-doc warnings in fs/block_dev.c:
../fs/block_dev.c:1066: warning: Excess function parameter 'whole' description in 'bd_abort_claiming'
../fs/block_dev.c:1837: warning: Function parameter or member 'dev' not described in 'lookup_bdev'
Fixes: 4e7b5671c6a8 ("block: remove i_bdev") Fixes: 37c3fc9abb25 ("block: simplify the block device claiming interface") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: linux-fsdevel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Tue, 29 Dec 2020 23:45:49 +0000 (15:45 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"16 patches
Subsystems affected by this patch series: mm (selftests, hugetlb,
pagecache, mremap, kasan, and slub), kbuild, checkpatch, misc, and
lib"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: slub: call account_slab_page() after slab page initialization
zlib: move EXPORT_SYMBOL() and MODULE_LICENSE() out of dfltcc_syms.c
lib/zlib: fix inflating zlib streams on s390
lib/genalloc: fix the overflow when size is too big
kdev_t: always inline major/minor helper functions
sizes.h: add SZ_8G/SZ_16G/SZ_32G macros
local64.h: make <asm/local64.h> mandatory
kasan: fix null pointer dereference in kasan_record_aux_stack
mm: generalise COW SMC TLB flushing race comment
mm/mremap.c: fix extent calculation
mm: memmap defer init doesn't work as expected
mm: add prototype for __add_to_page_cache_locked()
checkpatch: prefer strscpy to strlcpy
Revert "kbuild: avoid static_assert for genksyms"
mm/hugetlb: fix deadlock in hugetlb_cow error path
selftests/vm: fix building protection keys test
Roman Gushchin [Tue, 29 Dec 2020 23:15:07 +0000 (15:15 -0800)]
mm: slub: call account_slab_page() after slab page initialization
It's convenient to have page->objects initialized before calling into
account_slab_page(). In particular, this information can be used to
pre-alloc the obj_cgroup vector.
Let's call account_slab_page() a bit later, after the initialization of
page->objects.
This commit doesn't bring any functional change, but is required for
further optimizations.
[akpm@linux-foundation.org: undo changes needed by forthcoming mm-memcg-slab-pre-allocate-obj_cgroups-for-slab-caches-with-slab_account.patch]
Link: https://lkml.kernel.org/r/20201110195753.530157-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Tue, 29 Dec 2020 23:15:04 +0000 (15:15 -0800)]
zlib: move EXPORT_SYMBOL() and MODULE_LICENSE() out of dfltcc_syms.c
In commit 11fb479ff5d9 ("zlib: export S390 symbols for zlib modules"), I
added EXPORT_SYMBOL()s to dfltcc_inflate.c but then Mikhail said that
these should probably be in dfltcc_syms.c with the other
EXPORT_SYMBOL()s.
However, that is contrary to the current kernel style, which places
EXPORT_SYMBOL() immediately after the function that it applies to, so
move all EXPORT_SYMBOL()s to their respective function locations and
drop the dfltcc_syms.c file. Also move MODULE_LICENSE() from the
deleted file to dfltcc.c.
Ilya Leoshkevich [Tue, 29 Dec 2020 23:15:01 +0000 (15:15 -0800)]
lib/zlib: fix inflating zlib streams on s390
Decompressing zlib streams on s390 fails with "incorrect data check"
error.
Userspace zlib checks inflate_state.flags in order to byteswap checksums
only for zlib streams, and s390 hardware inflate code, which was ported
from there, tries to match this behavior. At the same time, kernel zlib
does not use inflate_state.flags, so it contains essentially random
values. For many use cases either zlib stream is zeroed out or checksum
is not used, so this problem is masked, but at least SquashFS is still
affected.
Fix by always passing a checksum to and from the hardware as is, which
matches zlib_inflate()'s expectations.
Randy Dunlap [Tue, 29 Dec 2020 23:14:49 +0000 (15:14 -0800)]
local64.h: make <asm/local64.h> mandatory
Make <asm-generic/local64.h> mandatory in include/asm-generic/Kbuild and
remove all arch/*/include/asm/local64.h arch-specific files since they
only #include <asm-generic/local64.h>.
This fixes build errors on arch/c6x/ and arch/nios2/ for
block/blk-iocost.c.
Build-tested on 21 of 25 arch-es. (tools problems on the others)
Yes, we could even rename <asm-generic/local64.h> to
<linux/local64.h> and change all #includes to use
<linux/local64.h> instead.
Link: https://lkml.kernel.org/r/20201227024446.17018-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicholas Piggin [Tue, 29 Dec 2020 23:14:43 +0000 (15:14 -0800)]
mm: generalise COW SMC TLB flushing race comment
I'm not sure if I'm completely missing something here, but AFAIKS the
reference to the mysterious "COW SMC race" confuses the issue. The
original changelog and mailing list thread didn't help me either.
This SMC race is where the problem was detected, but isn't the general
problem bigger and more obvious: that the new PTE could be picked up at
any time by any TLB while entries for the old PTE exist in other TLBs
before the TLB flush takes effect?
The case where the iTLB and dTLB of a CPU are pointing at different pages
is an interesting one but follows from the general problem.
The other (minor) thing with the comment I think it makes it a bit clearer
to say what the old code was doing (i.e., it avoids the race as opposed to
what?).
References: 4ce072f1faf29 ("mm: fix a race condition under SMC + COW") Link: https://lkml.kernel.org/r/20201215121119.351650-1-npiggin@gmail.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suresh Siddha <sbsiddha@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Baoquan He [Tue, 29 Dec 2020 23:14:37 +0000 (15:14 -0800)]
mm: memmap defer init doesn't work as expected
VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.
Before the commit:
[0.033176] Normal zone: 1445888 pages used for memmap
[0.033176] Normal zone: 89391104 pages, LIFO batch:63
[0.035851] ACPI: PM-Timer IO Port: 0x448
With commit
[0.026874] Normal zone: 1445888 pages used for memmap
[0.026875] Normal zone: 89391104 pages, LIFO batch:63
[2.028450] ACPI: PM-Timer IO Port: 0x448
The root cause is the current memmap defer init doesn't work as expected.
Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node. However, since commit 73a6e474cb376,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.
E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2. The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
only one memory section's memmap initialized. That's why more time is
costed at the time.
Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.
Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Baoquan He <bhe@redhat.com> Reported-by: Rahul Gopakumar <gopakumarr@vmware.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>