Since Qualcomm device-trees already use SoC specific compatibles for
describing the 'sdhci-msm' nodes, it makes sense to add the support for the
same in the driver as well.
Keep the old deprecated compatible strings still in the driver, to ensure
backward compatibility with older device-trees.
dt-bindings: mmc: brcm,sdhci-brcmstb: cleanup example
Cleanup indentation and order of entries in example DTS. The most
important when reading the DTS are compatible and reg. By convention
they are usually to first entries. No functional change.
Al Cooper [Wed, 27 Apr 2022 18:08:51 +0000 (14:08 -0400)]
mmc: sdhci-brcmstb: Enable Clock Gating to save power
Enabling this feature will allow the controller to stop the bus
clock when the bus is idle. The feature is not part of the standard
and is unique to newer Arasan cores and is enabled with a bit in a
vendor specific register. This feature will only be enabled for
non-removable devices because they don't switch the voltage and
clock gating breaks SD Card volatge switching.
Signed-off-by: Al Cooper <alcooperx@gmail.com> Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20220427180853.35970-3-kdasu.kdev@gmail.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Al Cooper [Wed, 27 Apr 2022 18:08:50 +0000 (14:08 -0400)]
mmc: sdhci-brcmstb: Re-organize flags
Re-organize the flags by basing the bit names on the flag that they
apply to. Also change the "flags" member in the "brcmstb_match_priv"
struct to const.
Signed-off-by: Al Cooper <alcooperx@gmail.com> Signed-off-by: Kamal Dasu <kdasu.kdev@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20220427180853.35970-2-kdasu.kdev@gmail.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Newer variants of the MMC controller support a 34-bit physical address
space by using word addresses instead of byte addresses. However, the
code truncates the DMA descriptor address to 32 bits before applying the
shift. This breaks DMA for descriptors allocated above the 32-bit limit.
Fixes: 3536b82e5853 ("mmc: sunxi: add support for A100 mmc controller") Signed-off-by: Samuel Holland <samuel@sholland.org> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Jernej Skrabec <jernej.skrabec@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220424231751.32053-1-samuel@sholland.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Bean Huo [Sat, 23 Apr 2022 22:16:23 +0000 (00:16 +0200)]
mmc: core: Allows to override the timeout value for ioctl() path
Occasionally, user-land applications initiate longer timeout values for certain commands
through ioctl() system call. But so far we are still using a fixed timeout of 10 seconds
in mmc_poll_for_busy() on the ioctl() path, even if a custom timeout is specified in the
userspace application. This patch allows custom timeout values to override this default
timeout values on the ioctl path.
drivers: mmc: sdhci_am654: Add the quirk to set TESTCD bit
The ARASAN MMC controller on Keystone 3 class of devices need the SDCD
line to be connected for proper functioning. Similar to the issue pointed
out in sdhci-of-arasan.c driver, commit 3794c542641f ("mmc:
sdhci-of-arasan: Set controller to test mode when no CD bit").
In cases where this can't be connected, add a quirk to force the
controller into test mode and set the TESTCD bit. Use the flag
"ti,fails-without-test-cd", to implement this above quirk when required.
dt-bindings: mmc: sdhci-am654: Add flag to force setting of TESTCD bit
The ARASAN MMC controller on Keystone 3 class of devices needs the SDCD
line to be connected for proper functioning. Similar to the issue pointed
out in sdhci-of-arasan.c driver, commit 3794c542641f ("mmc:
sdhci-of-arasan: Set controller to test mode when no CD bit").
In cases where SDCD line is not connected, adding
"ti,fails-without-test-cd" in the DT node helps to indicate the
controller, that the SDCD line has been pulled down, using the TESTCD bit.
Chris Packham [Tue, 19 Apr 2022 02:46:11 +0000 (14:46 +1200)]
dt-bindings: mmc: convert sdhci-dove to JSON schema
Convert the sdhci-dove binding to JSON schema. The optional clocks
property was not in the original binding document but has been in the
dove.dtsi since commit 5b03df9ace68 ("ARM: dove: switch to DT clock
providers").
dt-bindings: mmc: Add small binding note on level shifters
The VQMMC is often provided by a level shifter, so drop a small note
in the bindings that this can be the case and how that is done.
It is helpful information since this is pretty common.
Ben Chuang [Thu, 14 Apr 2022 09:49:45 +0000 (17:49 +0800)]
mmc: sdhci-pci-gli: A workaround to allow GL9755 to enter ASPM L1.2
When GL9755 enters ASPM L1 sub-states, it will stay at L1.1 and will not
enter L1.2. The workaround is to toggle PM state to allow GL9755 to enter
ASPM L1.2.
mmc: jz4740: Apply DMA engine limits to maximum segment size
Do what is done in other DMA-enabled MMC host drivers (cf. host/mmci.c) and
limit the maximum segment size based on the DMA engine's capabilities. This
is needed to avoid warnings like the following with CONFIG_DMA_API_DEBUG=y.
The SDHC controller in the imx8mn and imx8mp have the same controller
as the imx8mm which is slightly different than that of the imx7d.
Using the fallback of the imx8mm enables the controllers to support
HS400-ES which is not available on the imx7d. After discussion with NXP,
it turns out that the imx8qm should fall back to the imx8qxp, because
those have some additional flags not present in the imx8mm.
Mark the current state of the fallbacks as deprecated, and add the
proper fallbacks so in the future, the deprecated combination can be
removed and prevent any future devices from using the wrong fallback.
Wolfram Sang [Fri, 8 Apr 2022 08:00:44 +0000 (10:00 +0200)]
mmc: improve API to make clear hw_reset callback is for cards
To make it unambiguous that the hw_reset callback is for cards and not
for controllers, we add 'card' to the callback name and convert all
users in one go. We keep the argument as mmc_host, though, because the
callback is used very early when mmc_card is not yet populated.
Wolfram Sang [Fri, 8 Apr 2022 08:00:43 +0000 (10:00 +0200)]
mmc: core: improve API to make clear that mmc_sw_reset is for cards
To make it unambiguous that mmc_sw_reset() is for cards and not for
controllers, we make the function argument mmc_card instead of mmc_host.
There are no users to convert currently.
The driver, OMAP specific, now omits clk_prepare/unprepare() steps, not
supported by OMAP custom implementation of clock API. However, non-CCF
stubs of those functions exist for use on such platforms until converted
to CCF.
Update the driver to be compatible with CCF implementation of clock API.
Sergey Shtylyov [Wed, 30 Mar 2022 21:09:07 +0000 (00:09 +0300)]
mmc: core: block: fix sloppy typing in mmc_blk_ioctl_multi_cmd()
Despite mmc_ioc_multi_cmd::num_of_cmds is a 64-bit field, its maximum
value is limited to MMC_IOC_MAX_CMDS (only 255); using a 64-bit local
variable to hold a copy of that field leads to gcc generating ineffective
loop code: despite the source code using an *int* variable for the loop
counters, the 32-bit object code uses 64-bit unsigned counters. Also,
gcc has to drop the most significant word of that 64-bit variable when
calling kcalloc() and assigning to mmc_queue_req::ioc_count anyway.
Using the *unsigned int* variable instead results in a better code.
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Chris Packham [Tue, 29 Mar 2022 22:05:44 +0000 (11:05 +1300)]
dt-bindings: mmc: xenon: Convert to JSON schema
Convert the marvell,xenon-sdhci binding to JSON schema. Currently the
in-tree dts files don't validate because they use sdhci@ instead of mmc@
as required by the generic mmc-controller schema.
The compatible "marvell,sdhci-xenon" was not documented in the old
binding but it accompanies the of "marvell,armada-3700-sdhci" in the
armada-37xx SoC dtsi so this combination is added to the new binding
document.
Yann Gautier [Mon, 28 Mar 2022 14:51:14 +0000 (16:51 +0200)]
mmc: mmci: stm32: use a buffer for unaligned DMA requests
In SDIO mode, the sg list for requests can be unaligned with what the
STM32 SDMMC internal DMA can support. In that case, instead of failing,
use a temporary bounce buffer to copy from/to the sg list.
This buffer is limited to 1MB. But for that we need to also limit
max_req_size to 1MB. It has not shown any throughput penalties for
SD-cards or eMMC.
Wolfram Sang [Sun, 20 Mar 2022 12:30:16 +0000 (13:30 +0100)]
mmc: renesas_sdhi: make 'dmac_only_one_rx' a quirk
After Shimoda-san's much appreciated refactoring of the quirk handling,
we can convert now 'dmac_only_one_rx' from an ugly global flag to a
regular quirk. This makes quirk handling more consistent and easier to
maintain. After this patch, soc_dma_quirks is completely gone, hooray!
Wolfram Sang [Sun, 20 Mar 2022 12:30:15 +0000 (13:30 +0100)]
mmc: renesas_sdhi: make 'fixed_addr_mode' a quirk
After Shimoda-san's much appreciated refactoring of the quirk handling,
we can convert now the 'fixed_addr_mode' from an ugly global flag to a
regular quirk. This makes quirk handling more consistent and easier to
maintain.
Wolfram Sang [Sun, 20 Mar 2022 12:30:14 +0000 (13:30 +0100)]
mmc: renesas_sdhi: remove a stale comment
The whitelist has been refactored away with a0fb3fc8af01 ("mmc:
renesas_sdhi: remove whitelist for internal DMAC") so the comment
doesn't make any sense anymore.
Wolfram Sang [Sun, 20 Mar 2022 12:30:12 +0000 (13:30 +0100)]
mmc: renesas_sdhi: R-Car D3 also has no HS400
It is not explicitly expressed in the docs, but the needed data strobe
pin is indeed missing for D3. The BSP disables HS400 as well. This means
a little refactoring to reuse an already existing setup.
Brian Norris [Fri, 22 Apr 2022 17:08:53 +0000 (10:08 -0700)]
mmc: core: Set HS clock speed before sending HS CMD13
Way back in commit 4f25580fb84d ("mmc: core: changes frequency to
hs_max_dtr when selecting hs400es"), Rockchip engineers noticed that
some eMMC don't respond to SEND_STATUS commands very reliably if they're
still running at a low initial frequency. As mentioned in that commit,
JESD84-B51 P49 suggests a sequence in which the host:
1. sets HS_TIMING
2. bumps the clock ("<= 52 MHz")
3. sends further commands
It doesn't exactly require that we don't use a lower-than-52MHz
frequency, but in practice, these eMMC don't like it.
The aforementioned commit tried to get that right for HS400ES, although
it's unclear whether this ever truly worked as committed into mainline,
as other changes/refactoring adjusted the sequence in conflicting ways:
08573eaf1a70 ("mmc: mmc: do not use CMD13 to get status after speed mode
switch")
53e60650f74e ("mmc: core: Allow CMD13 polling when switching to HS mode
for mmc")
In any case, today we do step 3 before step 2. Let's fix that, and also
apply the same logic to HS200/400, where this eMMC has problems too.
Resolves errors like this seen when booting some RK3399 Gru/Scarlet
systems:
Ealier versions of this patch bumped to 200MHz/HS200 speeds too early,
which caused issues on, e.g., qcom-msm8974-fairphone-fp2. (Thanks for
the report Luca!) After a second look, it appears that aligns with
JESD84 / page 45 / table 28, so we need to keep to lower (HS / 52 MHz)
rates first.
Fixes: 08573eaf1a70 ("mmc: mmc: do not use CMD13 to get status after speed mode switch") Fixes: 53e60650f74e ("mmc: core: Allow CMD13 polling when switching to HS mode for mmc") Fixes: 4f25580fb84d ("mmc: core: changes frequency to hs_max_dtr when selecting hs400es") Cc: Shawn Lin <shawn.lin@rock-chips.com> Link: https://lore.kernel.org/linux-mmc/11962455.O9o76ZdvQC@g550jk/ Reported-by: Luca Weiss <luca@z3ntu.xyz> Signed-off-by: Brian Norris <briannorris@chromium.org> Tested-by: Luca Weiss <luca@z3ntu.xyz> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220422100824.v4.1.I484f4ee35609f78b932bd50feed639c29e64997e@changeid Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Merge tag 'perf_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Add Sapphire Rapids CPU support
- Fix a perf vmalloc-ed buffer mapping error (PERF_USE_VMALLOC in use)
* tag 'perf_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/cstate: Add SAPPHIRERAPIDS_X CPU support
perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled
Merge tag 'edac_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC fix from Borislav Petkov:
- Read the reported error count from the proper register on
synopsys_edac
* tag 'edac_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/synopsys: Read the error count from the correct register
kvmalloc: use vmalloc_huge for vmalloc allocations
Since commit 559089e0a93d ("vmalloc: replace VM_NO_HUGE_VMAP with
VM_ALLOW_HUGE_VMAP"), the use of hugepage mappings for vmalloc is an
opt-in strategy, because it caused a number of problems that weren't
noticed until x86 enabled it too.
One of the issues was fixed by Nick Piggin in commit 3b8000ae185c
("mm/vmalloc: huge vmalloc backing pages should be split rather than
compound"), but I'm still worried about page protection issues, and
VM_FLUSH_RESET_PERMS in particular.
However, like the hash table allocation case (commit f2edd118d02d:
"page_alloc: use vmalloc_huge for large system hash"), the use of
kvmalloc() should be safe from any such games, since the returned
pointer might be a SLUB allocation, and as such no user should
reasonably be using it in any odd ways.
We also know that the allocations are fairly large, since it falls back
to the vmalloc case only when a kmalloc() fails. So using a hugepage
mapping seems both safe and relevant.
This patch does show a weakness in the opt-in strategy: since the opt-in
flag is in the 'vm_flags', not the usual gfp_t allocation flags, very
few of the usual interfaces actually expose it.
That's not much of an issue in this case that already used one of the
fairly specialized low-level vmalloc interfaces for the allocation, but
for a lot of other vmalloc() users that might want to opt in, it's going
to be very inconvenient.
We'll either have to fix any compatibility problems, or expose it in the
gfp flags (__GFP_COMP would have made a lot of sense) to allow normal
vmalloc() users to use hugepage mappings. That said, the cases that
really matter were probably already taken care of by the hash tabel
allocation.
Song Liu [Fri, 15 Apr 2022 16:44:11 +0000 (09:44 -0700)]
page_alloc: use vmalloc_huge for large system hash
Use vmalloc_huge() in alloc_large_system_hash() so that large system
hash (>= PMD_SIZE) could benefit from huge pages.
Note that vmalloc_huge only allocates huge pages for systems with
HAVE_ARCH_HUGE_VMALLOC.
Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
- cap maximum sector size reported to avoid mount problems
- reference count fix
- fix filename rename race
* tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION
ksmbd: increment reference count of parent fp
ksmbd: remove filename in ksmbd_file
Merge tag 'arc-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
- Assorted fixes
* tag 'arc-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: remove redundant READ_ONCE() in cmpxchg loop
ARC: atomic: cleanup atomic-llsc definitions
arc: drop definitions of pgd_index() and pgd_offset{, _k}() entirely
ARC: dts: align SPI NOR node name with dtschema
ARC: Remove a redundant memset()
ARC: fix typos in comments
ARC: entry: fix syscall_trace_exit argument
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"One fix for an information leak caused by copying a buffer to
userspace without checking for error first in the sr driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sr: Do not leak information in ioctl
Merge tag 'for-linus-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"A simple cleanup patch and a refcount fix for Xen on Arm"
* tag 'for-linus-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
arm/xen: Fix some refcount leaks
xen: Convert kmap() to kmap_local_page()
Merge tag 'drm-fixes-2022-04-23' of git://anongit.freedesktop.org/drm/drm
Pull more drm fixes from Dave Airlie:
"Maarten was away, so Maxine stepped up and sent me the drm-fixes
merge, so no point leaving it for another week.
The big change is an OF revert around bridge/panels, it may have some
driver fallout, but hopefully this revert gets them shook out in the
next week easier.
Otherwise it's a bunch of locking/refcounts across drivers, a radeon
dma_resv logic fix and some raspberry pi panel fixes.
panel:
- revert of patch that broke panel/bridge issues
dma-buf:
- remove unused header file.
amdgpu:
- partial revert of locking change
radeon:
- fix dma_resv logic inversion
panel:
- pi touchscreen panel init fixes
vc4:
- build fix
- runtime pm refcount fix
vmwgfx:
- refcounting fix"
* tag 'drm-fixes-2022-04-23' of git://anongit.freedesktop.org/drm/drm:
drm/amdgpu: partial revert "remove ctx->lock" v2
Revert "drm: of: Lookup if child node has panel or bridge"
Revert "drm: of: Properly try all possible cases for bridge/panel detection"
drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
drm/vmwgfx: Fix gem refcounting and memory evictions
drm/vc4: Fix build error when CONFIG_DRM_VC4=y && CONFIG_RASPBERRYPI_FIRMWARE=m
drm/panel/raspberrypi-touchscreen: Initialise the bridge in prepare
drm/panel/raspberrypi-touchscreen: Avoid NULL deref if not initialised
dma-buf-map: remove renamed header file
drm/radeon: fix logic inversion in radeon_sync_resv
Merge tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Just two small regression fixes for bcache"
* tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block:
bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook()
bcache: put bch_bio_map() back to correct location in journal_write_unlocked()
Merge tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Just two small fixes - one fixing a potential leak for the iovec for
larger requests added in this cycle, and one fixing a theoretical leak
with CQE_SKIP and IOPOLL"
* tag 'io_uring-5.18-2022-04-22' of git://git.kernel.dk/linux-block:
io_uring: fix leaks on IOPOLL and CQE_SKIP
io_uring: free iovec if file assignment fails
Merge tag 'perf-tools-fixes-for-v5.18-2022-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix header include for LLVM >= 14 when building with libclang.
- Allow access to 'data_src' for auxtrace in 'perf script' with ARM SPE
perf.data files, fixing processing data with such attributes.
- Fix error message for test case 71 ("Convert perf time to TSC") on
s390, where it is not supported.
* tag 'perf-tools-fixes-for-v5.18-2022-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf test: Fix error message for test case 71 on s390, where it is not supported
perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event
perf script: Always allow field 'data_src' for auxtrace
perf clang: Fix header include for LLVM >= 14
Randy Dunlap [Sat, 23 Apr 2022 03:25:17 +0000 (20:25 -0700)]
sparc: cacheflush_32.h needs struct page
Add a struct page forward declaration to cacheflush_32.h.
Fixes this build warning:
CC drivers/crypto/xilinx/zynqmp-sha.o
In file included from arch/sparc/include/asm/cacheflush.h:11,
from include/linux/cacheflush.h:5,
from drivers/crypto/xilinx/zynqmp-sha.c:6:
arch/sparc/include/asm/cacheflush_32.h:38:37: warning: 'struct page' declared inside parameter list will not be visible outside of this definition or declaration
38 | void sparc_flush_page_to_ram(struct page *page);
Exposed by commit 0e03b8fd2936 ("crypto: xilinx - Turn SHA into a
tristate and allow COMPILE_TEST") but not Fixes: that commit because the
underlying problem is older.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David S. Miller <davem@davemloft.net> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: sparclinux@vger.kernel.org Acked-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Airlie [Sat, 23 Apr 2022 05:00:33 +0000 (15:00 +1000)]
Merge tag 'drm-misc-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Two fixes for the raspberrypi panel initialisation, one fix for a logic
inversion in radeon, a build and pm refcounting fix for vc4, two reverts
for drm_of_get_bridge that caused a number of regression and a locking
regression for amdgpu.
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix some syzbot-detected bugs, as well as other bugs found by I/O
injection testing.
Change ext4's fallocate to consistently drop set[ug]id bits when an
fallocate operation might possibly change the user-visible contents of
a file.
Also, improve handling of potentially invalid values in the the
s_overhead_cluster superblock field to avoid ext4 returning a negative
number of free blocks"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: fix a potential race while discarding reserved buffers after an abort
ext4: update the cached overhead value in the superblock
ext4: force overhead calculation if the s_overhead_cluster makes no sense
ext4: fix overhead calculation to account for the reserved gdt blocks
ext4, doc: fix incorrect h_reserved size
ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
ext4: fix use-after-free in ext4_search_dir
ext4: fix bug_on in start_this_handle during umount filesystem
ext4: fix symlink file size not match to file content
ext4: fix fallocate to use file_modified to update permissions consistently
Merge tag 'ata-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA fix from Damien Le Moal:
"A single fix to avoid a NULL pointer dereference in the pata_marvell
driver with adapters not supporting DMA, from Zheyu"
* tag 'ata-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: pata_marvell: Check the 'bmdma_addr' beforing reading
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"The main and larger change here is a workaround for AMD's lack of
cache coherency for encrypted-memory guests.
I have another patch pending, but it's waiting for review from the
architecture maintainers.
RISC-V:
- Remove 's' & 'u' as valid ISA extension
- Do not allow disabling the base extensions 'i'/'m'/'a'/'c'
x86:
- Fix NMI watchdog in guests on AMD
- Fix for SEV cache incoherency issues
- Don't re-acquire SRCU lock in complete_emulated_io()
- Avoid NULL pointer deref if VM creation fails
- Fix race conditions between APICv disabling and vCPU creation
- Bugfixes for disabling of APICv
- Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
selftests:
- Do not use bitfields larger than 32-bits, they differ between GCC
and clang"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: selftests: introduce and use more page size-related constants
kvm: selftests: do not use bitfields larger than 32-bits for PTEs
KVM: SEV: add cache flush to solve SEV cache incoherency issues
KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs
KVM: SVM: Simplify and harden helper to flush SEV guest page(s)
KVM: selftests: Silence compiler warning in the kvm_page_table_test
KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
KVM: SPDX style and spelling fixes
KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled
KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref
KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused
KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy
KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()
RISC-V: KVM: Restrict the extensions that can be disabled
RISC-V: KVM: Remove 's' & 'u' as valid ISA extension
Thomas Richter [Wed, 20 Apr 2022 06:29:21 +0000 (08:29 +0200)]
perf test: Fix error message for test case 71 on s390, where it is not supported
Test case 71 'Convert perf time to TSC' is not supported on s390.
Subtest 71.1 is skipped with the correct message, but subtest 71.2 is
not skipped and fails.
The root cause is function evlist__open() called from
test__perf_time_to_tsc(). evlist__open() returns -ENOENT because the
event cycles:u is not supported by the selected PMU, for example
platform s390 on z/VM or an x86_64 virtual machine.
The PMU driver returns -ENOENT in this case. This error is leads to the
failure.
Fix this by returning TEST_SKIP on -ENOENT.
Output before:
71: Convert perf time to TSC:
71.1: TSC support: Skip (This architecture does not support)
71.2: Perf time to TSC: FAILED!
Output after:
71: Convert perf time to TSC:
71.1: TSC support: Skip (This architecture does not support)
71.2: Perf time to TSC: Skip (perf_read_tsc_conversion is not supported)
This also happens on an x86_64 virtual machine:
# uname -m
x86_64
$ ./perf test -F 71
71: Convert perf time to TSC :
71.1: TSC support : Ok
71.2: Perf time to TSC : FAILED!
$
Committer testing:
Continues to work on x86_64:
$ perf test 71
71: Convert perf time to TSC :
71.1: TSC support : Ok
71.2: Perf time to TSC : Ok
$
Fixes: 290fa68bdc458863 ("perf test tsc: Fix error message when not supported") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Chengdong Li <chengdongli@tencent.com> Cc: chengdongli@tencent.com Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Link: https://lore.kernel.org/r/20220420062921.1211825-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Leo Yan [Thu, 14 Apr 2022 12:32:01 +0000 (20:32 +0800)]
perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event
Since commit bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem
info is not available") "perf mem report" and "perf report --mem-mode"
don't report result if the PERF_SAMPLE_DATA_SRC bit is missed in sample
type.
The commit ffab487052054162 ("perf: arm-spe: Fix perf report
--mem-mode") partially fixes the issue. It adds PERF_SAMPLE_DATA_SRC
bit for Arm SPE event, this allows the perf data file generated by
kernel v5.18-rc1 or later version can be reported properly.
On the other hand, perf tool still fails to be backward compatibility
for a data file recorded by an older version's perf which contains Arm
SPE trace data. This patch is a workaround in reporting phase, when
detects ARM SPE PMU event and without PERF_SAMPLE_DATA_SRC bit, it will
force to set the bit in the sample type and give a warning info.
Fixes: bb30acae4c4dacfa ("perf report: Bail out --mem-mode if mem info is not available") Reviewed-by: James Clark <james.clark@arm.com> Signed-off-by: Leo Yan <leo.yan@linaro.org> Tested-by: German Gomez <german.gomez@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Link: https://lore.kernel.org/r/20220414123201.842754-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Leo Yan [Sun, 17 Apr 2022 11:48:37 +0000 (19:48 +0800)]
perf script: Always allow field 'data_src' for auxtrace
If use command 'perf script -F,+data_src' to dump memory samples with
Arm SPE trace data, it reports error:
# perf script -F,+data_src
Samples for 'dummy:u' event do not have DATA_SRC attribute set. Cannot print 'data_src' field.
This is because the 'dummy:u' event is absent DATA_SRC bit in its sample
type, so if a file contains AUX area tracing data then always allow
field 'data_src' to be selected as an option for perf script.
Fixes: e55ed3423c1bb29f ("perf arm-spe: Synthesize memory event") Signed-off-by: Leo Yan <leo.yan@linaro.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: German Gomez <german.gomez@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20220417114837.839896-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Commit 5467801f1fcb ("gpio: Restrict usage of GPIO chip irq members
before initialization") attempted to fix a race condition that lead to a
NULL pointer, but in the process caused a regression for _AEI/_EVT
declared GPIOs.
This manifests in messages showing deferred probing while trying to
allocate IRQs like so:
amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0000 to IRQ, err -517
amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x002C to IRQ, err -517
amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x003D to IRQ, err -517
[ .. more of the same .. ]
The code for walking _AEI doesn't handle deferred probing and so this
leads to non-functional GPIO interrupts.
Fix this issue by moving the call to `acpi_gpiochip_request_interrupts`
to occur after gc->irc.initialized is set.
Merge tag 'riscv-for-linus-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes Palmer Dabbelt:
- A pair of build fixes for the recent cpuidle driver
- A fix for systems without sv57 that manifests as a crash
early in boot
* tag 'riscv-for-linus-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: cpuidle: fix Kconfig select for RISCV_SBI_CPUIDLE
RISC-V: mm: Fix set_satp_mode() for platform not having Sv57
cpuidle: riscv: support non-SMP config
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"There's no real pattern to the fixes, but the main one fixes our
pmd_leaf() definition to resolve a NULL dereference on the migration
path.
- Fix PMU event validation in the absence of any event counters
- Fix allmodconfig build using clang in conjunction with binutils
- Fix definitions of pXd_leaf() to handle PROT_NONE entries
- More typo fixes"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: mm: fix p?d_leaf()
arm64: fix typos in comments
arm64: Improve HAVE_DYNAMIC_FTRACE_WITH_REGS selection for clang
arm_pmu: Validate single/group leader events
Miaoqian Lin [Wed, 20 Apr 2022 01:49:13 +0000 (01:49 +0000)]
arm/xen: Fix some refcount leaks
The of_find_compatible_node() function returns a node pointer with
refcount incremented, We should use of_node_put() on it when done
Add the missing of_node_put() to release the refcount.
Fixes: 9b08aaa3199a ("ARM: XEN: Move xen_early_init() before efi_init()") Fixes: b2371587fe0c ("arm/xen: Read extended regions from DT and init Xen resource") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Merge tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarray
Pull xarray fixes from Matthew Wilcox:
"Syzbot found a nasty race between large page splitting and page
lookup. Details in the commit log, but fortunately it has a reliable
reproducer. I thought it better to send this one to you straight away.
Also fix the test suite build for kmem_cache_alloc_lru()"
* tag 'xarray-5.18a' of git://git.infradead.org/users/willy/xarray:
XArray: Disallow sibling entries of nodes
tools: Add kmem_cache_alloc_lru()
Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four fixes, two of them for stable:
- fcollapse fix
- reconnect lock fix
- DFS oops fix
- minor cleanup patch"
* tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: destage any unwritten data to the server before calling copychunk_write
cifs: use correct lock type in cifs_reconnect()
cifs: fix NULL ptr dereference in refresh_mounts()
cifs: Use kzalloc instead of kmalloc/memset
Merge tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull mount_setattr fix from Christian Brauner:
"The recent cleanup in e257039f0fc7 ("mount_setattr(): clean the
control flow and calling conventions") switched the mount attribute
codepaths from do-while to for loops as they are more idiomatic when
walking mounts.
However, we did originally choose do-while constructs because if we
request a mount or mount tree to be made read-only we need to hold
writers in the following way: The mount attribute code will grab
lock_mount_hash() and then call mnt_hold_writers() which will
_unconditionally_ set MNT_WRITE_HOLD on the mount.
Any callers that need write access have to call mnt_want_write(). They
will immediately see that MNT_WRITE_HOLD is set on the mount and the
caller will then either spin (on non-preempt-rt) or wait on
lock_mount_hash() (on preempt-rt).
The fact that MNT_WRITE_HOLD is set unconditionally means that once
mnt_hold_writers() returns we need to _always_ pair it with
mnt_unhold_writers() in both the failure and success paths.
The do-while constructs did take care of this. But Al's change to a
for loop in the failure path stops on the first mount we failed to
change mount attributes _without_ going into the loop to call
mnt_unhold_writers().
This in turn means that once we failed to make a mount read-only via
mount_setattr() - i.e. there are already writers on that mount - we
will block any writers indefinitely. Fix this by ensuring that the for
loop always unsets MNT_WRITE_HOLD including the first mount we failed
to change to read-only. Also sprinkle a few comments into the cleanup
code to remind people about what is happening including myself. After
all, I didn't catch it during review.
This is only relevant on mainline and was reported by syzbot. Details
about the syzbot reports are all in the commit message"
* tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: unset MNT_WRITE_HOLD on failure
Merge tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"At this time, the majority of changes are for pending ASoC fixes while
a few usual HD-audio and USB-audio quirks are found.
Almost all patches are small device-specific fixes, and nothing
worrisome stands out, so far"
* tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (37 commits)
ALSA: hda/realtek: Add quirk for Clevo NP70PNP
ALSA: hda: intel-dsp-config: Add RaptorLake PCI IDs
ALSA: hda/realtek: Enable mute/micmute LEDs and limit mic boost on EliteBook 845/865 G9
ALSA: usb-audio: Clear MIDI port active flag after draining
ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX.
ALSA: hda/i915: Fix one too many pci_dev_put()
ALSA: hda/hdmi: add HDMI codec VID for Raptorlake-P
ALSA: hda/hdmi: fix warning about PCM count when used with SOF
sound/oss/dmasound: fix 'dmasound_setup' defined but not used
firmware: cs_dsp: Fix overrun of unterminated control name string
ASoC: codecs: Fix an error handling path in (rx|tx|va)_macro_probe()
ASoC: Intel: sof_es8336: Add a quirk for Huawei Matebook D15
ASoC: Intel: sof_es8336: add a quirk for headset at mic1 port
ASoC: Intel: sof_es8336: support a separate gpio to control headphone
ASoC: Intel: sof_es8336: simplify speaker gpio naming
ASoC: wm8731: Disable the regulator when probing fails
ASoC: Intel: soc-acpi: correct device endpoints for max98373
ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
ASoC: SOF: topology: Fix memory leak in sof_control_load()
ASoC: SOF: topology: cleanup dailinks on widget unload
...
There is a race between xas_split() and xas_load() which can result in
the wrong page being returned, and thus data corruption. Fortunately,
it's hard to hit (syzbot took three months to find it) and often guarded
with VM_BUG_ON().
The anatomy of this race is:
thread A thread B
order-9 page is stored at index 0x200
lookup of page at index 0x274
page split starts
load of sibling entry at offset 9
stores nodes at offsets 8-15
load of entry at offset 8
The entry at offset 8 turns out to be a node, and so we descend into it,
and load the page at index 0x234 instead of 0x274. This is hard to fix
on the split side; we could replace the entire node that contains the
order-9 page instead of replacing the eight entries. Fixing it on
the lookup side is easier; just disallow sibling entries that point
to nodes. This cannot ever be a useful thing as the descent would not
know the correct offset to use within the new node.
The test suite continues to pass, but I have not added a new test for
this bug.
Reported-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com Tested-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Turn kmem_cache_alloc() into a wrapper around kmem_cache_alloc_lru().
Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Li Wang <liwang@redhat.com>
Subsystems affected by this patch series: mm (memory-failure, memcg,
userfaultfd, hugetlbfs, mremap, oom-kill, kasan, hmm), and kcov"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
kcov: don't generate a warning on vm_insert_page()'s failure
MAINTAINERS: add Vincenzo Frascino to KASAN reviewers
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
selftest/vm: add skip support to mremap_test
selftest/vm: support xfail in mremap_test
selftest/vm: verify remap destination address in mremap_test
selftest/vm: verify mmap addr in mremap_test
mm, hugetlb: allow for "high" userspace addresses
userfaultfd: mark uffd_wp regardless of VM_WRITE flag
memcg: sync flush only if periodic flush is delayed
mm/memory-failure.c: skip huge_zero_page in memory_failure()
mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
Nicholas Piggin [Fri, 22 Apr 2022 06:01:05 +0000 (16:01 +1000)]
mm/vmalloc: huge vmalloc backing pages should be split rather than compound
Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
in order to allow the sub-pages to be refcounted by callers such as
"remap_vmalloc_page [sic]" (remap_vmalloc_range).
However a similar problem exists for other struct page fields callers
use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
not only refcounts it but uses ->lru, ->mapping, ->index.
This is not compatible with compound sub-pages, and can cause bad page
state issues like
The correct approach is to use split high-order pages for the huge
vmalloc backing. These allow callers to treat them in exactly the same
way as individually-allocated order-0 pages.
Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Cc: Song Liu <songliubraving@fb.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 22 Apr 2022 06:00:33 +0000 (14:00 +0800)]
arm64: mm: fix p?d_leaf()
The pmd_leaf() is used to test a leaf mapped PMD, however, it misses
the PROT_NONE mapped PMD on arm64. Fix it. A real world issue [1]
caused by this was reported by Qian Cai. Also fix pud_leaf().
Merge tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Extra quiet after Easter, only have minor i915 and msm pulls. However
I haven't seen a PR from our misc tree in a little while, I've cc'ed
all the suspects. Once that unblocks I expect a bit larger bunch of
patches to arrive.
Otherwise as I said, one msm revert and two i915 fixes.
msm:
- revert iommu change that broke some platforms.
i915:
- Unset enable_psr2_sel_fetch if PSR2 detection fails
- Fix to detect when VRR is turned off from panel settings"
* tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm:
drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
drm/msm: Revert "drm/msm: Stop using iommu_present()"
drm/i915/display/vrr: Reset VRR capable property on a long hpd
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
In some cases it is possible for mmu_interval_notifier_remove() to race
with mn_tree_inv_end() allowing it to return while the notifier data
structure is still in use. Consider the following sequence:
As the wait_event() condition is true it will return immediately. This
can lead to use-after-free type errors if the caller frees the data
structure containing the interval notifier subscription while it is
still on a deferred list. Fix this by taking the appropriate lock when
reading invalidate_seq to ensure proper synchronisation.
I observed this whilst running stress testing during some development.
You do have to be pretty unlucky, but it leads to the usual problems of
use-after-free (memory corruption, kernel crash, difficult to diagnose
WARN_ON, etc).
Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier") Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Christian König <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aleksandr Nogikh [Thu, 21 Apr 2022 23:36:07 +0000 (16:36 -0700)]
kcov: don't generate a warning on vm_insert_page()'s failure
vm_insert_page()'s failure is not an unexpected condition, so don't do
WARN_ONCE() in such a case.
Instead, print a kernel message and just return an error code.
This flaw has been reported under an OOM condition by sysbot [1].
The message is mainly for the benefit of the test log, in this case the
fuzzer's log so that humans inspecting the log can figure out what was
going on. KCOV is a testing tool, so I think being a little more chatty
when KCOV unexpectedly is about to fail will save someone debugging
time.
We don't want the WARN, because it's not a kernel bug that syzbot should
report, and failure can happen if the fuzzer tries hard enough (as
above).
Link: https://lkml.kernel.org/r/Ylkr2xrVbhQYwNLf@elver.google.com Link: https://lkml.kernel.org/r/20220401182512.249282-1-nogikh@google.com Fixes: b3d7fe86fbd0 ("kcov: properly handle subsequent mmap calls"), Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper. This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.
A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:
selftest/vm: verify remap destination address in mremap_test
Because mremap does not have a MAP_FIXED_NOREPLACE flag, it can destroy
existing mappings. This causes a segfault when regions such as text are
remapped and the permissions are changed.
Verify the requested mremap destination address does not overlap any
existing mappings by using mmap's MAP_FIXED_NOREPLACE flag. Keep
incrementing the destination address until a valid mapping is found or
fail the current test once the max address is reached.
Avoid calling mmap with requested addresses that are less than the
system's mmap_min_addr. When run as root, mmap returns EACCES when
trying to map addresses < mmap_min_addr. This is not one of the error
codes for the condition to retry the mmap in the test.
Rather than arbitrarily retrying on EACCES, don't attempt an mmap until
addr > vm.mmap_min_addr.
Add a munmap call after an alignment check as the mappings are retained
after the retry and can reach the vm.max_map_count sysctl.
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.
This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).
Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.
So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area(). To allow that, move those two macros out
of mm/mmap.c into include/linux/sched/mm.h
If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.
For the time being, only ARM64 is affected by this change.
Catalin (ARM64) said
"We should have fixed hugetlb_get_unmapped_area() as well when we added
support for 52-bit VA. The reason for commit f6795053dac8 was to
prevent normal mmap() from returning addresses above 48-bit by default
as some user-space had hard assumptions about this.
It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
but I doubt anyone would notice. It's more likely that the current
behaviour would cause issues, so I'd rather have them consistent.
Basically when arm64 gained support for 52-bit addresses we did not
want user-space calling mmap() to suddenly get such high addresses,
otherwise we could have inadvertently broken some programs (similar
behaviour to x86 here). Hence we added commit f6795053dac8. But we
missed hugetlbfs which could still get such high mmap() addresses. So
in theory that's a potential regression that should have bee addressed
at the same time as commit f6795053dac8 (and before arm64 enabled
52-bit addresses)"
Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> [5.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nadav Amit [Thu, 21 Apr 2022 23:35:43 +0000 (16:35 -0700)]
userfaultfd: mark uffd_wp regardless of VM_WRITE flag
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
currently only marked as write-protected if the VMA has VM_WRITE flag
set. This seems incorrect or at least would be unexpected by the users.
Consider the following sequence of operations that are being performed
on a certain page:
At this point the user would expect to still get UFFD notification when
the page is accessed for write, but the user would not get one, since
the PTE was not marked as UFFD_WP during UFFDIO_COPY.
Fix it by always marking PTEs as UFFD_WP regardless on the
write-permission in the VMA flags.
Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg: sync flush only if periodic flush is delayed
Daniel Dao has reported [1] a regression on workloads that may trigger a
lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.
This patch fixes this regression by making the rstat flushing
conditional in the performance critical codepaths. More specifically,
the kernel relies on the async periodic rstat flusher to flush the stats
and only if the periodic flusher is delayed by more than twice the
amount of its normal time window then the kernel allows rstat flushing
from the performance critical codepaths.
Now the question: what are the side-effects of this change? The worst
that can happen is the refault codepath will see 4sec old lruvec stats
and may cause false (or missed) activations of the refaulted page which
may under-or-overestimate the workingset size. Though that is not very
concerning as the kernel can already miss or do false activations.
There are two more codepaths whose flushing behavior is not changed by
this patch and we may need to come to them in future. One is the
writeback stats used by dirty throttling and second is the deactivation
heuristic in the reclaim. For now keeping an eye on them and if there
is report of regression due to these codepaths, we will reevaluate then.
Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Daniel Dao <dqminh@cloudflare.com> Tested-by: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Frank Hofmann <fhofmann@cloudflare.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>