Loic Poulain [Wed, 7 May 2025 13:45:38 +0000 (15:45 +0200)]
mmc: core: Scan the eMMC boot areas for partition table
It appears that some vendors provision the boot areas with valid part
tables (GPT) in order to have identifiable partitions for device and
firmware specific data, such has the qualcomm CDT (Qualcomm Config
Data Table). Additionally, these boot areas can be utilized to host
device-specific IDs, calibration data, and other critical information.
Charan Pedumuru [Wed, 7 May 2025 06:29:36 +0000 (06:29 +0000)]
dt-binding: mmc: microchip,sdhci-pic32: convert text based binding to json schema
Update text binding to YAML.
Changes during conversion:
Add appropriate include statements for interrupts and clock-names
to resolve errors identified by `dt_binding_check` and `dtbs_check`.
Randy Dunlap [Thu, 24 Apr 2025 03:46:10 +0000 (20:46 -0700)]
mmc: sdhci-esdhc-imx: fix defined but not used warnings
Fix warnings when CONFIG_PM=y and CONFIG_PM_SLEEP is not set by
surrounding the 2 functions with #ifdef CONFIG_PM_SLEEP.
drivers/mmc/host/sdhci-esdhc-imx.c:1659:13: warning: 'sdhc_esdhc_tuning_restore' defined but not used [-Wunused-function]
1659 | static void sdhc_esdhc_tuning_restore(struct sdhci_host *host)
drivers/mmc/host/sdhci-esdhc-imx.c:1637:13: warning: 'sdhc_esdhc_tuning_save' defined but not used [-Wunused-function]
1637 | static void sdhc_esdhc_tuning_save(struct sdhci_host *host)
Fixes: 3d1eea493894 ("mmc: sdhci-esdhc-imx: Save tuning value when card stays powered in suspend") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Haibo Chen <haibo.chen@nxp.com> Link: https://lore.kernel.org/r/20250424034610.441532-1-rdunlap@infradead.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
mmc: sdhci-of-dwcmshc: add PD workaround on RK3576
RK3576's power domains have a peculiar design where the PD_NVM power
domain, of which the sdhci controller is a part, seemingly does not have
idempotent runtime disable/enable. The end effect is that if PD_NVM gets
turned off by the generic power domain logic because all the devices
depending on it are suspended, then the next time the sdhci device is
unsuspended, it'll hang the SoC as soon as it tries accessing the CQHCI
registers.
RK3576's UFS support needed a new dev_pm_genpd_rpm_always_on function
added to the generic power domains API to handle what appears to be a
similar hardware design.
Use this new function to ask for the same treatment in the sdhci
controller by giving rk3576 its own platform data with its own postinit
function. The benefit of doing this instead of marking the power domains
always on in the power domain core is that we only do this if we know
the platform we're running on actually uses the sdhci controller. For
others, keeping PD_NVM always on would be a waste, as they won't run
into this specific issue. The only other IP in PD_NVM that could be
affected is FSPI0. If it gets a mainline driver, it will probably want
to do the same thing.
Acked-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com> Reviewed-by: Shawn Lin <shawn.lin@rock-chips.com> Fixes: cfee1b507758 ("pmdomain: rockchip: Add support for RK3576 SoC") Cc: <stable@vger.kernel.org> # v6.15+ Link: https://lore.kernel.org/r/20250423-rk3576-emmc-fix-v3-1-0bf80e29967f@collabora.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Philipp Stanner [Thu, 17 Apr 2025 09:27:43 +0000 (11:27 +0200)]
mmc: cavium-thunderx: Use non-hybrid PCI devres API
cavium-thunderx enables its PCI device with pcim_enable_device(). This,
implicitly, switches the function pci_request_regions() into managed
mode, where it becomes a devres function.
The PCI subsystem wants to remove this hybrid nature from its
interfaces. To do so, users of the aforementioned combination of
functions must be ported to non-hybrid functions.
Moreover, since both functions are already managed in this driver, the
calls to pci_release_regions() are unnecessary.
Remove the calls to pci_release_regions().
Replace the call to sometimes-managed pci_request_regions() with one to
the always-managed pcim_request_all_regions().
dt-bindings: mmc: mtk-sd: Add support for Dimensity 1200 MT6893
Add a compatible for the MediaTek Dimensity 1200 (MT6893) SoC.
All of the MMC/SD controllers in this chip are compatible with
the ones found in MT8183, but do also make use of an optional
crypto clock when enabling HW disk encryption.
Axe Yang [Fri, 11 Apr 2025 05:40:25 +0000 (13:40 +0800)]
mmc: mtk-sd: Add condition to enable 'single' burst type
This change add a condition for 'single' burst type selection.
Read AXI_LEN field from EMMC50_CFG2(AHB2AXI wrapper) register, if the
value is not 0, it means the HWIP is using AXI as AMBA bus, which do
not support 'single' burst type.
Suggested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Axe Yang <axe.yang@mediatek.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20250411054134.31822-1-axe.yang@mediatek.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Luke Wang [Wed, 9 Apr 2025 07:55:50 +0000 (15:55 +0800)]
mmc: sdhci-esdhc-imx: switch standard tuning to manual tuning
Current standard tuning has some limitations:
1. Standard tuning only try 40 times to find first pass window, but this
pass window maybe not the best pass window.
2. Sometimes there are two tuning pass windows and the gap between
those two windows may only have one cell. If tuning step > 1, the gap may
just be skipped and host assumes those two windows as a continuous
windows. This will cause a bad delay cell near the gap to be selected.
3. Standard tuning logic need to detect at least one success and failure
to pass the tuning. If all cells in the tuning window pass, the hardware
will not set the SDHCI_CTRL_TUNED_CLK bit, causing tuning failed.
4. Standard tuning logic only check the CRC, do not really compare the data
pattern. If data pins are connected incorrectly, standard will not detect
this kind of issue.
Switch to manual tuning to avoid those limitations
mmc: sdhci-esdhc-imx: widen auto-tuning window for manual tuning
Expand the auto-tuning window width from 0 to 3 for manual tuning to
account for sampling point shifts caused by temperature change. This change
is based on hardware recommendation, providing enough margin for the
auto-tuning logic to locate valid sampling points.
When config the manual tuning final sample delay, need deduct the auto
tuning window width according to the IP logic.
mmc: sdhci-esdhc-imx: widen auto-tuning window for standard tuning
Expand the auto-tuning window width from 2 to 3 for standard tuning to
account for sampling point shifts caused by temperature change. This change
is based on hardware recommendation, providing 50% more margin for the
auto-tuning logic to locate valid sampling points.
mmc: sdhci-esdhc-imx: explicitly reset tuning circuit via RSTT bit
According to the i.MX Reference Manual, the RSTT bit (SYS_CTRL[28]) is
designed to reset the tuning circuit. While the Reference Manual states
that clearing EXECUTE_TUNING bit from 1 to 0 in AUTOCMD12_ERR_STATUS
can also set RSTT, this mechanism only works when the original
EXECUTE_TUNING bit was 1. When the bit is already 0, the tuning circuit
reset will not be triggered.
This explicit reset approach strengthens the tuning reliability and
aligns with the Reference Manual recommendations.
Wolfram Sang [Tue, 1 Apr 2025 09:58:45 +0000 (11:58 +0200)]
mmc: rename mmc_can_trim() to mmc_card_can_trim()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here.
Wolfram Sang [Tue, 1 Apr 2025 09:58:44 +0000 (11:58 +0200)]
mmc: rename mmc_can_sleep() to mmc_card_can_sleep()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here.
Wolfram Sang [Tue, 1 Apr 2025 09:58:43 +0000 (11:58 +0200)]
mmc: rename mmc_can_secure_erase_trim() to mmc_card_can_secure_erase_trim()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here.
Wolfram Sang [Tue, 1 Apr 2025 09:58:42 +0000 (11:58 +0200)]
mmc: rename mmc_can_sanitize() to mmc_card_can_sanitize()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here.
Wolfram Sang [Tue, 1 Apr 2025 09:58:41 +0000 (11:58 +0200)]
mmc: rename mmc_can_reset() to mmc_card_can_reset()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here. Conversion was simplified by
inverting the logic.
Wolfram Sang [Tue, 1 Apr 2025 09:58:39 +0000 (11:58 +0200)]
mmc: rename mmc_can_ext_csd() to mmc_card_can_ext_csd()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here.
Wolfram Sang [Tue, 1 Apr 2025 09:58:38 +0000 (11:58 +0200)]
mmc: rename mmc_can_erase() to mmc_card_can_erase()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here.
Wolfram Sang [Tue, 1 Apr 2025 09:58:37 +0000 (11:58 +0200)]
mmc: rename mmc_can_discard() to mmc_card_can_discard()
mmc_can_* functions sometimes relate to the card and sometimes to the host.
Make it obvious by renaming this function to include 'card'. Also, convert to
proper bool type while we are here.
mmc: mtk-sd: Do single write in function msdc_new_tx_setting
Instead of reading and writing the LOOP_TEST_CONTROL register for
each set or cleared bit, read it once, modify the contents in a
local variable, and then write once.
mmc: mtk-sd: Aggregate writes for MSDC_PATCH_BIT1/2 setup
Instead of continuously reading and writing to the patch bit 1/2
registers, prepare the final values to write to those and write
just once per register during the setup phase.
This makes the driver slightly smaller and also slightly improves
the execution time of the msdc_init_hw function, called not only
at probe time, but also when resuming from system suspend.
mmc: mtk-sd: Clarify patch bit register initialization and layout
In preparation for cleaning up the msdc_init_hw register setting
for the patch bit registers, add the necessary definitions for
bits in the MSDC_PATCH_BIT and MSDC_PATCH_BIT1 registers and use
them in place of "magic numbers" writes during initialization.
Luke Wang [Fri, 28 Mar 2025 11:25:17 +0000 (19:25 +0800)]
mmc: sdhci-esdhc-imx: Save tuning value when card stays powered in suspend
For SoCs like i.MX6UL(L/Z) and i.MX7D, USDHC powers off completely during
system power management (PM), causing the internal tuning status to be
lost. To address this, save the tuning value when system suspend and
restore it for any command issued after system resume when re-tuning is
held.
A typical case involves SDIO WiFi devices with the MMC_PM_KEEP_POWER and
MMC_PM_WAKE_SDIO_IRQ flag, which retain power during system PM. To
conserve power, WiFi switches to 1-bit mode and restores 4-bit mode upon
resume. As per the specification, tuning commands are not supported in
1-bit mode. When sending CMD52 to restore 4-bit mode, re-tuning must be
held. However, CMD52 still requires a correct sample point to avoid CRC
errors, necessitating preservation of the previous tuning value.
mmc: core: Add support for graceful host removal for SD
An mmc host driver may allow to unbind from its corresponding host device.
If an SD card is attached to the host, the mmc core will just try to cut
the power for it, without obeying to the SD spec that potentially may
damage the card.
Let's fix this problem by implementing a graceful power-down of the card at
host removal.
mmc: core: Add support for graceful host removal for eMMC
An mmc host driver may allow to unbind from its corresponding host device.
If an eMMC card is attached to the host, the mmc core will just try to cut
the power for it, without obeying to the eMMC spec.
Potentially this may damage the card and it may also prevent us from
successfully doing a re-initialization of it, which would typically happen
if/when we try to re-bind the mmc host driver.
To fix these problems, let's implement a graceful power-down of the card at
host removal.
Reported-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250407152759.25160-5-ulf.hansson@linaro.org
mmc: core: Convert into an enum for the poweroff-type for eMMC
Currently we are only distinguishing between the suspend and the shutdown
scenarios, which make a bool sufficient as in-parameter to the various PM
functions for eMMC. However, to prepare for adding support for another
scenario in a subsequent change, let's convert into using an enum.
Suggested-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250407152759.25160-4-ulf.hansson@linaro.org
mmc: core: Further avoid re-storing power to the eMMC before a shutdown
To manage a graceful power-off of the eMMC card during platform shutdown,
the prioritized option is to use the poweroff-notification command, if the
eMMC card supports it.
During a suspend request we may decide to fall back to use the sleep
command, instead of the poweroff-notification, unless the mmc host supports
a complete power-cycle of the eMMC. For this reason, we may need to restore
power and re-initialize the card, if it remains suspended when a shutdown
request is received.
However, the current condition to restore power and re-initialize the card
doesn't take into account MMC_CAP2_FULL_PWR_CYCLE_IN_SUSPEND properly. This
may lead to doing a re-initialization when it really isn't needed, as the
eMMC may already have been powered-off using the poweroff-notification
command. Let's fix the condition to avoid this.
Erick Shepherd [Mon, 31 Mar 2025 22:13:37 +0000 (17:13 -0500)]
mmc: Add quirk to disable DDR50 tuning
Adds the MMC_QUIRK_NO_UHS_DDR50_TUNING quirk and updates
mmc_execute_tuning() to return 0 if that quirk is set. This fixes an
issue on certain Swissbit SD cards that do not support DDR50 tuning
where tuning requests caused I/O errors to be thrown.
Wolfram Sang [Mon, 31 Mar 2025 06:43:17 +0000 (08:43 +0200)]
mmc: renesas_sdhi: improve registering irqs
The probe() function is convoluted enough, so merge sanity checks for
number of irqs into one place. Also, change the error code for 'no irq'
because ENXIO will not print a warning from the driver core.
Lad Prabhakar [Wed, 26 Mar 2025 14:39:37 +0000 (14:39 +0000)]
dt-bindings: mmc: renesas,sdhi: Document RZ/V2N support
Add SDHI bindings for the Renesas RZ/V2N (a.k.a R9A09G056) SoC. Use
`renesas,sdhi-r9a09g057` as a fallback since the SD/MMC block on
RZ/V2N is identical to the one on RZ/V2H(P), allowing reuse of the
existing driver without modifications.
dt-bindings: mmc: marvell,xenon-sdhci: Allow "dma-coherent" and "iommus"
The Marvell xenon-sdhci block can be cache-coherent and needs the
"dma-coherent" property. It can also be behind an IOMMU and needs the
"iommus" property.
Linus Torvalds [Sun, 11 May 2025 18:30:13 +0000 (11:30 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM:
- Avoid use of uninitialized memcache pointer in user_mem_abort()
- Always set HCR_EL2.xMO bits when running in VHE, allowing
interrupts to be taken while TGE=0 and fixing an ugly bug on
AmpereOne that occurs when taking an interrupt while clearing the
xMO bits (AC03_CPU_36)
- Prevent VMMs from hiding support for AArch64 at any EL virtualized
by KVM
- Save/restore the host value for HCRX_EL2 instead of restoring an
incorrect fixed value
- Make host_stage2_set_owner_locked() check that the entire requested
range is memory rather than just the first page
RISC-V:
- Add missing reset of smstateen CSRs
x86:
- Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid
causing problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to
sanitize the VMCB as its state is undefined after SHUTDOWN,
emulating INIT is the least awful choice).
- Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM
KVM doesn't goof a sanity check in the future.
- Free obsolete roots when (re)loading the MMU to fix a bug where
pre-faulting memory can get stuck due to always encountering a
stale root.
- When dumping GHCB state, use KVM's snapshot instead of the raw GHCB
page to print state, so that KVM doesn't print stale/wrong
information.
- When changing memory attributes (e.g. shared <=> private), add
potential hugepage ranges to the mmu_invalidate_range_{start,end}
set so that KVM doesn't create a shared/private hugepage when the
the corresponding attributes will become mixed (the attributes are
commited *after* KVM finishes the invalidation).
- Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM
has at least one active VM. Effectively BP_SPEC_REDUCE when KVM is
loaded led to very measurable performance regressions for non-KVM
workloads"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitions
KVM: arm64: Fix memory check in host_stage2_set_owner_locked()
KVM: arm64: Kill HCRX_HOST_FLAGS
KVM: arm64: Properly save/restore HCRX_EL2
KVM: arm64: selftest: Don't try to disable AArch64 support
KVM: arm64: Prevent userspace from disabling AArch64 support at any virtualisable EL
KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode
KVM: arm64: Fix uninitialized memcache pointer in user_mem_abort()
KVM: x86/mmu: Prevent installing hugepages when mem attributes are changing
KVM: SVM: Update dump_ghcb() to use the GHCB snapshot fields
KVM: RISC-V: reset smstateen CSRs
KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()
KVM: x86: Check that the high 32bits are clear in kvm_arch_vcpu_ioctl_run()
KVM: SVM: Forcibly leave SMM mode on SHUTDOWN interception
Linus Torvalds [Sun, 11 May 2025 17:33:25 +0000 (10:33 -0700)]
Merge tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc timers fixes from Ingo Molnar:
- Fix time keeping bugs in CLOCK_MONOTONIC_COARSE clocks
- Work around absolute relocations into vDSO code that GCC erroneously
emits in certain arm64 build environments
- Fix a false positive lockdep warning in the i8253 clocksource driver
* tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/i8253: Use raw_spinlock_irqsave() in clockevent_i8253_disable()
arm64: vdso: Work around invalid absolute relocations from GCC
timekeeping: Prevent coarse clocks going backwards
Linus Torvalds [Sun, 11 May 2025 17:29:29 +0000 (10:29 -0700)]
Merge tag 'input-for-v6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- Synaptics touchpad on multiple laptops (Dynabook Portege X30L-G,
Dynabook Portege X30-D, TUXEDO InfinityBook Pro 14 v5, Dell Precision
M3800, HP Elitebook 850 G1) switched from PS/2 to SMBus mode
- a number of new controllers added to xpad driver: HORI Drum
controller, PowerA Fusion Pro 4, PowerA MOGA XP-Ultra controller,
8BitDo Ultimate 2 Wireless Controller, 8BitDo Ultimate 3-mode
Controller, Hyperkin DuchesS Xbox One controller
- fixes to xpad driver to properly handle Mad Catz JOYTECH NEO SE
Advanced and PDP Mirror's Edge Official controllers
- fixes to xpad driver to properly handle "Share" button on some
controllers
- a fix for device initialization timing and for waking up the
controller in cyttsp5 driver
- a fix for hisi_powerkey driver to properly wake up from s2idle state
- other assorted cleanups and fixes
* tag 'input-for-v6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: xpad - fix xpad_device sorting
Input: xpad - add support for several more controllers
Input: xpad - fix Share button on Xbox One controllers
Input: xpad - fix two controller table values
Input: hisi_powerkey - enable system-wakeup for s2idle
Input: synaptics - enable InterTouch on Dell Precision M3800
Input: synaptics - enable InterTouch on TUXEDO InfinityBook Pro 14 v5
Input: synaptics - enable InterTouch on Dynabook Portege X30L-G
Input: synaptics - enable InterTouch on Dynabook Portege X30-D
Input: synaptics - enable SMBus for HP Elitebook 850 G1
Input: mtk-pmic-keys - fix possible null pointer dereference
Input: xpad - add support for 8BitDo Ultimate 2 Wireless Controller
Input: cyttsp5 - fix power control issue on wakeup
MAINTAINERS: .mailmap: update Mattijs Korpershoek's email address
dt-bindings: mediatek,mt6779-keypad: Update Mattijs' email address
Input: stmpe-ts - use module alias instead of device table
Input: cyttsp5 - ensure minimum reset pulse width
Input: sparcspkr - avoid unannotated fall-through
input/joystick: magellan: Mark __nonstring look-up table
Linus Torvalds [Sun, 11 May 2025 17:23:53 +0000 (10:23 -0700)]
Merge tag 'fixes-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fixes from Mike Rapoport:
- Mark set_high_memory() as __init to fix section mismatch
- Accept memory allocated in memblock_double_array() to mitigate crash
of SNP guests
* tag 'fixes-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: Accept allocated memory before use in memblock_double_array()
mm,mm_init: Mark set_high_memory as __init
Vicki Pfau [Sun, 11 May 2025 05:59:25 +0000 (22:59 -0700)]
Input: xpad - fix Share button on Xbox One controllers
The Share button, if present, is always one of two offsets from the end of the
file, depending on the presence of a specific interface. As we lack parsing for
the identify packet we can't automatically determine the presence of that
interface, but we can hardcode which of these offsets is correct for a given
controller.
More controllers are probably fixable by adding the MAP_SHARE_BUTTON in the
future, but for now I only added the ones that I have the ability to test
directly.
Vicki Pfau [Fri, 28 Mar 2025 23:43:36 +0000 (16:43 -0700)]
Input: xpad - fix two controller table values
Two controllers -- Mad Catz JOYTECH NEO SE Advanced and PDP Mirror's
Edge Official -- were missing the value of the mapping field, and thus
wouldn't detect properly.
Ulf Hansson [Thu, 6 Mar 2025 11:50:21 +0000 (12:50 +0100)]
Input: hisi_powerkey - enable system-wakeup for s2idle
To wake up the system from s2idle when pressing the power-button, let's
convert from using pm_wakeup_event() to pm_wakeup_dev_event(), as it allows
us to specify the "hard" in-parameter, which needs to be set for s2idle.
Linus Torvalds [Sat, 10 May 2025 22:50:56 +0000 (15:50 -0700)]
Merge tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"22 hotfixes. 13 are cc:stable and the remainder address post-6.14
issues or aren't considered necessary for -stable kernels.
About half are for MM. Five OCFS2 fixes and a few MAINTAINERS updates"
* tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
mm: fix folio_pte_batch() on XEN PV
nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs()
mm/hugetlb: copy the CMA flag when demoting
mm, swap: fix false warning for large allocation with !THP_SWAP
selftests/mm: fix a build failure on powerpc
selftests/mm: fix build break when compiling pkey_util.c
mm: vmalloc: support more granular vrealloc() sizing
tools/testing/selftests: fix guard region test tmpfs assumption
ocfs2: stop quota recovery before disabling quotas
ocfs2: implement handshaking with ocfs2 recovery thread
ocfs2: switch osb->disable_recovery to enum
mailmap: map Uwe's BayLibre addresses to a single one
MAINTAINERS: add mm THP section
mm/userfaultfd: fix uninitialized output field for -EAGAIN race
selftests/mm: compaction_test: support platform with huge mount of memory
MAINTAINERS: add core mm section
ocfs2: fix panic in failed foilio allocation
mm/huge_memory: fix dereferencing invalid pmd migration entry
MAINTAINERS: add reverse mapping section
x86: disable image size check for test builds
...
Linus Torvalds [Sat, 10 May 2025 16:53:11 +0000 (09:53 -0700)]
Merge tag 'driver-core-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fix from Greg KH:
"Here is a single driver core fix for a regression for platform devices
that is a regression from a change that went into 6.15-rc1 that
affected Pixel devices. It has been in linux-next for over a week with
no reported problems"
* tag 'driver-core-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
platform: Fix race condition during DMA configure at IOMMU probe time
Linus Torvalds [Sat, 10 May 2025 16:18:05 +0000 (09:18 -0700)]
Merge tag 'usb-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 6.15-rc6. Included in here
are:
- typec driver fixes
- usbtmc ioctl fixes
- xhci driver fixes
- cdnsp driver fixes
- some gadget driver fixes
Nothing really major, just all little stuff that people have reported
being issues. All of these have been in linux-next this week with no
reported issues"
* tag 'usb-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
xhci: dbc: Avoid event polling busyloop if pending rx transfers are inactive.
usb: xhci: Don't trust the EP Context cycle bit when moving HW dequeue
usb: usbtmc: Fix erroneous generic_read ioctl return
usb: usbtmc: Fix erroneous wait_srq ioctl return
usb: usbtmc: Fix erroneous get_stb ioctl error returns
usb: typec: tcpm: delay SNK_TRY_WAIT_DEBOUNCE to SRC_TRYWAIT transition
USB: usbtmc: use interruptible sleep in usbtmc_read
usb: cdnsp: fix L1 resume issue for RTL_REVISION_NEW_LPM version
usb: typec: ucsi: displayport: Fix NULL pointer access
usb: typec: ucsi: displayport: Fix deadlock
usb: misc: onboard_usb_dev: fix support for Cypress HX3 hubs
usb: uhci-platform: Make the clock really optional
usb: dwc3: gadget: Make gadget_wakeup asynchronous
usb: gadget: Use get_status callback to set remote wakeup capability
usb: gadget: f_ecm: Add get_status callback
usb: host: tegra: Prevent host controller crash when OTG port is used
usb: cdnsp: Fix issue with resuming from L1
usb: gadget: tegra-xudc: ACK ST_RC after clearing CTRL_RUN
Linus Torvalds [Sat, 10 May 2025 16:08:19 +0000 (09:08 -0700)]
Merge tag 'staging-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are three small staging driver fixes for 6.15-rc6. These are:
- bcm2835-camera driver fix
- two axis-fifo driver fixes
All of these have been in linux-next for a few weeks with no reported
issues"
* tag 'staging-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: axis-fifo: Remove hardware resets for user errors
staging: axis-fifo: Correct handling of tx_fifo_depth for size validation
staging: bcm2835-camera: Initialise dev in v4l2_dev
Linus Torvalds [Sat, 10 May 2025 15:52:41 +0000 (08:52 -0700)]
Merge tag 'i2c-for-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
- omap: use correct function to read from device tree
- MAINTAINERS: remove Seth from ISMT maintainership
* tag 'i2c-for-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: Remove entry for Seth Heasley
i2c: omap: fix deprecated of_property_read_bool() use
Linus Torvalds [Sat, 10 May 2025 15:44:36 +0000 (08:44 -0700)]
Merge tag 'for-linus-6.15a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A fix for the xenbus driver allowing to use a PVH Dom0 with
Xenstore running in another domain
- A fix for the xenbus driver addressing a rare race condition
resulting in NULL dereferences and other problems
- A fix for the xen-swiotlb driver fixing a problem seen on Arm
platforms
* tag 'for-linus-6.15a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xenbus: Use kref to track req lifetime
xenbus: Allow PVH dom0 a non-local xenstore
xen: swiotlb: Use swiotlb bouncing if kmalloc allocation demands it
Linus Torvalds [Sat, 10 May 2025 15:36:07 +0000 (08:36 -0700)]
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount fixes from Al Viro:
"A couple of races around legalize_mnt vs umount (both fairly old and
hard to hit) plus two bugs in move_mount(2) - both around 'move
detached subtree in place' logics"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix IS_MNT_PROPAGATING uses
do_move_mount(): don't leak MNTNS_PROPAGATING on failures
do_umount(): add missing barrier before refcount checks in sync case
__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
Paolo Bonzini [Sat, 10 May 2025 15:11:06 +0000 (11:11 -0400)]
Merge tag 'kvm-x86-fixes-6.15-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.15-rcN
- Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid causing
problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to sanitize the
VMCB as its state is undefined after SHUTDOWN, emulating INIT is the
least awful choice).
- Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM
KVM doesn't goof a sanity check in the future.
- Free obsolete roots when (re)loading the MMU to fix a bug where
pre-faulting memory can get stuck due to always encountering a stale
root.
- When dumping GHCB state, use KVM's snapshot instead of the raw GHCB page
to print state, so that KVM doesn't print stale/wrong information.
- When changing memory attributes (e.g. shared <=> private), add potential
hugepage ranges to the mmu_invalidate_range_{start,end} set so that KVM
doesn't create a shared/private hugepage when the the corresponding
attributes will become mixed (the attributes are commited *after* KVM
finishes the invalidation).
- Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM has at
least one active VM. Effectively BP_SPEC_REDUCE when KVM is loaded led
to very measurable performance regressions for non-KVM workloads.
Paolo Bonzini [Sat, 10 May 2025 15:10:02 +0000 (11:10 -0400)]
Merge tag 'kvmarm-fixes-6.15-3' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.15, round #3
- Avoid use of uninitialized memcache pointer in user_mem_abort()
- Always set HCR_EL2.xMO bits when running in VHE, allowing interrupts
to be taken while TGE=0 and fixing an ugly bug on AmpereOne that
occurs when taking an interrupt while clearing the xMO bits
(AC03_CPU_36)
- Prevent VMMs from hiding support for AArch64 at any EL virtualized by
KVM
- Save/restore the host value for HCRX_EL2 instead of restoring an
incorrect fixed value
- Make host_stage2_set_owner_locked() check that the entire requested
range is memory rather than just the first page
Linus Torvalds [Fri, 9 May 2025 23:45:21 +0000 (16:45 -0700)]
Merge tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix dentry leak which can cause umount crash
- Add warning for parse contexts error on compounded operation
* tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Avoid race in open_cached_dir with lease breaks
smb3 client: warn when parse contexts returns error on compounded operation
Al Viro [Thu, 8 May 2025 19:35:51 +0000 (15:35 -0400)]
fix IS_MNT_PROPAGATING uses
propagate_mnt() does not attach anything to mounts created during
propagate_mnt() itself. What's more, anything on ->mnt_slave_list
of such new mount must also be new, so we don't need to even look
there.
When move_mount() had been introduced, we've got an additional
class of mounts to skip - if we are moving from anon namespace,
we do not want to propagate to mounts we are moving (i.e. all
mounts in that anon namespace).
Unfortunately, the part about "everything on their ->mnt_slave_list
will also be ignorable" is not true - if we have propagation graph
A -> B -> C
and do OPEN_TREE_CLONE open_tree() of B, we get
A -> [B <-> B'] -> C
as propagation graph, where B' is a clone of B in our detached tree.
Making B private will result in
A -> B' -> C
C still gets propagation from A, as it would after making B private
if we hadn't done that open_tree(), but now the propagation goes
through B'. Trying to move_mount() our detached tree on subdirectory
in A should have
* moved B' on that subdirectory in A
* skipped the corresponding subdirectory in B' itself
* copied B' on the corresponding subdirectory in C.
As it is, the logics in propagation_next() and friends ends up
skipping propagation into C, since it doesn't consider anything
downstream of B'.
IOW, walking the propagation graph should only skip the ->mnt_slave_list
of new mounts; the only places where the check for "in that one
anon namespace" are applicable are propagate_one() (where we should
treat that as the same kind of thing as "mountpoint we are looking
at is not visible in the mount we are looking at") and
propagation_would_overmount(). The latter is better dealt with
in the caller (can_move_mount_beneath()); on the first call of
propagation_would_overmount() the test is always false, on the
second it is always true in "move from anon namespace" case and
always false in "move within our namespace" one, so it's easier
to just use check_mnt() before bothering with the second call and
be done with that.
Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 29 Apr 2025 01:43:23 +0000 (21:43 -0400)]
do_move_mount(): don't leak MNTNS_PROPAGATING on failures
as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.
Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Tue, 29 Apr 2025 03:56:14 +0000 (23:56 -0400)]
do_umount(): add missing barrier before refcount checks in sync case
do_umount() analogue of the race fixed in 119e1ef80ecf "fix
__legitimize_mnt()/mntput() race". Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment. Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit). Still a bug...
Fixes: 48a066e72d97 ("RCU'd vfsmounts") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sun, 27 Apr 2025 19:41:51 +0000 (15:41 -0400)]
__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().
Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Fri, 9 May 2025 19:41:34 +0000 (12:41 -0700)]
Merge tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly drm fixes, bit bigger than last week, but overall amdgpu/xe
with some ivpu bits and a random few fixes, and dropping the
ttm_backup struct which wrapped struct file and was recently
frowned at.
xe:
- Prevent PF queue overflow
- Hold all forcewake during mocs test
- Remove GSC flush on reset path
- Fix forcewake put on error path
- Fix runtime warning when building without svm
i915:
- Fix oops on resume after disconnecting DP MST sinks during suspend
- Fix SPLC num_waiters refcounting
ivpu:
- Increase timeouts
- Fix deadlock in cmdq ioctl
- Unlock mutices in correct order
v3d:
- Avoid memory leak in job handling"
* tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel: (32 commits)
drm/i915/dp: Fix determining SST/MST mode during MTP TU state computation
drm/xe: Add config control for svm flush work
drm/xe: Release force wake first then runtime power
drm/xe/gsc: do not flush the GSC worker from the reset path
drm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs
drm/xe: Add page queue multiplier
drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush
drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush
drm/amdgpu: fix pm notifier handling
Revert "drm/amd: Stop evicting resources on APUs in suspend"
drm/amdgpu/vcn: using separate VCN1_AON_SOC offset
drm/amd/display: Fix wrong handling for AUX_DEFER case
drm/amd/display: Copy AUX read reply data whenever length > 0
drm/amd/display: Remove incorrect checking in dmub aux handler
drm/amd/display: Fix the checking condition in dmub aux handling
drm/amd/display: Shift DMUB AUX reply command if necessary
drm/amd/display: Call FP Protect Before Mode Programming/Mode Support
...
Dave Airlie [Fri, 9 May 2025 19:02:38 +0000 (05:02 +1000)]
Merge tag 'drm-xe-fixes-2025-05-09' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Prevent PF queue overflow
- Hold all forcewake during mocs test
- Remove GSC flush on reset path
- Fix forcewake put on error path
- Fix runtime warning when building without svm
Linus Torvalds [Fri, 9 May 2025 18:30:26 +0000 (11:30 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Move the arm64_use_ng_mappings variable from the .bss to the .data
section as it is accessed very early during boot with the MMU off and
before the .bss has been initialised.
This could lead to incorrect idmap page table"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation
Linus Torvalds [Fri, 9 May 2025 18:17:50 +0000 (11:17 -0700)]
Merge tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- The compressed half-word misaligned access instructions (c.lhu, c.lh,
and c.sh) from the Zcb extension are now properly emulated
- A series of fixes to properly emulate permissions while handling
userspace misaligned accesses
- A pair of fixes for PR_GET_TAGGED_ADDR_CTRL to avoid accessing the
envcfg CSR on systems that don't support that CSR, and to report
those failures up to userspace
- The .rela.dyn section is no longer stripped from vmlinux, as it is
necessary to relocate the kernel under some conditions (including
kexec)
* tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm
scripts: Do not strip .rela.dyn section
riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL
riscv: misaligned: use get_user() instead of __get_user()
riscv: misaligned: enable IRQs while handling misaligned accesses
riscv: misaligned: factorize trap handling
riscv: misaligned: Add handling for ZCB instructions
Linus Torvalds [Fri, 9 May 2025 17:34:50 +0000 (10:34 -0700)]
Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix for a regression in this series for loop and read/write iterator
handling
- zone append block update tweak
- remove a broken IO priority test
- NVMe pull request via Christoph:
- unblock ctrl state transition for firmware update (Daniel
Wagner)
* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
block: remove test of incorrect io priority level
nvme: unblock ctrl state transition for firmware update
block: only update request sector if needed
loop: Add sanity check for read/write_iter
Linus Torvalds [Fri, 9 May 2025 16:26:46 +0000 (09:26 -0700)]
Merge tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix for linked timeouts arming and firing wrt prep and issue of the
request being managed by the linked timeout
- Fix for a CQE ordering issue between requests with multishot and
using the same buffer group. This is a dumbed down version for this
release and for stable, it'll get improved for v6.16
- Tweak the SQPOLL submit batch size. A previous commit made SQPOLL
manage its own task_work and chose a tiny batch size, bump it from 8
to 32 to fix a performance regression due to that
* tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux:
io_uring/sqpoll: Increase task_work submission batch size
io_uring: ensure deferred completions are flushed for multishot
io_uring: always arm linked timeouts prior to issue
Linus Torvalds [Fri, 9 May 2025 16:09:49 +0000 (09:09 -0700)]
Merge tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull modules fix from Petr Pavlu:
"A single fix to prevent use of an uninitialized completion pointer
when releasing a module_kobject in specific situations.
This addresses a latent bug exposed by commit f95bbfe18512 ("drivers:
base: handle module_kobject creation")"
* tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
module: ensure that kobject_put() is safe for module type kobjects
Dave Hansen [Thu, 8 May 2025 22:41:32 +0000 (15:41 -0700)]
x86/mm: Eliminate window where TLB flushes may be inadvertently skipped
tl;dr: There is a window in the mm switching code where the new CR3 is
set and the CPU should be getting TLB flushes for the new mm. But
should_flush_tlb() has a bug and suppresses the flush. Fix it by
widening the window where should_flush_tlb() sends an IPI.
Long Version:
=== History ===
There were a few things leading up to this.
First, updating mm_cpumask() was observed to be too expensive, so it was
made lazier. But being lazy caused too many unnecessary IPIs to CPUs
due to the now-lazy mm_cpumask(). So code was added to cull
mm_cpumask() periodically[2]. But that culling was a bit too aggressive
and skipped sending TLB flushes to CPUs that need them. So here we are
again.
=== Problem ===
The too-aggressive code in should_flush_tlb() strikes in this window:
// Turn on IPIs for this CPU/mm combination, but only
// if should_flush_tlb() agrees:
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
load_new_mm_cr3(need_flush);
// ^ After 'need_flush' is set to false, IPIs *MUST*
// be sent to this CPU and not be ignored.
this_cpu_write(cpu_tlbstate.loaded_mm, next);
// ^ Not until this point does should_flush_tlb()
// become true!
should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3()
and writing to 'loaded_mm', which is a window where they should not be
suppressed. Whoops.
=== Solution ===
Thankfully, the fuzzy "just about to write CR3" window is already marked
with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in
should_flush_tlb() is sufficient to ensure that the CPU is targeted with
an IPI.
This will cause more TLB flush IPIs. But the window is relatively small
and I do not expect this to cause any kind of measurable performance
impact.
Update the comment where LOADED_MM_SWITCHING is written since it grew
yet another user.
Peter Z also raised a concern that should_flush_tlb() might not observe
'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off()
writes them. Add a barrier to ensure that they are observed in the
order they are written.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ Fixes: 6db2526c1d69 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2] Reported-by: Stephen Dolan <sdolan@janestreet.com> Cc: stable@vger.kernel.org Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep. While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.
There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation. But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput. The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.
In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec
With this patch:
READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec
which is a 15% improvement in measured bandwidth. The average
submission latency is noticeably lowered too. As measured by
fio:
Baseline:
lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
lat (usec): min=26, max=226, avg=86.48, stdev=4.82
If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.
In the baseline, after a stabilization phase, an ordinary submission
looks like:
254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7
After patching, I see consistently more requests per unplug.
254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32
Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size. We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem. Instead, let's just give it a more sensible value
that will allow for more efficient batching. I've tested with different
cap values, and initially proposed to increase the cap to 1024. Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.
Imre Deak [Wed, 7 May 2025 15:19:53 +0000 (18:19 +0300)]
drm/i915/dp: Fix determining SST/MST mode during MTP TU state computation
Determining the SST/MST mode during state computation must be done based
on the output type stored in the CRTC state, which in turn is set once
based on the modeset connector's SST vs. MST type and will not change as
long as the connector is using the CRTC. OTOH the MST mode indicated by
the given connector's intel_dp::is_mst flag can change independently of
the above output type, based on what sink is at any moment plugged to
the connector.
Fix the state computation accordingly.
Cc: Jani Nikula <jani.nikula@intel.com> Fixes: f6971d7427c2 ("drm/i915/mst: adapt intel_dp_mtp_tu_compute_config() for 128b/132b SST") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4607 Reviewed-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://lore.kernel.org/r/20250507151953.251846-1-imre.deak@intel.com
(cherry picked from commit 0f45696ddb2b901fbf15cb8d2e89767be481d59f) Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Tom Lendacky [Thu, 8 May 2025 17:24:10 +0000 (12:24 -0500)]
memblock: Accept allocated memory before use in memblock_double_array()
When increasing the array size in memblock_double_array() and the slab
is not yet available, a call to memblock_find_in_range() is used to
reserve/allocate memory. However, the range returned may not have been
accepted, which can result in a crash when booting an SNP guest:
Mitigate this by calling accept_memory() on the memory range returned
before the slab is available.
Prior to v6.12, the accept_memory() interface used a 'start' and 'end'
parameter instead of 'start' and 'size', therefore the accept_memory()
call must be adjusted to specify 'start + size' for 'end' when applying
to kernels prior to v6.12.
Linus Torvalds [Thu, 8 May 2025 21:28:49 +0000 (14:28 -0700)]
Merge tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- Some fixes to help with filesystem analysis: ensure superblock
error count gets written if we go ERO, don't discard the journal
aggressively (so it's available for list_journal -a)
- Fix lost wakeup on arm causing us to get stuck when reading btree
nodes
- Fix fsck failing to exit on ctrl-c
- An additional fix for filesystems with misaligned bucket sizes: we
now ensure that allocations are properly aligned
- Setting background target but not promote target will now leave that
data cached on the foreground target, as it used to
- Revert a change to when we allocate the VFS superblock, this was done
for implementing blk_holder_ops but ended up not being needed, and
allocating a superblock and not setting SB_BORN while we do recovery
caused sync() calls and other things to hang
- Assorted fixes for harmless error messages that caused concern to
users
* tag 'bcachefs-2025-05-08' of git://evilpiepirate.org/bcachefs:
bcachefs: Don't aggressively discard the journal
bcachefs: Ensure superblock gets written when we go ERO
bcachefs: Filter out harmless EROFS error messages
bcachefs: journal_shutdown is EROFS, not EIO
bcachefs: Call bch2_fs_start before getting vfs superblock
bcachefs: fix hung task timeout in journal read
bcachefs: Add missing barriers before wake_up_bit()
bcachefs: Ensure proper write alignment
bcachefs: Improve want_cached_ptr()
bcachefs: thread_with_stdio: fix spinning instead of exiting
v2 (Matt):
refine commit message to have more details
add Fixes tag
move the code to xe_svm.h which already have the config
remove a blank line per codestyle suggestion
Fixes: 63f6e480d115 ("drm/xe: Add SVM garbage collector") Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250502170052.1787973-1-shuicheng.lin@intel.com
(cherry picked from commit 9d80698bcd97a5ad1088bcbb055e73fd068895e2) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>