]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
6 days agoRevert "arm64: mm: Unmap kernel data/bss entirely from the linear map"
Will Deacon [Wed, 10 Jun 2026 10:40:23 +0000 (11:40 +0100)] 
Revert "arm64: mm: Unmap kernel data/bss entirely from the linear map"

This reverts commit 63e0b6a5b6934d6a919d1c65ea185303200a1874.

Unmapping the kernel '.bss' appears to break KVM initialisation on some
devices, breaking the boot on popular platforms such as RaspberryPi3 and
4.

Revert this change for now so that we can revisit it in future.

Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/all/aicVyebkEMs6w6UV@sirena.co.uk
Link: https://lore.kernel.org/r/a1b27e97-182c-485d-a448-56c19c5de2c2@samsung.com
Signed-off-by: Will Deacon <will@kernel.org>
6 days agoRevert "arm64: mm: Defer remap of linear alias of data/bss"
Will Deacon [Wed, 10 Jun 2026 10:34:39 +0000 (11:34 +0100)] 
Revert "arm64: mm: Defer remap of linear alias of data/bss"

This reverts commit 53205d56212cbff880a77497e25a0e44036d490a.

Unmapping the kernel '.bss' appears to break KVM initialisation on some
devices, breaking the boot on popular platforms such as RaspberryPi3 and
4.

Revert this change for now so that we can revisit it in future.

Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/all/aicVyebkEMs6w6UV@sirena.co.uk
Link: https://lore.kernel.org/r/a1b27e97-182c-485d-a448-56c19c5de2c2@samsung.com
Signed-off-by: Will Deacon <will@kernel.org>
6 days agoMerge tag 'usb-serial-7.1-rc8' of ssh://gitolite.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Wed, 10 Jun 2026 10:25:33 +0000 (12:25 +0200)] 
Merge tag 'usb-serial-7.1-rc8' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus

Johan writes:

USB serial fixes for 7.1-rc8

Here is one more buffer overflow fix.

This one has been in linux-next overnight with no reported issues.

* tag 'usb-serial-7.1-rc8' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
  USB: serial: kl5kusb105: fix bulk-out buffer overflow

6 days agoxfs: cleanup xfs_growfs_compute_deltas
Christoph Hellwig [Tue, 9 Jun 2026 07:52:45 +0000 (09:52 +0200)] 
xfs: cleanup xfs_growfs_compute_deltas

xfs_growfs_compute_deltas has an odd calling conventions, and looks
very convoluted due to the use of do_div and strangely named and typed
variables.

Rename it, make it return the agcount and let the caller calculate the
delta.  The internally use the better div_u64_rem helper and descriptive
variable names and types.  Also add a comment describing what the
function is used for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 days agoxfs: pass back updated nb from xfs_growfs_compute_deltas
Christoph Hellwig [Tue, 9 Jun 2026 07:52:44 +0000 (09:52 +0200)] 
xfs: pass back updated nb from xfs_growfs_compute_deltas

xfs_growfs_compute_deltas can update nb for corner cases like a number
of blocks that would create a less the minimal sized AG, or running
past the max AG limit.  Pass back the calculated value to the caller,
as it relies on to calculate the new number of perag structures.

Note that the grown file system size is not affected by this
miscalculation as it uses the passed back delta value.

Fixes: a49b7ff63f98 ("xfs: Refactoring the nagcount and delta calculation")
Cc: stable@vger.kernel.org # v7.0
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 days agoxfs: fix pointer arithmetic error on 32-bit systems
Darrick J. Wong [Wed, 10 Jun 2026 04:57:24 +0000 (21:57 -0700)] 
xfs: fix pointer arithmetic error on 32-bit systems

The translation of the old XFS_BMBT_KEY_ADDR macro into a static
function is not correct on 32-bit systems because the sizeof() argument
went from being a xfs_bmbt_key_t (i.e. a struct) to a (struct
xfs_bmbt_key *) (i.e. a pointer to the same struct).  On 64-bit systems
this turns out ok because they are the same size, but on 32-bit systems
this is catastrophic because they are not the same size.  So far there
have been no complaints, most likely because the xfs developers urge
against running it on 32-bit systems.  But this needs fixing asap.

Cc: stable@vger.kernel.org # v6.12
Fixes: 79124b37400635 ("xfs: replace shouty XFS_BM{BT,DR} macros")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 days agoxfs: initialize iomap->flags earlier in xfs_bmbt_to_iomap
Christoph Hellwig [Tue, 9 Jun 2026 07:53:44 +0000 (09:53 +0200)] 
xfs: initialize iomap->flags earlier in xfs_bmbt_to_iomap

Otherwise we lose the IOMAP_IOEND_BOUNDARY assingment for writes to the
first block in a realtime group, and could cause incorrect merges for
such writes.

Fixes: b91afef72471 ("xfs: don't merge ioends across RTGs")
Cc: <stable@vger.kernel.org> # v6.13
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 days agoxfs: only log freed extents for the current RTG in zoned growfs
Christoph Hellwig [Wed, 10 Jun 2026 05:07:21 +0000 (07:07 +0200)] 
xfs: only log freed extents for the current RTG in zoned growfs

Otherwise a power fail or crash during growfs could lead to an
elevated sb_rblocks counter.

Note that the step function is much simpler compared to the classic RT
allocator as zoned RT sections must be aligned to real time group
boundaries.

Fixes: 01b71e64bb87 ("xfs: support growfs on zoned file systems")
Cc: <stable@vger.kernel.org> # v6.15
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 days agodrm/amd/display: use plane color_mgmt_changed to track colorop changes
Melissa Wen [Tue, 9 Jun 2026 10:20:21 +0000 (12:20 +0200)] 
drm/amd/display: use plane color_mgmt_changed to track colorop changes

Ensure the driver tracks changes in any colorop property of a plane
color pipeline by using the same mechanism of CRTC color management and
update plane color blocks when any colorop property changes. It fixes an
issue observed on gamescope settings for night mode which is done via
shaper/3D-LUT updates.

Fixes: 9ba25915efba ("drm/amd/display: Add support for sRGB EOTF in DEGAM block")
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Melissa Wen <melissa.srw@gmail.com>
Link: https://patch.msgid.link/20260609110420.1298352-5-mwen@igalia.com
6 days agodrm/atomic: track individual colorop updates
Melissa Wen [Tue, 9 Jun 2026 10:20:20 +0000 (12:20 +0200)] 
drm/atomic: track individual colorop updates

As we do for CRTC color mgmt properties, use color_mgmt_changed flag to
track any value changes in the color pipeline of a given plane, so that
drivers can update color blocks as soon as plane color pipeline or
individual colorop values change. Since we're here, only announce and
track changes to plane COLOR_PIPELINE prop if its value is actually
changing.

Fixes: 8c5ea1745f4c ("drm/colorop: Add BYPASS property")
Fixes: 7fa3ee8c0a79 ("drm/colorop: Define LUT_1D interpolation")
Fixes: 41651f9d42eb ("drm/colorop: Add 1D Curve subtype")
Fixes: 3410108037d5 ("drm/colorop: Add multiplier type")
Fixes: db971856bbe0 ("drm/colorop: Add 3D LUT support to color pipeline")
Fixes: e5719e7f1900 ("drm/colorop: Add 3x4 CTM type")
Fixes: 99a4e4f08abe ("drm/colorop: Add 1D Curve Custom LUT type")
Fixes: 2afc3184f3b3 ("drm/plane: Add COLOR PIPELINE property")
Reviewed-by: Harry Wentland <harry.wentland@amd.com> #v1
Reviewed-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Fixes: 9ba25915efba ("drm/amd/display: Add support for sRGB EOTF in DEGAM block")
Signed-off-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Melissa Wen <melissa.srw@gmail.com>
Link: https://patch.msgid.link/20260609110420.1298352-4-mwen@igalia.com
6 days agodrm/colorop: make lut(1/3)d_interpolation props correctly behave as mutable
Melissa Wen [Tue, 9 Jun 2026 10:20:19 +0000 (12:20 +0200)] 
drm/colorop: make lut(1/3)d_interpolation props correctly behave as mutable

As interpolation props are actually mutable props, any changes should be
handled by drm_colorop_state. Move their enum and make it correctly
behaves as mutable.

Fixes: 7fa3ee8c0a79 ("drm/colorop: Define LUT_1D interpolation")
Fixes: db971856bbe0 ("drm/colorop: Add 3D LUT support to color pipeline")
Reviewed-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Fixes: 9ba25915efba ("drm/amd/display: Add support for sRGB EOTF in DEGAM block")
Signed-off-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Melissa Wen <melissa.srw@gmail.com>
Link: https://patch.msgid.link/20260609110420.1298352-3-mwen@igalia.com
6 days agodrm/colorop: Remove read-only comments from interpolation fields
Alex Hung [Tue, 9 Jun 2026 10:20:18 +0000 (12:20 +0200)] 
drm/colorop: Remove read-only comments from interpolation fields

The lut1d_interpolation and lut3d_interpolation fields and their
associated properties were marked as read-only, but userspace
can set them via drm_atomic_colorop_set_property().

Fixes: 7fa3ee8c0a79 ("drm/colorop: Define LUT_1D interpolation")
Fixes: db971856bbe0 ("drm/colorop: Add 3D LUT support to color pipeline")
Reviewed-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Fixes: 9ba25915efba ("drm/amd/display: Add support for sRGB EOTF in DEGAM block")
Signed-off-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Melissa Wen <melissa.srw@gmail.com>
Link: https://patch.msgid.link/20260609110420.1298352-2-mwen@igalia.com
6 days agofanotify: allow reporting pidfds for reaped tasks
AnonymeMeow [Sun, 7 Jun 2026 00:33:43 +0000 (08:33 +0800)] 
fanotify: allow reporting pidfds for reaped tasks

Fanotify used to refuse to report pidfds for reaped tasks by applying a
pid_has_task() check before calling pidfd_prepare(). This prevented
userspace from obtaining information about the task.

Register the event pid with pidfs when creating the fanotify event if
pidfd reporting was requested, so pidfd_prepare() can later create a
pidfd for the reaped task.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/linux-fsdevel/20260528-schmuckvoll-heilen-garen-be77b4208671@brauner/
Signed-off-by: AnonymeMeow <anonymemeow@gmail.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Link: https://patch.msgid.link/20260607003343.425939-3-anonymemeow@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
6 days agofanotify: report thread pidfds for FAN_REPORT_TID
AnonymeMeow [Sun, 7 Jun 2026 00:33:42 +0000 (08:33 +0800)] 
fanotify: report thread pidfds for FAN_REPORT_TID

The FAN_REPORT_PIDFD and FAN_REPORT_TID flags used to be mutually
exclusive because by the time the pidfd support was introduced to
fanotify, pidfds could only be created for thread group leaders. Now
that the pidfd API supports thread-specific pidfds via PIDFD_THREAD,
this restriction can be lifted.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: AnonymeMeow <anonymemeow@gmail.com>
Link: https://patch.msgid.link/20260607003343.425939-2-anonymemeow@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
7 days agodrm/i915/gem: Fix phys BO pread/pwrite with offset
Joonas Lahtinen [Wed, 10 Jun 2026 06:03:14 +0000 (09:03 +0300)] 
drm/i915/gem: Fix phys BO pread/pwrite with offset

sg_page() returns struct page pointer not (void *) so the scaling
of pread/pwrite is wrong for phys BO and wrong parts of BO would be
accessed if non-zero offset is used.

Last impacted platform with overlay or cursor planes using phys
mapping was Gen3/945G/Lakeport.

Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Fixes: c6790dc22312 ("drm/i915: Wean off drm_pci_alloc/drm_pci_free")
Cc: <stable@vger.kernel.org> # v4.5+
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Link: https://patch.msgid.link/20260610060314.26111-1-joonas.lahtinen@linux.intel.com
(cherry picked from commit 3e49a2f85070b2fb672c1e0fdba281a4ea3aebe6)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
7 days agogpiolib: handle gpio-hogs only once
Daniel Drake [Mon, 8 Jun 2026 21:01:08 +0000 (22:01 +0100)] 
gpiolib: handle gpio-hogs only once

Commit d1d564ec49929 ("gpio: move hogs into GPIO core") introduced a
behaviour change that breaks boot on Raspberry Pi 5 when using the
firmware-supplied device tree:

  gpiochip_add_data_with_key: GPIOs 544..575
    (/soc@107c000000/gpio@7d517c00) failed to register, -22
  brcmstb-gpio 107d517c00.gpio: Could not add gpiochip for bank 1
  brcmstb-gpio 107d517c00.gpio: probe with driver brcmstb-gpio failed
    with error -22

gpio-brcmstb registers two gpio_chips against the device tree
node gpio@7d517c00, one for each bank. The firmware-supplied DT includes
a gpio-hog on RP1 RUN, and this gpio-hog is attempted to be applied to
*both* gpio_chips. This succeeds against bank 0 (which hosts the GPIO)
and fails for bank 1 (which does not).

In the previous implementation, failures to apply gpio-hogs were
quietly ignored. In the new code, the error code propagates and causes
probe to fail.

Closely approximate the previous behaviour by using the OF_POPULATED flag
to ensure that each gpio-hog is processed only once. The flag was
previously being set before the gpio-hogs were processed, so as part
of this change, the flag now gets set only after the gpio-hog is actioned.
The handling of gpio-hogs on a DT node with multiple gpio_chips remains a
bit incomplete/unclear, but this at least retains the ability to apply
hogs to the first gpio_chip per node.

Fixes: d1d564ec49929 ("gpio: move hogs into GPIO core")
Signed-off-by: Daniel Drake <dan@reactivated.net>
Link: https://patch.msgid.link/20260608210108.36248-1-dan@reactivated.net
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
7 days agobacking-file: fix backing_file_open() kerneldoc parameter
Li Wang [Thu, 28 May 2026 10:42:08 +0000 (18:42 +0800)] 
backing-file: fix backing_file_open() kerneldoc parameter

The kerneldoc for backing_file_open() documented a @user_path argument,
but the function takes const struct file *user_file. The user
path is derived as &user_file->f_path.

Update the @-tag to @user_file and adjust the description accordingly.
Also fix the "reuqested" typo to 'requested' in the old comment.

Signed-off-by: Li Wang <liwang@kylinos.cn>
Link: https://patch.msgid.link/20260528104208.395757-1-liwang@kylinos.cn
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
7 days agogpio: fix cleanup path on hog failure
Bartosz Golaszewski [Tue, 9 Jun 2026 12:17:50 +0000 (14:17 +0200)] 
gpio: fix cleanup path on hog failure

If gpiochip_hog_lines() successfully processes some hogs but fails on
a later one, the error handling path in gpiochip_add_data_with_key()
jumps directly to err_remove_of_chip. This leaks resources allocated
earlier for ACPI, interrupts and hogs that were successfully processed.
Use the right label in error path.

Closes: https://sashiko.dev/#/patchset/20260608210108.36248-1-dan%40reactivated.net
Fixes: d1d564ec4992 ("gpio: move hogs into GPIO core")
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://patch.msgid.link/20260609-gpio-hogs-fixes-v1-2-b4064f8070e7@oss.qualcomm.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
7 days agoiomap: pass the correct len to fserror_report_io in __iomap_write_begin
Christoph Hellwig [Wed, 10 Jun 2026 05:06:42 +0000 (07:06 +0200)] 
iomap: pass the correct len to fserror_report_io in __iomap_write_begin

len is size of the (larger) write request, plen is the range for which
the read failed here.

Fixes: a9d573ee88af ("iomap: report file I/O errors to the VFS")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260610050642.1906695-1-hch@lst.de
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
7 days agom68k: Correct CONFIG_MVME16x macro name in #endif comment
Ethan Nelson-Moore [Tue, 9 Jun 2026 20:12:08 +0000 (13:12 -0700)] 
m68k: Correct CONFIG_MVME16x macro name in #endif comment

A comment in arch/m68k/kernel/head.S incorrectly refers to
CONFIG_MVME162 and CONFIG_MVME167 instead of CONFIG_MVME16x. Correct it.

Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://patch.msgid.link/20260609201211.173438-1-enelsonmoore@gmail.com
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
7 days agorust: make `build_assert` module the home of related macros
Gary Guo [Tue, 9 Jun 2026 14:26:33 +0000 (15:26 +0100)] 
rust: make `build_assert` module the home of related macros

Given the macro scoping rules, all macros are rendered twice, in the
module and in the top-level of kernel crate.

Add `#[doc(hidden)]` to the macro definition and `#[doc(inline)]` to the
re-export inside `build_assert` module so the top-level items are hidden.

[ Sadly, because the definition is hidden, `rustdoc` decides to not list
  them as re-exports in the `prelude` page anymore, even if we refer to
  the not-actually-hidden item.

    - Miguel ]

Acked-by: Danilo Krummrich <dakr@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Alexandre Courbot <acourbot@nvidia.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Acked-by: Boqun Feng <boqun@kernel.org>
Signed-off-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260609142637.373347-1-gary@kernel.org
[ Kept a single declaration in the prelude, and reworded since they
  already had `no_inline`. Removed other imports from `predefine` since
  we now use the prelude. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
7 days agorust: str: clean unused import for Rust >= 1.98
Miguel Ojeda [Tue, 9 Jun 2026 10:41:52 +0000 (12:41 +0200)] 
rust: str: clean unused import for Rust >= 1.98

Starting with Rust 1.98.0 (expected 2026-08-20), the compiler has changed
how the resolution algorithm works [1] in upstream commit c4d84db5f184
("Resolver: Batched import resolution."), and it now spots:

    error: unused import: `flags::*`
     --> rust/kernel/str.rs:7:9
      |
    7 |         flags::*,
      |         ^^^^^^^^
      |
      = note: `-D unused-imports` implied by `-D warnings`
      = help: to override `-D warnings` add `#[allow(unused_imports)]`

It happens to not be needed because the `prelude::*` already provides
the flags.

Thus clean it up.

Cc: stable@vger.kernel.org # Needed in 6.18.y and later (prelude added to `str`).
Link: https://github.com/rust-lang/rust/pull/145108
Reviewed-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20260609104152.261145-2-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
7 days agorust: str: use the "kernel vertical" imports style
Miguel Ojeda [Tue, 9 Jun 2026 10:41:51 +0000 (12:41 +0200)] 
rust: str: use the "kernel vertical" imports style

Convert the imports to use the "kernel vertical" imports style [1].

No functional changes intended.

Link: https://docs.kernel.org/rust/coding-guidelines.html#imports
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260609104152.261145-1-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
7 days agorust: aref: use the "kernel vertical" imports style
Andreas Hindborg [Thu, 4 Jun 2026 20:11:20 +0000 (22:11 +0200)] 
rust: aref: use the "kernel vertical" imports style

Convert the imports to use the "kernel vertical" imports style [1].

No functional changes intended.

Link: https://docs.kernel.org/rust/coding-guidelines.html#imports
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://patch.msgid.link/20260604-unique-ref-v17-8-7b4c3d2930b9@kernel.org
[ Picked from larger series and reworded. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
7 days agorust: page: use the "kernel vertical" imports style
Andreas Hindborg [Thu, 4 Jun 2026 20:11:16 +0000 (22:11 +0200)] 
rust: page: use the "kernel vertical" imports style

Convert the imports to use the "kernel vertical" imports style [1].

No functional changes intended.

Link: https://docs.kernel.org/rust/coding-guidelines.html#imports
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://patch.msgid.link/20260604-unique-ref-v17-4-7b4c3d2930b9@kernel.org
[ Picked from larger series and reworded. Adjusted the `error::`
  block too. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
7 days agoxfs: add newly added RTGs to the free pool in growfs
Christoph Hellwig [Wed, 10 Jun 2026 05:07:20 +0000 (07:07 +0200)] 
xfs: add newly added RTGs to the free pool in growfs

When growing a zoned RT section, the newly added RTGs also need to be
tagged as free in the radix tree and add to the nr_free_zones counters.
Call xfs_add_free_zone to do that, otherwise using up the newly added
space will wait for free zones forever.

Fixes: 01b71e64bb87 ("xfs: support growfs on zoned file systems")
Cc: stable@vger.kernel.org # v6.15
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
7 days agoxfs: factor out a xfs_zone_mark_free helper
Christoph Hellwig [Wed, 10 Jun 2026 05:07:19 +0000 (07:07 +0200)] 
xfs: factor out a xfs_zone_mark_free helper

Add a helper for adding a zone to the free pool in preparation of adding
another caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
7 days agoclocksource: move NXP timer selection to drivers/clocksource
Enric Balletbo i Serra [Thu, 14 May 2026 11:14:17 +0000 (13:14 +0200)] 
clocksource: move NXP timer selection to drivers/clocksource

The Kconfig logic for selecting the scheduler clocksource on
NXP Vybrid (VF610) uses a `choice` block restricted to 32-bit ARM. This
prevents 64-bit architectures, such as the NXP S32 family, from enabling
the NXP Periodic Interrupt Timer (PIT) driver (CONFIG_NXP_PIT_TIMER).

Relocate the NXP clocksource selection from arch/arm/mach-imx/Kconfig to
drivers/clocksource/Kconfig. This allows the configuration to be shared
across different architectures.

Update the selection to include support for ARCH_S32 and add a "None"
option restricted to ARCH_S32, since Vybrid lacks the ARM Architected
Timer. The Vybrid Global Timer option is restricted to ARCH_MULTI_V7
SOC_VF610 platforms to prevent it from being visible on Cortex-M4 builds,
which lack the ARM Global Timer hardware.

Fixes: bee33f22d7c3 ("clocksource/drivers/nxp-pit: Add NXP Automotive s32g2 / s32g3 support")
Signed-off-by: Enric Balletbo i Serra <eballetb@redhat.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260514-fix-nxp-timer-v3-1-a3e68fdb505e@redhat.com
7 days agoclocksource/drivers/timer-tegra186: Reserve and service a kernel watchdog
Kartik Rajput [Thu, 7 May 2026 15:45:57 +0000 (21:15 +0530)] 
clocksource/drivers/timer-tegra186: Reserve and service a kernel watchdog

Tegra SoCs supports multiple watchdog timers. If the kernel crashes or
hangs before userspace enables a watchdog, the system cannot recover and
may remain bricked, e.g. after a failed OTA update. The driver currently
leaves all watchdogs disabled until userspace configures them.

Reserve first available watchdog as a kernel-only watchdog for Tegra186
and Tegra234. Arm it during probe (120s timeout) and keep it alive in
the driver IRQ handler. Do not register it to userspace. Other available
watchdogs remain exposed to userspace. This guarantees the system can
reset itself in case of a hang or crash even when userspace never starts.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20260507154557.2082697-5-kkartik@nvidia.com
7 days agoclocksource/drivers/timer-tegra186: Register all accessible watchdog timers
Kartik Rajput [Thu, 7 May 2026 15:45:56 +0000 (21:15 +0530)] 
clocksource/drivers/timer-tegra186: Register all accessible watchdog timers

Tegra186+ SoCs expose multiple watchdog timers, but the driver only
registers WDT(0).

Iterate over num_wdts and, for each WDT, check the SCR (firewall) registers
in the TKE block to determine whether Linux has read and write access.
Register the watchdogs that are accessible.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20260507154557.2082697-4-kkartik@nvidia.com
7 days agoclocksource/drivers/timer-tegra186: Correct num_wdts for Tegra186 and Tegra234
Kartik Rajput [Thu, 7 May 2026 15:45:55 +0000 (21:15 +0530)] 
clocksource/drivers/timer-tegra186: Correct num_wdts for Tegra186 and Tegra234

On Tegra186 and Tegra234, WDT2 is connected to the Audio Processing
Engine (APE) and cannot be accessed from Linux. Only WDT0 and WDT1
are accessible to Linux.

Update num_wdts from 3 to 2 for both Tegra186 and Tegra234 to reflect
the actual number of watchdogs available to Linux.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20260507154557.2082697-3-kkartik@nvidia.com
7 days agoclocksource/drivers/timer-tegra186: Fix support for multiple watchdog instances
Kartik Rajput [Thu, 7 May 2026 15:45:54 +0000 (21:15 +0530)] 
clocksource/drivers/timer-tegra186: Fix support for multiple watchdog instances

Tegra SoCs support multiple watchdogs; currently only one (WDT0) is
used. When multiple watchdogs are registered, tegra186_wdt_enable()
overwrites the TKEIE(x) register, discarding any existing watchdog
interrupt enable bits. As a result, enabling one watchdog inadvertently
disables interrupts for the others.

Fix this by preserving the existing TKEIE(x) value and updating it
using a read-modify-write sequence.

Fixes: 42cee19a9f83 ("clocksource: Add Tegra186 timers support")
Cc: stable@vger.kernel.org
Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20260507154557.2082697-2-kkartik@nvidia.com
7 days agoptp: ocp: fix resource freeing order
Vadim Fedorenko [Mon, 8 Jun 2026 15:59:52 +0000 (15:59 +0000)] 
ptp: ocp: fix resource freeing order

Commit a60fc3294a37 ("ptp: rework ptp_clock_unregister() to disable
events") added a call to ptp_disable_all_events() which changes the
configuration of pins if they support EXTTS events. In ptp_ocp_detach()
pins resources are freed before ptp_clock_unregister() and it leads to
use-after-free during driver removal. Fix it by changing the order of
free/unregister calls. To avoid irq handler running on the other core
while ptp device unregistering, call synchronize_irq() after HW is
configured to stop producing irqs and no irqs are in-flight.

Fixes: a60fc3294a37 ("ptp: rework ptp_clock_unregister() to disable events")
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260608155952.240304-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agotun: zero the whole vnet header in tun_put_user()
Xiang Mei [Sun, 7 Jun 2026 05:44:28 +0000 (22:44 -0700)] 
tun: zero the whole vnet header in tun_put_user()

tun_put_user() declares an on-stack struct virtio_net_hdr_v1_hash_tunnel
without zeroing it. For a non-tunnel skb, virtio_net_hdr_tnl_from_skb()
only initializes the first 10 bytes (sizeof(struct virtio_net_hdr)),
leaving bytes 10..23 (num_buffers and the hash/tunnel fields) as stack
garbage.

An unprivileged user can set the vnet header size to 24 with
TUNSETVNETHDRSZ, so __tun_vnet_hdr_put() copies all 24 bytes of the
partially-initialized struct to userspace, leaking 14 bytes of kernel
stack on every read of a non-tunnel packet.

Fix it the same way tun_get_user() already does by zeroing the whole
header right after declaration.

Fixes: 288f30435132 ("tun: enable gso over UDP tunnel support.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260607054428.3050243-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agonet/rds: fix NULL deref in rds_ib_send_cqe_handler() on masked atomic completion
Weiming Shi [Sat, 6 Jun 2026 19:24:48 +0000 (12:24 -0700)] 
net/rds: fix NULL deref in rds_ib_send_cqe_handler() on masked atomic completion

rds_ib_xmit_atomic() always programs a masked atomic opcode
(IB_WR_MASKED_ATOMIC_CMP_AND_SWP or IB_WR_MASKED_ATOMIC_FETCH_AND_ADD)
for every RDS atomic cmsg.  But the completion-side switch in
rds_ib_send_unmap_op() only handles the non-masked opcodes, so a masked
atomic completion falls through to default and returns rm == NULL while
send->s_op is left set.  rds_ib_send_cqe_handler() then dereferences the
NULL rm via rm->m_final_op, oopsing in softirq context.  An unprivileged
AF_RDS sendmsg() of an atomic cmsg over an active RDS/IB connection
triggers it; on hardware that natively accepts masked atomics (mlx4,
mlx5) no extra setup is needed.

  RDS/IB: rds_ib_send_unmap_op: unexpected opcode 0xd in WR!
  Oops: general protection fault [#1] SMP KASAN
  KASAN: null-ptr-deref in range [0x0000000000000190-0x0000000000000197]
  RIP: rds_ib_send_cqe_handler+0x25c/0xb10 (net/rds/ib_send.c:282)
  Call Trace:
   <IRQ>
   rds_ib_send_cqe_handler (net/rds/ib_send.c:282)
   poll_scq (net/rds/ib_cm.c:274)
   rds_ib_tasklet_fn_send (net/rds/ib_cm.c:294)
   tasklet_action_common (kernel/softirq.c:943)
   handle_softirqs (kernel/softirq.c:573)
   run_ksoftirqd (kernel/softirq.c:479)
   </IRQ>
  Kernel panic - not syncing: Fatal exception in interrupt

Handle the masked atomic opcodes in the same case as the non-masked
ones: they map to the same struct rds_message.atomic union member, so
the existing container_of()/rds_ib_send_unmap_atomic() body is correct
for them.

Fixes: 20c72bd5f5f9 ("RDS: Implement masked atomic operations")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260606192447.1179255-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agonet: guard timestamp cmsgs to real error queue skbs
Kyle Zeng [Sun, 7 Jun 2026 02:18:19 +0000 (19:18 -0700)] 
net: guard timestamp cmsgs to real error queue skbs

skb_is_err_queue() treats PACKET_OUTGOING as the sole marker for an skb
from sk_error_queue. That assumption is not true for AF_PACKET sockets:
outgoing packet taps are also delivered to packet sockets with
skb->pkt_type == PACKET_OUTGOING, but their skb->cb is owned by AF_PACKET
instead of struct sock_exterr_skb.

If such an skb is received with timestamping enabled, the generic
timestamp cmsg path can read AF_PACKET control-buffer state as
sock_exterr_skb::opt_stats. With SO_RXQ_OVFL enabled, the packet drop
counter overlaps opt_stats. An odd drop count makes the path emit
SCM_TIMESTAMPING_OPT_STATS with skb->len and skb->data. For non-linear
skbs this copies past the linear head and can trigger hardened usercopy or
disclose adjacent heap contents.

Keep skb_is_err_queue() local to net/socket.c, but make it verify that
the PACKET_OUTGOING marker is paired with the sock_rmem_free destructor
installed by sock_queue_err_skb(). AF_PACKET receive skbs use normal
receive ownership and no longer pass as error-queue skbs, while legitimate
sk_error_queue entries keep the PACKET_OUTGOING marker and sock_rmem_free
ownership.

Fixes: 8605330aac5a ("tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs")
Signed-off-by: Kyle Zeng <kylebot@openai.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260607021819.49698-1-kylebot@openai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agosctp: validate embedded INIT chunk and address list lengths in cookie
Xin Long [Sun, 7 Jun 2026 23:03:47 +0000 (19:03 -0400)] 
sctp: validate embedded INIT chunk and address list lengths in cookie

sctp_unpack_cookie() only checked that the embedded INIT chunk length
did not exceed the remaining cookie payload, but did not ensure that the
INIT chunk is large enough to contain a complete INIT header.

A malformed COOKIE_ECHO can therefore carry a truncated INIT chunk whose
length field is smaller than sizeof(struct sctp_init_chunk).  Later,
sctp_process_init() accesses INIT parameters unconditionally, which may
lead to out-of-bounds reads.

In addition, raw_addr_list_len is not fully validated against the
remaining cookie payload. When cookie authentication is disabled, an
attacker can supply an oversized raw_addr_list_len and cause
sctp_raw_to_bind_addrs() to read beyond the end of the cookie. The
address parser also lacks sufficient bounds checks for parameter headers
and lengths, allowing malformed address parameters to trigger
out-of-bounds reads.

Fix this by:

- requiring the embedded INIT chunk length to be at least sizeof(struct
  sctp_init_chunk);
- validating that the INIT chunk and raw address list together fit
  within the cookie payload;
- verifying sufficient data exists for each address parameter header and
  payload before parsing it.

Note that sctp_verify_init() must be called after sctp_unpack_cookie()
and before sctp_process_init() when cookie authentication is disabled.
This will be addressed in a separate patch.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/75af23a89adf881a0895d511775e4770da367cbf.1780873427.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agoip6_vti: set netns_immutable on the fallback device.
Eric Dumazet [Mon, 8 Jun 2026 15:59:18 +0000 (15:59 +0000)] 
ip6_vti: set netns_immutable on the fallback device.

john1988 and Noam Rathaus reported that vti6_init_net() does not set the
netns_immutable flag on the per-netns fallback tunnel device (ip6_vti0).

Other similar tunnel drivers (like ip6_tunnel, sit, ip6_gre, and ip_tunnel)
correctly set this flag during their fallback device initialization to
prevent them from being moved to another network namespace.

Fixes: 61220ab34948 ("vti6: Enable namespace changing")
Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://patch.msgid.link/20260608155918.787644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agosctp: fix uninit-value in __sctp_rcv_asconf_lookup()
Michael Bommarito [Mon, 8 Jun 2026 12:22:34 +0000 (08:22 -0400)] 
sctp: fix uninit-value in __sctp_rcv_asconf_lookup()

__sctp_rcv_asconf_lookup() in net/sctp/input.c only checks that the ASCONF
chunk can hold the ADDIP header and a parameter header, then calls
af->from_addr_param(), which reads the full address (16 bytes for IPv6)
trusting the parameter's declared length.

An unauthenticated peer can send a truncated trailing ASCONF chunk that
declares an IPv6 address parameter but stops after the 4-byte parameter
header; reached from the no-association lookup path, from_addr_param() then
reads uninitialized bytes past the parameter.

Impact: an unauthenticated SCTP peer makes the receive path read up to 16
bytes of uninitialized memory past a truncated ASCONF address parameter.

The sibling __sctp_rcv_init_lookup() bounds parameters with
sctp_walk_params(); this path open-codes the fetch and omits the bound.
Verify the whole address parameter lies within the chunk before
from_addr_param() reads it, the same class of fix as commit 51e5ad549c43
("net: sctp: fix KMSAN uninit-value in sctp_inq_pop").

Fixes: df2185771439 ("[SCTP]: Update association lookup to look at ASCONF chunks as well")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20260608122234.459098-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agobnxt_en: Fix NULL pointer dereference
Kyle Meyer [Fri, 5 Jun 2026 22:25:24 +0000 (17:25 -0500)] 
bnxt_en: Fix NULL pointer dereference

PCIe errors detected by a Root Port or Downstream Port cause error
recovery services to run on all subordinate devices regardless of
administrative state.

The .error_detected() callback, bnxt_io_error_detected(), disables
and synchronizes IRQs via bnxt_disable_int_sync(), which calls
bnxt_cp_num_to_irq_num() to map completion rings to IRQs using
bp->bnapi.

Since bp->bnapi is allocated on NIC open and freed on NIC close, PCIe
error recovery on a closed NIC can dereference a NULL pointer.

Check if bp->bnapi is NULL before disabling and synchronizing IRQs.

Fixes: e5811b8c09df ("bnxt_en: Add IRQ remapping logic.")
Cc: stable@vger.kernel.org
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/aiNM1CY2-StPilxW@hpe.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agosctp: stream: fully roll back denied add-stream state
Wyatt Feng [Fri, 5 Jun 2026 05:53:42 +0000 (13:53 +0800)] 
sctp: stream: fully roll back denied add-stream state

When ADD_OUT_STREAMS is denied, SCTP only shrinks the queued chunks and
then lowers outcnt. That leaves removed stream metadata behind, so a
later re-add can reuse a stale ext and hit a null-pointer dereference in
the scheduler get path.

Fix the rollback by tearing down the removed stream state the same way
other stream resizes do. Unschedule the current scheduler state, drop
the removed stream ext state with sctp_stream_outq_migrate(), and then
reschedule the remaining streams.

This keeps scheduler-private RR/FC/PRIO lists consistent while fully
rolling back denied outgoing stream additions.

Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/d78954ecd94954653ee299400e98d74a03a6f7d3.1780603399.git.bronzed_45_vested@icloud.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days agoMerge tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 10 Jun 2026 00:20:00 +0000 (17:20 -0700)] 
Merge tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier fixes from Steven Rostedt:

 - Fix reset ordering on per-task destruction

   Reset the task before dropping the slot instead of after, which was
   causing out-of-bound memory accesses.

 - Fix HA monitor synchronization and cleanup

   Ensure synchronous cleanup for HA monitors by running timer callbacks
   in RCU read-side critical sections and using synchronize_rcu() during
   destruction.

 - Avoid armed timers after tasks exit

   Add automatic cleanup for per-task HA monitors to prevent timers from
   firing after task exit.

 - Fix memory ordering for DA/HA monitors

   Fix race conditions during monitor start by using release-acquire
   semantics for the monitoring flag.

 - Fix initialization for DA/HA monitors

   Ensure monitors are not initialized relying on potentially corrupted
   state like the monitoring flag, that is not reset by all monitors
   type and may have an unknown state in monitors reusing the storage
   (per-task).

 - Fix memory safety in per-task and per-object monitors

   Prevent use-after-free and out-of-bounds access by synchronizing with
   in-flight tracepoint probes using tracepoint_synchronize_unregister()
   before freeing monitor storage or releasing task slots.

 - Adjust monitors for preemptible tracepoints

   Fix monitors that relied on tracepoints disabling preemption.
   Explicitly disable task migration when per-CPU monitors handle events
   to avoid accessing the wrong state and update the opid monitor logic.

 - Fix incorrect __user specifier usage

   Remove __user from a non-pointer variable in the extract_params()
   helper.

 - Fix bugs in the rv tool

   Ensure strings are NUL-terminated, fix substring matching in monitor
   searches, and improve cleanup and exit status handling.

 - Fix several bugs in rvgen

   Fix LTL literal stringification, subparsers' options handling, and
   suffix stripping in dot2k.

* tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  verification/rvgen: Fix ltl2k writing True as a literal
  verification/rvgen: Fix options shared among commands
  verification/rvgen: Fix suffix strip in dot2k
  tools/rv: Fix cleanup after failed trace setup
  tools/rv: Fix substring match when listing container monitors
  tools/rv: Fix substring match bug in monitor name search
  tools/rv: Ensure monitor name and desc are NUL-terminated
  rv: Use 0 to check preemption enabled in opid
  rv: Prevent task migration while handling per-CPU events
  rv: Ensure synchronous cleanup for HA monitors
  rv: Add automatic cleanup handlers for per-task HA monitors
  rv: Do not rely on clean monitor when initialising HA
  rv: Fix monitor start ordering and memory ordering for monitoring flag
  rv: Ensure all pending probes terminate on per-obj monitor destroy
  rv: Prevent in-flight per-task handlers from using invalid slots
  rv: Reset per-task DA monitors before releasing the slot
  rv: Fix __user specifier usage in extract_params()

7 days agoMerge tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 10 Jun 2026 00:05:19 +0000 (17:05 -0700)] 
Merge tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull RTLA fix from Steven Rostedt:

 - Fix multi-character short option parsing

   Fix regression in parsing of multiple-character short options
   (eg -p100 /= -p 100/, -un /= -u -n/) caused by getopt_long()
   internal state corruption after a refactoring.

* tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rtla: Fix parsing of multi-character short options

7 days agokconfig: tests: fix typo in comment
Ethan Nelson-Moore [Tue, 9 Jun 2026 02:17:10 +0000 (19:17 -0700)] 
kconfig: tests: fix typo in comment

scripts/kconfig/tests/no_write_if_dep_unmet/__init__.py contains a typo
"COFIG_" for "CONFIG_". Fix it.

Discovered while searching for typos in CONFIG_* variable references.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260609021712.7965-1-enelsonmoore@gmail.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
7 days agospi: dw: fix race between IRQ handler and error handler on SMP
Peng Yang [Mon, 8 Jun 2026 09:58:49 +0000 (17:58 +0800)] 
spi: dw: fix race between IRQ handler and error handler on SMP

On SMP systems, dw_spi_handle_err() can be called from the SPI core
kthread while the IRQ handler is still accessing the FIFO on another
CPU. Resetting the chip via dw_spi_reset_chip() during an active FIFO
read/write causes a bus error.

Fix this by calling disable_irq() before the chip reset, which masks
the IRQ and waits for any in-flight handler to complete via
synchronize_irq(). This ensures no handler is accessing the FIFO when
the reset occurs.

Signed-off-by: Peng Yang <pyangyyd@amazon.com>
Suggested-by: Jonathan Chocron <jonnyc@amazon.com>
Link: https://patch.msgid.link/20260608095849.3446-1-pyangyyd@amazon.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agoASoC: amd: yc: Add DMI quirk for ASUS EXPERTBOOK PM1403CDA
Zhang Heng [Thu, 4 Jun 2026 12:58:15 +0000 (20:58 +0800)] 
ASoC: amd: yc: Add DMI quirk for ASUS EXPERTBOOK PM1403CDA

Add a DMI quirk for the ASUS EXPERTBOOK PM1403CDA fixing the issue
where the internal microphone was not detected.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=221608
Signed-off-by: Zhang Heng <zhangheng@kylinos.cn>
Link: https://patch.msgid.link/20260604125815.42297-1-zhangheng@kylinos.cn
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agospi: meson-spifc: fix runtime PM leak on remove
Ruoyu Wang [Tue, 9 Jun 2026 05:26:47 +0000 (13:26 +0800)] 
spi: meson-spifc: fix runtime PM leak on remove

pm_runtime_get_sync() increments the runtime PM usage counter even when it
returns an error. meson_spifc_remove() uses it to resume the controller
before disabling runtime PM, but never drops the usage counter again.

Balance the get with pm_runtime_put_noidle() after disabling runtime PM,
matching the teardown pattern used by other SPI controller drivers.

Found by static analysis. I do not have hardware to test this.

Fixes: c3e4bc5434d2 ("spi: meson: Add support for Amlogic Meson SPIFC")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
Link: https://patch.msgid.link/20260609052647.5-1-ruoyuw560@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agosoftware node: allow passing reference args to PROPERTY_ENTRY_REF()
Dmitry Torokhov [Sun, 7 Jun 2026 03:51:29 +0000 (20:51 -0700)] 
software node: allow passing reference args to PROPERTY_ENTRY_REF()

When dynamically creating software nodes and properties for subsequent
use with software_node_register() current implementation of
PROPERTY_ENTRY_REF() is not suitable because it creates a temporary
instance of struct software_node_ref_args on stack which will later
disappear, and software_node_register() only does shallow copy of
properties.

Fix this by allowing to pass address of reference arguments structure
directly into PROPERTY_ENTRY_REF(), so that caller can manage lifetime
of the object properly.

Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/aiTo4dvKu8pyimHA@google.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
7 days agoregulator: dt-bindings: mt6311: Convert to DT schema
Ninad Naik [Thu, 4 Jun 2026 16:26:24 +0000 (21:56 +0530)] 
regulator: dt-bindings: mt6311: Convert to DT schema

Convert mediatek,mt6311 to DT schema.

Signed-off-by: Ninad Naik <ninadnaik07@gmail.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260604162624.644241-1-ninadnaik07@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agoregulator: qcom_smd-regulator: Add PM8019
Mark Brown [Tue, 9 Jun 2026 21:46:07 +0000 (22:46 +0100)] 
regulator: qcom_smd-regulator: Add PM8019

Stephan Gerhold <stephan.gerhold@linaro.org> says:

Add the definitions and dt-bindings for the regulators in PM8019 to allow
controlling them through the RPM firmware. PM8019 is typically used
together with the MDM9607 SoC.

Link: https://patch.msgid.link/20260608-rpm-smd-regulator-pm8019-v1-0-c671388b9ea5@linaro.org
7 days agoregulator: qcom_smd-regulator: Add PM8019
Stephan Gerhold [Mon, 8 Jun 2026 12:05:44 +0000 (14:05 +0200)] 
regulator: qcom_smd-regulator: Add PM8019

Add the definitions for the regulators in PM8019 to allow controlling them
through the RPM firmware. Reading the TYPE/SUBTYPE registers using SPMI
reveals that PM8019 uses a mixture of regulators from PMA8084 (hfsmps,
pldo) and PM8916 (nldo).

Signed-off-by: Stephan Gerhold <stephan@gerhold.net>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260608-rpm-smd-regulator-pm8019-v1-2-c671388b9ea5@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agoregulator: dt-bindings: qcom,smd-rpm-regulator: Add PM8019
Stephan Gerhold [Mon, 8 Jun 2026 12:05:43 +0000 (14:05 +0200)] 
regulator: dt-bindings: qcom,smd-rpm-regulator: Add PM8019

Add the qcom,rpm-pm8019-regulators compatible to allow describing
regulators controlled by the RPM firmware on platforms that use PM8019.

Signed-off-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Link: https://patch.msgid.link/20260608-rpm-smd-regulator-pm8019-v1-1-c671388b9ea5@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agospi: Use named initializers for platform_device_id arrays
Uwe Kleine-König (The Capable Hub) [Thu, 4 Jun 2026 20:55:26 +0000 (22:55 +0200)] 
spi: Use named initializers for platform_device_id arrays

Named initializers are better readable and more robust to changes of the
struct definition. This robustness is relevant for a planned change to
struct platform_device_id replacing .driver_data by an anonymous union.

While touching these arrays unify spacing and usage of commas.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/3fcd432a505bb1bb7f8ef0fba9162243200b3347.1780606153.git.u.kleine-koenig@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agospi: rzv2h-rspi: Add suspend/resume support
Tommaso Merciai [Mon, 8 Jun 2026 20:25:08 +0000 (22:25 +0200)] 
spi: rzv2h-rspi: Add suspend/resume support

Add suspend/resume support to the rzv2h-rspi driver by implementing
suspend and resume callbacks that delegate to spi_controller_suspend()
and spi_controller_resume() respectively.

Signed-off-by: Tommaso Merciai <tommaso.merciai.xr@bp.renesas.com>
Link: https://patch.msgid.link/20260608202509.3651345-1-tommaso.merciai.xr@bp.renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agospi: qcom-geni: Fix cs_change handling on the last transfer
Viken Dadhaniya [Tue, 9 Jun 2026 08:43:09 +0000 (14:13 +0530)] 
spi: qcom-geni: Fix cs_change handling on the last transfer

TPM TIS SPI probe fails with:

   tpm_tis_spi: probe of spi11.0 failed with error -110

TPM TIS SPI sets cs_change=1 on single-transfer messages to keep CS
asserted across the header, wait-state, and data phases of a transaction.
CS deassertion between these phases violates the TCG SPI flow control
specification.

This bug was introduced by commit b99181cdf9fa ("spi-geni-qcom: remove
manual CS control"), which replaced manual CS control with automatic CS
control via the FRAGMENTATION bit. The FRAGMENTATION bit controls CS
behavior after a transfer: when set to 1, CS remains asserted; when
cleared to 0, CS is deasserted.

The commit correctly sets FRAGMENTATION for non-last transfers with
cs_change=0 to keep CS asserted between chained transfers, but misses the
case where cs_change=1 is set on the last transfer. When cs_change=1 on
the last transfer, the client requests CS to remain asserted after the
message completes, so FRAGMENTATION must be set to 1 in this case as well.

Fix setup_se_xfer() to set FRAGMENTATION when cs_change=1 on the last
transfer.

Also fix the same missing case in setup_gsi_xfer() and correct it to
write 1 instead of the raw bitmask FRAGMENTATION (value 4) to
peripheral.fragmentation. This field is a 1-bit boolean consumed by
gpi_create_spi_tre() via u32_encode_bits(..., TRE_SPI_GO_FRAG). Writing 4
to a 1-bit field causes u32_encode_bits() to mask it to 0, silently
disabling the FRAGMENTATION bit in the GPI TRE regardless of the
cs_change logic.

Fixes: b99181cdf9fa ("spi-geni-qcom: remove manual CS control")
Cc: stable@vger.kernel.org
Reviewed-by: Jonathan Marek <jonathan@marek.ca>
Signed-off-by: Viken Dadhaniya <viken.dadhaniya@oss.qualcomm.com>
Link: https://patch.msgid.link/20260609-fix-spi-fragmentation-bit-logic-v2-1-e18efc255563@oss.qualcomm.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agohwmon: (ina238) Add update_interval_us attribute
Ferdinand Schwenk [Tue, 9 Jun 2026 19:43:12 +0000 (21:43 +0200)] 
hwmon: (ina238) Add update_interval_us attribute

The INA238 family supports eight conversion time steps from 50 us to
4120 us (SQ52206: 66 us to 8230 us). At the millisecond granularity of
update_interval, the four shortest steps (50, 84, 150, 280 us) all
round to the same value and cannot be individually selected.

Add support for the generic update_interval_us attribute, which reports
and programs the same ADC cycle time as update_interval but in
microseconds, giving userspace full access to all conversion time steps.

Both attributes reflect the total cycle time including the active
averaging count: the reported value is the raw conversion time
multiplied by the number of averaged samples, and writes apply the
inverse mapping.

Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Link: https://lore.kernel.org/r/20260609-hwmon-ina238-update-interval-us-v2-v3-3-016b55567950@advastore.com
[groeck: Fixed some multi-line alignment issues]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
7 days agohwmon: Add update_interval_us chip attribute
Ferdinand Schwenk [Tue, 9 Jun 2026 19:43:11 +0000 (21:43 +0200)] 
hwmon: Add update_interval_us chip attribute

Some hardware monitoring chips support update intervals below one
millisecond. The existing update_interval attribute uses millisecond
granularity, which causes sub-millisecond steps to round to the same
value and become inaccessible from userspace.

Introduce update_interval_us, a companion chip-level attribute that
expresses the same update interval in microseconds. Drivers
implementing this attribute should also implement update_interval for
compatibility with millisecond-based userspace interfaces.

Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Link: https://lore.kernel.org/r/20260609-hwmon-ina238-update-interval-us-v2-v3-2-016b55567950@advastore.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
7 days agohwmon: (ina238) Add support for samples and update_interval
Ferdinand Schwenk [Tue, 9 Jun 2026 19:43:10 +0000 (21:43 +0200)] 
hwmon: (ina238) Add support for samples and update_interval

Expose INA238 ADC averaging count (AVG) and conversion timing
(VBUSCT/VSHCT/VTCT) through chip-level hwmon attributes:

  chip/samples
  chip/update_interval

Use per-chip conversion-time lookup tables so the same helpers work
for INA228/INA237/INA238/INA700/INA780 and SQ52206. Cache ADC_CONFIG
in driver data and update it on writes to avoid extra register reads
during read-modify-write updates.

Report update_interval in milliseconds as required by the hwmon ABI.
Compute it from raw ADC cycle time multiplied by the active averaging
count, and apply the inverse mapping on writes so programmed conversion
time tracks the selected sample count.

Clamp user-provided update_interval before unit scaling to prevent
overflow in arithmetic conversions.

Also combine chip attributes in HWMON_CHANNEL_INFO using a bitwise OR
for a single logical chip channel.

Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com>
Link: https://lore.kernel.org/r/20260609-hwmon-ina238-update-interval-us-v2-v3-1-016b55567950@advastore.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
7 days agoiommu/dma: Do not try to iommu_map a 0 length region in swiotlb
Jason Gunthorpe [Mon, 8 Jun 2026 18:10:04 +0000 (15:10 -0300)] 
iommu/dma: Do not try to iommu_map a 0 length region in swiotlb

iommu_dma_iova_link_swiotlb() processes a mapping that is unaligned in three
parts, the head, middle and trailer. If the middle is empty because there
are no aligned pages it will call down to iommu_map() with a 0 size
which the iommupt implementation will fail as illegal.

It then tries to do an error unwind and starts from the wrong spot
corrupting the mapping so the eventual destruction triggers a WARN_ON.

Check for 0 length and avoid mapping and use offset not 0 as the starting
point to unlink.

This is frequently triggered by using some kinds of thunderbolt NVMe
drives that trigger forced SWIOTLB for unaligned memory. NVMe seems to
pass in oddly aligned buffers for the passthrough commands from smartctl
that hit this condition.

Cc: stable@vger.kernel.org
Fixes: 433a76207dcf ("dma-mapping: Implement link/unlink ranges API")
Reported-by: Mark Lord <mlord@pobox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/0-v1-8536728bc89f+469-swiotlb_warn_jgg@nvidia.com
7 days agodrm/vc4: fix krealloc() memory leak
Alexander A. Klimov [Sat, 6 Jun 2026 12:38:10 +0000 (14:38 +0200)] 
drm/vc4: fix krealloc() memory leak

Don't just overwrite the original pointer passed to krealloc()
with its return value without checking latter:

    MEM = krealloc(MEM, SZ, GFP);

If krealloc() returns NULL, that erases the pointer
to the still allocated memory, hence leaks this memory.
Instead, use a temporary variable, check it's not NULL
and only then assign it to the original pointer:

    TMP = krealloc(MEM, SZ, GFP);
    if (!TMP) return;
    MEM = TMP;

While on it, use krealloc_array().

Fixes: 6d45c81d229d ("drm/vc4: Add support for branching in shader validation.")
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Link: https://patch.msgid.link/20260606123817.37222-1-grandmaster@al2klimov.de
7 days agoi2c: qcom-geni: Use pm_runtime_force_{suspend,resume} helpers
Praveen Talari [Wed, 20 May 2026 07:14:29 +0000 (12:44 +0530)] 
i2c: qcom-geni: Use pm_runtime_force_{suspend,resume} helpers

The driver carries custom system suspend/resume handling that manually
tracks a suspended state and conditionally calls
geni_i2c_runtime_suspend()
from the noirq suspend path, then adjusts runtime PM state by hand. This
duplicates PM core behavior and adds unnecessary complexity.

Drop the manual state tracking and switch to pm_runtime_force_suspend()
and pm_runtime_force_resume() for system sleep. These helpers already
perform the required checks, call the runtime PM callbacks when needed,
and keep runtime PM state transitions consistent.

Reviewed-by: Mukesh Kumar Savaliya <mukesh.savaliya@oss.qualcomm.com>
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260520-use_pm_runtime_apis-v1-1-6a5238fc6cb6@oss.qualcomm.com
7 days agoregulator: mt6359: Fix vbbck default internal supply name
Chen-Yu Tsai [Tue, 9 Jun 2026 08:36:27 +0000 (16:36 +0800)] 
regulator: mt6359: Fix vbbck default internal supply name

This issue was pointed out by Sashiko.

vbbck is fed internally from vio18. For the MT6359, the default supply
name was incorrectly set as "VIO18", instead of the supply's default
"VIO18". In practice this still works, but it causes the regulator
description copy and replace to always happen. For the MT6359P the
name is correct.

Fix the supply name for MT6359 so that both instances are the same and
correct. Also copy the comment about the internal supply from the MT6359
list to the MT6359P list.

Fixes: 10be8fc1d534 ("regulator: mt6359: Add regulator supply names")
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://patch.msgid.link/20260609083630.1600070-1-wenst@chromium.org
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agos390/ap: Fix locking issue in SE bind and associate sysfs functions
Harald Freudenberger [Wed, 3 Jun 2026 13:04:56 +0000 (15:04 +0200)] 
s390/ap: Fix locking issue in SE bind and associate sysfs functions

Revisit and reorganize the locking and lock coverage of the
ap->lock spinlock as used in the two sysfs functions
se_bind_store() and se_associate_store().

A kernel run reported a possible deadlock situation, caused by
holding the spinlock (ap->lock) while triggering a uevent.
The fix rearranges the code protected by the spinlock by excluding
the uevent invocation, which does not require protection.

Additionally, the start of the protected region is moved earlier
to cover more lines, ensuring a consistent view of the AP queue
state between reading and updating its struct fields.

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
7.1.0-20260601.rc6.git12.516b5dbd4d4a.300.fc44.s390x+debug #1 Not tainted
-----------------------------------------------------
setupseguest.sh/11034 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
000001c991f498e8 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x5a/0x6d0
and this task is already holding:
000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
which would create a new lock dependency:
 (&aq->lock){+.-.}-{2:2} -> (fs_reclaim){+.+.}-{0:0}
but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&aq->lock){+.-.}-{2:2}
... which became SOFTIRQ-irq-safe at:
  __lock_acquire+0x5ae/0x15a0
  lock_acquire+0x14c/0x400
  _raw_spin_lock_bh+0x58/0xb0
  ap_tasklet_fn+0x72/0xd0
  tasklet_action_common+0x174/0x1b0
  handle_softirqs+0x180/0x5c0
  irq_exit_rcu+0x196/0x200
  do_ext_irq+0x12a/0x4d0
  ext_int_handler+0xc6/0xf0
  folio_zero_user+0x1c6/0x240
  folio_zero_user+0x182/0x240
  vma_alloc_anon_folio_pmd+0xa0/0x1d0
  __do_huge_pmd_anonymous_page+0x3a/0x200
  __handle_mm_fault+0x56c/0x590
  handle_mm_fault+0xa2/0x370
  do_exception+0x292/0x590
  __do_pgm_check+0x136/0x3e0
  pgm_check_handler+0x114/0x160
to a SOFTIRQ-irq-unsafe lock:
 (fs_reclaim){+.+.}-{0:0}
... which became SOFTIRQ-irq-unsafe at:
...
  __lock_acquire+0x5ae/0x15a0
  lock_acquire+0x14c/0x400
  __fs_reclaim_acquire+0x44/0x50
  fs_reclaim_acquire+0xbe/0x100
  fs_reclaim_correct_nesting+0x20/0x70
  dotest+0x5e/0x148
  locking_selftest+0x2854/0x2a88
  start_kernel+0x3b2/0x4f0
  startup_continue+0x2e/0x40
other info that might help us debug this:
 Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
       local_irq_disable();
       lock(&aq->lock);
       lock(fs_reclaim);
  <Interrupt>
    lock(&aq->lock);
 *** DEADLOCK ***
4 locks held by setupseguest.sh/11034:
 #0: 000000c485d01440 (sb_writers#4){.+.+}-{0:0}, at: vfs_write+0x2fc/0x380
 #1: 000000c4d2283288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x12a0x270
 #2: 000000c4a1830e48 (kn->active#172){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1e/0x270
 #3: 000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&aq->lock){+.-.}-{2:2} {
   HARDIRQ-ON-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    _raw_spin_lock_bh+0x58/0xb0
    ap_queue_init_state+0x2e/0x50
    ap_scan_domains+0x5d6/0x620
    ap_scan_adapter+0x4c0/0x810
    ap_scan_bus+0x70/0x350
    ap_scan_bus_wq_callback+0x56/0x80
    process_one_work+0x2ba/0x820
    worker_thread+0x21a/0x400
    kthread+0x164/0x190
    __ret_from_fork+0x4c/0x340
    ret_from_fork+0xa/0x30
   IN-SOFTIRQ-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    _raw_spin_lock_bh+0x58/0xb0
    ap_tasklet_fn+0x72/0xd0
    tasklet_action_common+0x174/0x1b0
    handle_softirqs+0x180/0x5c0
    irq_exit_rcu+0x196/0x200
    do_ext_irq+0x12a/0x4d0
    ext_int_handler+0xc6/0xf0
    folio_zero_user+0x1c6/0x240
    folio_zero_user+0x182/0x240
    vma_alloc_anon_folio_pmd+0xa0/0x1d0
    __do_huge_pmd_anonymous_page+0x3a/0x200
    __handle_mm_fault+0x56c/0x590
    handle_mm_fault+0xa2/0x370
    do_exception+0x292/0x590
    __do_pgm_check+0x136/0x3e0
    pgm_check_handler+0x114/0x160
   INITIAL USE at:
   __lock_acquire+0x5ae/0x15a0
   lock_acquire+0x14c/0x400
   _raw_spin_lock_bh+0x58/0xb0
   ap_queue_init_state+0x2e/0x50
   ap_scan_domains+0x5d6/0x620
   ap_scan_adapter+0x4c0/0x810
   ap_scan_bus+0x70/0x350
   ap_scan_bus_wq_callback+0x56/0x80
   process_one_work+0x2ba/0x820
   worker_thread+0x21a/0x400
   kthread+0x164/0x190
   __ret_from_fork+0x4c/0x340
   ret_from_fork+0xa/0x30
 }
 ... key      at: [<000001c9936e8aa0>] __key.7+0x0/0x10
the dependencies between the lock to be acquired
 and SOFTIRQ-irq-unsafe lock:
-> (fs_reclaim){+.+.}-{0:0} {
   HARDIRQ-ON-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    __fs_reclaim_acquire+0x44/0x50
    fs_reclaim_acquire+0xbe/0x100
    fs_reclaim_correct_nesting+0x20/0x70
    dotest+0x5e/0x148
    locking_selftest+0x2854/0x2a88
    start_kernel+0x3b2/0x4f0
    startup_continue+0x2e/0x40
   SOFTIRQ-ON-W at:
    __lock_acquire+0x5ae/0x15a0
    lock_acquire+0x14c/0x400
    __fs_reclaim_acquire+0x44/0x50
    fs_reclaim_acquire+0xbe/0x100
    fs_reclaim_correct_nesting+0x20/0x70
    dotest+0x5e/0x148
    locking_selftest+0x2854/0x2a88
    start_kernel+0x3b2/0x4f0
    startup_continue+0x2e/0x40
   INITIAL USE at:
   __lock_acquire+0x5ae/0x15a0
   lock_acquire+0x14c/0x400
   __fs_reclaim_acquire+0x44/0x50
   fs_reclaim_acquire+0xbe/0x100
   fs_reclaim_correct_nesting+0x20/0x70
   dotest+0x5e/0x148
   locking_selftest+0x2854/0x2a88
   start_kernel+0x3b2/0x4f0
   startup_continue+0x2e/0x40
 }
 ... key      at: [<000001c991f498e8>] __fs_reclaim_map+0x0/0x30
 ... acquired at:
   check_prev_add+0x178/0xf40
   __lock_acquire+0x12aa/0x15a0
   lock_acquire+0x14c/0x400
   __fs_reclaim_acquire+0x44/0x50
   fs_reclaim_acquire+0xbe/0x100
   __kmalloc_cache_noprof+0x5a/0x6d0
   kobject_uevent_env+0xd4/0x420
   ap_send_se_bind_uevent+0x48/0x70
   se_bind_store+0x146/0x3a0
   kernfs_fop_write_iter+0x18c/0x270
   vfs_write+0x23c/0x380
   ksys_write+0x88/0x120
   __do_syscall+0x170/0x750
   system_call+0x72/0x90
stack backtrace:
CPU: 6 UID: 0 PID: 11034 Comm: setupseguest.sh Not tainted 7.1.0-20260601.rc6.git2.516b5dbd4d4a.300.fc44.s390x+debug #1 PREEMPT
Hardware name: IBM 9175 ME1 701 (KVM/Linux)
Call Trace:
 [<000001c98ffa0a7e>] dump_stack_lvl+0xae/0x108
 [<000001c9900a6d7a>] print_bad_irq_dependency+0x47a/0x480
 [<000001c9900a7184>] check_irq_usage+0x404/0x4c0
 [<000001c9900a73b8>] check_prev_add+0x178/0xf40
 [<000001c9900aaf1a>] __lock_acquire+0x12aa/0x15a0
 [<000001c9900ab35c>] lock_acquire+0x14c/0x400
 [<000001c9903be454>] __fs_reclaim_acquire+0x44/0x50
 [<000001c9903be51e>] fs_reclaim_acquire+0xbe/0x100
 [<000001c9903cf4ca>] __kmalloc_cache_noprof+0x5a/0x6d0
 [<000001c9910ca9d4>] kobject_uevent_env+0xd4/0x420
 [<000001c990d84098>] ap_send_se_bind_uevent+0x48/0x70
 [<000001c990d87416>] se_bind_store+0x146/0x3a0
 [<000001c99057da7c>] kernfs_fop_write_iter+0x18c/0x270
 [<000001c99047712c>] vfs_write+0x23c/0x380
 [<000001c990477438>] ksys_write+0x88/0x120
 [<000001c9910f64e0>] __do_syscall+0x170/0x750
 [<000001c99110a412>] system_call+0x72/0x90
INFO: lockdep is turned off.

Fixes: 4179c3984227 ("s390/ap: Implement SE bind and associate uevents")
Reported-by: Ingo Franzki <ifranzki@linux.ibm.com>
Suggested-by: Finn Callies <fcallies@linux.ibm.com>
Reviewed-by: Finn Callies <fcallies@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
7 days agoASoC: SOF: amd: set ipc flags to zero
Vijendar Mukunda [Tue, 9 Jun 2026 16:08:45 +0000 (21:38 +0530)] 
ASoC: SOF: amd: set ipc flags to zero

As per design, set IPC conf structure flags to zero during acp init
sequence.

Link: https://github.com/thesofproject/linux/pull/5642
Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Tested-by: Umang Jain <uajain@igalia.com>
Link: https://patch.msgid.link/20260609160938.3717513-2-Vijendar.Mukunda@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agoASoC: SOF: amd: fix for ipc flags check
Vijendar Mukunda [Tue, 9 Jun 2026 16:08:44 +0000 (21:38 +0530)] 
ASoC: SOF: amd: fix for ipc flags check

Firmware will set dsp_ack to 1 when firmware sends response for the IPC
command issued by host. Similarly dsp_msg flag will be updated to 1.

During ACP D0 entry, the value read from the sof_dsp_ack_write scratch
flag can be uninitialized. A non-zero garbage value is treated as a
pending DSP IPC ack before SOF_FW_BOOT_COMPLETE, causing a spurious
"IPC reply before FW_BOOT_COMPLETE" log.

Fix the condition checks for ipc flags.

Fixes: 738a2b5e2cc9 ("ASoC: SOF: amd: Add IPC support for ACP IP block")
Link: https://github.com/thesofproject/linux/pull/5642
Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Tested-by: Umang Jain <uajain@igalia.com>
Link: https://patch.msgid.link/20260609160938.3717513-1-Vijendar.Mukunda@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>
7 days agobtrfs: fix use-after-free after relocation failure with concurrent COW
Filipe Manana [Fri, 5 Jun 2026 15:15:37 +0000 (16:15 +0100)] 
btrfs: fix use-after-free after relocation failure with concurrent COW

If we get a failure during relocation, before we update all the extent
buffers that have file extent items pointing to extents from the block
group being relocated, we can trigger a user-after-free on the reloc
control structure (fs_info->reloc_control) if we have a concurrent task
that is COWing a subvolume leaf.

This happens like this:

1) Relocation of data block group X starts;

2) Relocation changes its state to UPDATE_DATA_PTRS;

3) A task doing a rename for example, COWs leaf A from a subvolume tree
   and ends up at btrfs_reloc_cow_block() and extracts fs_info->reloc_ctl
   into a local variable, which then passes to replace_file_extents();

4) The relocation task gets an error and under the label 'out_put_bg' in
   btrfs_relocate_block_group() calls free_reloc_control(), which frees
   the reloc control structure that the rename task is using;

5) The rename task triggers a use-after-free on the reloc control
   structure that was just freed.

Syzbot reported this recently, with the following stack trace:

   [   88.389822][ T5325] BTRFS error (device loop0 state A): Transaction aborted (error -5)
   [   88.389842][ T5325] BTRFS: error (device loop0 state A) in cleanup_transaction:2067: errno=-5 IO failure
   [   88.389864][ T5325] BTRFS info (device loop0 state EA): forced readonly
   [   88.392277][ T5324] BTRFS: error (device loop0 state EA) in btrfs_sync_log:3572: errno=-5 IO failure
   [   88.396630][ T5325] BTRFS info (device loop0 state EA): balance: ended with status: -5
   [   88.400135][ T5346] ==================================================================
   [   88.400148][ T5346] BUG: KASAN: slab-use-after-free in replace_file_extents+0x85f/0x1590
   [   88.400288][ T5346] Read of size 8 at addr ffff888012312010 by task syz.0.0/5346
   [   88.400299][ T5346]
   [   88.400306][ T5346] CPU: 0 UID: 0 PID: 5346 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full)
   [   88.400319][ T5346] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
   [   88.400325][ T5346] Call Trace:
   [   88.400331][ T5346]  <TASK>
   [   88.400336][ T5346]  dump_stack_lvl+0xe8/0x150
   [   88.400351][ T5346]  print_address_description+0x55/0x1e0
   [   88.400364][ T5346]  ? replace_file_extents+0x85f/0x1590
   [   88.400378][ T5346]  print_report+0x58/0x70
   [   88.400389][ T5346]  kasan_report+0x117/0x150
   [   88.400405][ T5346]  ? replace_file_extents+0x85f/0x1590
   [   88.400420][ T5346]  replace_file_extents+0x85f/0x1590
   [   88.400440][ T5346]  ? __pfx_replace_file_extents+0x10/0x10
   [   88.400452][ T5346]  ? update_ref_for_cow+0xa71/0x1270
   [   88.400473][ T5346]  btrfs_force_cow_block+0xa4d/0x2450
   [   88.400492][ T5346]  ? __pfx_btrfs_force_cow_block+0x10/0x10
   [   88.400508][ T5346]  ? __pfx_btrfs_get_32+0x10/0x10
   [   88.400523][ T5346]  btrfs_cow_block+0x3c4/0xa90
   [   88.400542][ T5346]  push_leaf_left+0x2ac/0x4a0
   [   88.400561][ T5346]  split_leaf+0xd16/0x12e0
   [   88.400574][ T5346]  ? btrfs_bin_search+0x924/0xc70
   [   88.400592][ T5346]  ? __pfx_split_leaf+0x10/0x10
   [   88.400602][ T5346]  ? leaf_space_used+0x177/0x1e0
   [   88.400618][ T5346]  ? btrfs_leaf_free_space+0x14a/0x2f0
   [   88.400634][ T5346]  btrfs_search_slot+0x2641/0x2d20
   [   88.400654][ T5346]  ? __pfx_btrfs_search_slot+0x10/0x10
   [   88.400669][ T5346]  ? rcu_is_watching+0x15/0xb0
   [   88.400681][ T5346]  ? trace_kmem_cache_alloc+0x29/0xe0
   [   88.400694][ T5346]  btrfs_insert_empty_items+0x9c/0x190
   [   88.400711][ T5346]  btrfs_insert_inode_ref+0x229/0xcb0
   [   88.400724][ T5346]  ? __pfx_btrfs_insert_inode_ref+0x10/0x10
   [   88.400736][ T5346]  ? __pfx_btrfs_qgroup_convert_reserved_meta+0x10/0x10
   [   88.400751][ T5346]  ? btrfs_record_root_in_trans+0x124/0x180
   [   88.400767][ T5346]  ? start_transaction+0x8a0/0x1820
   [   88.400778][ T5346]  ? btrfs_set_inode_index+0x5e/0x100
   [   88.400787][ T5346]  btrfs_rename2+0x17bb/0x40d0
   [   88.400800][ T5346]  ? check_noncircular+0xda/0x150
   [   88.400814][ T5346]  ? add_lock_to_list+0xc7/0x100
   [   88.400828][ T5346]  ? __pfx_btrfs_rename2+0x10/0x10
   [   88.400842][ T5346]  ? lockdep_hardirqs_on+0x7a/0x110
   [   88.400901][ T5346]  ? lock_acquire+0x221/0x350
   [   88.400915][ T5346]  ? down_write_nested+0x174/0x210
   [   88.400931][ T5346]  ? __pfx_down_write_nested+0x10/0x10
   [   88.400941][ T5346]  ? do_raw_spin_unlock+0x4d/0x210
   [   88.400952][ T5346]  ? try_break_deleg+0x5b/0x180
   [   88.400963][ T5346]  ? __pfx_btrfs_rename2+0x10/0x10
   [   88.400973][ T5346]  vfs_rename+0xa96/0xeb0
   [   88.400992][ T5346]  ? __pfx_vfs_rename+0x10/0x10
   [   88.401010][ T5346]  ovl_fill_super+0x46b7/0x5e20
   [   88.401030][ T5346]  ? __pfx_ovl_fill_super+0x10/0x10
   [   88.401042][ T5346]  ? xas_create+0x1902/0x1b90
   [   88.401060][ T5346]  ? __pfx___mutex_trylock_common+0x10/0x10
   [   88.401076][ T5346]  ? trace_contention_end+0x3d/0x140
   [   88.401094][ T5346]  ? shrinker_register+0x124/0x230
   [   88.401111][ T5346]  ? __mutex_unlock_slowpath+0x1be/0x6f0
   [   88.401127][ T5346]  ? shrinker_register+0x61/0x230
   [   88.401143][ T5346]  ? __pfx___mutex_lock+0x10/0x10
   [   88.401158][ T5346]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
   [   88.401177][ T5346]  ? __raw_spin_lock_init+0x45/0x100
   [   88.401196][ T5346]  ? sget_fc+0x962/0xa40
   [   88.401208][ T5346]  ? __pfx_set_anon_super_fc+0x10/0x10
   [   88.401222][ T5346]  ? __pfx_ovl_fill_super+0x10/0x10
   [   88.401241][ T5346]  get_tree_nodev+0xbb/0x150
   [   88.401257][ T5346]  vfs_get_tree+0x92/0x2a0
   [   88.401272][ T5346]  do_new_mount+0x341/0xd30
   [   88.401283][ T5346]  ? apparmor_capable+0x126/0x170
   [   88.401301][ T5346]  ? __pfx_do_new_mount+0x10/0x10
   [   88.401311][ T5346]  ? ns_capable+0x89/0xe0
   [   88.401322][ T5346]  ? path_mount+0x690/0x10e0
   [   88.401333][ T5346]  ? user_path_at+0xd4/0x160
   [   88.401346][ T5346]  __se_sys_mount+0x31d/0x420
   [   88.401358][ T5346]  ? __pfx___se_sys_mount+0x10/0x10
   [   88.401370][ T5346]  ? __x64_sys_mount+0x20/0xc0
   [   88.401381][ T5346]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401391][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401403][ T5346]  ? trace_irq_disable+0x3b/0x140
   [   88.401413][ T5346]  ? clear_bhb_loop+0x40/0x90
   [   88.401421][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401429][ T5346] RIP: 0033:0x7fa1ff79ce59
   [   88.401436][ T5346] Code: ff c3 66 (...)
   [   88.401443][ T5346] RSP: 002b:00007fa2005affe8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
   [   88.401456][ T5346] RAX: ffffffffffffffda RBX: 00007fa1ffa16180 RCX: 00007fa1ff79ce59
   [   88.401464][ T5346] RDX: 0000200000000100 RSI: 0000200000002240 RDI: 0000000000000000
   [   88.401474][ T5346] RBP: 00007fa1ff832d6f R08: 0000200000000440 R09: 0000000000000000
   [   88.401481][ T5346] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [   88.401488][ T5346] R13: 00007fa1ffa16218 R14: 00007fa1ffa16180 R15: 00007ffc734fba78
   [   88.401500][ T5346]  </TASK>
   [   88.401506][ T5346]
   [   88.401510][ T5346] Allocated by task 5325:
   [   88.401516][ T5346]  kasan_save_track+0x3e/0x80
   [   88.401529][ T5346]  __kasan_kmalloc+0x93/0xb0
   [   88.401542][ T5346]  __kmalloc_cache_noprof+0x31c/0x660
   [   88.401554][ T5346]  btrfs_relocate_block_group+0x217/0xc40
   [   88.401568][ T5346]  btrfs_relocate_chunk+0x115/0x820
   [   88.401577][ T5346]  __btrfs_balance+0x1db0/0x2ae0
   [   88.401587][ T5346]  btrfs_balance+0xaf3/0x11b0
   [   88.401596][ T5346]  btrfs_ioctl_balance+0x3d3/0x610
   [   88.401612][ T5346]  __se_sys_ioctl+0xfc/0x170
   [   88.401626][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401640][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401650][ T5346]
   [   88.401653][ T5346] Freed by task 5325:
   [   88.401659][ T5346]  kasan_save_track+0x3e/0x80
   [   88.401671][ T5346]  kasan_save_free_info+0x46/0x50
   [   88.401680][ T5346]  __kasan_slab_free+0x5c/0x80
   [   88.401692][ T5346]  kfree+0x1c5/0x640
   [   88.401703][ T5346]  btrfs_relocate_block_group+0x95d/0xc40
   [   88.401715][ T5346]  btrfs_relocate_chunk+0x115/0x820
   [   88.401724][ T5346]  __btrfs_balance+0x1db0/0x2ae0
   [   88.401733][ T5346]  btrfs_balance+0xaf3/0x11b0
   [   88.401742][ T5346]  btrfs_ioctl_balance+0x3d3/0x610
   [   88.401757][ T5346]  __se_sys_ioctl+0xfc/0x170
   [   88.401770][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401785][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401795][ T5346]
   [   88.401798][ T5346] The buggy address belongs to the object at ffff888012312000
   [   88.401798][ T5346]  which belongs to the cache kmalloc-2k of size 2048
   [   88.401807][ T5346] The buggy address is located 16 bytes inside of
   [   88.401807][ T5346]  freed 2048-byte region [ffff888012312000ffff888012312800)
   [   88.401819][ T5346]
   [   88.401822][ T5346] The buggy address belongs to the physical page:
   [   88.401829][ T5346] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12310
   [   88.401840][ T5346] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
   [   88.401849][ T5346] flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
   [   88.401860][ T5346] page_type: f5(slab)
   [   88.401871][ T5346] raw: 00fff00000000040 ffff88801ac42000 dead000000000100 dead000000000122
   [   88.401881][ T5346] raw: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
   [   88.401892][ T5346] head: 00fff00000000040 ffff88801ac42000 dead000000000100 dead000000000122
   [   88.401902][ T5346] head: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
   [   88.401913][ T5346] head: 00fff00000000003 fffffffffffffe01 00000000ffffffff 00000000ffffffff
   [   88.401923][ T5346] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
   [   88.401929][ T5346] page dumped because: kasan: bad access detected
   [   88.401935][ T5346] page_owner tracks the page as allocated
   [   88.401941][ T5346] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 9, tgid 9 (kworker/0:0), ts 83905464494, free_ts 83674944822
   [   88.401961][ T5346]  post_alloc_hook+0x231/0x280
   [   88.401975][ T5346]  get_page_from_freelist+0x24ba/0x2540
   [   88.401990][ T5346]  __alloc_frozen_pages_noprof+0x18d/0x380
   [   88.402004][ T5346]  allocate_slab+0x77/0x660
   [   88.402019][ T5346]  refill_objects+0x339/0x3d0
   [   88.402033][ T5346]  __pcs_replace_empty_main+0x321/0x720
   [   88.402043][ T5346]  __kmalloc_node_track_caller_noprof+0x572/0x7b0
   [   88.402055][ T5346]  __alloc_skb+0x2c1/0x7d0
   [   88.402067][ T5346]  mld_newpack+0x14c/0xc90
   [   88.402080][ T5346]  add_grhead+0x5a/0x2a0
   [   88.402093][ T5346]  add_grec+0x1452/0x1740
   [   88.402105][ T5346]  mld_ifc_work+0x6e6/0xe70
   [   88.402116][ T5346]  process_scheduled_works+0xb5d/0x1860
   [   88.402127][ T5346]  worker_thread+0xa53/0xfc0
   [   88.402138][ T5346]  kthread+0x389/0x470
   [   88.402150][ T5346]  ret_from_fork+0x514/0xb70
   [   88.402161][ T5346] page last free pid 5282 tgid 5282 stack trace:
   [   88.402168][ T5346]  __free_frozen_pages+0xbc7/0xd30
   [   88.402180][ T5346]  __slab_free+0x274/0x2c0
   [   88.402191][ T5346]  qlist_free_all+0x99/0x100
   [   88.402201][ T5346]  kasan_quarantine_reduce+0x148/0x160
   [   88.402211][ T5346]  __kasan_slab_alloc+0x22/0x80
   [   88.402221][ T5346]  __kmalloc_cache_noprof+0x2ba/0x660
   [   88.402231][ T5346]  kernfs_fop_open+0x3f0/0xda0
   [   88.402253][ T5346]  do_dentry_open+0x785/0x14e0
   [   88.402262][ T5346]  vfs_open+0x3b/0x340
   [   88.402270][ T5346]  path_openat+0x2e08/0x3860
   [   88.402281][ T5346]  do_file_open+0x23e/0x4a0
   [   88.402292][ T5346]  do_sys_openat2+0x113/0x200
   [   88.402300][ T5346]  __x64_sys_openat+0x138/0x170
   [   88.402309][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.402326][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.402336][ T5346]
   [   88.402339][ T5346] Memory state around the buggy address:
   [   88.402345][ T5346]  ffff888012311f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   [   88.402352][ T5346]  ffff888012311f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   [   88.402359][ T5346] >ffff888012312000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402365][ T5346]                          ^
   [   88.402370][ T5346]  ffff888012312080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402380][ T5346]  ffff888012312100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402385][ T5346] ==================================================================

Fix this by:

1) Making the reloc control structure ref counted;

2) Make revery place that access fs_info->reloc_ctl outside the relocation
   code, which at the moment it's only replace_file_extents() and
   btrfs_init_reloc_root(), get a reference count on the structure.
   There's also btrfs_update_reloc_root() that is called outside the
   relocation code, but this case is safe because it's only called in
   the transaction commit path while under the fs_info->reloc_mutex
   protection, but nevertheless grab a reference to make the code more
   consistent and avoid false alerts from AI reviews;

3) Add a spinlock to protect fs_info->reloc_ctl, since we can not take the
   fs_info->reloc_mutex as that would cause a deadlock since that lock is
   taken in the transaction commit path. That spinlock is taken before
   setting fs_info->reloc_ctl to an allocated structure, setting it to
   NULL and reading fs_info->reloc_ctl;

4) Make sure the structure is freed only when its reference count drops to
   zero.

Reported-by: syzbot+0eea49bba18051dea35e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1df323.bb0696ed.125a22.000a.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: move WARN_ON on unexpected error in __add_tree_block()
Filipe Manana [Fri, 5 Jun 2026 16:25:50 +0000 (17:25 +0100)] 
btrfs: move WARN_ON on unexpected error in __add_tree_block()

There's no point in having the WARN_ON(1) inside the if statement for the
unexpected error. Move it into the if statement's condition, which brings
a couple benefits:

1) It marks the branch as unlikely, hinting the compiler to generate
   better code;

2) The WARN_ON() produces a stack trace after the dumped leaf and error
   message which can hide that more important information in case we get
   a truncated dmesg/syslog.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: move locking into btrfs_get_reloc_bg_bytenr()
Filipe Manana [Fri, 5 Jun 2026 16:07:08 +0000 (17:07 +0100)] 
btrfs: move locking into btrfs_get_reloc_bg_bytenr()

It does not make sense for the single caller to have the responsability
to lock the relocation mutex before calling the function and then have
the function to assert the lock is held. As this is a function in
relocation.c, move the locking details into it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: lzo: reject compressed segment that overflows the compressed input
Weiming Shi [Sun, 7 Jun 2026 05:25:13 +0000 (22:25 -0700)] 
btrfs: lzo: reject compressed segment that overflows the compressed input

lzo_decompress_bio() validates each on-disk segment length seg_len only
against the workspace cbuf size, not against the compressed input size
(compressed_len, the total folio bytes of the bio).  A crafted extent can
carry a segment whose seg_len passes the cbuf check but runs past the end
of the bio, so copy_compressed_segment() walks off the last folio:
get_current_folio() then returns the NULL folio from bio_next_folio(), and
with CONFIG_BTRFS_ASSERT disabled (default) folio_size(NULL) faults.

 BUG: KASAN: null-ptr-deref in lzo_decompress_bio (fs/btrfs/lzo.c:383)
 Read of size 8 at addr 0000000000000000 by task kworker/u8:1/29
 Workqueue: btrfs-endio simple_end_io_work
  kasan_report (mm/kasan/report.c:590)
  lzo_decompress_bio (fs/btrfs/lzo.c:383)
  end_bbio_compressed_read (fs/btrfs/compression.c:1065)
  btrfs_bio_end_io (fs/btrfs/bio.c:135)
  btrfs_check_read_bio (fs/btrfs/bio.c:180 fs/btrfs/bio.c:285)
  simple_end_io_work
  process_one_work
  worker_thread

Reject any segment whose payload would extend beyond compressed_len before
copying it, treating it as corruption like the other on-disk validation
failures in this function.

Reported-by: Xiang Mei <xmei5@asu.edu>
Fixes: a6e66e6f8c1b ("btrfs: rework lzo_decompress_bio() to make it subpage compatible")
Assisted-by: Claude:claude-opus-4-8
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: retry faulting in the pages after a zero sized short direct write
Qu Wenruo [Thu, 4 Jun 2026 00:29:48 +0000 (09:59 +0930)] 
btrfs: retry faulting in the pages after a zero sized short direct write

Currently btrfs_direct_write() will not try to fault in the pages, but
directly fall back to buffered writes, if the first page of the buffer
can not be faulted in.

For example, during generic/362 with nodatasum mount option, there is a
write at file offset 0, length PAGE_SIZE, and the page is not faulted in.
Then we go the following callchain and directly fall back to buffered
IO:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is now zero.
 |
 |- if (iov_iter_count() > 0 && (ret == -EFAULT || ret > 0))
 |  @ret is zero, thus not meeting the above retry condition
 |
 |- Fallback to buffered

Just slightly loosen the condition to allow retry faulting in pages after
a zero sized short write.

Unlike the previous two bug fixes, this one is not really cause any real
bug, but only reducing the chance to do zero-copy direct IO.
Thus it doesn't really require stable-CC nor fixes-tag.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: fix incorrect buffered IO fallback for append direct writes
Qu Wenruo [Thu, 4 Jun 2026 00:29:47 +0000 (09:59 +0930)] 
btrfs: fix incorrect buffered IO fallback for append direct writes

[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
 - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:13:09.072485767 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Wrong file size after first write, got 8192 expected 4096
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.

But if the direct io finished short, we do not revert the isize to the
previous value nor to the short write end.

Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.

The call chain looks like this:

 btrfs_direct_write(pos=0, length=4K)
 |- __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     |- btrfs_get_blocks_direct_write()
 |  |        |- i_size_write()
 |  |           Which updates the isize to the write end (4K).
 |  |
 |  |- iomap_dio_iter()
 |  |  Failed with -EFAULT on the first page.
 |  |
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |     Detects a short write, return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0;}
 |     Which resets the return value.
 |
 |- ret = iomap_dio_complet()
 |  Which returns 0.
 |
 |- btrfs_buffered_write(iocb, from);
    |- generic_write_checks()
       |- iocb->ki_pos = i_size_read()
          Which is still the new size (4K), other than the original
  isize 0.

[FIX]
Introduce the following btrfs_dio_data members:

- old_isize

- updated_isize
  If the direct write has enlarged the isize.

Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on old_isize and current file
position.

And here we call i_size_write() without holding an extent lock, which is
a very special case that we're safe to do:

 - Only a single writer can be enlarging isize
   Enlarging isize will take the exclusive inode lock.

 - Buffered readers need to wait for the OE we're holding
   Buffered readers will lock extent and wait for OE of the folio range.
   Sometimes we can skip the OE wait, but since all page cache is
   invalidated, the OE wait can not be skipped.

But I do not think this is the most elegant solution, nor covers all
cases. E.g. if the bio is submitted but IO failed, we are unable to do
the revert.

I believe the more elegant one would be extend the EXTENT_DIO_LOCKED
lifespan for direct writes, so that we can update the isize when a
write beyond EOF finished successfully.

However that change is too huge for a small bug fix.
So only implement the minimal partial fix for now.

[REASON FOR NO FIXES TAG]
The bug is again very old, before commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.

Thus only a CC to stable.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: fix false IO failure after falling back to buffered write
Qu Wenruo [Thu, 4 Jun 2026 00:29:46 +0000 (09:59 +0930)] 
btrfs: fix false IO failure after falling back to buffered write

[BUG]
The test case generic/362 will fail with "nodatasum" mount option (*):

 MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:21:17.574771567 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +First write failed: Input/output error
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned.
Thus we never got an error pointer from __iomap_dio_rw().

The call chain looks like this:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  |- btrfs_finish_ordered_extent(uptodate = false)
 |  |  |  |  |- can_finish_ordered_extent()
 |  |  |  |     |- btrfs_mark_ordered_extent_error()
 |  |  |  |        |- mapping_set_error()
 |  |  |  |           Now the address space is marked error.
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |     Thus no error pointer will be returned.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is 0.
 |
 |- Fallback to buffered IO
 |  And the buffered write finished without error
 |
 |- filemap_fdatawait_range()
    |- filemap_check_errors()
       The previous error is recorded, thus an error is returned

However the buffered write is properly submitted and finished, the error
is from the btrfs_finish_ordered_extent() call with @uptodate = false.

[FIX]
When a short dio write happened, any range that is submitted will have
btrfs_extract_ordered_extent() to be called, thus the submitted range
will always have an OE just covering the submitted range.

The remaining OE range is never submitted, thus they should be treated
as truncated, not an error. So that we can properly reclaim and not
insert an unnecessary file extent item, without marking the mapping as
error.

Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
that helper to mark the direct IO ordered extent as truncated, so it
won't cause failure for the later buffered fallback.

[REASON FOR NO FIXES TAG]
The bug itself is pretty old, at commit f85781fb505e ("btrfs: switch to
iomap for direct IO") we're already passing @uptodate=false finishing
the OE.
But at that time OE with IOERR won't call mapping_set_error(), so it's
not exposed.
Later commit d61bec08b904 ("btrfs: mark ordered extent and inode with
error if we fail to finish") finally exposed the bug, but that commit
is doing a correct job, not the root cause.

Anyway the bug is very old, dating back to 5.1x days, thus only CC to
stable.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: use verbose assertions in backref.c
Filipe Manana [Tue, 2 Jun 2026 13:42:17 +0000 (14:42 +0100)] 
btrfs: use verbose assertions in backref.c

While debugging a relocation issue I hit an assertion in backref.c but it
was not super useful, since it could not tell what was the unexpected
value that triggered the assertion. The stack trace was this:

  [583246.338097] assertion failed: !cache->nr_nodes, in fs/btrfs/backref.c:3158
  [583246.339588] ------------[ cut here ]------------
  [583246.340573] kernel BUG at fs/btrfs/backref.c:3158!
  [583246.342075] Oops: invalid opcode: 0000 [#1] SMP PTI
  [583246.343294] CPU: 5 UID: 0 PID: 677957 Comm: btrfs Not tainted 7.1.0-rc4-btrfs-next-234+ #1 PREEMPT(full)
  [583246.345715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [583246.348694] RIP: 0010:btrfs_backref_release_cache.cold+0x61/0x84 [btrfs]
  [583246.350759] Code: 90 d5 7c (...)
  [583246.354923] RSP: 0018:ffffd4fc88c93ad8 EFLAGS: 00010246
  [583246.355982] RAX: 000000000000003e RBX: ffff8dec90d97020 RCX: 0000000000000000
  [583246.357459] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
  [583246.359517] RBP: ffff8dec8eeb78c0 R08: 0000000000000000 R09: 3fffffffffefffff
  [583246.361180] R10: ffffd4fc88c93970 R11: 0000000000000003 R12: ffff8decd21f3470
  [583246.363184] R13: 00000000fffffffe R14: ffff8decd21f3000 R15: ffff8decd21f3000
  [583246.364666] FS:  00007f9a51751400(0000) GS:ffff8df3f4255000(0000) knlGS:0000000000000000
  [583246.366287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [583246.367443] CR2: 00007f9a518ed8f5 CR3: 00000004467c8002 CR4: 0000000000370ef0
  [583246.368969] Call Trace:
  [583246.369541]  <TASK>
  [583246.370040]  relocate_block_group+0xf2/0x520 [btrfs]
  [583246.371243]  btrfs_relocate_block_group+0x9a9/0x22e0 [btrfs]
  [583246.372443]  ? preempt_count_add+0x47/0xa0
  [583247.532978]  ? btrfs_tree_read_lock_nested+0x19/0x90 [btrfs]
  [583247.534520]  ? mutex_lock+0x1a/0x40
  [583247.602233]  ? btrfs_scrub_pause+0x2e/0x120 [btrfs]
  [583247.603543]  btrfs_relocate_chunk+0x3b/0x1a0 [btrfs]
  [583247.604893]  btrfs_balance+0x9d5/0x1920 [btrfs]
  [583247.606189]  ? preempt_count_add+0x69/0xa0
  [583247.607030]  btrfs_ioctl+0x260c/0x2a20 [btrfs]
  [583247.608015]  ? __memcg_slab_free_hook+0x156/0x1a0
  [583247.636971]  __x64_sys_ioctl+0x92/0xe0
  [583247.679247]  do_syscall_64+0x60/0xf20
  [583247.753297]  ? clear_bhb_loop+0x60/0xb0
  [583247.756321]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [583247.787018] RIP: 0033:0x7f9a5186a8db
  [583247.787787] Code: 00 48 89 (...)
  [583247.791410] RSP: 002b:00007fff2ffa6ac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [583247.792897] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f9a5186a8db
  [583247.794319] RDX: 00007fff2ffa6bb0 RSI: 00000000c4009420 RDI: 0000000000000003
  [583247.795714] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [583247.797149] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff2ffa903f
  [583247.798685] R13: 00007fff2ffa6bb0 R14: 0000000000000002 R15: 0000000000000002
  [583247.800136]  </TASK>

So update all simple assertions in backref.c to print out the values when
they aren't testing simple boolean conditions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: print a message when a missing device re-appears
Qu Wenruo [Tue, 2 Jun 2026 05:26:49 +0000 (14:56 +0930)] 
btrfs: print a message when a missing device re-appears

There is a bug report that fstrim crashed, and that crash is eventually
pinned down to a missing device which re-appeared and screwed up callers
that only checks BTRFS_DEV_STATE_MISSING, but not
BTRFS_DEV_STATE_WRITEABLE nor device->bdev.

A missing device re-appearing can be very tricky, as for now it will
result in a device without WRITEABLE or MISSING flag, and still no bdev
pointer.

As the first step to enhance handling of such re-appearing missing
devices, add a dmesg output when a missing device re-appeared.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: do not trim a device which is not writeable
Qu Wenruo [Tue, 2 Jun 2026 04:04:46 +0000 (13:34 +0930)] 
btrfs: do not trim a device which is not writeable

[BUG]
There is a bug report that btrfs/242 can randomly fail with the
following NULL pointer dereference:

  run fstests btrfs/242 at 2026-06-01 10:25:08
  BTRFS: device fsid d4d7f234-487c-4787-88e4-47a8b68c9874 devid 1 transid 9 /dev/sdc (8:32) scanned by mount (122609)
  BTRFS info (device sdc): first mount of filesystem d4d7f234-487c-4787-88e4-47a8b68c9874
  BTRFS info (device sdc): using crc32c checksum algorithm
  BTRFS warning (device sdc): devid 2 uuid fbe72d72-3272-482d-80fb-ab88ed398192 is missing
  BTRFS warning (device sdc): devid 2 uuid fbe72d72-3272-482d-80fb-ab88ed398192 is missing
  BTRFS info (device sdc): allowing degraded mounts
  BTRFS info (device sdc): turning on async discard
  BTRFS info (device sdc): enabling free space tree
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
  user pgtable: 4k pages, 48-bit VAs, pgdp=000000013fd6b000
  CPU: 4 UID: 0 PID: 122625 Comm: fstrim Not tainted 7.0.10-2-default #1 PREEMPT(full) openSUSE Tumbleweed e9a5f6b24978fba3bf015a992f865837fdfff3dd
  Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250812-19.fc42 08/12/2025
  pstate: 01400005 (nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : btrfs_trim_fs+0x34c/0xa00 [btrfs]
  lr : btrfs_trim_fs+0x1f0/0xa00 [btrfs]
  Call trace:
   btrfs_trim_fs+0x34c/0xa00 [btrfs f02c1d570ceea621c69d302ba75dd61868083840] (P)
   btrfs_ioctl_fitrim+0xe8/0x178 [btrfs f02c1d570ceea621c69d302ba75dd61868083840]
   btrfs_ioctl+0xdd4/0x2bd8 [btrfs f02c1d570ceea621c69d302ba75dd61868083840]
   __arm64_sys_ioctl+0xac/0x108
   invoke_syscall.constprop.0+0x5c/0xd0
   el0_svc_common.constprop.0+0x40/0xf0
   do_el0_svc+0x24/0x40
   el0_svc+0x40/0x1d0
   el0t_64_sync_handler+0xa0/0xe8
   el0t_64_sync+0x1b0/0x1b8
  Code: 17ffff83 f94017e0 f9002be0 f9402ea0 (f9400c00)
  ---[ end trace 0000000000000000  ]---

Also the reporter is very kind to test the following ASSERT() added to
btrfs_trim_free_extents_throttle():

ASSERT(device->bdev,
       "devid=%llu path=%s dev_state=0x%lx\n",
       device->devid, btrfs_dev_name(device), device->dev_state);

And it shows the following output:

  assertion failed: device->bdev, in extent-tree.c:6630 (devid=2 path=/dev/sdd dev_state=0x82)

Which means the device->bdev is NULL, and the dev_state is
BTRFS_DEV_STATE_IN_FS_METADATA | BTRFS_DEV_STATE_ITEM_FOUND, without
BTRFS_DEV_STATE_WRITEABLE flag set.

[CAUSE]
The pc points to the following call chain:

  btrfs_trim_fs()
  |- btrfs_trim_free_extents()
     |- btrfs_trim_free_extents_throttle()
        |- bdev_max_discard_sectors(device->bdev)

So the NULL pointer dereference is caused by device->bdev being NULL.

This looks impossible by a quick glance, as just before calling
btrfs_trim_free_extents_throttle(), we have skipped any device that has
BTRFS_DEV_STATE_MISSING flag set.

However in this particular case, there is a window where the missing
device is later re-scanned, causing btrfs to remove the
BTRFS_DEV_STATE_MISSING flag:

  btrfs_control_ioctl()
  |- btrfs_scan_one_device()
     |- device_list_add()
        |- rcu_assign_pointer(device->name, name);
        |  This updates the missing device's path to the new good path.
        |
        |- clear_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)
           This removes the BTRFS_DEV_STATE_MISSING flag.

This allows the missing device to re-appear and clear the
BTRFS_DEV_STATE_MISSING flag.  However the device still does not have
the BTRFS_DEV_STATE_WRITEABLE flag set, nor is its bdev pointer updated.

The bdev pointer remains NULL, triggering the crash later.

[FIX]
This is a big de-synchronization between BTRFS_DEV_STATE_MISSING and
device->bdev pointer, and shows a gap in btrfs's re-appearing-device
handling.

The proper handling of re-appearing device will need quite some extra
work, which is out of the context of this small fix.

Thankfully the regular bbio submission path has already handled it well
by checking if the device->bdev is NULL before submitting.

So here we just fix the crash by checking if the device is writeable and
has a bdev pointer before calling bdev_max_discard_sectors().

Reported-by: Su Yue <glass.su@suse.com>
Link: https://lore.kernel.org/linux-btrfs/wlwir19t.fsf@damenly.org/
Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: return real error after lookup failure in btrfs_ioctl_default_subvol()
Filipe Manana [Mon, 1 Jun 2026 09:45:14 +0000 (10:45 +0100)] 
btrfs: return real error after lookup failure in btrfs_ioctl_default_subvol()

If we fail to lookup the dir item, we are always returning -ENOENT but
that may not be the reason for the failure, as btrfs_lookup_dir_item() can
return many different errors, such as -EIO or -ENOMEM for example.
Fix this by returning the real error, and also fixup the silly error
message, including the id of the directory and the error.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: use mapping shared locking for reading super block
Filipe Manana [Sun, 31 May 2026 10:36:06 +0000 (11:36 +0100)] 
btrfs: use mapping shared locking for reading super block

There's no need to exclusively lock the mapping, shared locking is enough
to protect from a concurrent set block size operation (BLKBSZSET ioctl).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: use lockless read in nr_cached_objects shrinker callback
Ben Maurer [Fri, 29 May 2026 21:23:46 +0000 (14:23 -0700)] 
btrfs: use lockless read in nr_cached_objects shrinker callback

Under heavy memcg-driven slab reclaim with many memcgs and CPUs,
shrink_slab_memcg() invokes the per-superblock count callback once per
(memcg, NUMA node) tuple. For btrfs that callback reaches
percpu_counter_sum_positive() on fs_info->evictable_extent_maps, which
takes the percpu_counter's raw spinlock with IRQs disabled and walks
every online CPU. With hundreds of memcgs driving reclaim on a host with
dozens of CPUs, this counter lock becomes a global serialization point:
profiles show CPU pinned in the spin_lock_irqsave acquire under
__percpu_counter_sum, with cross-CPU IPIs hitting csd_lock_wait_toolong
while waiting for spinning vCPUs.

The shrinker count is advisory -- super_cache_count() already notes
"counts can change between super_cache_count and super_cache_scan, so we
really don't need locks here." Use percpu_counter_read_positive(), which
is lockless. Worst-case skew is bounded by batch * num_online_cpus (a
few thousand), negligible compared to the millions of extent maps a busy
filesystem accumulates and well within the noise that the shrinker
already tolerates.

Tested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Ben Maurer <bmaurer@meta.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: switch local indicator variables to bools
David Sterba [Tue, 26 May 2026 11:33:21 +0000 (13:33 +0200)] 
btrfs: switch local indicator variables to bools

For all local indicator variables do simple switch to bool, done on all
files.

Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: send: pass bool for pending_move and refs_processed parameters
David Sterba [Tue, 26 May 2026 11:29:49 +0000 (13:29 +0200)] 
btrfs: send: pass bool for pending_move and refs_processed parameters

We're passing simple indicators as int, switch them to bool types.

Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: use shifts for sectorsize and nodesize
David Sterba [Wed, 27 May 2026 11:16:52 +0000 (13:16 +0200)] 
btrfs: use shifts for sectorsize and nodesize

Convert more multiplications of sectorsize or nodesize to use the
shifts. The remaining cases are multiplications by constants that
compiler can optimize by itself, and in tests.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: fix deadlock cloning inline extent when using flushoncommit
Filipe Manana [Tue, 26 May 2026 13:44:30 +0000 (14:44 +0100)] 
btrfs: fix deadlock cloning inline extent when using flushoncommit

In commit b48c980b6a7e ("btrfs: fix deadlock between reflink and
transaction commit when using flushoncommit") a deadlock was fixed
between reflinks and transaction commits when the fs is mounted with the
flushoncommit option. This happened when we had to copy an inline extent's
data to the destination file. However the issue was fixed only for the
case where the destination offset is 0, it missed the case when the offset
is greater than zero.

Fix this by ensuring we get i_size update whenever we copied an inline
extent's data into the destination file.

Syzbot reported this with the following trace:

   INFO: task kworker/u8:3:57 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:kworker/u8:3    state:D stack:21600 pid:57    tgid:57    ppid:2      task_flags:0x4208160 flags:0x00080000
   Workqueue: writeback wb_workfn (flush-btrfs-129)
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wait_extent_bit fs/btrfs/extent-io-tree.c:905 [inline]
    btrfs_lock_extent_bits+0x59c/0x700 fs/btrfs/extent-io-tree.c:2008
    btrfs_lock_extent fs/btrfs/extent-io-tree.h:152 [inline]
    btrfs_invalidate_folio+0x440/0xc00 fs/btrfs/inode.c:7718
    extent_writepage fs/btrfs/extent_io.c:1848 [inline]
    extent_write_cache_pages fs/btrfs/extent_io.c:2552 [inline]
    btrfs_writepages+0x12f3/0x2410 fs/btrfs/extent_io.c:2684
    do_writepages+0x32e/0x550 mm/page-writeback.c:2571
    __writeback_single_inode+0x133/0x10e0 fs/fs-writeback.c:1764
    writeback_sb_inodes+0x97f/0x1980 fs/fs-writeback.c:2056
    wb_writeback+0x445/0xb00 fs/fs-writeback.c:2241
    wb_do_writeback fs/fs-writeback.c:2388 [inline]
    wb_workfn+0x3fd/0xf20 fs/fs-writeback.c:2428
    process_one_work+0x98b/0x1630 kernel/workqueue.c:3318
    process_scheduled_works kernel/workqueue.c:3401 [inline]
    worker_thread+0xb49/0x1140 kernel/workqueue.c:3482
    kthread+0x388/0x470 kernel/kthread.c:436
    ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
    </TASK>
   INFO: task syz.0.145:8523 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:syz.0.145       state:D stack:22752 pid:8523  tgid:8522  ppid:5850   task_flags:0x400140 flags:0x00080002
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wb_wait_for_completion+0x3e8/0x790 fs/fs-writeback.c:227
    __writeback_inodes_sb_nr+0x24c/0x2d0 fs/fs-writeback.c:2847
    try_to_writeback_inodes_sb+0x9a/0xc0 fs/fs-writeback.c:2895
    btrfs_start_delalloc_flush fs/btrfs/transaction.c:2182 [inline]
    btrfs_commit_transaction+0x813/0x2fc0 fs/btrfs/transaction.c:2371
    btrfs_sync_file+0xdf4/0x1230 fs/btrfs/file.c:1822
    generic_write_sync include/linux/fs.h:2663 [inline]
    btrfs_do_write_iter+0x6a9/0x840 fs/btrfs/file.c:1473
    new_sync_write fs/read_write.c:595 [inline]
    vfs_write+0x629/0xba0 fs/read_write.c:688
    ksys_write+0x156/0x270 fs/read_write.c:740
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f5a0bdece59
   RSP: 002b:00007f5a0b446028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
   RAX: ffffffffffffffda RBX: 00007f5a0c065fa0 RCX: 00007f5a0bdece59
   RDX: 000000000000029f RSI: 0000200000000200 RDI: 0000000000000004
   RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007f5a0c066038 R14: 00007f5a0c065fa0 R15: 00007ffe149206b8
    </TASK>
   INFO: task syz.0.145:8539 blocked for more than 143 seconds.
         Not tainted syzkaller #0
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   task:syz.0.145       state:D stack:23704 pid:8539  tgid:8522  ppid:5850   task_flags:0x400140 flags:0x00080002
   Call Trace:
    <TASK>
    context_switch kernel/sched/core.c:5402 [inline]
    __schedule+0x16f9/0x5500 kernel/sched/core.c:7204
    __schedule_loop kernel/sched/core.c:7283 [inline]
    schedule+0x164/0x360 kernel/sched/core.c:7298
    wait_current_trans+0x39f/0x590 fs/btrfs/transaction.c:536
    start_transaction+0xbd8/0x1820 fs/btrfs/transaction.c:716
    clone_copy_inline_extent fs/btrfs/reflink.c:299 [inline]
    btrfs_clone+0x1316/0x2540 fs/btrfs/reflink.c:574
    btrfs_clone_files+0x271/0x3f0 fs/btrfs/reflink.c:795
    btrfs_remap_file_range+0x76b/0x1320 fs/btrfs/reflink.c:948
    vfs_clone_file_range+0x435/0x7b0 fs/remap_range.c:403
    ioctl_file_clone fs/ioctl.c:239 [inline]
    ioctl_file_clone_range fs/ioctl.c:257 [inline]
    do_vfs_ioctl+0xe15/0x1540 fs/ioctl.c:544
    __do_sys_ioctl fs/ioctl.c:595 [inline]
    __se_sys_ioctl+0x82/0x170 fs/ioctl.c:583
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0x15f/0x560 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f5a0bdece59
   RSP: 002b:00007f5a0b425028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f5a0c066090 RCX: 00007f5a0bdece59
   RDX: 00002000000000c0 RSI: 000000004020940d RDI: 0000000000000004
   RBP: 00007f5a0be82d6f R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007f5a0c066128 R14: 00007f5a0c066090 R15: 00007ffe149206b8
    </TASK>

Reported-by: syzbot+c7443384724bb0f9e913@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a150a09.820a0220.e7972.0006.GAE@google.com/
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: allocate eb-attached btree pages as movable
Rik van Riel [Tue, 26 May 2026 22:37:39 +0000 (18:37 -0400)] 
btrfs: allocate eb-attached btree pages as movable

Extent buffer pages allocated by alloc_extent_buffer() are attached to
btree_inode->i_mapping (the buffer_tree path), reach the LRU, and are
served by the btree_migrate_folio aops in fs/btrfs/disk-io.c. They are
migratable in practice once their owning extent buffer hits refs == 1,
which happens naturally. The buddy allocator classifies them by GFP,
however, and bare GFP_NOFS lands them in MIGRATE_UNMOVABLE pageblocks.

The result: every btree_inode page we read in pins an unmovable pageblock
from the page-superblock allocator's perspective, even though the page
itself can be moved.

Have each caller of btrfs_alloc_page_array, btrfs_alloc_folio_array,
and alloc_eb_folio_array pass in the full GFP mask directly, instead
of having the functions calculate it from boolean flags.

The alloc_extent_buffer call site passes GFP_NOFS | __GFP_NOFAIL |
__GFP_MOVABLE. All other call sites pass plain GFP_NOFS.

Three categories of caller stay on bare GFP_NOFS, deliberately:

  - alloc_dummy_extent_buffer / btrfs_clone_extent_buffer: the
    resulting eb is EXTENT_BUFFER_UNMAPPED, folio->mapping stays NULL,
    the folios never enter LRU, never get migrate_folio aops. Tagging
    them __GFP_MOVABLE would violate the page allocator's migrability
    contract and they would defeat compaction in MOVABLE pageblocks
    where isolate_migratepages_block skips non-LRU non-movable_ops
    pages outright.

  - btrfs_alloc_page_array callers in fs/btrfs/raid56.c (stripe
    pages), fs/btrfs/inode.c (encoded reads), fs/btrfs/ioctl.c (io_uring
    encoded reads), fs/btrfs/relocation.c (relocation buffers): same
    contract violation. raid56 stripe_pages additionally persist in
    the stripe cache (RBIO_CACHE_SIZE=1024) well beyond a single I/O,
    so they are not transient enough to hand-wave the contract.

  - btrfs_alloc_folio_array caller in fs/btrfs/scrub.c (stripe
    folios): same -- stripe->folios[] are private buffers freed via
    folio_put in release_scrub_stripe.

This change targets the dominant fragmentation source observed on the
page-superblock series: ~28 GB of btree_inode pages parked across
many tainted superpageblocks on a 250 GB test system with btrfs root,
preventing 1 GiB hugepage allocation from those regions. With the
movable hint, those pages now land in MOVABLE pageblocks where the
existing background defragger drains them through the standard
PB_has_movable gate, no LRU-sample fallback needed.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: add 32-bit compat ioctl for BTRFS_IOC_GET_SUBVOL_INFO
Daan De Meyer [Thu, 21 May 2026 07:51:13 +0000 (07:51 +0000)] 
btrfs: add 32-bit compat ioctl for BTRFS_IOC_GET_SUBVOL_INFO

On 64-bit kernels with 32-bit userspace, struct btrfs_ioctl_timespec is
laid out as 16 bytes (8B sec + 4B nsec + 4B trailing padding) instead of
the 12 bytes a 32-bit userspace expects, because the surrounding struct
is not packed. As a result, struct btrfs_ioctl_get_subvol_info_args has
a different size and layout in 32-bit userspace than in the 64-bit
kernel, and BTRFS_IOC_GET_SUBVOL_INFO returns garbage to 32-bit callers.

Mirror what was done for BTRFS_IOC_SET_RECEIVED_SUBVOL: add a packed
btrfs_ioctl_get_subvol_info_args_32 with btrfs_ioctl_timespec_32 fields,
define BTRFS_IOC_GET_SUBVOL_INFO_32 with that struct as the size
argument, factor the existing handler into a shared _btrfs_ioctl_get_
subvol_info() helper, and add btrfs_ioctl_get_subvol_info_32() which
fills the kernel struct and translates field-by-field into the 32-bit
struct before copy_to_user().

Signed-off-by: Daan De Meyer <daan@amutable.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: derive f_fsid from on-disk fsid and dev_t
Anand Jain [Mon, 27 Apr 2026 10:18:04 +0000 (18:18 +0800)] 
btrfs: derive f_fsid from on-disk fsid and dev_t

The f_fsid was originally derived from fs_devices->fsid and the
subvolume root ID. However, when temp_fsid is active, fs_devices->fsid
is randomized, making the standard derivation inconsistent.

Since metadata_uuid is optional, it is not a reliable alternative.  This
patch instead retrieves the on-disk UUID from fs_info->super_copy->fsid.

To prevent f_fsid collisions between original and cloned filesystems,
this implementation hashes the dev_t for single-device btrfs filesystems
to ensure uniqueness. This is limited to single-device filesystems as
cloned mounts are currently only supported for that configuration. Note
that f_fsid will change if the device is replaced.

Additionally, since the kernel cannot distinguish between the original
and the cloned filesystem, this new f_fsid derivation is applied to
both.

Link: https://lore.kernel.org/linux-btrfs/cover.1772095546.git.asj@kernel.org/
Link: https://lore.kernel.org/linux-btrfs/cover.1774092915.git.asj@kernel.org/
Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: use on-disk uuid for s_uuid in temp_fsid mounts
Anand Jain [Mon, 27 Apr 2026 10:18:03 +0000 (18:18 +0800)] 
btrfs: use on-disk uuid for s_uuid in temp_fsid mounts

When mounting a cloned filesystem with a temporary fsuuid (temp_fsid),
layered modules like overlayfs require a persistent identifier.

While internal in-memory fs_devices->fsid must remain unique to
the kernel module, let s_uuid carry the original on-disk UUID.

Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: avoid unnecessary dev stats updates
Qu Wenruo [Tue, 7 Apr 2026 09:34:01 +0000 (19:04 +0930)] 
btrfs: avoid unnecessary dev stats updates

[MINOR PROBLEM]
When mounting a filesystem with a valid DEV_STATS item, we will always
update the DEV_STATS again in the next transaction commit, even if there
is no change the values.

[CAUSE]
During the mount, btrfs_device_init_dev_stats() will read out the
on-disk DEV_STATS item for each device.
Then it calls btrfs_dev_stat_set() to update the in-memory structure.

However btrfs_dev_stat_set() does not only set the dev stats value, but
also increase device->dev_stats_ccnt.

That member determines if we should update the device item at the next
transaction commit. Since we have called btrfs_dev_stat_set() for each
dev status member, dev_stats_ccnt will be non-zero and we will update
the dev stats item even it doesn't change at all.

[FIX]
Instead of using btrfs_dev_stat_set() for valid on-disk DEV_STATUS
values, directly call atomic_set() to set the in-memory values.

For other call sites, we still want to use btrfs_dev_stat_set() so that
we will force updating/creating the dev stats item.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: always update/create the dev stats item when adding a new device
Qu Wenruo [Tue, 7 Apr 2026 09:34:00 +0000 (19:04 +0930)] 
btrfs: always update/create the dev stats item when adding a new device

[MINOR PROBLEM]
When adding a new btrfs device, the corresponding DEV_STATS item creation
can only triggered by a mount cycle if there is no other error
triggered:

  # mkfs.btrfs -f $dev1 $mnt
  # mount $dev1 $mnt
  # btrfs dev add $dev2 $mnt
  # sync
  # btrfs ins dump-tree -t dev $dev1
  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 30588928 items 6 free space 15853 generation 9 owner DEV_TREE
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40 <<<
          persistent item objectid DEV_STATS offset 1
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (1 DEV_EXTENT 13631488) itemoff 16195 itemsize 48

Only after a mount cycle and a new transaction, the DEV_STATS for devid
2 can show up:

  # umount $mnt
  # mount $dev1 $mnt
  # touch $mnt
  # sync
  # btrfs ins dump-tree -t dev $dev1
  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 30605312 items 7 free space 15788 generation 10 owner DEV_TREE
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
          persistent item objectid DEV_STATS offset 1
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
          persistent item objectid DEV_STATS offset 2
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0

[CAUSE]
Btrfs only updates the DEV_STATS item when the device->dev_stats_ccnt
counter is not 0.

This is to reduce COW for the device tree. However that dev_stats_ccnt is
only increased at the following call sites:

- btrfs_dev_stat_inc()
  This happens when some IO error happened.

- btrfs_dev_stat_read_and_reset()
  This happens for GET_DEV_STATS ioctl with BTRFS_DEV_STATS_RESET flag.

- btrfs_dev_stat_set()
  This happens inside btrfs_device_init_dev_stats().

So when a new device is added, its dev_stats_ccnt is just initialized to
0, and btrfs won't create nor update the corresponding DEV_STATS item at
all.

[ENHANCEMENT]
When a new device is added, also increase the dev_stats_ccnt by one.
This includes both device add ioctl and dev-replace.

This will force btrfs to create a new DEV_STATS item or update the
existing one with the correct values.

This not only makes the DEV_STATS creation early, but also prevents
old DEV_STATS left from older kernels to cause false alerts for the
newly added device.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: remove the dev stats item when removing a device
Qu Wenruo [Tue, 7 Apr 2026 09:33:59 +0000 (19:03 +0930)] 
btrfs: remove the dev stats item when removing a device

[MINOR BUG]
The following script will cause DEV_STATS item to be left after the
corresponding device is removed:

  # mkfs.btrfs -f $dev1
  # mount $dev1 $mnt
  # btrfs dev add $dev2 $mnt
  # umount $mnt

  ## Without real errors, only at mount time btrfs will update
  ## dev->dev_stats_ccnt, thus we need a mount cycle to create the
  ## DEV_STATS item for the new device.

  # mount $dev1 $mnt
  # touch $mnt/foobar
  # sync
  # btrfs dev remove $dev2 $mnt
  # umount $mnt

This will result the DEV_STATS item for devid 2 still left in device
tree:

  device tree key (DEV_TREE ROOT_ITEM 0)
  leaf 31064064 items 7 free space 15788 generation 18 owner DEV_TREE
  leaf 31064064 flags 0x1(WRITTEN) backref revision 1
  fs uuid 4bd853ed-f6ef-45fd-bbf1-1c3a2d9987cb
  chunk uuid b496eab1-ec23-46b5-81c1-2f1b3503ca07
         item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
          persistent item objectid DEV_STATS offset 1
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
         item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
          persistent item objectid DEV_STATS offset 2
          device stats
          write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0

This is not a huge problem, but if the existing DEV_STATS contains
errors, and a new device is added into the fs taking the old devid, then
after a mount cycle, the new device will suddenly inherit old errors
which can give false alerts.

[CAUSE]
Btrfs never has the ability to delete DEV_STATS items.

It either create a new one through update_dev_stat_item(), or read an
existing one through btrfs_device_init_dev_stats().

However update_dev_stat_item() is only called lazily, if a new device is
created and no new update to dev stats, then it will skip the update of
the on-disk item.

So if the old DEV_STATS item exists and a new device is added, and no
errors during the remaining operations, the old DEV_STATS will not be
updated.

Then at the next mount cycle, btrfs_device_init_dev_stats() is called at
mount time, which will read out the old records, causing false alerts to
the newly added device.

[FIX]
Manually remove the DEV_STATS item during btrfs_rm_device().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: remove the dev stats item for replace target device
Qu Wenruo [Tue, 7 Apr 2026 09:33:58 +0000 (19:03 +0930)] 
btrfs: remove the dev stats item for replace target device

[MINOR PROBLEM]
When a running dev-replace hits some error for the target device (devid
0), there will be a DEV_STATS with error records created at the next
transaction commit.

Unfortunately that item will never to be deleted.

This means at the next dev-replace, if the replace is interrupted, then
at the next mount, the target device will suddenly inherit the old error
records from that DEV_STATS item, which can give some false alerts on
that device.

This shouldn't affect end users that much, as it requires all the
following conditions to be met, which is pretty rare:

- The initial dev-replace hits some error on the target device
  E.g. write errors, but those errors itself is already a big problem
  for a running replace.

  This is required to create the DEV_STATS item in the first place.

- The next replace is interrupted
  This is required to allow btrfs to read from the old records.

[CAUSE]
Btrfs just never deletes the DEV_STATS after a replace is finished.

[FIX]
Remove the DEV_STATS item for devid 0 after the replace is finished.

This is not going to completely fix the error, as we still have other
error paths, e.g. by somehow the fs flips RO and can not start a new
transaction for the DEV_STATS item removal.

But those corner cases will be addressed by later patches which provide
a more generic fix to DEV_STATS related problems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: validate data reloc tree file extent item members
Teng Liu [Wed, 13 May 2026 11:35:44 +0000 (13:35 +0200)] 
btrfs: validate data reloc tree file extent item members

get_new_location() uses BUG_ON() to crash the kernel if the file extent
item it looks up has any of offset, compression, encryption, or
other_encoding set non-zero. The data reloc inode is only written by
relocation's own paths and the four fields are always 0 in what the
kernel writes:

  - insert_prealloc_file_extent() memsets the stack item to zero and
    only fills in type, disk_bytenr, disk_num_bytes and num_bytes, so
    offset/compression/encryption/other_encoding stay 0.
  - insert_ordered_extent_file_extent() copies oe->compress_type into
    the file extent's compression field, but the data reloc inode is
    created with BTRFS_INODE_NOCOMPRESS so compress_type is always 0;
    encryption and other_encoding are reserved-and-zero in btrfs.

A non-zero value here means the leaf decoded from disk does not match
what the kernel wrote, i.e. on-disk corruption. A malformed image
reaches this code via balance and panics the kernel.

A previous attempt to enforce all four constraints in tree-checker's
check_extent_data_item() was merged as commit 7d0ee95979e9 ("btrfs:
validate data reloc tree file extent item members in tree-checker")
and then reverted by commit 1c034697fcaa after btrfs/061 produced
false positives on arm64 with 64K pages. The reason: relocation
writeback legitimately produces REG file_extent_items with offset != 0
in the data reloc tree. When an ordered extent covers only the back
portion of an underlying PREALLOC (num_bytes < ram_bytes on the input
file_extent), insert_ordered_extent_file_extent() inserts a REG with

  offset    = oe->offset
  num_bytes = oe->num_bytes
  ram_bytes preserved from the original PREALLOC,

and this item can reach disk if a transaction commit fires while it
is present in the leaf.

The four fields belong in different layers:

  - compression, encryption and other_encoding are universal
    invariants for every item in the data reloc tree, regardless of
    cluster geometry. Enforce them in tree-checker's
    check_extent_data_item() so a corrupt leaf is rejected at read
    time.

  - offset is only an invariant at the cluster-boundary keys that
    get_new_location() searches (the key is computed as
    src_disk_bytenr - reloc_block_group_start). Partial-PREALLOC
    writebacks legitimately place REG items at non-boundary keys with
    offset != 0; tree-checker cannot reject these. The cluster-
    boundary item is always written by either
    insert_prealloc_file_extent() (offset=0 by memset) or by the
    front portion of a partial writeback (offset=0 by construction),
    so a non-zero offset there is corruption.

Enforce the universal invariants in check_extent_data_item() with a
file_extent_err() rejection. Convert the BUG_ON() in
get_new_location() to a -EUCLEAN return paired with btrfs_print_leaf()
and btrfs_err() so the offending leaf is logged. The caller in
replace_file_extents() already handles non-zero returns from
get_new_location() by breaking out of the loop without aborting the
transaction.

Suggested-by: Qu Wenruo <wqu@suse.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: annotate lockless read of defrag_bytes in should_nocow()
Cen Zhang [Wed, 1 Apr 2026 02:21:53 +0000 (10:21 +0800)] 
btrfs: annotate lockless read of defrag_bytes in should_nocow()

should_nocow() reads inode->defrag_bytes without holding inode->lock,
while btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent()
update it under that spinlock.

This is a data race.  The read is a quick check used to decide whether
to fall back to COW for a NOCOW inode: if defrag_bytes is non-zero and
the range is tagged EXTENT_DEFRAG, we force COW so that defragmentation
can rewrite the extent.  Reading a stale value is harmless because:

  - A missed increment may skip COW once, but the defrag pass will
    redo the extent later.
  - A stale non-zero may force an unnecessary COW, which is a minor
    efficiency loss, not a correctness issue.

On 64-bit platforms an aligned u64 load is naturally atomic so tearing
cannot happen.  On 32-bit platforms u64 may tear, but we only test for
zero vs non-zero, so the heuristic stays correct regardless.  Use
data_race() annotation.

Fixes: 47059d930f0e ("Btrfs: make defragment work with nodatacow option")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
[ Use data_race() instead of READ_ONCXE() ]
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: send: switch struct fs_path to auto freeing
David Sterba [Sun, 24 May 2026 10:56:49 +0000 (12:56 +0200)] 
btrfs: send: switch struct fs_path to auto freeing

The fs_path can use the auto freeing pattern and it's completely
contained in send. Define the freeing wrapper and add the cleanup
attributes.

Almost all conversions are straightforward, replacing goto with direct
return.

Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: add message format for qgroupid
David Sterba [Sat, 23 May 2026 16:33:41 +0000 (18:33 +0200)] 
btrfs: add message format for qgroupid

The qgroupid has a specific format, add common format specifier, similar
to what we have for checksums and keys.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: zoned: always set max_active_zones for zoned devices
Johannes Thumshirn [Fri, 22 May 2026 09:22:12 +0000 (11:22 +0200)] 
btrfs: zoned: always set max_active_zones for zoned devices

When a block device does not report a maximum number of open or active
zones,  currently assign BTRFS_DEFAULT_MAX_ACTIVE_ZONES (128) to
the internal limit, if the device has more than
BTRFS_DEFAULT_MAX_ACTIVE_ZONES zones.

But if the device has less than BTRFS_DEFAULT_MAX_ACTIVE_ZONES the
internal max_active_zones limit will stay at 0, even if the device has
zone resource limits. Furthermore, if the device has a total number of
zones that is less than BTRFS_DEFAULT_MAX_ACTIVE_ZONE, max_active_zones
should be set to at most the number of zones.

Also move the max_active_zone calculation and setting into a dedicated
helper, to shrink btrfs_get_dev_zone_info().

Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: use bvec_phys() in compressed_bio_last_folio()
Matthew Wilcox (Oracle) [Fri, 22 May 2026 18:14:09 +0000 (19:14 +0100)] 
btrfs: use bvec_phys() in compressed_bio_last_folio()

This is open-coded bvec_phys(), also remove direct use of bv_page.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: replace __free_page with folio_put() in attach_eb_folio_to_filemap()
Matthew Wilcox (Oracle) [Fri, 22 May 2026 18:14:08 +0000 (19:14 +0100)] 
btrfs: replace __free_page with folio_put() in attach_eb_folio_to_filemap()

Calling __free_page() on folio_page() happens to work today, but
won't always.  Besides, it's far simpler to call folio_put().

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agoRevert "btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()"
Matthew Wilcox (Oracle) [Fri, 22 May 2026 18:14:07 +0000 (19:14 +0100)] 
Revert "btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()"

It seems that af566bdaff54 was tested against a tree which did not
contain commit 12851bd921d4 ("fs: Turn page_offset() into a wrapper
around folio_pos()).  Unfortunately it has a bug of its own; on 32-bit
systems, shifting by PAGE_SHIFT will overflow on files larger than 4GiB.
Since page_offset() is now fixed, just revert af566bdaff54.

Fixes: af566bdaff54 (btrfs: fix the file offset calculation inside btrfs_decompress_buf2page())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: zoned: fix deadlock waiting for ticket during data relocation
Johannes Thumshirn [Fri, 22 May 2026 09:02:47 +0000 (11:02 +0200)] 
btrfs: zoned: fix deadlock waiting for ticket during data relocation

When performing data relocation on a zoned filesystem, BTRFS can deadlock
in handle_reserve_tickets(). The relocation process is waiting on a space
reservation ticket that can never be fulfilled, because the relocation
itself is the operation responsible for freeing up that space.

Fix this by introducing a new flush state,
BTRFS_RESERVE_FLUSH_ZONED_RELOCATION, specifically for data chunk
allocation during zoned relocation. Like
BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE, this state uses
priority_reclaim_data_space() instead of the normal flushing path, which
avoids re-entering the relocation code and breaking the deadlock cycle.

In btrfs_alloc_data_chunk_ondemand(), select this new flush state when the
inode belongs to a data relocation root on a zoned filesystem.

Fixes: e2a7fd22378f ("btrfs: zoned: add zone reclaim flush state for DATA space_info")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
7 days agobtrfs: zoned: don't account data relocation space-info in statfs free space
Johannes Thumshirn [Fri, 22 May 2026 09:02:46 +0000 (11:02 +0200)] 
btrfs: zoned: don't account data relocation space-info in statfs free space

Don't account the free space in a data relocation space-info sub-group as
usable free space in statfs.

This is misleading as no user allocations can be made in this space-info
sub-group. It is only a target for relocation.

Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>