Johan Alvarado [Mon, 6 Apr 2026 07:44:25 +0000 (07:44 +0000)]
net: stmmac: dwmac-motorcomm: fix eFUSE MAC address read failure
This patch fixes an issue where reading the MAC address from the eFUSE
fails due to a race condition.
The root cause was identified by comparing the driver's behavior with a
custom U-Boot port. In U-Boot, the MAC address was read successfully
every time because the driver was loaded later in the boot process, giving
the hardware ample time to initialize. In Linux, reading the eFUSE
immediately returns all zeros, resulting in a fallback to a random MAC address.
Hardware cold-boot testing revealed that the eFUSE controller requires a
short settling time to load its internal data. Adding a 2000-5000us
delay after the reset ensures the hardware is fully ready, allowing the
native MAC address to be read consistently.
Fixes: 02ff155ea281 ("net: stmmac: Add glue driver for Motorcomm YT6801 ethernet controller") Reported-by: Georg Gottleuber <ggo@tuxedocomputers.com> Closes: https://lore.kernel.org/24cfefff-1233-4745-8c47-812b502d5d19@tuxedocomputers.com Signed-off-by: Johan Alvarado <contact@c127.dev> Reviewed-by: Yao Zi <me@ziyao.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/fc5992a4-9532-49c3-8ec1-c2f8c5b84ca1@smtp-relay.sendinblue.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Allow referenced dynptr to be overwritten when siblings exists
The patchset conditionally allow a referenced dynptr to be overwritten
when its siblings (original dynptr or dynptr clone) exist. Do it before
the verifier relation tracking refactor to mimimize verifier changes at
a time.
====================
Test overwriting referenced dynptr and clones to make sure it is only
allow when there is at least one other dynptr with the same ref_obj_id.
Also make sure slice is still invalidated after the dynptr's stack slot
is destroyed.
bpf: Allow overwriting referenced dynptr when refcnt > 1
The verifier currently does not allow overwriting a referenced dynptr's
stack slot to prevent resource leak. This is because referenced dynptr
holds additional resources that requires calling specific helpers to
release. This limitation can be relaxed when there are multiple copies
of the same dynptr. Whether it is the orignial dynptr or one of its
clones, as long as there exists at least one other dynptr with the same
ref_obj_id (to be used to release the reference), its stack slot should
be allowed to be overwritten.
Daniel Borkmann [Tue, 7 Apr 2026 19:24:19 +0000 (21:24 +0200)]
bpf: Clear delta when clearing reg id for non-{add,sub} ops
When a non-{add,sub} alu op such as xor is performed on a scalar
register that previously had a BPF_ADD_CONST delta, the else path
in adjust_reg_min_max_vals() only clears dst_reg->id but leaves
dst_reg->delta unchanged.
This stale delta can propagate via assign_scalar_id_before_mov()
when the register is later used in a mov. It gets a fresh id but
keeps the stale delta from the old (now-cleared) BPF_ADD_CONST.
This stale delta can later propagate leading to a verifier-vs-
runtime value mismatch.
The clear_id label already correctly clears both delta and id.
Make the else path consistent by also zeroing the delta when id
is cleared. More generally, this introduces a helper clear_scalar_id()
which internally takes care of zeroing. There are various other
locations in the verifier where only the id is cleared. By using
the helper we catch all current and future locations.
Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Tue, 7 Apr 2026 19:24:18 +0000 (21:24 +0200)]
bpf: Fix linked reg delta tracking when src_reg == dst_reg
Consider the case of rX += rX where src_reg and dst_reg are pointers to
the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first
modifies the dst_reg in-place, and later in the delta tracking, the
subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the
post-{add,sub} value instead of the original source.
This is problematic since it sets an incorrect delta, which sync_linked_regs()
then propagates to linked registers, thus creating a verifier-vs-runtime
mismatch. Fix it by just skipping this corner case.
Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
John Pavlick [Mon, 6 Apr 2026 13:23:33 +0000 (13:23 +0000)]
net: sfp: add quirks for Hisense and HSGQ GPON ONT SFP modules
Several GPON ONT SFP sticks based on Realtek RTL960x report
1000BASE-LX at 1300MBd in their EEPROM but can operate at 2500base-X.
On hosts capable of 2500base-X (e.g. Banana Pi R3 / MT7986), the
kernel negotiates only 1G because it trusts the incorrect EEPROM data.
Each quirk advertises 2500base-X and ignores TX_FAULT during the
module's ~40s Linux boot time.
Tested on Banana Pi R3 (MT7986) with OpenWrt 25.12.1, confirmed
2.5Gbps link and full throughput with flow offloading.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Suggested-by: Marcin Nita <marcin.nita@leolabs.pl> Signed-off-by: John Pavlick <jspavlick@posteo.net> Link: https://patch.msgid.link/20260406132321.72563-1-jspavlick@posteo.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, ath12k AHB (in IPQ5332) uses SCM calls to authenticate the
firmware image to bring up userpd. From IPQ5424 onwards, Q6 firmware can
directly communicate with the Trusted Management Engine - Lite (TME-L),
eliminating the need for SCM calls for userpd bring-up.
Hence, to enable IPQ5424 device support, use qcom_mdt_load_no_init() and
skip the SCM call as Q6 will directly authenticate the userpd firmware.
wifi: ath12k: add ath12k_hw_version_map entry for IPQ5424
Add a new ath12k_hw_version_map entry for the AHB based WiFi 7 device
IPQ5424.
Reuse most of the ath12k_hw_version_map fields such as hal_ops,
hal_desc_sz, tcl_to_wbm_rbm_map, and hal_params from IPQ5332. The
register addresses differ on IPQ5424, hence set hw_regs temporarily
to NULL and populated it in a subsequent patch.
Add ath12k_hw_params for the ath12k AHB-based WiFi 7 device IPQ5424.
The WiFi device IPQ5424 is similar to IPQ5332. Most of the hardware
parameters like hw_ops, wmi_init, ring_mask, etc., are the same between
IPQ5424 and IPQ5332, hence use these same parameters for IPQ5424.
Some parameters are specific to IPQ5424; initially set these to
0 or NULL, and populate them in subsequent patches.
Baochen Qiang [Wed, 25 Mar 2026 03:05:01 +0000 (11:05 +0800)]
wifi: ath10k: fix station lookup failure during disconnect
Recent commit [1] moved station statistics collection to an earlier stage
of the disconnect flow. With this change in place, ath10k fails to resolve
the station entry when handling a peer stats event triggered during
disconnect, resulting in log messages such as:
wlp58s0: deauthenticating from 74:1a:e0:e7:b4:c8 by local choice (Reason: 3=DEAUTH_LEAVING)
ath10k_pci 0000:3a:00.0: not found station for peer stats
ath10k_pci 0000:3a:00.0: failed to parse stats info tlv: -22
The failure occurs because ath10k relies on ieee80211_find_sta_by_ifaddr()
for station lookup. That function uses local->sta_hash, but by the time
the peer stats request is triggered during disconnect, mac80211 has
already removed the station from that hash table, leading to lookup
failure.
Before commit [1], this issue was not visible because the transition from
IEEE80211_STA_NONE to IEEE80211_STA_NOTEXIST prevented ath10k from sending
a peer stats request at all: ath10k_mac_sta_get_peer_stats_info() would
fail early to find the peer and skip requesting statistics.
Fix this by switching the lookup path to ath10k_peer_find(), which queries
ath10k's internal peer table. At the point where the firmware emits the
peer stats event, the peer entry is still present in the driver's list,
ensuring lookup succeeds.
wifi: ath12k: Create symlink for each radio in a wiphy
In single-wiphy design, when more than one radio is registered as a
single-wiphy in the mac80211 layer, the following warnings are seen:
1. debugfs: File 'ath12k' in directory 'phy0' already present!
2. debugfs: File 'simulate_fw_crash' in directory 'pci-0000:57:00.0' already present!
debugfs: File 'device_dp_stats' in directory 'pci-01777777777777777777777:57:00.0' already present!
When more than one radio is registered as a single-wiphy, symlinks for
all the radios are created in the same debugfs directory:
/sys/kernel/debug/ieee80211/phyX/ath12k, resulting in warning 1. When a
symlink is created for the first radio, since the 'ath12k' directory is
not present, it will be created and no warning will be thrown. But when
symlink is created for more than one radio, since the 'ath12k'
directory was already created for symlink for radio 1, a warning is
thrown complaining that 'ath12k' directory is already present. To resolve
warning 1, create symlink for each radio in separate debugfs directories.
For the first radio, the symlink will always be the 'ath12k' directory.
This ensures that the existing directory structure is retained for
single-wiphy and multi-wiphy architectures. In single-wiphy architecture
with multiple radios, create symlink in separate debugfs directories
introduced by mac80211.
Existing debugfs directory in single-wiphy architecture:
/sys/kernel/debug/ieee80211/phyX/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/macY
Proposed debugfs directory in single-wiphy architecture with one radio:
/sys/kernel/debug/ieee80211/phyX/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/mac0
Proposed debugfs directory in single-wiphy architecture with more than
one radio:
/sys/kernel/debug/ieee80211/phyX/radio0/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/mac0 and
/sys/kernel/debug/ieee80211/phyX/radioY/ath12k is a symlink to
/sys/kernel/debug/ath12k/pci-0001:01:00.0/macY
Where X is phy index and Y is radio index, seen in
'iw phyX info | grep Idx'. Two symlinks for the first radio are to ensure
compatibility with the existing design. Add radio_idx inside ar, to track
the radio index in probing order.
API ath12k_debugfs_pdev_create() that creates SoC entries is called more
than once when hardware group starts up, resulting in warning 2. To
resolve this warning, remove all other calls to this API and add one
inside the ath12k_core_pdev_create(). This API carries all pdev-specific
initializations and can conveniently hold a call to
ath12k_debugfs_pdev_create().
Avula Sri Charan [Mon, 30 Mar 2026 04:07:32 +0000 (09:37 +0530)]
wifi: ath12k: Skip adding inactive partner vdev info
Currently, a vdev that is created is considered active for partner link
population. In case of an MLD station, non-associated link vdevs can be
created but not started. Yet, they are added as partner links. This leads
to the creation of stale FW partner entries which accumulate and cause
assertions.
To resolve this issue, check if the vdev is started and operating on a
chosen frequency, i.e., arvif->is_started, instead of checking if the vdev
is created, i.e., arvif->is_created. This determines if the vdev is active
or not and skips adding it as a partner link if it's inactive.
Add support to request channel change stats from the firmware through
HTT stats type 76. These stats give channel switch details like the
channel that the radio changed to, its center frequency, time taken
for the switch, chainmask details, etc.
wifi: ath12k: Rename hw_link_id to radio_idx in ath12k_ah_to_ar()
ath12k_ah_to_ar() is returning radio from the given hardware based on the
radio index passed. But, the variable that radio index is received at is
wrongly named 'hw_link_id', which points to the hardware link index that
comes from the firmware. This affects readability.
Resolve this by renaming 'hw_link_id' to 'radio_idx'.
====================
tracing: Fix kprobe attachment when module shadows vmlinux symbol
When a kernel module exports a symbol with the same name as an existing
vmlinux symbol, kprobe attachment fails with -EADDRNOTAVAIL because
number_of_same_symbols() counts matches across both vmlinux and all
loaded modules, returning a count greater than 1.
This series takes a different approach from v1-v4, which implemented a
libbpf-side fallback parsing /proc/kallsyms and retrying with the
absolute address. That approach was rejected (Andrii Nakryiko, Ihor
Solodrai) because ambiguous symbol resolution does not belong in libbpf.
Following Ihor's suggestion, this series fixes the root cause in the
kernel: when an unqualified symbol name is given and the symbol is found
in vmlinux, prefer the vmlinux symbol and do not scan loaded modules.
This makes the skeleton auto-attach path work transparently with no
libbpf changes needed.
Patch 1: Kernel fix - return vmlinux-only count from
number_of_same_symbols() when the symbol is found in vmlinux,
preventing module shadows from causing -EADDRNOTAVAIL.
Patch 2: Selftests using bpf_fentry_shadow_test which exists in both
vmlinux and bpf_testmod - tests unqualified (vmlinux) and
MOD:SYM (module) attachment across all four attach modes, plus
kprobe_multi with the duplicate symbol.
Changes since v6 [1]:
- Fix comment style: use /* on its own line instead of networking-style
/* text on opener line (Alexei Starovoitov).
selftests/bpf: Add tests for kprobe attachment with duplicate symbols
bpf_fentry_shadow_test exists in both vmlinux (net/bpf/test_run.c) and
bpf_testmod (bpf_testmod.c), creating a duplicate symbol condition when
bpf_testmod is loaded. Add subtests that verify kprobe behavior with
this duplicate symbol:
In attach_probe:
- dup-sym-{default,legacy,perf,link}: unqualified attach succeeds
across all four modes, preferring vmlinux over module shadow.
- MOD:SYM qualification attaches to the module version.
In kprobe_multi_test:
- dup_sym: kprobe_multi attach with kprobe and kretprobe succeeds.
bpf_fentry_shadow_test is not invoked via test_run, so tests verify
attach and detach succeed without triggering the probe.
bpf: Prefer vmlinux symbols over module symbols for unqualified kprobes
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.
When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well") Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads") Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Link: https://lore.kernel.org/r/20260407203912.1787502-2-andrey.grodzovsky@crowdstrike.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: add test for nullable PTR_TO_BUF access
Add iter_buf_null_fail with two tests and a test runner:
- iter_buf_null_deref: verifier must reject direct dereference of
ctx->key (PTR_TO_BUF | PTR_MAYBE_NULL) without a null check
- iter_buf_null_check_ok: verifier must accept dereference after
an explicit null check
Dave Airlie [Tue, 24 Feb 2026 02:06:23 +0000 (12:06 +1000)]
ttm/pool: track allocated_pages per numa node.
This gets the memory sizes from the nodes and stores the limit
as 50% of those. I think eventually we should drop the limits
once we have memcg aware shrinking, but this should be more NUMA
friendly, and I think seems like what people would prefer to
happen on NUMA aware systems.
Cc: Christian Koenig <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:22 +0000 (12:06 +1000)]
ttm/pool: make pool shrinker NUMA aware (v2)
This enable NUMA awareness for the shrinker on the
ttm pools.
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:20 +0000 (12:06 +1000)]
ttm/pool: port to list_lru. (v2)
This is an initial port of the TTM pools for
write combined and uncached pages to use the list_lru.
This makes the pool's more NUMA aware and avoids
needing separate NUMA pools (later commit enables this).
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:19 +0000 (12:06 +1000)]
drm/ttm: use gpu mm stats to track gpu memory allocations. (v4)
This uses the newly introduced per-node gpu tracking stats,
to track GPU memory allocated via TTM and reclaimable memory in
the TTM page pools.
These stats will be useful later for system information and
later when mem cgroups are integrated.
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 24 Feb 2026 02:06:18 +0000 (12:06 +1000)]
mm: add gpu active/reclaim per-node stat counters (v2)
While discussing memcg intergration with gpu memory allocations,
it was pointed out that there was no numa/system counters for
GPU memory allocations.
With more integrated memory GPU server systems turning up, and
more requirements for memory tracking it seems we should start
closing the gap.
Add two counters to track GPU per-node system memory allocations.
The first is currently allocated to GPU objects, and the second
is for memory that is stored in GPU page pools that can be reclaimed,
by the shrinker.
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
When a CREATE returns STATUS_STOPPED_ON_SYMLINK, smb2_check_message()
returns success without any length validation, leaving the symlink
parsers as the only defense against an untrusted server.
symlink_data() walks SMB 3.1.1 error contexts with the loop test "p <
end", but reads p->ErrorId at offset 4 and p->ErrorDataLength at offset
0. When the server-controlled ErrorDataLength advances p to within 1-7
bytes of end, the next iteration will read past it. When the matching
context is found, sym->SymLinkErrorTag is read at offset 4 from
p->ErrorContextData with no check that the symlink header itself fits.
smb2_parse_symlink_response() then bounds-checks the substitute name
using SMB2_SYMLINK_STRUCT_SIZE as the offset of PathBuffer from
iov_base. That value is computed as sizeof(smb2_err_rsp) +
sizeof(smb2_symlink_err_rsp), which is correct only when
ErrorContextCount == 0.
With at least one error context the symlink data sits 8 bytes deeper,
and each skipped non-matching context shifts it further by 8 +
ALIGN(ErrorDataLength, 8). The check is too short, allowing the
substitute name read to run past iov_len. The out-of-bound heap bytes
are UTF-16-decoded into the symlink target and returned to userspace via
readlink(2).
Fix this all up by making the loops test require the full context header
to fit, rejecting sym if its header runs past end, and bound the
substitute name against the actual position of sym->PathBuffer rather
than a fixed offset.
Because sub_offs and sub_len are 16bits, the pointer math will not
overflow here with the new greater-than.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Bharath SM <bharathsm@microsoft.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: stable <stable@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steve French <stfrench@microsoft.com>
smb: client: fix off-by-8 bounds check in check_wsl_eas()
The bounds check uses (u8 *)ea + nlen + 1 + vlen as the end of the EA
name and value, but ea_data sits at offset sizeof(struct
smb2_file_full_ea_info) = 8 from ea, not at offset 0. The strncmp()
later reads ea->ea_data[0..nlen-1] and the value bytes follow at
ea_data[nlen+1..nlen+vlen], so the actual end is ea->ea_data + nlen + 1
+ vlen. Isn't pointer math fun?
The earlier check (u8 *)ea > end - sizeof(*ea) only guarantees the
8-byte header is in bounds, but since the last EA is placed within 8
bytes of the end of the response, the name and value bytes are read past
the end of iov.
Fix this mess all up by using ea->ea_data as the base for the bounds
check.
An "untrusted" server can use this to leak up to 8 bytes of kernel heap
into the EA name comparison and influence which WSL xattr the data is
interpreted as.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: Bharath SM <bharathsm@microsoft.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: stable <stable@kernel.org> Assisted-by: gregkh_clanker_t1000 Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steve French <stfrench@microsoft.com>
gfs2: prevent NULL pointer dereference during unmount
When flushing out outstanding glock work during an unmount, gfs2_log_flush()
can be called when sdp->sd_jdesc has already been deallocated and sdp->sd_jdesc
is NULL. Commit 35264909e9d1 ("gfs2: Fix NULL pointer dereference in
gfs2_log_flush") added a check for that to gfs2_log_flush() itself, but it
missed the sdp->sd_jdesc dereference in gfs2_log_release(). Fix that.
Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202604071139.HNJiCaAi-lkp@intel.com/ Fixes: 35264909e9d1 ("gfs2: Fix NULL pointer dereference in gfs2_log_flush") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
During an unmount, wait for potential withdraw to complete before calling
gfs2_make_fs_ro(). This will allow gfs2_make_fs_ro() to skip much of its work.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_dinode_in(), only allow directories to have the GFS2_DIF_EXHASH
flag set. This will prevent other parts of the code from treating
regular inodes as directories based on the presence of that flag.
In sweep_bh_for_rgrps() and __gfs2_free_blocks(), check if the
GFS2_DIF_EXHASH flag is set instead of checking if i_depth is non-zero.
This matches what the directory code does. (The i_depth checks were
introduced in commit 6d3117b412951 ("GFS2: Wipe directory hash table
metadata when deallocating a directory").)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a withdraw occurs in gfs2_log_flush() and we are left with an unsubmitted
bio, fail that bio. Otherwise, the bh's in that bio will remain locked and
gfs2_evict_inode() -> truncate_inode_pages() -> gfs2_invalidate_folio() ->
gfs2_discard() will hang trying to discard the locked bh's.
In addition, when gfs2_log_flush() fails to submit a new transaction, unpin the
buffers in the failing transaction like gfs2_remove_from_journal() does. If
any of the bd's are on the ail2 list, leave them there and do_withdraw() ->
gfs2_withdraw_glocks() -> inode_go_inval() -> truncate_inode_pages() ->
gfs2_invalidate_folio() -> gfs2_discard() will remove them. They will be freed
in gfs2_release_folio().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_logd() calls the log flushing functions gfs2_ail1_start(),
gfs2_ail1_wait(), and gfs2_ail1_empty() without holding sdp->sd_log_flush_lock,
but these functions require exclusion against concurrent transactions.
To fix that, add a non-locking __gfs2_log_flush() function. Then, in
gfs2_logd(), take sdp->sd_log_flush_lock before calling the above mentioned log
flushing functions and __gfs2_log_flush().
Fixes: 5e4c7632aae1c ("gfs2: Issue revokes more intelligently") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
gfs2: fix address space truncation during withdraw
When a withdrawn filesystem's inodes are being evicted, the address spaces of
those inodes still need to be truncated but we can no longer start new
transactions. We still don't want gfs2_invalidate_folio() to race with
gfs2_log_flush(), so take a read lock on sdp->sd_log_flush_lock in that case.
(It may not be obvious, but gfs2_invalidate_folio() is a jdata-only address
space operation.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a withdraw is carried out, call gfs2_ail_drain() under the
sdp->sd_log_flush_lock. This isn't strictly necessary but should be easier to
read, and more robust against possible future bugs.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
53373d135579 dtc: Remove unused dts_version in dtc-lexer.l caf7465c5d60 libfdt: fdt_check_full: Handle FDT_NOP when FDT_END is expected 5976c4a66098 libfdt: fdt_rw: Introduce fdt_downgrade_version() 5bb5bedd347d fdtdump: Return an error code on wrong tag value 68b960e299f7 fdtdump: Remove dtb version check adba02caf554 dtc: Use a consistent type for basenamelen 8d15a63e84ff libfdt: Verify alignment of sub-blocks in dtb
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
The dynamic EPP feature uses power_supply_reg_notifier() and
power_supply_unreg_notifier() but doesn't declare a dependency on
POWER_SUPPLY, causing linker errors when POWER_SUPPLY is not enabled.
Add POWER_SUPPLY to the selects.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Fixes: e30ca6dd5345 ("cpufreq/amd-pstate: Add dynamic energy performance preference") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604040742.ySEdkuAa-lkp@intel.com/ Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20260407194949.310114-1-mario.limonciello@amd.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Justin Stitt [Tue, 31 Mar 2026 00:09:08 +0000 (17:09 -0700)]
kbuild: expand inlining hints with -fdiagnostics-show-inlining-chain
Clang recently added -fdiagnostics-show-inlining-chain [1] to improve
the visibility of inlining chains in diagnostics. This is particularly
useful for CONFIG_FORTIFY_SOURCE where detections can happen deep in
inlined functions.
Add this flag to KBUILD_CFLAGS under a cc-option so it is enabled if the
compiler supports it. Note that GCC does not have an equivalent flag as
it supports a similar diagnostic structure unconditionally.
Dmitry Torokhov [Thu, 8 Aug 2024 17:27:29 +0000 (10:27 -0700)]
Input: pc110pad - remove driver
Palm Top PC 110 is a handheld personal computer with 80486SX CPU that was
released exclusively in Japan in September 1995.
While the kernel still supports 486 CPU it is highly unlikely that anyone
is using this device with the latest kernel.
Remove the driver.
[bhelgaas: since this was posted, "x86/cpu: Remove M486/M486SX/ELAN
support" has been queued for v7.1, so pc110pad is no longer relevant:
https://lore.kernel.org/all/20251214084710.3606385-2-mingo@kernel.org/]
RCU Tasks Trace grace period implies RCU grace period, and this
guarantee is expected to remain in the future. Only BPF is the user of
this predicate, hence retire the API and clean up all in-tree users.
RCU Tasks Trace is now implemented on SRCU-fast and its grace period
mechanism always has at least one call to synchronize_rcu() as it is
required for SRCU-fast's correctness (it replaces the smp_mb() that
SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP.
Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: Allow prog name matching for tests with __description
For tests that carry a __description tag, allow matching on both the
description string and program name for convenience. Before this commit,
the description string must be spelt out to filter the tests.
watchdog: ni903x_wdt: Convert to a platform driver
In all cases in which a struct acpi_driver is used for binding a driver
to an ACPI device object, a corresponding platform device is created by
the ACPI core and that device is regarded as a proper representation of
underlying hardware. Accordingly, a struct platform_driver should be
used by driver code to bind to that device. There are multiple reasons
why drivers should not bind directly to ACPI device objects [1].
In particular, registering a watchdog device under a struct acpi_device
is questionable because it causes the watchdog to be hidden in the ACPI
bus sysfs hierarchy and it goes against the general rule that a struct
acpi_device can only be a parent of another struct acpi_device.
Overall, it is better to bind drivers to platform devices than to their
ACPI companions, so convert the ni903x_wdt watchdog ACPI driver to a
platform one.
While this is not expected to alter functionality, it changes sysfs
layout and so it will be visible to user space.
Note that after this change it actually makes sense to look for
the "timeout-sec" property via device_property_read_u32() under the
device passed to watchdog_init_timeout() because it has an fwnode
handle (unlike a struct acpi_device which is an fwnode itself).
In all cases in which a struct acpi_driver is used for binding a driver
to an ACPI device object, a corresponding platform device is created by
the ACPI core and that device is regarded as a proper representation of
underlying hardware. Accordingly, a struct platform_driver should be
used by driver code to bind to that device. There are multiple reasons
why drivers should not bind directly to ACPI device objects [1].
Overall, it is better to bind drivers to platform devices than to their
ACPI companions, so convert the Xen ACPI processor aggregator device
(PAD) driver to a platform one.
While this is not expected to alter functionality, it changes sysfs
layout and so it will be visible to user space.
Using the stricter "./tools/docs/kernel-doc -Wall -v" to verify proper
formatting of documentation comments includes warnings related to return
markup on functions that are omitted during the default verification
checks. This stricter verification reports a couple of missing return
descriptions in resctrl:
Warning: .../fs/resctrl/rdtgroup.c:1536 No description found for return value of 'rdtgroup_cbm_to_size'
Warning: .../fs/resctrl/rdtgroup.c:3131 No description found for return value of 'mon_get_kn_priv'
Warning: .../fs/resctrl/rdtgroup.c:3523 No description found for return value of 'cbm_ensure_valid'
Warning: .../fs/resctrl/monitor.c:238 No description found for return value of 'resctrl_find_cleanest_closid'
The x86 maintainers handle the resctrl filesystem and x86 architectural
resctrl code. Even so, the x86 maintainers are not part of the resctrl
section and not returned when scripts/get_maintainer.pl is run on resctrl
filesystem code. With patches flowing via x86 maintainers resctrl should
also ensure it follows the tip rules.
Add the x86 maintainer alias, x86@kernel.org, to the resctrl section to
ensure x86 maintainers are included in associated resctrl submissions.
Add a reference to the tip tree handbook to make it clear which rules
resctrl follows.
workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value
use NR_STD_WORKER_POOLS for irq_work_fns[] array definition.
NR_STD_WORKER_POOLS is also 2, but better to use MACRO.
Initialization loop for_each_bh_worker_pool() also uses same MACRO.
Herve Codina [Wed, 25 Mar 2026 14:35:44 +0000 (15:35 +0100)]
of: property: Allow fw_devlink device-tree on x86
PCI drivers can use a device-tree overlay to describe the hardware
available on the PCI board. This is the case, for instance, of the
LAN966x PCI device driver.
Adding some more nodes in the device-tree overlay adds some more
consumer/supplier relationship between devices instantiated from this
overlay.
Those fw_node consumer/supplier relationships are handled by fw_devlink
and are created based on the device-tree parsing done by the
of_fwnode_add_links() function.
Those consumer/supplier links are needed in order to ensure a correct PM
runtime management and a correct removal order between devices.
For instance, without those links a supplier can be removed before its
consumers is removed leading to all kind of issue if this consumer still
want the use the already removed supplier.
The support for the usage of an overlay from a PCI driver has been added
on x86 systems in commit 1f340724419ed ("PCI: of: Create device tree PCI
host bridge node").
In the past, support for fw_devlink on x86 had been tried but this
support has been removed in commit 4a48b66b3f52 ("of: property: Disable
fw_devlink DT support for X86"). Indeed, this support was breaking some
x86 systems such as OLPC system and the regression was reported in [0].
Instead of disabling this support for all x86 system, use a finer grain
and disable this support only for the possible problematic subset of x86
systems (at least OLPC and CE4100).
Those systems use a device-tree to describe their hardware. Identify
those systems using key properties in the device-tree.
Boris Burkov [Mon, 6 Apr 2026 16:15:15 +0000 (09:15 -0700)]
btrfs: btrfs_log_dev_io_error() on all bio errors
As far as I can tell, we never intentionally constrained ourselves to
these status codes, and it is misleading and surprising to lack the
bdev error logging when we get a different error code from the block
layer. This can lead to jumping to a wrong conclusion like "this
system didn't see any bio failures but aborted with EIO".
For example on nvme devices, I observe many failures coming back as
BLK_STS_MEDIUM. It is apparent that the nvme driver returns a variety of
BLK_STS_* status values in nvme_error_status().
So handle the known expected errors and make some noise on the rest
which we expect won't really happen.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: fix silent IO error loss in encoded writes and zoned split
can_finish_ordered_extent() and btrfs_finish_ordered_zoned() set
BTRFS_ORDERED_IOERR via bare set_bit(). Later,
btrfs_mark_ordered_extent_error() in btrfs_finish_one_ordered() uses
test_and_set_bit(), finds it already set, and skips
mapping_set_error(). The error is never recorded on the inode's
address_space, making it invisible to fsync. For encoded writes this
causes btrfs receive to silently produce files with zero-filled holes.
Fix: replace bare set_bit(BTRFS_ORDERED_IOERR) with
btrfs_mark_ordered_extent_error() which pairs test_and_set_bit() with
mapping_set_error(), guaranteeing the error is recorded exactly once.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: Michal Grzedzicki <mge@meta.com> Signed-off-by: David Sterba <dsterba@suse.com>
Dave Chen [Mon, 30 Mar 2026 03:31:48 +0000 (11:31 +0800)]
btrfs: skip clearing EXTENT_DEFRAG for NOCOW ordered extents
In btrfs_finish_one_ordered(), clear_bits is unconditionally initialized
with EXTENT_DEFRAG. For NOCOW ordered extents this is always a no-op
because should_nocow() already forces the COW path when EXTENT_DEFRAG is
set, so a NOCOW ordered extent can never have EXTENT_DEFRAG on its range.
Although harmless, the unconditional btrfs_clear_extent_bit() call still
performs a cold rbtree lookup under the io tree spinlock on every NOCOW
write completion. Avoid this by only adding EXTENT_DEFRAG to clear_bits
for non-NOCOW ordered extents, and skip the call entirely when there are
no bits to clear.
Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Dave Chen [Tue, 7 Apr 2026 03:36:24 +0000 (11:36 +0800)]
btrfs: use BTRFS_FS_UPDATE_UUID_TREE_GEN flag for UUID tree rescan check
The UUID tree rescan check in open_ctree() compares
fs_info->generation with the superblock's uuid_tree_generation.
This comparison is not reliable because fs_info->generation is
bumped at transaction start time in join_transaction(), while
uuid_tree_generation is only updated at commit time via
update_super_roots().
Between the early BTRFS_FS_UPDATE_UUID_TREE_GEN flag check and the
late rescan decision, mount operations such as file orphan cleanup
from an unclean shutdown start transactions without committing
them. This advances fs_info->generation past uuid_tree_generation
and produces a false-positive mismatch.
Use the BTRFS_FS_UPDATE_UUID_TREE_GEN flag directly instead. The
flag was already set earlier in open_ctree() when the generations
were known to match, and accurately represents "UUID tree is up to
date" without being affected by subsequent transaction starts.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 1 Apr 2026 18:46:59 +0000 (19:46 +0100)]
btrfs: remove duplicate journal_info reset on failure to commit transaction
If we get an error during the transaction commit path, we are resetting
current->journal_info to NULL twice - once in btrfs_commit_transaction()
right before calling cleanup_transaction() and then once again inside
cleanup_transaction(). Remove the instance in btrfs_commit_transaction().
Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 1 Apr 2026 17:51:35 +0000 (18:51 +0100)]
btrfs: tag as unlikely if statements that check for fs in error state
Having the filesystem in an error state, meaning we had a transaction
abort, is unexpected. Mark every check for the error state with the
unlikely annotation to convey that and to allow the compiler to generate
better code.
On x86_64, using gcc 14.2.0-19 from Debian, resulted in a slightly
reduced object size and better code.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 2008598 175912 15592 2200102 219226 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 2008450 175912 15592 2199954 219192 fs/btrfs/btrfs.ko
Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Merge tag 'ata-7.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:
- Add a quirk for JMicron JMB582/JMB585 AHCI controllers such that
they only use 32-bit DMA addresses.
While these controllers do report that they support 64-bit DMA
addresses, a user reports that using 64-bit DMA addresses cause
silent corruption even on modern x86 systems (Arthur)
* tag 'ata-7.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci: force 32-bit DMA for JMicron JMB582/JMB585
Merge tag 'hyperv-fixes-signed-20260406' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull Hyper-V fixes from Wei Liu:
- Two fixes for Hyper-V PCI driver (Long Li, Sahil Chandna)
- Fix an infinite loop issue in MSHV driver (Stanislav Kinsburskii)
* tag 'hyperv-fixes-signed-20260406' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
mshv: Fix infinite fault loop on permission-denied GPA intercepts
PCI: hv: Fix double ida_free in hv_pci_probe error path
PCI: hv: Set default NUMA node to 0 for devices without affinity info
ASoC: amd: ps: fix the pcm device numbering for acp pdm dmic
Fixed PCM device numbering is required for acp pdm dmic pcm device
to have a common UCM changes.
Set the 'use_dai_pcm_id' flag true in acp pdm dma driver for acp 6.3
platform. This will fix the pcm device numbering based on dai_link->id.
alarmtimer: Access timerqueue node under lock in suspend
In alarmtimer_suspend(), timerqueue_getnext() is called under
base->lock, but next->expires is read after the lock is released.
This is safe because suspend freezes all relevant task contexts,
but reading the node while holding the lock makes the code easier
to reason about and not worry about a theoretical UAF.
dt-bindings: arm-smmu: qcom: Add compatible for Hawi SoC
Qualcomm Hawi SoC include apps smmu that implements arm,mmu-500, which
is used to translate device-visible virtual addresses to physical
addresses. Add compatible for these items.
Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Will Deacon <will@kernel.org>
Keep the direct kfree(space_info) for the earlier failure path, but
after btrfs_sysfs_add_space_info_type() has called kobject_put(), let
the kobject release callback handle the cleanup.
Fixes: a11224a016d6d ("btrfs: fix memory leaks in create_space_info() error paths") CC: stable@vger.kernel.org # 6.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
Keep parent->sub_group[index] = NULL for the failure path, but after
btrfs_sysfs_add_space_info_type() has called kobject_put(), let the
kobject release callback handle the cleanup.
Qu Wenruo [Tue, 31 Mar 2026 23:02:57 +0000 (09:32 +1030)]
btrfs: do not reject a valid running dev-replace
[BUG]
There is a bug report that a btrfs with running dev-replace got rejected
with the following messages:
BTRFS error (device sdk1): devid 0 path /dev/sdk1 is registered but not found in chunk tree
BTRFS error (device sdk1): remove the above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount
BTRFS error (device sdk1): open_ctree failed: -117
[CAUSE]
The tree and super block dumps show the fs is completely sane, except
one thing, there is no dev item for devid 0 in chunk tree.
However this is not a bug, as we do not insert dev item for devid 0 in
the first place.
Since the devid 0 is only there temporarily we do not really need to
insert a dev item for it and then later remove it again.
It is the commit 34308187395f ("btrfs: add extra device item checks at
mount") adding a overly strict check that triggers a false alert and
rejected the valid filesystem.
[FIX]
Add a special handling for devid 0, and doesn't require devid 0 to
have a device item in chunk tree.
Qu Wenruo [Mon, 30 Mar 2026 22:21:56 +0000 (08:51 +1030)]
btrfs: only invalidate btree inode pages after all ebs are released
In close_ctree(), we call invalidate_inode_pages2() to invalidate all
pages from btree inode.
But the problem is, it never returns 0, but always -EBUSY.
The problem is that we are still holding all the essential tree root
nodes, thus pages holding those tree blocks can not be invalidated thus
invalidate_inode_pages2() always returns -EBUSY.
This is also against the error cleanup path of open_ctree(), which
properly frees all root pointers before calling invalidate_inode_pages().
So fix the order by delaying invalidate_inode_pages2() until we have
freed all root pointers.
Reviewed-by: Anand Jain <asj@kernel.org> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: prevent direct reclaim during compressed readahead
Under memory pressure, direct reclaim can kick in during compressed
readahead. This puts the associated task into D-state. Then shrink_lruvec()
disables interrupts when acquiring the LRU lock. Under heavy pressure,
we've observed reclaim can run long enough that the CPU becomes prone to
CSD lock stalls since it cannot service incoming IPIs. Although the CSD
lock stalls are the worst case scenario, we have found many more subtle
occurrences of this latency on the order of seconds, over a minute in some
cases.
Prevent direct reclaim during compressed readahead. This is achieved by
using different GFP flags at key points when the bio is marked for
readahead.
There are two functions that allocate during compressed readahead:
btrfs_alloc_compr_folio() and add_ra_bio_pages(). Both currently use
GFP_NOFS which includes __GFP_DIRECT_RECLAIM.
For the internal API call btrfs_alloc_compr_folio(), the signature changes
to accept an additional gfp_t parameter. At the readahead call site, it
gets flags similar to GFP_NOFS but stripped of __GFP_DIRECT_RECLAIM.
__GFP_NOWARN is added since these allocations are allowed to fail. Demand
reads still use full GFP_NOFS and will enter reclaim if needed. All other
existing call sites of btrfs_alloc_compr_folio() now explicitly pass
GFP_NOFS to retain their current behavior.
add_ra_bio_pages() gains a bool parameter which allows callers to specify
if they want to allow direct reclaim or not. In either case, the
__GFP_NOWARN flag was added unconditionally since the allocations are
speculative.
There has been some previous work done on calling add_ra_bio_pages() [0].
This patch is complementary: where that patch reduces call frequency, this
patch reduces the latency associated with those calls.
Teng Liu [Sat, 28 Mar 2026 06:40:59 +0000 (07:40 +0100)]
btrfs: replace BUG_ON() with error return in cache_save_setup()
In cache_save_setup(), if create_free_space_inode() succeeds but the
subsequent lookup_free_space_inode() still fails on retry, the
BUG_ON(retries) will crash the kernel. This can happen due to I/O
errors or transient failures, not just programming bugs.
Replace the BUG_ON with proper error handling that returns the original
error code through the existing cleanup path. The callers already handle
this gracefully: disk_cache_state defaults to BTRFS_DC_ERROR, so the
space cache simply won't be written for that block group.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Teng Liu <27rabbitlt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 6 Jan 2026 16:20:34 +0000 (17:20 +0100)]
btrfs: zstd: don't cache sectorsize in a local variable
The sectorsize is used once or at most twice in the callbacks, no need
to cache it on stack. Minor effect on zstd_compress_folios() where it
saves 8 bytes of stack.
David Sterba [Tue, 6 Jan 2026 16:20:30 +0000 (17:20 +0100)]
btrfs: zlib: drop redundant folio address variable
We're caching the current output folio address but it's not really
necessary as we store it in the variable and then pass it to the stream
context. We can read the folio address directly.
David Sterba [Tue, 6 Jan 2026 16:20:28 +0000 (17:20 +0100)]
btrfs: use common eb range validation in read_extent_buffer_to_user_nofault()
The extent buffer access is checked in other helpers by
check_eb_range(), which validates the requested start, length against
the extent buffer. While this almost never fails we should still handle
it as an error and not just warn.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 6 Jan 2026 16:20:27 +0000 (17:20 +0100)]
btrfs: read eb folio index right before loops
There are generic helpers to access extent buffer folio data of any
length, potentially iterating over a few of them. This is a slow path,
either we use the type based accessors or the eb folio allocation is
contiguous and we can use the memcpy/memcmp helpers.
The initialization of 'i' is done at the beginning though it may not be
needed. Move it right before the folio loop, this has minor effect on
generated code in __write_extent_buffer().
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 6 Jan 2026 16:20:25 +0000 (17:20 +0100)]
btrfs: unify types for binary search variables
The variables calculating where to jump next are using mixed in types
which requires some conversions on the instruction level. Using 'u32'
removes one call to 'movslq', making the main loop shorter.
This complements type conversion done in a724f313f84beb ("btrfs: do
unsigned integer division in the extent buffer binary search loop")
David Sterba [Tue, 6 Jan 2026 16:20:24 +0000 (17:20 +0100)]
btrfs: remove duplicate calculation of eb offset in btrfs_bin_search()
In the main search loop the variable 'oil' (offset in folio) is set
twice, one duplicated when the key fits completely to the contiguous
range. We can remove it and while it's just a simple calculation, the
binary search loop is executed many times so micro optimizations add up.
The code size is reduced by 64 bytes on release config, the loop is
reorganized a bit and a few instructions shorter.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Mark Harmstone [Wed, 25 Mar 2026 12:53:43 +0000 (12:53 +0000)]
btrfs: tree-checker: add remap-tree checks to check_block_group_item()
Add some write-time checks for block group items relating to the remap
tree.
Here we're checking:
* That the REMAPPED or METADATA_REMAP flags aren't set unless the
REMAP_TREE incompat flag is also set
* That `remap_bytes` isn't more than the size of the block group
* That `identity_remap_count` isn't more than the number of sectors in
the block group
Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 24 Mar 2026 15:01:34 +0000 (15:01 +0000)]
btrfs: make btrfs_free_log() and btrfs_free_log_root_tree() return void
These functions never fail, always return success (0) and none of the
callers care about their return values. Change their return type from int
to void.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 23 Mar 2026 15:50:13 +0000 (15:50 +0000)]
btrfs: fix deadlock between reflink and transaction commit when using flushoncommit
When using the flushoncommit mount option, we can have a deadlock between
a transaction commit and a reflink operation that copied an inline extent
to an offset beyond the current i_size of the destination node.
The deadlock happens like this:
1) Task A clones an inline extent from inode X to an offset of inode Y
that is beyond Y's current i_size. This means we copied the inline
extent's data to a folio of inode Y that is beyond its EOF, using a
call to copy_inline_to_page();
2) Task B starts a transaction commit and calls
btrfs_start_delalloc_flush() to flush delalloc;
3) The delalloc flushing sees the new dirty folio of inode Y and when it
attempts to flush it, it ends up at extent_writepage() and sees that
the offset of the folio is beyond the i_size of inode Y, so it attempts
to invalidate the folio by calling folio_invalidate(), which ends up at
btrfs' folio invalidate callback - btrfs_invalidate_folio(). There it
tries to lock the folio's range in inode Y's extent io tree, but it
blocks since it's currently locked by task A - during a reflink we lock
the inodes and the source and destination ranges after flushing all
delalloc and waiting for ordered extent completion - after that we
don't expect to have dirty folios in the ranges, the exception is if
we have to copy an inline extent's data (because the destination offset
is not zero);
4) Task A then attempts to start a transaction to update the inode item,
and then it's blocked since the current transaction is in the
TRANS_STATE_COMMIT_START state. Therefore task A has to wait for the
current transaction to become unblocked (its state >=
TRANS_STATE_UNBLOCKED).
So task A is waiting for the transaction commit done by task B, and
the later waiting on the extent lock of inode Y that is currently
held by task A.
Syzbot recently reported this with the following stack traces:
Fix this by updating the i_size of the destination inode of a reflink
operation after we copy an inline extent's data to an offset beyond the
i_size and before attempting to start a transaction to update the inode's
item.
Reported-by: syzbot+63056bf627663701bbbf@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/69bba3fe.050a0220.227207.002f.GAE@google.com/ Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Mark Harmstone [Tue, 3 Mar 2026 15:59:12 +0000 (15:59 +0000)]
btrfs: tree-checker: add checker for items in remap tree
Add write-time checking of items in the remap tree, to catch errors
before they are written to disk.
We're checking:
* That remap items, remap backrefs, and identity remaps aren't written
unless the REMAP_TREE incompat flag is set
* That identity remaps have a size of 0
* That remap items and remap backrefs have a size of sizeof(struct
btrfs_remap_item)
* That the objectid for these items is aligned to the sector size
* That the offset for these items (i.e. the size of the remapping) isn't
0 and is aligned to the sector size
* That objectid + offset doesn't overflow
Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Dave Chen [Mon, 23 Mar 2026 03:43:22 +0000 (11:43 +0800)]
btrfs: fix unnecessary flush on close when truncating zero-sized files
In btrfs_setsize(), when a file is truncated to size 0, the
BTRFS_INODE_FLUSH_ON_CLOSE flag is unconditionally set to ensure
pending writes get flushed on close. This flag was designed to protect
the "truncate-then-rewrite" pattern, where an application truncates a
file with existing data down to zero and writes new content, ensuring
the new data reach disk on close.
However, when a file already has a size of 0 (e.g. a newly created
file opened with O_CREAT | O_TRUNC), oldsize and newsize are both 0.
In this case, setting BTRFS_INODE_FLUSH_ON_CLOSE is unnecessary because
no "good data" was truncated away. The subsequent filemap_flush() in
btrfs_release_file() then triggers avoidable writeback that disrupts
the normal delayed writeback batching, adding I/O overhead.
This comes from a real workload. A backup service creates temporary
files via mkstemp(), closes them, and later reopens them with O_TRUNC
for writing. The O_TRUNC is defensive. The file creation and usage is
done by a different component, so removing the unneeded truncation is
not straightforward. This pattern repeats for a large number of files
each close() triggers an unnecessary filemap_flush().
Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 27 Feb 2026 03:33:44 +0000 (14:03 +1030)]
btrfs: move shutdown and remove_bdev callbacks out of experimental features
These two new callbacks have been introduced in v6.19, and it has been
two releases in v7.1.
During that time we have not yet exposed bugs related that two features,
thus it's time to expose them for end users.
It's especially important to expose remove_bdev callback to end users.
That new callback makes btrfs automatically shutdown or go degraded
when a device is missing (depending on if the fs can maintain RW), which
is affecting end users.
We want some feedback from early adopters.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Yochai Eisenrich [Sun, 22 Mar 2026 06:39:35 +0000 (08:39 +0200)]
btrfs: fix btrfs_ioctl_space_info() slot_count TOCTOU which can lead to info-leak
btrfs_ioctl_space_info() has a TOCTOU race between two passes over the
block group RAID type lists. The first pass counts entries to determine
the allocation size, then the second pass fills the buffer. The
groups_sem rwlock is released between passes, allowing concurrent block
group removal to reduce the entry count.
When the second pass fills fewer entries than the first pass counted,
copy_to_user() copies the full alloc_size bytes including trailing
uninitialized kmalloc bytes to userspace.
Fix by copying only total_spaces entries (the actually-filled count from
the second pass) instead of alloc_size bytes, and switch to kzalloc so
any future copy size mismatch cannot leak heap data.
Fixes: 7fde62bffb57 ("Btrfs: buffer results in the space_info ioctl") CC: stable@vger.kernel.org # 3.0 Signed-off-by: Yochai Eisenrich <echelonh@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 18 Mar 2026 13:39:51 +0000 (13:39 +0000)]
btrfs: avoid taking the device_list_mutex in btrfs_run_dev_stats()
btrfs_run_dev_stats() is called during the critical section of a
transaction commit and it takes the device_list_mutex, which is also
acquired by fitrim, which does discard operations while holding that
mutex. Most of the time, if we are on a healthy filesystem, we don't have
new stat updates to persist in the device tree, so blocking on the
device_list_mutex is just wasting time and making any tasks that need to
start a new transaction wait longer that necessary.
Since the device list is RCU safe/protected, make btrfs_run_dev_stats()
do an initial check for device stat updates using RCU and quit without
taking the device_list_mutex in case there are no new device stats that
need to be persisted in the device tree.
Also note that adding/removing devices also requires starting a
transaction, and since btrfs_run_dev_stats() is called from the critical
section of a transaction commit, no one can be concurrently adding or
removing a device while btrfs_run_dev_stats() is called.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Leo Martins [Thu, 19 Mar 2026 23:49:08 +0000 (16:49 -0700)]
btrfs: avoid GFP_ATOMIC allocations in qgroup free paths
When qgroups are enabled, __btrfs_qgroup_release_data() and
qgroup_free_reserved_data() pass an extent_changeset to
btrfs_clear_record_extent_bits() to track how many bytes had their
EXTENT_QGROUP_RESERVED bits cleared. Inside the extent IO tree spinlock,
add_extent_changeset() calls ulist_add() with GFP_ATOMIC to record each
changed range. If this allocation fails, it hits a BUG_ON and panics the
kernel.
However, both of these callers only read changeset.bytes_changed
afterwards — the range_changed ulist is populated and immediately freed
without ever being iterated. The GFP_ATOMIC allocation is entirely
unnecessary for these paths.
Introduce extent_changeset_init_bytes_only() which uses a sentinel value
(EXTENT_CHANGESET_BYTES_ONLY) on the ulist's prealloc field to signal
that only bytes_changed should be tracked. add_extent_changeset() checks
for this sentinel and returns early after updating bytes_changed,
skipping the ulist_add() call entirely. This eliminates the GFP_ATOMIC
allocation and makes the BUG_ON unreachable for these paths.
Callers that need range tracking (qgroup_reserve_data,
qgroup_unreserve_range, btrfs_qgroup_check_reserved_leak) continue to
use extent_changeset_init() and are unaffected.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: decrease indentation of find_free_extent_update_loop
Decrease the indentation of find_free_extent_update_loop(), by inverting
the check if the loop state is smaller than LOOP_NO_EMPTY_SIZE.
This also allows for an early return from find_free_extent_update_loop(),
in case LOOP_NO_EMPTY_SIZE is already set at this point.
While at it change a
if () {
}
else if
else
pattern to all using curly braces and be consistent with the rest of btrfs
code.
Also change 'int exists' to 'bool have_trans' giving it a more meaningful
name and type.
No functional changes intended.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 17 Mar 2026 19:35:58 +0000 (19:35 +0000)]
btrfs: unexport btrfs_qgroup_reserve_meta()
There's only one caller outside qgroup.c of btrfs_qgroup_reserve_meta()
and we have btrfs_qgroup_reserve_meta_prealloc() is a wrapper around
that function. Make that caller use btrfs_qgroup_reserve_meta_prealloc()
and unexport btrfs_qgroup_reserve_meta(), simplifying the external API.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 17 Mar 2026 19:19:43 +0000 (19:19 +0000)]
btrfs: collapse __btrfs_qgroup_reserve_meta() into btrfs_qgroup_reserve_meta_prealloc()
Since __btrfs_qgroup_reserve_meta() is only called by
btrfs_qgroup_reserve_meta_prealloc(), which is a simple inline wrapper,
get rid of the later and rename __btrfs_qgroup_reserve_meta() to the
later.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 17 Mar 2026 19:09:15 +0000 (19:09 +0000)]
btrfs: collapse __btrfs_qgroup_free_meta() into btrfs_qgroup_free_meta_prealloc()
Since __btrfs_qgroup_free_meta() is only called by
btrfs_qgroup_free_meta_prealloc(), which is a simple inline wrapper, get
rid of the later and rename __btrfs_qgroup_free_meta() to the later.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 18:36:36 +0000 (18:36 +0000)]
btrfs: optimize clearing all bits from first extent record in an io tree
When we are clearing all the bits from the first record that contains the
target range and that record ends at or before our target range but starts
before our target range, we are doing a lot of unnecessary work:
1) Allocating a prealloc state if we don't have one already;
2) Adjust that record's start offset to the start of our range and
make the prealloc state have a range going from the original start
offset of that first record to the start offset of our target range,
and with the same bits as that first record. Then we insert the
prealloc extent in the rbtree - this is done in split_state();
3) Remove our adjusted first state from the rbtree since all the bits
were cleared - this is done in clear_state_bit().
This is only wasting time when we can simply trim that first record, so
that it represents the range from its start offset to the start offset of
our target range. So optimize for that case and avoid the prealloc state
allocation, insertion and deletion from the rbtree.
This patch is the last patch of a patchset comprised of the following
patches (in descending order):
btrfs: optimize clearing all bits from first extent record in an io tree
btrfs: panic instead of warn when splitting extent state not in the tree
btrfs: free cached state outside critical section in wait_extent_bit()
btrfs: avoid unnecessary wake ups on io trees when there are no waiters
btrfs: remove wake parameter from clear_state_bit()
btrfs: change last argument of add_extent_changeset() to boolean
btrfs: use extent_io_tree_panic() instead of BUG_ON()
btrfs: make add_extent_changeset() only return errors or success
btrfs: tag as unlikely branches that call extent_io_tree_panic()
btrfs: turn extent_io_tree_panic() into a macro for better error reporting
btrfs: optimize clearing all bits from the last extent record in an io tree
The following fio script was used to measure performance before and after
applying all the patches:
Also, using bpftrace to collect the duration (in nanoseconds) of all the
btrfs_clear_extent_bit_changeset() calls done during that fio test and
then making an histogram from that data, held the following results:
Filipe Manana [Wed, 11 Mar 2026 16:15:59 +0000 (16:15 +0000)]
btrfs: panic instead of warn when splitting extent state not in the tree
We are not expected ever to split an extent state record that is not in
the rbtree, as every record we pass to split_state() was found by
iterating the rbtree, so if that ever happens it means we are not holding
the extent io tree's spinlock or we have some memory corruption.
Instead of simply warning in case the extent state record passed to
split_state() is not in the rbtree, panic as this is a serious problem.
Also tag as unlikely the case where the record is not in the rbtree.
This also makes a tiny reduction the btrfs module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 2000080 174328 15592 2190000 216ab0 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 2000064 174328 15592 2189984 216aa0 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 16 Mar 2026 11:38:36 +0000 (11:38 +0000)]
btrfs: free cached state outside critical section in wait_extent_bit()
There's no need to free the cached extent state record while holding the
io tree's spinlock, it's just making the critical section longer than it
needs to be. So just do it after unlocking the io tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>