git.ipfire.org Git - thirdparty/kernel/stable.git/log

riscv: traps_misaligned: Avoid redundant unaligned access speed probe

When a CPU is taken offline and then is brought back online, unaligned
access speed probe always runs even though the unaligned access speed is
already known, wasting CPU cycles.

This is because when a CPU becomes online, the following happen:

  1. check_unaligned_access_emulated() is called, which clears
     misaligned_access_speed if there is no emulation.

  2. check_unaligned_access() is called because misaligned_access_speed is
     cleared, wasting CPU cycles determining something already previous
     known.

Avoid the redundant access speed probe by stop clearing
misaligned_access_speed in (1). If access speed is already known, just
reuse it.

On my Visionfive 2, this reduces CPU bring-up time from 26ms to 0.8ms.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://patch.msgid.link/aa5755142537d462a9e3d2074d82ad4eef6774ba.1780002199.git.namcao@linutronix.de
Signed-off-by: Paul Walmsley <pjw@kernel.org>

riscv: misaligned: Fix fast_unaligned_access_speed_key init

When booting with unaligned_scalar_speed=fast,
fast_unaligned_access_speed_key is initialized incorrectly.

The key is currently derived from the fast_misaligned_access cpumask, but
that mask is only populated when the unaligned access speed probe runs.
Specifying unaligned_scalar_speed=fast skips the probe entirely, leaving
the mask uninitialized.

The information tracked by fast_misaligned_access is already available in
the misaligned_access_speed per-CPU variable. Use that to initialize
fast_unaligned_access_speed_key instead and remove the redundant cpumask.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://patch.msgid.link/2468816ceb433394099a00d7822f819745276b49.1780002199.git.namcao@linutronix.de
Signed-off-by: Paul Walmsley <pjw@kernel.org>

RDMA/srp: bound SRP_RSP sense copy by the received length

srp_process_rsp() copies sense data from rsp->data + resp_data_len,
where resp_data_len is the full 32-bit value supplied by the SRP target
and is never checked against the number of bytes actually received
(wc->byte_len). The copy length is bounded to SCSI_SENSE_BUFFERSIZE, so
at most 96 bytes are copied, but the source offset is not bounded.

A malicious or compromised SRP target on the InfiniBand/RoCE fabric that
the initiator has logged into can return an SRP_RSP with
SRP_RSP_FLAG_SNSVALID set and a large resp_data_len. The receive buffer
is allocated at the target-chosen max_ti_iu_len, so the source of the
sense copy lands past the bytes actually received; with resp_data_len
near 0xFFFFFFFF it is gigabytes past the buffer and the read faults.

Copy the sense data only if it has not been truncated, that is, only if
the response header, the response data, and the sense region fit within
the bytes actually received; otherwise drop the sense and log. The
in-tree iSER and NVMe-RDMA receive paths already bound their parse by
wc->byte_len; this brings ib_srp into line with them.

Fixes: aef9ec39c47f ("IB: Add SCSI RDMA Protocol (SRP) initiator")
Link: https://patch.msgid.link/r/20260602220457.2542840-1-michael.bommarito@gmail.com
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN

In drivers/infiniband/ulp/isert/ib_isert.c, isert_login_recv_done()
computes the login request payload length as wc->byte_len minus
ISER_HEADERS_LEN with no lower bound, and login_req_len is a signed int.
A remote iSER initiator can post a login Send work request carrying
fewer than ISER_HEADERS_LEN (76) bytes, so the subtraction underflows
and login_req_len becomes negative.

isert_rx_login_req() then reads that negative length back into a signed
int, takes size = min(rx_buflen, MAX_KEY_VALUE_PAIRS), and because the
min() is signed it keeps the negative value; the value is then passed as
the memcpy() length and sign-extended to a multi-gigabyte size_t. The
copy into the 8192-byte login->req_buf runs far out of bounds and
faults, crashing the target node. The login phase precedes iSCSI
authentication, so no credentials are required to reach this path.

Reject any login PDU shorter than ISER_HEADERS_LEN before the
subtraction, mirroring the existing early return on a failed work
completion, so login_req_len can never go negative. The upper bound was
already safe: a posted login buffer cannot deliver more than
ISER_RX_PAYLOAD_SIZE, so the difference stays at or below
MAX_KEY_VALUE_PAIRS and the existing min() clamps it; only the missing
lower bound needs to be added.

Fixes: b8d26b3be8b3 ("iser-target: Add iSCSI Extensions for RDMA (iSER) target driver")
Link: https://patch.msgid.link/r/20260602194642.2273217-1-michael.bommarito@gmail.com
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

cpufreq: Use policy->min/max init as QoS request

Modify cpufreq_policy_init_qos() introduced previously to use
policy->min/max set in the driver .init() callback as the initial
values for the policy min/max frequency QoS requests, respectively,
so long as they are different from 0 (which means that they have
been updated by the driver). Update the documentation in accordance
with that code change.

This only affects the following drivers:

- gx-suspmod (min)
- cppc-cpufreq (min)
- longrun (min/max)

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
[ rjw: Changelog rewrite ]
Link: https://patch.msgid.link/20260528090913.2759118-5-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Remove driver default policy->min/max init

Prior to commit 521223d8b3ec ("cpufreq: Fix initialization of min and
max frequency QoS requests"), drivers were setting policy->min/max and
these values were used as initial policy QoS constraints.

After the above commit, these values are only used temporarily, as
cpufreq_set_policy() ultimately overrides them through:

cpufreq_policy_online()
\-cpufreq_init_policy()
\-cpufreq_set_policy()
\-/* Set policy->min/max */

A subsequent change will restore the previous behavior allowing
drivers to request special min/max QoS frequencies instead of
FREQ_QOS_MIN_DEFAULT_VALUE and FREQ_QOS_MAX_DEFAULT_VALUE, respectively,
if desired. For instance, the CPPC driver wants to advertise the lowest
non-linear frequency that should be used as the initial minimum
frequency QoS request.

However, for this purpose, all drivers setting policy->min/max to
policy->cpuinfo.min/max_freq, respectively, need to be updated so
their initial policy->min/max settings don't limit the frequency
scaling unnecessarily going forward (which would defeat the purpose
of commit 521223d8b3ec), so do that.

This does not actually alter the observed behavior of all of
the drivers in question because setting policy->min/max to
policy->cpuinfo.min/max_freq, respectively, is not necessary or
even useful any more after a previous change ("cpufreq: Set default
policy->min/max values for all drivers").

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Acked-by: Jie Zhan <zhanjie9@hisilicon.com>
[ rjw: Changelog rewrite ]
Link: https://patch.msgid.link/20260528090913.2759118-4-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Set default policy->min/max values for all drivers

Some drivers set policy->min/max in their .init() callback, but
cpufreq_set_policy() will ultimately override them through:

cpufreq_policy_online()
\-cpufreq_init_policy()
\-cpufreq_set_policy()
\-/* Set policy->min/max */

Thus the policy min/max values set by the drivers are only temporary.

There is an exception if CPUFREQ_NEED_INITIAL_FREQ_CHECK is set and
cpufreq_policy_online() calls __cpufreq_driver_target() which invokes
cpufreq_driver->target().

To prepare for a subsequent change that will remove all initialization
of policy->min/max in driver .init() callbacks if the min/max value is
equal to the corresponding cpuinfo.min/max_freq, set default
policy->min/max values in the core for all drivers.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
[ rjw: Edits of the new comment and changelog ]
Link: https://patch.msgid.link/20260528090913.2759118-3-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Extract cpufreq_policy_init_qos() function

Extract the QoS-related logic from cpufreq_policy_online()
to make that function shorter/simpler.

The logic is placed in cpufreq_policy_init_qos() and is
now executed right after the following calls:

- cpufreq_driver->init()
- cpufreq_table_validate_and_sort()

This facilitats subsequent changes that will, in
cpufreq_policy_init_qos():

- Set a default policy->min/max value for all policies.
- Use the policy->min/max values set by drivers as initial request
values for policy frequency QoS requests.

No functional change.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
[ rjw: Changelog edits ]
Link: https://patch.msgid.link/20260528090913.2759118-2-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

RDMA: During rereg_mr ensure that REREG_ACCESS is compatible

If IB_MR_REREG_ACCESS changes from RO to RW then the umem has to be
re-evaluated to ensure it is properly pinned as RW. Since the umem is
hidden inside each driver's mr struct add a ib_umem_check_rereg() function
that each driver has to call before processing IB_MR_REREG_ACCESS.

mlx4 has to retain its duplicate ib_access_writable check because it
implements IB_MR_REREG_ACCESS | IB_MR_REREG_TRANS by changing both items
in place sequentially while the MR is live, so it will continue to not
support this combination.

Cc: stable@vger.kernel.org
Fixes: b40656aa7d55 ("RDMA/umem: remove FOLL_FORCE usage")
Link: https://patch.msgid.link/r/0-v1-06fb1a2d6cf5+107-rereg_access_jgg@nvidia.com
Reported-by: Philip Tsukerman <philiptsukerman@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Documentation: KVM: Synchronize x86 VM types

KVM has reflected KVM_X86_SNP_VM to userspace since 1dfe571c12cf
("KVM: SEV: Add initial SEV-SNP support"), and KVM_X86_TDX_VM since
161d34609f9b ("KVM: TDX: Make TDX VM type supported"). Update the
documentation to reflect this fact.

Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Fixes: 161d34609f9b ("KVM: TDX: Make TDX VM type supported")
Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260603114504.814647-2-clopez@suse.de
[sean: use one tab instead of two]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add regression test for mediated PMU fixed counter filter bug

Add a regression test where KVM would inadvertently ignore PMU event
filters on writes that change _some_ bits in FIXED_CTR_CTRL, but not the
enable bits for PMCs that are denied to the guest.

Link: https://patch.msgid.link/20260603231905.1738487-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Use hardware value when reprogramming for FIXED_CTR_CTRL changes

When (conditionally) reprogramming fixed counters, use the hardware value
of FIXED_CTR_CTRL to detect changes, not the guest's original value. For
guests with a mediated PMU, overwriting fixed_ctr_ctrl_hw at the start of
reprogramming without actually reacting to changes in fixed_ctr_ctrl_hw can
lead to KVM ignoring PMU event filters.

E.g. if the guest attempts to enable a fixed PMC that is disallowed, and
then toggles a different PMC in a subsequent WRMSR, KVM will update
pmu->fixed_ctr_ctrl_hw and reprogram the PMC that is changing, but not the
others that are now effectively enabled in pmu->fixed_ctr_ctrl_hw.

Note, the perf-based PMU is unaffected, as it doesn't use fixed_ctr_ctrl_hw
(which is also why keying off fixed_ctr_ctrl_hw works for both PMUs.

Note #2, fixed_ctr_ctrl_hw won't mess up pmc_in_use either, because the
latter isn't used by the mediated PMU. Its purpose is solely to release
perf events that are no longer being actively used, and the meadiated PMU
obviously doesn't create perf events.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260528005419.0228F1F00A3A@smtp.kernel.org
Link: https://patch.msgid.link/20260603231905.1738487-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: hyper-v: Bound the bank index when querying sparse banks

When checking if a VP ID is included in a sparse bank set, explicitly check
that the ID can actually be contained in a sparse bank (the TLFS allows for
a maximum of 64 banks of 64 vCPUs each).  When handling a paravirtual TLB
flush for L2, the VP ID is copied verbatim from the enlightened VMCS,
without any bounds check, i.e. isn't guaranteed to be under the limit of
4096.

Failure to check the bounds of the VP ID leads to an out-of-bounds read
when testing the sparse bank, and super strictly speaking could lead to KVM
performing an unnecessary TLB flush for an L2 vCPU.

  ==================================================================
  BUG: KASAN: use-after-free in hv_is_vp_in_sparse_set+0x85/0x100 [kvm]
  Read of size 8 at addr ffff88811ba5f598 by task hyperv_evmcs/2802

  CPU: 12 UID: 1000 PID: 2802 Comm: hyperv_evmcs Not tainted 7.1.0-rc2 #7 PREEMPT
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x51/0x60
   print_report+0xcb/0x5d0
   kasan_report+0xb4/0xe0
   kasan_check_range+0x35/0x1b0
   hv_is_vp_in_sparse_set+0x85/0x100 [kvm]
   kvm_hv_flush_tlb+0xe9e/0x16c0 [kvm]
   kvm_hv_hypercall+0xe6b/0x1e60 [kvm]
   vmx_handle_exit+0x485/0x1b60 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x22e3/0x5070 [kvm]
   kvm_vcpu_ioctl+0x5d0/0x10c0 [kvm]
   __x64_sys_ioctl+0x129/0x1a0
   do_syscall_64+0xb9/0xcf0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7f0e62d1a9bf
   </TASK>

  The buggy address belongs to the physical page:
  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x11ba5f
  flags: 0x4000000000000000(zone=1)
  raw: 4000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  raw: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff88811ba5f480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
   ffff88811ba5f500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  >ffff88811ba5f580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                              ^
   ffff88811ba5f600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
   ffff88811ba5f680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ==================================================================
  Disabling lock debugging due to kernel taint

Opportunistically add a compile time assertion to ensure the maximum number
of sparse banks exactly matches the number of possible bits in the passed
in mask.

Cc: stable@vger.kernel.org
Fixes: c58a318f6090 ("KVM: x86: hyper-v: L2 TLB flush")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/aiQyZIJtO-2Aj_xN@v4bel
[sean: add KASAN splat, drop comment, add assert, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: guest_memfd: fix NUMA interleave index double-counting

kvm_gmem_get_policy() sets the interleave index (the output param that's
typically named "ilx") to the full page offset (vm_pgoff + vma offset).
But get_vma_policy() adds the page offset on top of the interleave index,
and so the offset is counted twice. This causes NUMA interleaving to skip
nodes: for order-0 pages the effective index jumps by 2 for each
consecutive page.

The vm_op.get_policy() implementation should return only a per-file bias in
the interleave index (like shmem_get_policy does with inode->i_ino),
letting get_vma_policy() add the page-offset component.

Fix by setting the output interleave index to the inode number (a la shmem)
instead of the full page offset, as the index is intended to be a constant,
semi-random value for a given file, e.g. so that interleaving doesn't start
at the same node for every file, and so that allocations are round-robined
across nodes based on the page offset (the selected node would bounce/skip
around if the index isn't constant).

Found by Sashiko (sashiko.dev) AI code review.

Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Fixes: 7f3779a3ac3e ("mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()")
Link: https://patch.msgid.link/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com
[sean: use reverse fir-tree, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

doc: security: Add documentation of exporting and deleting IMA measurements

Add the documentation of exporting and deleting IMA measurements in
Documentation/security/IMA-export-delete.rst.

Also add the missing Documentation/security/IMA-templates.rst file in
MAINTAINERS.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Support staging and deleting N measurements records

Add support for sending a value N between 1 and ULONG_MAX to the IMA
original measurement interface. This value represents the number of
measurements that should be deleted from the current measurements list. In
this case, measurements are staged in an internal non-user visible list,
and immediately deleted.

This staging method allows the remote attestation agents to easily separate
the measurements that were verified (staged and deleted) from those that
weren't due to the race between taking a TPM quote and reading the
measurements list.

In order to minimize the locking time of ima_extend_list_mutex, deleting
N records is realized by doing a lockless walk in the current measurements
list to determine the N-th entry to cut, to cut the current measurements
list under the lock, and by deleting the excess records after releasing the
lock.

Flushing the hash table is not supported for N records, since it would
require removing the N records one by one from the hash table under the
ima_extend_list_mutex lock, which would increase the locking time.

Link: https://github.com/linux-integrity/linux/issues/1
Co-developed-by: Steven Chen <chenste@linux.microsoft.com>
Signed-off-by: Steven Chen <chenste@linux.microsoft.com>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Add support for flushing the hash table when staging measurements

During staging and delete, measurements are not completely deallocated.
Their entry digest portion is kept and is still reachable with the hash
table to detect duplicate records. If the number of records is significant,
this reduces the memory saving benefit of staging.

Some users might be interested in achieving the best memory saving (the
measurements are completely deallocated) at the cost of having duplicate
records across the staged measurement lists. Duplicate records are still
avoided within the current measurement list.

Introduce the new kernel option ima_flush_htable to decide whether or not
the digests of staged measurement records are flushed from the hash table,
when they are deleted, to achieve the maximum memory saving.

When the option is enabled, replace the old hash table with a new one,
by calling ima_alloc_replace_htable(), and completely delete the
measurements records.

Note: This code derives from the Alt-IMA Huawei project, whose license is
GPL-2.0 OR MIT.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Add support for staging measurements with prompt

Introduce the ability of staging the IMA measurement list and deleting them
with a prompt.

Staging means moving the current measurement list records to a separate
location, and allowing users to read and delete it. This causes the current
measurement list to be emptied (since records were moved) and new
measurements to be added on the empty list. Staging can be done only once
at a time. In the event of kexec(), staging is aborted and staged records
will be carried over to the new kernel.

Introduce ascii_runtime_measurements_<algo>_staged and
binary_runtime_measurements_<algo>_staged interfaces to access and delete
the measurements.

Use 'echo A > <IMA _staged interface>' and
'echo D > <IMA _staged interface>' to respectively stage and delete the
entire measurements list. Locking of these interfaces is also mediated with
a call to _ima_measurements_open() and with ima_measurements_release().

Implement the staging functionality by introducing the new global
measurements list ima_measurements_staged, and ima_queue_stage() and
ima_queue_staged_delete_all() to respectively move measurements from the
current measurements list to the staged one, and to move staged
measurements to the ima_measurements_trim list for deletion. Introduce
ima_queue_delete() to delete the measurements.

Staging is forbidden after measurement is suspended, and between staging
and deleting, so that walking the staged and current measurements list can
be done locklessly in ima_dump_measurement_list(). Strict ordering of
suspending and dumping is enforced by two reboot notifiers with different
priority. Refusing to delete staged measurements also signals to user space
that those measurements are already carried over to the secondary kernel,
so that it does not save them twice.

Finally, introduce the BINARY_STAGED and BINARY_FULL binary measurements
list types, to maintain the counters and the binary size of staged
measurements and the full measurements list (including records that were
staged). BINARY still represents the current binary measurements list.

Use the binary size for the BINARY + BINARY_STAGED types in
ima_add_kexec_buffer(), since both measurements list types are copied to
the secondary kernel during kexec. Use BINARY_FULL in
ima_measure_kexec_event(), to generate a critical data record.

It should be noted that the BINARY_FULL counter is not passed through
kexec. Thus, the number of records included in the kexec critical data
records refers to the records since the critical data records generated
from the previous kexec event.

Note: This code derives from the Alt-IMA Huawei project, whose license is
GPL-2.0 OR MIT.

Link: https://github.com/linux-integrity/linux/issues/1
Suggested-by: Gregory Lumen <gregorylumen@linux.microsoft.com> (staging revert)
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Tested-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce ima_dump_measurement()

Introduce ima_dump_measurement() to simplify the code of
ima_dump_measurement_list() and to avoid repeating the
ima_dump_measurement() code block if iteration occurs on multiple lists.

No functional change: only code moved to a separate function.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Use snprintf() in create_securityfs_measurement_lists

Use the more secure snprintf() function (accepting the buffer size) in
create_securityfs_measurement_lists().

No functional change: sprintf() and snprintf() have the same behavior.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Mediate open/release method of the measurements list

Introduce the ima_measure_users counter, to implement a semaphore-like
locking scheme where the binary and ASCII measurements list interfaces can
be concurrently opened by multiple readers, or alternatively by a single
writer. In addition, allow the same writer to open the other interfaces for
write or read/write, so that it can see the same measurement state across
all the interfaces.

A semaphore cannot be used because the kernel cannot return to user space
with a lock held.

Introduce the ima_measure_lock() and ima_measure_unlock() primitives, to
respectively lock/unlock the interfaces (safely with the ima_measure_users
counter, without holding a lock).

Finally, introduce _ima_measurements_open() to lock the interface before
seq_open(), and call it from ima_measurements_open() and
ima_ascii_measurements_open(). And, introduce ima_measurements_release(),
to unlock the interface.

Require CAP_SYS_ADMIN if the interface is opened for write (not possible
for the current measurements interfaces, since they only have read
permission).

No functional changes: multiple readers are allowed as before.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce _ima_measurements_start() and _ima_measurements_next()

Introduce _ima_measurements_start() and _ima_measurements_next(), renamed
from ima_measurements_start() and ima_measurements_next(), to include the
list head as an additional parameter, so that iteration on different lists
can be implemented by calling those functions.

No functional change: ima_measurements_start() and ima_measurements_next()
pass the ima_measurements list head, used before. They become wrappers for
the new functions.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce per binary measurements list type binary_runtime_size value

Make binary_runtime_size as an array, to have separate counters per binary
measurements list type. Currently, define the BINARY type for the existing
binary measurements list.

Introduce ima_update_binary_runtime_size() to facilitate updating a
binary_runtime_size value with a given binary measurement list type.

Also add the binary measurements list type parameter to
ima_get_binary_runtime_size(), to retrieve the desired value. Retrieving
the value is now done under the ima_extend_list_mutex, since there can be
concurrent updates.

No functional change (except for the mutex usage, that fixes the
concurrency issue): the BINARY array element is equivalent to the old
binary_runtime_size.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce per binary measurements list type ima_num_records counter

Make ima_num_records as an array, to have separate counters per binary
measurements list type. Currently, define the BINARY type for the existing
binary measurements list.

No functional change: the BINARY type is equivalent to the value without
the array.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Replace static htable queue with dynamically allocated array

The IMA hash table is a fixed-size array of hlist_head buckets:

struct hlist_head ima_htable[IMA_MEASURE_HTABLE_SIZE];

IMA_MEASURE_HTABLE_SIZE is (1 << IMA_HASH_BITS) = 1024 buckets, each a
struct hlist_head (one pointer, 8 bytes on 64-bit). That is 8 KiB allocated
in BSS for every kernel, regardless of whether IMA is ever used, and
regardless of how many measurements are actually made.

Replace the fixed-size array with a RCU-protected pointer to a dynamically
allocated array that is initialized in ima_init_htable(), which is called
from ima_init() during early boot. ima_init_htable() calls the static
function ima_alloc_replace_htable() which, other than initializing the hash
table the first time, can also hot-swap the existing hash table with a
blank one.

The allocation in ima_alloc_replace_htable() uses kcalloc() so the buckets
are zero-initialised (equivalent to HLIST_HEAD_INIT { .first = NULL }).
Callers of ima_alloc_replace_htable() must call synchronize_rcu() and free
the returned hash table.

Finally, access the hash table with rcu_dereference() in
ima_lookup_digest_entry() (reader side) and with
rcu_dereference_protected() in ima_add_digest_entry() (writer side).

No functional change: bucket count, hash function, and all locking remain
identical.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Remove ima_h_table structure

The ima_h_table structure is a collection of IMA measurement list
metadata - number of records in the IMA measurement list, number of
integrity violations, and a hash table containing the IMA template data
hash, needed to prevent measurement list record duplication.

Removing records from the measurement list needs to be reflected in the
hash table. As a pre-req to removing records from the measurement list,
separate those counters from the hash table, remove the ima_h_table
structure, and just replace the hash table pointer.

Finally, rename ima_show_htable_value(), ima_show_htable_violations()
and ima_htable_violations_ops respectively to ima_show_counter(),
ima_show_num_violations() and ima_num_violations_ops.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

i2c: riic: fix refcount leak in riic_i2c_resume_noirq()

When riic_i2c_resume_noirq() is called, it deasserts the reset
using reset_control_deassert(), which for shared resets increments
a reference count. If pm_runtime_force_resume() then fails, the
function returns without calling reset_control_assert() to
decrement the count. This leaves the reset deasserted and the
reference count unbalanced, which can prevent other users of the
shared reset from properly asserting it later.

Fix the leak by calling reset_control_assert() on the error
handling path for a failed pm_runtime_force_resume().

Fixes: e383f0961422 ("i2c: riic: Move suspend handling to NOIRQ phase")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Cc: <stable@vger.kernel.org> # v6.19+
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260608071123.128964-1-vulab@iscas.ac.cn

Merge tag 'v7.1-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:

- Fix random config build failure on s390.

* tag 'v7.1-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: s390 - add select CRYPTO_AEAD for aes

power: sequencing: pcie-m2: Add PCI ID 0x1103 for WCN6855 Bluetooth

WCN6855 is a Qualcomm Wi-Fi/BT combo chip that uses PCI device ID
0x1103. Add it to pwrseq_m2_pci_ids[] alongside the existing 0x1107
(WCN7850) entry, so that the pwrseq-pcie-m2 driver creates a Bluetooth
serdev device for WCN6855 cards inserted into PCIe M.2 Key E connectors.

Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Wei Deng <wei.deng@oss.qualcomm.com>
Link: https://patch.msgid.link/20260608091702.3797437-2-wei.deng@oss.qualcomm.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

gpio: mvebu: fix NULL pointer dereference in suspend/resume

mvebu_pwm_suspend() and mvebu_pwm_resume() are called for all GPIO
banks during suspend/resume, but not all banks have PWM functionality.
GPIO banks without PWM have mvchip->mvpwm set to NULL.

Calling mvebu_pwm_suspend() with mvpwm == NULL causes a NULL pointer
dereference when it tries to access mvpwm->blink_select.

  Unable to handle kernel NULL pointer dereference at virtual address 00000020 when write
  [00000020] *pgd=00000000
  Internal error: Oops: 815 [#1] PREEMPT ARM
  Modules linked in:
  CPU: 0 UID: 0 PID: 406 Comm: sh Not tainted 6.12.74-rt12-yocto-standard-g4e96f98fb7db-dirty #353
  Hardware name: Marvell Armada 370/XP (Device Tree)
  PC is at regmap_mmio_read+0x38/0x54
  LR is at regmap_mmio_read+0x38/0x54
  pc : [<c05fd2ac>]    lr : [<c05fd2ac>]    psr: 200f0013
  sp : f0c11d10  ip : 00000000  fp : c100d2f0
  r10: c14fb854  r9 : 00000000  r8 : 00000000
  r7 : c1799c00  r6 : 00000020  r5 : 00000020  r4 : c179c7c0
  r3 : f0a231a0  r2 : 00000020  r1 : 00000020  r0 : 00000000
  Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
  Control: 10c5387d  Table: 135ec059  DAC: 00000051
  Call trace:
   regmap_mmio_read from _regmap_bus_reg_read+0x78/0xac
   _regmap_bus_reg_read from _regmap_read+0x60/0x154
   _regmap_read from regmap_read+0x3c/0x60
   regmap_read from mvebu_gpio_suspend+0xa4/0x14c
   mvebu_gpio_suspend from dpm_run_callback+0x54/0x180
   dpm_run_callback from device_suspend+0x124/0x630
   device_suspend from dpm_suspend+0x124/0x270
   dpm_suspend from dpm_suspend_start+0x64/0x6c
   dpm_suspend_start from suspend_devices_and_enter+0x140/0x8e8
   suspend_devices_and_enter from pm_suspend+0x2fc/0x308
   pm_suspend from state_store+0x6c/0xc8
   state_store from kernfs_fop_write_iter+0x10c/0x1f8
   kernfs_fop_write_iter from vfs_write+0x270/0x468
   vfs_write from ksys_write+0x70/0xf0
   ksys_write from ret_fast_syscall+0x0/0x54

Add a NULL check for mvchip->mvpwm before calling the PWM
suspend/resume functions.

Fixes: 757642f9a584 ("gpio: mvebu: Add limited PWM support")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Link: https://patch.msgid.link/20260608084334.2960803-1-yun.zhou@windriver.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

io_uring/net: support registered buffer for plain send and recv

So far IORING_RECVSEND_FIXED_BUF is only honoured on the SEND_ZC path,
even though the import wiring is already present for plain send and
completely absent for recv. Targets such as ublk's NBD backend want to
push/pull I/O data directly to/from an io_uring registered buffer over a
plain send/recv on a TCP socket.

Wire IORING_RECVSEND_FIXED_BUF into the plain IORING_OP_SEND and
IORING_OP_RECV paths:

- Accept the flag in SENDMSG_FLAGS / RECVMSG_FLAGS and, at prep time,
   restrict it to the non-vectorized IORING_OP_SEND / IORING_OP_RECV
   opcodes. It is mutually exclusive with buffer select, bundles and
   (for recv) multishot, and records sqe->buf_index.

- For recv, set REQ_F_IMPORT_BUFFER in setup so the registered buffer
   is imported lazily at issue time, mirroring the send path.

- In io_send()/io_recv(), import the registered buffer via
   io_import_reg_buf() (ITER_SOURCE for send, ITER_DEST for recv) and
   clear REQ_F_IMPORT_BUFFER. The resulting bvec iter persists in
   async_data, so MSG_WAITALL partial send/recv retries resume at the
   right offset.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260608142511.659240-2-ming.lei@redhat.com
[axboe: combine flags checks]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'hyperv-fixes-signed-20260607' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

- MSHV driver fixes from various people (Anirudh Rayabharam, Can Peng,
   Dexuan Cui, Michael Kelley, Jork Loeser, Wei Liu)

- Hyper-V user space tools fixes (Thorsten Blum)

- Allow VMBus to be unloaded after frame buffer is flushed (Michael
   Kelley)

* tag 'hyperv-fixes-signed-20260607' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  mshv: support 1G hugepages by passing them as 2M-aligned chunks
  Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
  mshv: use kmalloc_array in mshv_root_scheduler_init
  mshv: Add conditional VMBus dependency
  hyperv: Clean up and fix the guest ID comment in hvgdk.h
  drm/hyperv: During panic do VMBus unload after frame buffer is flushed
  Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
  mshv: unmap debugfs stats pages on kexec
  mshv: clean up SynIC state on kexec for L1VH
  mshv: limit SynIC management to MSHV-owned resources
  hv: utils: replace deprecated strcpy with strscpy in kvp_register
  hv: utils: handle and propagate errors in kvp_register
  mshv: add a missing padding field

ntfs: fix u16 truncation of restart-area length check

ntfs_check_restart_area() validates that the $LogFile restart area and
its trailing log client record array fit within the system page size:

        u16 ra_ofs, ra_len, ca_ofs;
        ...
        ra_len = ca_ofs + le16_to_cpu(ra->log_clients) *
                        sizeof(struct log_client_record);
        if (ra_ofs + ra_len > le32_to_cpu(rp->system_page_size) || ...)
                return false;

ra_len is u16, but the right-hand side is computed in size_t
(sizeof(struct log_client_record) == 160). Both ca_ofs and log_clients
come straight from the on-disk restart area. With an on-disk
log_clients of 410 the product 410 * 160 = 65600; adding ca_ofs and
storing into the u16 ra_len truncates modulo 65536 (e.g. ca_ofs 64
gives ra_len 128), so the "fits in the page" check passes even though
the client array described by log_clients extends far beyond the page.

ntfs_check_log_client_array() then walks the array bounded only by the
on-disk log_clients count:

        cr = ca + idx;
        if (cr->prev_client != LOGFILE_NO_CLIENT) ...

For log_clients 410 it dereferences records up to ca + 409 * 160,
~64 KiB past the kvzalloc(system_page_size) restart-page buffer -- an
out-of-bounds read of attacker-controlled extent, reachable when a
crafted NTFS image is mounted (load_and_check_logfile() at mount time).
This is the in-kernel analogue of CVE-2022-30789, fixed in the ntfs-3g
userspace driver but never in this revived classic driver.

Compute the restart-area length in a u32 so the existing bounds check
rejects an over-large client array instead of being defeated by the
truncation. Widen ra_ofs and ca_ofs to u32 as well: both are loaded
from __le16 on-disk fields and every comparison already promotes to
int/size_t, so this changes no result and keeps the declaration uniform.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

ntfs: bound the attribute-list entry in ntfs_read_inode_mount()

The $MFT attribute-list walk in ntfs_read_inode_mount() validates each
entry only with "(u8 *)al_entry + 6 > al_end" and
"(u8 *)al_entry + le16_to_cpu(al_entry->length) > al_end", but then reads
al_entry->lowest_vcn (an __le64 at offset 8) and al_entry->mft_reference
(offset 16) -- fields beyond the 6 bytes proven in range. al_entry->length
is attacker-controlled and only required non-zero, so a short entry (e.g.
length 8) placed at the tail passes both checks while the lowest_vcn /
mft_reference reads fall past al_end.

al_end is ni->attr_list + attr_list_size (the on-disk size); the buffer is
kvzalloc(round_up(attr_list_size, SECTOR_SIZE)), so the sector rounding
usually absorbs the over-read -- but when attr_list_size is a multiple of
SECTOR_SIZE there is no slack and a crafted $MFT attribute list produces an
out-of-bounds read at mount time.

Validate the entry with ntfs_attr_list_entry_is_valid() (added in patch
1/3) before dereferencing it, matching the bound the other attribute-list
walks now use. The validator already requires the length to cover the fixed
header, which makes the separate "!al_entry->length" check redundant, so
drop it too.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

ntfs: bound the look-ahead attribute-list entry in ntfs_external_attr_find()

When resolving an attribute lookup with a non-zero @lowest_vcn,
ntfs_external_attr_find() peeks at the next $ATTRIBUTE_LIST entry to
decide whether to keep searching, but bounds that not-yet-validated
entry only with "(u8 *)next_al_entry + 6 < al_end" (which proves just
bytes 0..6 are in range) and "(u8 *)next_al_entry + length <= al_end"
with an attacker-controlled, non-8-aligned length. It then reads
next_al_entry->lowest_vcn (an __le64 at offset 8) and the name at
next_al_entry->name_offset, both of which can lie past al_end -- the
exact end of the kvmalloc'd attribute-list buffer (allocated at the
on-disk attr_list_size, no rounding). A crafted on-disk $ATTRIBUTE_LIST
whose last entry sits a few bytes before al_end therefore yields a slab
out-of-bounds read when the inode is read.

Validate the look-ahead entry with ntfs_attr_list_entry_is_valid() (added
in patch 1/3) before dereferencing lowest_vcn and the name, so the same
fixed-header, length and name bounds the main attribute-list walk uses now
guard this read too.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

ntfs: validate resident attribute lists and harden the validator

A base inode's $ATTRIBUTE_LIST is sanity-checked by load_attribute_list()
only on the non-resident path; ntfs_read_locked_inode() copies a *resident*
attribute list into ni->attr_list with a plain memcpy() and no validation
at all. Every subsequent walk of ni->attr_list --
ntfs_external_attr_find(), ntfs_inode_attach_all_extents() and
ntfs_attrlist_need() -- then trusts the entries are well-formed and reads
attr_list_entry fixed-header fields
(lowest_vcn at offset 8, mft_reference at offset 16, and the name) with
bounds that assume validation already happened. A crafted resident
attribute list therefore reaches those walks unvalidated and can drive
out-of-bounds reads of the attribute-list buffer.

load_attribute_list() itself reads ale->name_offset (offset 7),
ale->mft_reference (offset 16) and the name length under only an
"al < al_start + size" bound, so its own validation loop can over-read the
fixed header of a truncated trailing entry by a few bytes.

Factor the per-entry validation into ntfs_attr_list_entry_is_valid(),
which requires each entry's fixed header (offsetof(struct
attr_list_entry, name)) to be in range before any field is dereferenced,
that ale->length is a multiple of 8 covering the fixed header plus the
name, and that the entry is in use and carries a live MFT reference.
ntfs_attr_list_is_valid() walks the buffer with it and checks the entries
tile it exactly. Use the list validator in load_attribute_list()
(replacing the open-coded loop, closing its own over-read) and on the
resident path in ntfs_read_locked_inode() (which previously skipped
validation entirely); patches 2/3 reuse the per-entry helper at the other
two attribute-list walks.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

thermal: sysfs: Replace sscanf() with kstrtoul()

Replace sscanf() with kstrtoul() in cur_state_store(), as kstrto<type>
is preferred over single-variable sscanf().

Signed-off-by: Ovidiu Panait <ovidiu.panait.oss@gmail.com>
[ rjw: Changelog edits ]
Link: https://patch.msgid.link/20260606210420.2311145-3-ovidiu.panait.oss@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal: testing: Replace sscanf() with kstrtoint()

Generally, kstrtoint() is preferred to sscanf() in kernel code, so
replace the latter with the former in tt_del_tz() and tt_get_tt_zone().

Signed-off-by: Ovidiu Panait <ovidiu.panait.oss@gmail.com>
[ rjw: Changelog rewrite ]
Link: https://patch.msgid.link/20260606210420.2311145-2-ovidiu.panait.oss@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

btrfs: tracepoints: add trace event for log_new_dir_dentries()

log_new_dir_dentries() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for log_all_new_ancestors()

log_all_new_ancestors() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for btrfs_log_all_parents()

btrfs_log_all_parents() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for btrfs_log_inode()

btrfs_log_inode() is one of the most important steps called during a fsync,
as well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use a named enum for the log mode in inode log functions

We use this unnamed enum for the log mode and then pass it around log
functions as an int type with the odd name "inode_only" which suggests a
boolean. So add a name to the enum and change the type everywhere to that
enum and rename the parameters to something more clear - "log_mode".
Also move the enum into tree-log.h - it will be used later by new trace
events.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for btrfs_log_inode_parent()

btrfs_log_inode_parent() is one of the most important steps called during
a fsync operation as well as during rename and link operations on inodes
that were previously logged. Add trace events for when entering and
exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for when fsync finishes

Currently we only have a trace event for when a fsync operation starts,
but this alone is not very helpful. Add a trace event for when fsync
finishes, which reports its return value, so that using tracing we can
see which other trace events happened in between (several will be added
soon for inode logging steps) and even measure execution time.

So rename the existing trace event btrfs_sync_file to
btrfs_sync_file_enter and add the trace event btrfs_sync_file_exit.
The naming is similar to what ext4 does (ext4_sync_file_enter and
ext4_sync_file_exit) and with similar information reported.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove redundant writeback error check during fsync

If we can skip logging the inode during fsync, we check for writeback
errors in the inode's mapping by calling filemap_check_wb_err() and then
jump to the 'out_release_extents' label, which in turn jumps to the 'out'
label under which we check again for a writeback error by calling
file_check_and_advance_wb_err(). So the filemap_check_wb_err() ends up
being redundant. This happens since commit 333427a505be ("btrfs: minimal
conversion to errseq_t writeback error reporting on fsync").

Remove the filemap_check_wb_err() call.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: stop checking for greater then zero return values in btrfs_sync_file()

The value of 'ret' can never be greater than zero when we reach the end of
btrfs_sync_file() but we have this ternary operator converting any such
value into -EIO. This logic exists since the first fsync implementation,
added in 2007 by commit 8fd17795b226 ("Btrfs: early fsync support"), when
all that fsync did was simply to commit a transaction, but even a call to
btrfs_commit_transaction() could never return a value greater than zero.

So stop checking for a greater than zero value and assert that 'ret' is
never greater than zero, to catch any eventual regression during future
development.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: trace transaction states during commit phase

Currently the trace event is fired only when a transaction is fully
complete (its state is TRANS_STATE_COMPLETED). However during a
transaction commit we go through several states and as soon as the
state reaches TRANS_STATE_UNBLOCKED, another transaction can start.
Therefore it's useful to track every transaction state changed during
the commit of a transaction, so that we can see if a new transaction
is started before the current one is completed. Add the transaction
state to the transaction commit event and call the event everytime
we change the transaction state during commit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for the start of a new transaction

While tracing it's useful to know not just when a transaction is committed
or aborted, but also when a new one is started. So add a trace event for
transaction starts.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for transaction aborts

While tracing it's useful to know not just when a transaction is committed
but also when one is aborted. So add a trace event for transaction aborts.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add in_fsync field to transaction commit event

Include the in_fsync value from the transaction handle so that we can know
if a transaction commit was triggered by a fsync call.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: pass a transaction handle to transaction commit event

The transaction commit tracepoint prints fs_info->generation as if it
were the ID of the committed transaction but this does not always match
that ID. This is because the trace point is called in the transaction
commit path after the transaction is in the TRANS_STATE_COMPLETED state,
which means another transaction may have already started (which can happen
as soon as the transaction state was set to TRANS_STATE_UNBLOCKED), in
which case fs_info->generation was incremented and does not correspond
to the committed transaction anymore.

So fix this by passing a transaction handle to the trace event instead of
fs_info. This will also allow later for the trace event to dump other
useful information about the transaction.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove call to transaction commit trace in btrfs_cleanup_transaction()

We are not committing a transaction there, plus in subsequent patches we
want to change the argument for the trace event to be a transaction handle
instead of fs_info and in this context we don't have a transaction handle
(struct btrfs_trans_handle, only a struct btrfs_transaction). So remove the
call to the trace point.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove call to transaction commit trace in warn_about_uncommitted_trans()

We are not committing a transaction there, plus in subsequent patches we
want to change the argument for the trace event to be a transaction handle
instead of fs_info and in this context we don't have a transaction handle
(struct btrfs_trans_handle, only a struct btrfs_transaction). So remove the
call to the trace point.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: remove pointless root field from transaction commit event

A transaction commit is global, not per root, and we are currently always
emitting a root id field matching the root tree for no good reason at all,
causing confusion for no reason at all. So remove the root field.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: remove double negation in finish ordered extent event

There is no need to add a double negation (!!) to the update field because
the field has a boolean type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: add more cross checks for free space tree

This introduces extra checks using the previous key.

If there is a previous key, we can do extra validations:

- The previous key is FREE_SPACE_INFO
  This means the current extent/bitmap should be inside the
  free space info key range.

  And matches the type of the free space info.

- The previous key is FREE_SPACE_EXTENT or FREE_SPACE_BITMAP
  In that case both the current and previous key should belong to the same
  block group.

  Thus the key type must match, and no overlap between the two keys.

These extra checks are inspired by the recently added type checks during
free space tree loading.

The new tree-checker checks will allow earlier detection, but the
loading time checks are still needed, as the tree-checker checks are
still inside the same leaf, not matching per-bg level checks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: ensure free space tree entries won't overflow

Add an extra check to ensure the free space extent/bitmap and space info
keys won't overflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: extract the shared key check for free space entries

Currently both check_free_space_extent() and check_free_space_bitmap()
share a very common validation on the keys.

Extract them into a helper, check_free_space_common_key(), and
change the output string ("extent" or "bitmap") depending on the key type.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove folio ordered flag and subpage bitmap

Btrfs has an internal flag/subpage bitmap called ordered, which is to
indicate that a block has corresponding ordered extent covering it.

However this requires extra synchronization between the inode ordered
tree, and the folio flag/subpage bitmap, not to mention we need to
maintain the extra folio flag with subpage bitmap.

As a step to align btrfs_folio_state more closely to iomap_folio_state,
remove the btrfs specific ordered flag/bitmap.

This will also save us 64 bytes for the bitmap of a huge folio.

Since we're here, also update the ASCII graph of the bitmap, as there
are only 3 sub-bitmaps now, show all sub-bitmaps directly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove folio_test_ordered() usage

This involves:

- The ASSERT() inside end_bbio_data_write()
  It's only an ASSERT() and it has never been triggered as far as I
  know.

- btrfs_migrate_folio()
  Since all folio_test_ordered() usage will be removed, there is no need to
  copy the folio ordered flag.

- The ASSERT() inside btrfs_invalidate_folio()
  This one has its usefulness as it indeed caught some bugs during
  development.
  But that's the last user and will not be worth the folio flag or the
  subpage bitmap.

This will allow btrfs to finally remove the ordered flags.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use dirty flag to check if an ordered extent needs to be truncated

Currently there are only two folio ordered flag users:

- extent_writepage_io()
  To ensure the folio range has an ordered extent covering it.
  This is from the legacy COW fixup mechanism, which is already removed
  and only a simple check is left.

- btrfs_invalidate_folio()
  This is to avoid race with end_bbio_data_write(), where
  btrfs_finish_ordered_extent() will be called to handle the OE
  finishing.

But for btrfs_invalidate_folio() we have already waited for the folio
writeback to finish, and locked the folio.
This means we can use the dirty flag to check if a range is already
submitted or not.

If the OE range is not dirty, it means the range has been submitted and
its dirty flag was cleared. And since we have already waited for
writeback, the endio function will handle the OE finishing.
Thus if the range is not dirty, we must skip the range.

If the OE range is dirty, it means we have allocated an ordered extent but
have not yet submitted the range. And that's exactly the case where we need
to truncate the ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: unify folio dirty flag clearing

Currently during folio writeback, we call folio_clear_dirty_for_io()
before extent_writepage(), which causes folio dirty flag to be cleared,
but without touching the subpage bitmaps.

This works fine for the bio submission path, as we always call
btrfs_folio_clear_dirty() to clear the subpage bitmap.

But this is far from consistent, thus this patch is going to unify the
behavior to always use btrfs_folio_clear_dirty() helper to clear both
folio flag and subpage bitmap.

This involves:

- Replace folio_clear_dirty_for_io() with folio_test_dirty()
  There is only one call site calling folio_clear_dirty_for_io() outside
  of subpage.c, that's inside extent_write_cache_pages() just before
  extent_writepage().

- Make btrfs_invalidate_folio() clear dirty range for the whole folio
  The function btrfs_invalidate_folio() is also called during
  extent_writepage().

  If we had a folio completely beyond isize, we call
  folio_invalidate() -> btrfs_invalidate_folio() to free the folio.

  Since we no longer have folio_clear_dirty_for_io() to clear the folio
  dirty flag, we must manually clear the folio dirty flag for the
  to-be-invalidated folio, and also clear the PAGECACHE_TAG_DIRTY tag.

  The tag clearing is done using a new helper,
  btrfs_clear_folio_dirty_tag(), which is almost the same as the old
  btree_clear_folio_dirty_tag(), but with minor improvements including:

  * Remove the folio_test_dirty() check
    We have already done an ASSERT().

  * Add an ASSERT() to make sure folio is mapped

- Add extra ASSERT()s before clearing folio private
  During development I hit dirty folios without the private flag set,
  and that caused a lot of ASSERT()s.
  The reason is that btrfs_invalidate_folio() is relying on the dirty
  flag being cleared when it's called from extent_writepage().

  Add extra ASSERT()s inside clear_folio_extent_mapped() to catch
  wild dirty/writeback flags.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: detect dirty blocks without an ordered extent more reliably

Currently btrfs detects dirty folio which doesn't have an ordered extent
at extent_writepage_io(), but that is not ideal:

- The check is not handling all dirty blocks
  We can have multiple blocks inside a large folio, but the whole folio
  is marked ordered as long as there is one ordered extent in the range.

  We can still hit cases where some dirty blocks do not have
  corresponding ordered extents.

Instead of checking the folio ordered flags, do the check at
alloc_new_bio(), where we're already searching for ordered extents for
writebacks.

If we didn't find an ordered extent, we should already give an error
message and notify the caller there is something wrong.

This allows us to check every block that goes through
submit_extent_folio().

With this new and more reliable check, we can remove the old check.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove locked subpage bitmap

Currently there are two members inside btrfs_folio_state that are related
to locked bitmap:

- locked sub-bitmap inside btrfs_folio_state::bitmaps[]
The enum btrfs_bitmap_nr_locked determines the sub-bitmap.

- btrfs_folio_state::nr_locked
Which records how many blocks are locked inside the folio.

The locked sub-bitmap is a btrfs specific per-block tracking mechanism,
which is mostly for async-submission, utilized by compressed writes.

The sub-bitmap itself is a super set of nr_locked, as it can provide a
more reliable tracking.

But the sub-bitmap itself can be pretty large for the incoming huge
folio, 2M sized folio for 4K page size, meaning 512 bits for one
sub-bitmap.

Furthermore, in the long run compression will be reworked to get rid of
async-submission completely, there is not much need for a full
sub-bitmap to track the locked status.

This patch removes the locked sub-bitmap and only relies on @nr_locked
atomic to do the tracking.
This can also save 64 bytes from btrfs_folio_state::bitmaps[] for a huge
folio.

This will reduce some safety checks, as previously if a block is not
locked, btrfs_folio_end_lock()/btrfs_folio_end_lock_bitmap() will find
out that, and skip reducing @nr_locked for that block, and avoid
under-flow.

But this safety net itself shouldn't be necessary in the first place.
If we're unlocking a block that is not locked, it's a bug in the logic,
and we should catch it, not silently ignoring it.
Thus I believe the removal of the extra safety net should not be a
problem.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: validate names in ROOT_REF and ROOT_BACKREF

ROOT_REF and ROOT_BACKREF items contain a struct btrfs_root_ref followed
by the subvolume name. Several readers assume that this layout is already
valid and then use the on-disk name length directly. A corrupted item can
therefore make those readers address bytes outside the item, and
BTRFS_IOC_GET_SUBVOL_INFO can copy too many bytes into its fixed-size UAPI
name buffer.

Validate ROOT_REF and ROOT_BACKREF items in tree-checker before any reader
uses them. Reject records that do not contain a non-empty name, whose
name_len does not exactly describe the remaining item payload, or whose
name exceeds BTRFS_NAME_LEN.

For BTRFS_IOC_GET_SUBVOL_INFO, copy only the validated on-disk name_len
instead of deriving the copy length from the item size. The ioctl result is
zeroed when allocated. That leaves the existing trailing zero byte
untouched.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: free-space-tree: reject mismatched extent and bitmap items

btrfs_load_free_space_tree() reads FREE_SPACE_INFO once and then chooses
the bitmap or extent loader for all following free-space records until the
next FREE_SPACE_INFO item. Those loaders currently enforce the selected
record type only with ASSERT().

On production builds without CONFIG_BTRFS_ASSERT, a malformed free-space
tree can therefore be decoded in the wrong mode. An EXTENT item can reach
btrfs_free_space_test_bit() as bitmap data, while a BITMAP item can be
added as a full free extent. The latter corrupts the in-memory free-space
cache and the former can read beyond the item payload.

Sanitizer validation reported:
general protection fault
Call trace:
  assert_eb_folio_uptodate() (fs/btrfs/extent_io.c:4134)
  extent_buffer_test_bit() (?:?)
  btrfs_free_space_test_bit() (fs/btrfs/free-space-tree.c:518)
  srso_alias_return_thunk() (arch/x86/include/asm/nospec-branch.h:375)
  __entry_text_end() (?:?)
  __asan_memcpy() (mm/kasan/shadow.c:103)
  read_extent_buffer() (?:?)
  load_free_space_bitmaps() (fs/btrfs/free-space-tree.c:1548)
  btrfs_get_32() (fs/btrfs/free-space-tree.c:?)
  btrfs_set_16() (fs/btrfs/free-space-tree.c:?)
  kmem_cache_alloc_noprof() (?:?)
  btrfs_load_free_space_tree() (fs/btrfs/free-space-tree.c:1685)
  load_free_space_tree_for_test() (?:?)
  rcu_disable_urgency_upon_qs() (kernel/rcu/tree.c:721)
  vprintk_emit() (?:?)
  __up_write() (kernel/locking/rwsem.c:1401)
  clone_commit_root_for_test() (?:?)
  test_extent_as_bitmap_mode_mismatch() (?:?)
  kmem_cache_free() (?:?)
  btrfs_free_path() (fs/btrfs/free-space-tree.c:1449)
  __add_block_group_free_space() (fs/btrfs/free-space-tree.c:20)
  run_test() (?:?)
  do_raw_spin_unlock() (?:?)
  btrfs_test_free_space_tree() (fs/btrfs/tests/free-space-tree-tests.c:547)
  btrfs_test_qgroups() (fs/btrfs/tests/qgroup-tests.c:462)
  btrfs_run_sanity_tests() (fs/btrfs/free-space-tree.c:?)
  init_btrfs_fs() (fs/btrfs/super.c:2690)
  do_one_initcall() (init/main.c:1382)
  __kasan_kmalloc() (?:?)
  rcu_is_watching() (?:?)
  do_initcalls() (init/main.c:1457)
  kernel_init_freeable() (init/main.c:1674)
  kernel_init() (init/main.c:1584)
  ret_from_fork() (?:?)
  __switch_to() (?:?)
  ret_from_fork_asm() (?:?)

Validate every post-info key before decoding it. Reject keys whose type
does not match the mode selected by FREE_SPACE_INFO, and reject keys
whose range extends past the block group, returning -EUCLEAN instead of
feeding the wrong record type to the bitmap or extent decoder.

Also reject zero-length FREE_SPACE_EXTENT items in tree-checker, matching
the existing FREE_SPACE_BITMAP zero-length check. This keeps the loader
range check simple and prevents a zero-length extent item from being a
valid on-disk free-space record.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use on stack backref iterator in build_backref_tree()

The iterator is used only once and within build_backref_tree() so we can
avoid one allocation and place it on stack.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove fs_info from struct btrfs_backref_iter

The fs_info is available everywhere and we don't need to store it inside
a structure that is used within one function only, which is
build_backref_tree(). The size of btrfs_backref_iter is now 48 bytes.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify the btree folio wait during invalidation

The btree inode is very different from regular data inodes, as the btree
inode is never exposed to user space operations.

All operations are either initiated by btrfs metadata operations, or MM
layer like memory pressure to release folios.

This means we never need to handle partial folio invalidation inside
btree_invalidate_folio().

With that said, we can slightly simplify the btree folio invalidation
by:

- Add ASSERT()s to make sure the range covers the whole folio

- Remove "if (start > end)" check
As the range always covers the full folio, that check is always
false and can be removed.

- Open code extent_invalidate_folio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: unexport and move extent_invalidate_folio()

The function extent_invalidate_folio() has only a single caller inside
btree_invalidate_folio().

There is no need to export such a function just for a single caller inside
another file.

Unexport extent_invalidate_folio() and move it to disk-io.c.

And since we're moving the code, update the commit to match the current
style, and remove the seemingly stale comment on the extent state
removal, it's better explained by the comment just before
btrfs_unlock_extent().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: optimize fill_holes() to merge a new hole with both adjacent items

fill_holes() currently merges a punched hole with either the previous
or the next file extent item, but never both in the same call.  When
holes are punched in a non-sequential order this leaves consecutive
hole items in the inode's subvolume tree that should have been collapsed
into a single one.

This is a minor metadata optimization that reduces the number of file
extent items when holes are punched in non-sequential order. While
having extra file extent items is harmless and has no functional
impact, reducing metadata overhead can benefit workloads with heavily
fragmented hole patterns.

For example:

  fallocate -p -o 4K  -l 4K ${FILE}
  fallocate -p -o 12K -l 4K ${FILE}
  fallocate -p -o 8K  -l 4K ${FILE}

After the third punch the [4K, 8K) and [12K, 16K) holes become
adjacent to the new [8K, 12K) hole, but fill_holes() merges only one
side and leaves two separate hole items ([4K, 12K) and [12K, 16K))
instead of the expected single [4K, 16K) hole item.

Fix this by checking both path->slots[0] - 1 and path->slots[0] in one
pass:

  - If only the previous slot is mergeable, extend it forward as
    before.
  - If only the next slot is mergeable, extend it backward and update
    its key offset as before.
  - If both are mergeable, extend the previous item to cover the new
    hole plus the next item, and remove the redundant next item with
    btrfs_del_items().

Because the merge path may now delete an item, switch the initial
btrfs_search_slot() call from a plain lookup (ins_len = 0) to a
search-for-deletion (ins_len = -1), so the leaf is prepared for a
possible item removal.

Note: This optimization only applies to filesystems without the
NO_HOLES feature enabled. Since NO_HOLES is now the default, this
primarily benefits older filesystems or those explicitly created with
NO_HOLES disabled.

Signed-off-by: Dave Chen <davechen@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: warn about extent buffer that can not be released

When we unmount the fs or during mount failures, btrfs will call
invalidate_inode_pages() to release all btree inode folios.

However that function can return -EBUSY if any folios can not be
invalidated.
This can be caused by:

- Some extent buffers are still held by btrfs
  This is a logic error, as we should release all tree root nodes
  during unmount and mount failure handling.

- Some extent buffers are under readahead and haven't yet finished
  These are much rarer but valid cases.
  In that case we should wait for those extent buffers.

Introduce a new helper invalidate_and_check_btree_folios() which will:

- Call invalidate_inode_pages2() and catch its return value
  If it returned 0 as expected, that's great and we can call it a day.

- Otherwise go through each extent buffer in buffer_tree
  Increase the ref by one first for the eb we're checking.
  This is to ensure the eb won't be freed after the readahead is
  finished.

  For ebs that still have EXTENT_BUFFER_READING flag, wait for them to
  finish first.

  After waiting for the readahead, check the refs of the eb and if it's
  still dirty.

  If the eb ref count is greater than 2 (one for the buffer tree, one
  held by us), it means we are still holding the extent buffer somewhere
  else, which is a code bug.

  If the eb is still dirty, it means a bug in transaction handling, e.g.
  the bug fixed by patch "btrfs: only release the dirty pages io tree
  after successful writes".

  For either case, show a warning message about the eb, including its
  bytenr, owner, refs and flags.
  And if it's a debug build, also trigger WARN_ON_ONCE() so that fstests
  can properly catch such situation.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=221270
Reported-by: AHN SEOK-YOUNG <iamsyahn@gmail.com>
CC: Teng Liu <27rabbitlt@gmail.com>
Tested-by: Teng Liu <27rabbitlt@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make sure report_eb_range() is not inlined

If report_rb_range() is inlined into its single caller (check_eb_range()),
we end up with a larger module size, which is undesirable and does not
provide any advantage since this code is for a cold path which we don't
expect to ever hit.

Add the noinline attribute to report_rb_range() and while at it also make
it return void as it always returns true.

Before this change (with gcc 14.2.0-19 from Debian):

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  2018267 176232   15592 2210091 21b92b fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  2017835 176048   15592 2209475 21b6c3 fs/btrfs/btrfs.ko

Also, replacing the noinline with __cold, yields slighty worse results:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  2017889 176048   15592 2209529 21b6f9 fs/btrfs/btrfs.ko

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move transaction abort message to __btrfs_abort_transaction()

The btrfs_abort_transaction() is called at the location where we want to
report the abort. It must be a macro so we get the correct line and
stack trace. This inlines the necessary code and the rest is pushed to
__btrfs_abort_transaction().

There's a possibility to reduce the inlined code if we move the message
to the helper function as well, without loss of information. The
difference is only that the WARN will not print it inside the stack
report but after:

  --[ cut here ]--
  WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
  ...
  --[ end trace ] --
  BTRFS error (device dm-0 state A): Transaction aborted (error -28)

While previously there would be one more line like:

  --[ cut here ]--
  BTRFS: Transaction aborted (error -28)
  WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
  ...
  --[ end trace ] --

This removes about 20KiB of btrfs.ko on a release config.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't force DIO writes to be serialized

Before btrfs switched to the new mount API in 2023, we were setting
SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
filesystem may have files which don't have security xattrs, enabling it
to do some optimizations.

Unfortunately this was missed in the transition, meaning that IS_NOSEC
will always return false for a btrfs inode. This means that
btrfs_direct_write() calls will always get the inode lock exclusively,
meaning that DIO writes to the same file will be serialized.

On my machine, this one-line change results in a ~59% improvement in DIO
throughput:

Before patch:

  test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
  ...
  fio-3.39
  Starting 32 processes
  test: Laying out IO file (1 file / 1024MiB)
  Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026
    write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets
     bw (  KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808
     iops        : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808
    cpu          : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
       issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=64

  Run status group 0 (all jobs):
    WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec

After patch:

  test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
  ...
  fio-3.39
  Starting 32 processes
  test: Laying out IO file (1 file / 1024MiB)
  Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026
    write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets
     bw (  MiB/s): min=  619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808
     iops        : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808
    cpu          : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
       issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=64

  Run status group 0 (all jobs):
    WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec

The script to reproduce that:

  #!/bin/bash
  mkfs.btrfs -f /dev/nvme0n1
  mount /dev/nvme0n1 /mnt/test
  mkdir /mnt/test/nocow
  chattr +C /mnt/test/nocow
  fio /root/test.fio

  # cat /root/test.fio
  [global]
  rw=randwrite
  ioengine=io_uring
  iodepth=64
  size=1g
  direct=1
  startdelay=20
  force_async=4
  ramp_time=5
  runtime=60
  group_reporting=1
  numjobs=32
  time_based
  disk_util=0
  clat_percentiles=0
  disable_lat=1
  disable_clat=1
  disable_slat=1
  filename=/mnt/test/nocow/fiofile
  [test]
  name=test
  bs=4k
  stonewall

This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
through PCI passthrough. The figures for XFS and ext4 in comparison are
both about ~3GB/s.

Fixes: ad21f15b0f79 ("btrfs: switch to the new mount API")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move large data folios out of experimental features

This feature was introduced in v6.17 under experimental, and we had
several small bugs related to or exposed by that:

e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling")
18de34daa7c6 ("btrfs: truncate ordered extent when skipping writeback past i_size")

Otherwise, the feature has been frequently tested by btrfs developers.

The latest fix only arrived in v6.19. After three releases, I think it's
time to move this feature out of experimental.

And since we're here, also remove the comment about the bitmap size
limit, which is no longer relevant in the context. It will soon be
outdated for the incoming huge folio support.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: refresh add_ra_bio_pages() to indicate it's using folios

The function add_ra_bio_folios() has been utilizing folio interfaces
since c808c1dcb1b2 ("btrfs: convert add_ra_bio_pages() to use only
folios"), but we are still referring to "pages" inside the function name
and all comments.

Furthermore, such folio/page mixing can even be confusing, e.g. the
variable @page_end is very confusing as we're not really referring to
the end of the page, but the end of the folio, especially when we
already have large folio support.

Enhance that function by:

- Rename "page" to "folio" to avoid confusion

- Skip to the folio end if there is already a folio in the page cache
  The existing skip is:

   cur += folio_size(folio);

  This is incorrect if @cur is not folio size aligned, and can be
  common with large folio support.

  Thankfully this is not going to cause any real bugs, but at most will
  skip some blocks that can be added to readahead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: enable cross-folio readahead for bs < ps and large folio cases

[BACKGROUND]
When bs < ps support was initially introduced, the compressed data
readahead was disabled as at that time the target page size was 64K.
This means a compressed data extent can span at most 3 64K pages (the
head and tail parts are not aligned to 64K), meaning the benefit is
pretty minimal.

[UNEXPECTED WORKING SITUATION]
But with the already merged large folio support, we're already enabling
readahead with subpage routine unintentionally, e.g.:

   0      4K      8K      12K      16K
   |   Folio 0    |    Folio 8K    |
   |<----- Compressed data ------->|

We have 2 8K sized folios, all backed by a single compressed data.

In that case add_ra_bio_pages() will continue to add folio 8K into the
read bio, as the condition to skip is only (bs < ps), not taking the
newer large folio support into consideration at all.

So for folio 8K, it is added to the read bio, but without subpage lock
bitmap populated.

Then at end_bbio_data_read(), folio 0 has proper locked bitmap set, but
folio 8K does not.
This inconsistency is handled by the extra safety net at
btrfs_subpage_end_and_test_lock() where if a folio has no @nr_locked, it
will just be unlocked without touching the locked bitmap.

[ENHANCEMENT]
Make add_ra_bio_pages() support bs < ps and large folio cases, by
removing the check and calling btrfs_folio_set_lock() unconditionally.

This won't make any difference on 4K page sized systems with large
folios, as the readahead is already working, although unexpectedly.

But this will enable true compressed data readahead for bs < ps cases
properly.

Please note that such readahead will only work if the compressed extent is
crossing folio boundaries, which is also the existing limitation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove 32bit compat code for VFS inode number

Commit 0b2600f81cefcd ("treewide: change inode->i_ino from unsigned long
to u64") sets the inode number type to u64 unconditionally, so we can
use it directly as there's no difference on 32bit and 64bit platform. We
used to have a copy of the number in our btrfs_inode.

The size of btrfs_inode on 32bit platform is about 688 bytes (after the
change).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: limit size of bios submitted from writeback

Currently btrfs_writepages() just accumulates as large bio as possible
(within writeback_control constraints) and then submits it. This can
however lead to significant latency in writeback IO submission (I have
observed tens of milliseconds) because the submitted bio easily has over
hundred of megabytes. Consequently this leads to IO pipeline stalls and
reduced throughput.

At the same time beyond certain size submitting so large bio provides
diminishing returns because the bio is split by the block layer
immediately anyway. So compute (estimate of) bio size beyond which we
are unlikely to improve performance and just submit the bio for
writeback once we accumulate that much to keep the IO pipeline busy.
This improves writeback throughput for sequential writes by about 15% on
the test machine I was using.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
[ Fix the handling of missing device to avoid NULL pointer dereference. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove 2K block size support

Originally 2K block size support was introduced to test subpage (block
size < page size) on x86_64 where the page size is exactly the original
minimal block size.

However that 2K block size support has some problems:

- No 2K nodesize support
  This is critical, as there is still no way to exercise the subpage
  metadata routine.

- Very easy to test subpage data path now
  With the currently experimental large folio support, it's very easy to
  test the subpage data folio path already, as when a folio larger than
  4K is encountered on x86_64, we will need all the subpage folio states
  and bitmaps.

  So there is no need to use 2K block size just to verify subpage data
  path even on x86_64.

And with the incoming huge folio (2M on x86_64) support, the 2K block
size will easily double the bitmap size, considering the burden to
maintain and the limited extra coverage, I believe it's time to remove
it for the incoming huge folio support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: change return type from int to bool in check_eb_range()

The function always returns true or false but the its return type is
defined as int, which makes no sense. Change it to bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add missing unlikely to if branches leading to a DEBUG_WARN()

If statement branches that lead to a DEBUG_WARN() are unexpected to happen
and in most places we surround their expressions with the unlikely tag,
however a few places are missing. Add the unlikely tag to those missing
places to make it explicit to a reader that it's not expected and to hint
the compiler to generate better code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use QSTR() in __btrfs_ioctl_snap_create()

Drop the length argument and use the simpler QSTR().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use the enums instead of int type in struct btrfs_block_group fields

The 'disk_cache_state' and 'cached' fields are defined with an int type
but all the values we assigned to them come from the enums
btrfs_disk_cache_state and btrfs_caching_type. So change the type in the
btrfs_block_group structure from int to these enums - in practice an enum
is an int, so this is more for readability and clarity.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use min_size variable to setup block rsv in btrfs_replace_file_extents()

There's no need to calculate again the size for the temporary block
reserve in btrfs_replace_file_extents() - we have already calculated it
and stored it in the 'min_size' variable.

So use the variable to make it more clear and also make the variable const
since it's not supposed to change during the whole function.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: balance: fix potential bg lookup failure in btrfs_may_alloc_data_chunk()

[BUG]
Running btrfs balance can trigger a null-ptr-deref before relocating a
data chunk when metadata corruption leaves a chunk in the chunk tree
without a corresponding block group in the in-memory cache:

  KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
  RIP: 0010:btrfs_may_alloc_data_chunk+0x40/0x1c0 fs/btrfs/volumes.c:3601
  Call Trace:
    __btrfs_balance fs/btrfs/volumes.c:4217 [inline]
    btrfs_balance+0x2516/0x42b0 fs/btrfs/volumes.c:4604
    btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
    btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
    ...

[CAUSE]
__btrfs_balance() iterates the on-disk chunk tree and passes the chunk
logical bytenr to btrfs_may_alloc_data_chunk() before relocating a data
chunk. That helper then queries the in-memory block group cache:

  cache = btrfs_lookup_block_group(fs_info, chunk_offset);
  chunk_type = cache->flags;   /* cache may be NULL */

A corrupt image can contain a chunk item whose matching block group
item is missing, so no block group is ever inserted into the cache. In
that case btrfs_lookup_block_group() returns NULL.

The code only guards this with ASSERT(cache), which becomes a no-op when
CONFIG_BTRFS_ASSERT is disabled. The subsequent dereference of
cache->flags therefore crashes the kernel.

[FIX]
Add a NULL check after btrfs_lookup_block_group() in
btrfs_may_alloc_data_chunk() and print and error message for clarity.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: balance: fix potential bg lookup failure in chunk_usage_range_filter()

[BUG]
Running btrfs balance with a usage range filter (-dusage=min..max) can
trigger a null-ptr-deref when metadata corruption causes a chunk to have
no corresponding block group in the in-memory cache:

  KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
  RIP: 0010:chunk_usage_range_filter fs/btrfs/volumes.c:3845 [inline]
  RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4031 [inline]
  RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4182 [inline]
  RIP: 0010:btrfs_balance+0x249e/0x4320 fs/btrfs/volumes.c:4618
  ...
  Call Trace:
    btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
    btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
    vfs_ioctl fs/ioctl.c:51 [inline]
    ...

The bug is reproducible on recent development branch.

[CAUSE]
Two separate data structures are involved:

1. The on-disk chunk tree, which records every chunk (logical address
   space region) and is iterated by __btrfs_balance().

2. The in-memory block group cache (fs_info->block_group_cache_tree),
   which is built at mount time by btrfs_read_block_groups() and holds
   a struct btrfs_block_group for each chunk. This cache is what the
   usage range filter queries.

On a well-formed filesystem, these two are kept in 1:1 correspondence.
However, btrfs_read_block_groups() builds the cache from block group
items in the extent tree, not directly from the chunk tree. A corrupted
image can therefore contain a chunk item in the chunk tree whose
corresponding block group item is absent from the extent tree; that
chunk's block group is then never inserted into the in-memory cache.

When balance iterates the chunk tree and reaches such an orphaned chunk,
should_balance_chunk() calls chunk_usage_range_filter(), which queries
the block group cache:

  cache = btrfs_lookup_block_group(fs_info, chunk_offset);
  chunk_used = cache->used;   /* cache may be NULL */

btrfs_lookup_block_group() returns NULL silently when no cached entry
covers chunk_offset. chunk_usage_range_filter() does not check the return
value, so the immediately following dereference of cache->used triggers
the crash.

[FIX]
Add a NULL check after btrfs_lookup_block_group() in
chunk_usage_range_filter(). When the lookup fails, emit a btrfs_err()
message identifying the affected bytenr and return -EUCLEAN to indicate
filesystem corruption.

Since chunk_usage_range_filter() now has an error path, change its
return type from bool to error pointer, return 0 if the chunk matches
the usage range, and 1 if it should be filtered out.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: balance: fix potential bg lookup failure in chunk_usage_filter()

[BUG]
Running btrfs balance with a usage filter (-dusage=N) can trigger a
null-ptr-deref when metadata corruption causes a chunk to have no
corresponding block group in the in-memory cache:

  KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
  RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline]
  RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline]
  RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline]
  RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604
  ...
  Call Trace:
    btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
    btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
    vfs_ioctl fs/ioctl.c:51 [inline]
    ...

The bug is reproducible on current development branch.

[CAUSE]
Two separate data structures are involved:

1. The on-disk chunk tree, which records every chunk (logical address
   space region) and is iterated by __btrfs_balance().

2. The in-memory block group cache (fs_info->block_group_cache_tree),
   which is built at mount time by btrfs_read_block_groups() and holds
   a struct btrfs_block_group for each chunk. This cache is what the
   usage filter queries.

On a well-formed filesystem, these two are kept in 1:1 correspondence.
However, btrfs_read_block_groups() builds the cache from block group
items in the extent tree, not directly from the chunk tree. A corrupted
image can therefore contain a chunk item in the chunk tree whose
corresponding block group item is absent from the extent tree; that
chunk's block group is then never inserted into the in-memory cache.

When balance iterates the chunk tree and reaches such an orphaned chunk,
should_balance_chunk() calls chunk_usage_filter(), which queries the block
group cache:

  cache = btrfs_lookup_block_group(fs_info, chunk_offset);
  chunk_used = cache->used;   /* cache may be NULL */

btrfs_lookup_block_group() returns NULL silently when no cached entry
covers chunk_offset. chunk_usage_filter() does not check the return value,
so the immediately following dereference of cache->used triggers the crash.

[FIX]
Add a NULL check after btrfs_lookup_block_group() in chunk_usage_filter().
When the lookup fails, emit a btrfs_err() message identifying the
affected bytenr and return -EUCLEAN to indicate filesystem corruption.

Since chunk_usage_filter() now has an error path, change its return type
from bool to error pointer and 0 if the chunk passes the usage filter,
and 1 if it should be skipped.

Update should_balance_chunk() accordingly to propagate negative errors
from the usage filter.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add ioctl GET_CSUMS to read raw checksums from file range

Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to
query the on-disk csums for a file range.

The ioctl is deliberately per-file rather than exposing raw csum tree
lookups, to avoid leaking information to users about files they may not
have access to.

This is done by userspace passing a struct btrfs_ioctl_get_csums_args to
the kernel, which details the offset and length we're interested in, and
a buffer for the kernel to write its results into. The kernel writes a
struct btrfs_ioctl_get_csums_entry into the buffer, followed by the
csums if available. The maximum size of the user buffer is capped to
16MiB.

If the extent is an uncompressed, non-NODATASUM extent, the kernel sets
the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the
csums. If it is sparse, preallocated, or beyond the EOF, it sets the
type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use
the precomputed hash of the zero sector. Otherwise, it sets the type to
BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED,
BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE.

For example, a file with a [0, 4K) hole and [4K, 12K) data extent would
produce the following output buffer:

  | [0, 4K) ZEROED | [4K, 12K) HAS_CSUMS | csum data |

We do store the csums of compressed extents, but we deliberately don't
return them here: they're calculated over the compressed data, not the
uncompressed data that's returned to userspace. Similarly for encrypted
data, once encryption is supported, in which the csums will be on the
ciphertext.

The main use case for this is for speeding up mkfs.btrfs --rootdir. For
the case when the source FS is btrfs and using the same csum algorithm,
we can avoid having to recalculate the csums - in my synthetic
benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11%
speed-up (218s to 196s).

When using the --reflink option added in btrfs-progs v6.16.1, we can forgo
reading the data entirely, resulting a ~2200% speed-up on the same test
(128s to 6s).

    # mkdir rootdir
    # dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304

    (without ioctl)
    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir testimg
    ...
    real    3m37.965s
    user    0m5.496s
    sys     0m6.125s

    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir --reflink testimg
    ...
    real    2m8.342s
    user    0m5.472s
    sys     0m1.667s

    (with ioctl)
    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir testimg
    ...
    real    3m15.865s
    user    0m4.258s
    sys     0m6.261s

    # echo 3 > /proc/sys/vm/drop_caches
    # time mkfs.btrfs --rootdir rootdir --reflink testimg
    ...
    real    0m5.847s
    user    0m2.899s
    sys     0m0.097s

Another notable use case is for deduplication, where reading the
checksums may serve as a hint instead of reading the whole file data.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: check and set EXTENT_DELALLOC_NEW before clearing EXTENT_DELALLOC

[WARNING]
When running test cases with injected errors or shutdown, e.g.
generic/388 or generic/475, there is a chance that the following kernel
warning is triggered:

  BTRFS info (device dm-2): first mount of filesystem d8a19a28-3232-4809-b0df-38df83e71bff
  BTRFS info (device dm-2): using crc32c checksum algorithm
  BTRFS info (device dm-2): checking UUID tree
  BTRFS info (device dm-2): turning on async discard
  BTRFS info (device dm-2): enabling free space tree
  BTRFS critical (device dm-2 state E): emergency shutdown
  ------------[ cut here ]------------
  WARNING: extent_io.c:1742 at extent_writepage_io+0x437/0x520 [btrfs], CPU#2: kworker/u43:2/651591
  CPU: 2 UID: 0 PID: 651591 Comm: kworker/u43:2 Tainted: G        W  OE       7.0.0-rc6-custom+ #365 PREEMPT(full)  5804053f02137e627472d94b5128cc9fcb110e88
  RIP: 0010:extent_writepage_io+0x437/0x520 [btrfs]
  Call Trace:
   <TASK>
   extent_write_cache_pages+0x2a5/0x820 [btrfs 70299925d0856939e93b17d480651713b3cbba58]
   btrfs_writepages+0x74/0x130 [btrfs 70299925d0856939e93b17d480651713b3cbba58]
   do_writepages+0xd0/0x160
   __writeback_single_inode+0x42/0x340
   writeback_sb_inodes+0x22d/0x580
   wb_writeback+0xc6/0x360
   wb_workfn+0xbd/0x470
   process_one_work+0x198/0x3b0
   worker_thread+0x1c8/0x330
   kthread+0xee/0x120
   ret_from_fork+0x2a6/0x330
   ret_from_fork_asm+0x11/0x20
   </TASK>
  ---[ end trace 0000000000000000 ]---
  BTRFS error (device dm-2 state E): root 5 ino 259 folio 1323008 is marked dirty without notifying the fs
  BTRFS error (device dm-2 state E): failed to submit blocks, root=5 inode=259 folio=1323008 submit_bitmap=0: -117
  BTRFS info (device dm-2 state E): last unmount of filesystem d8a19a28-3232-4809-b0df-38df83e71bff

[CAUSE]
Inside btrfs we have the following pattern in several locations, for
example inside btrfs_dirty_folio():

btrfs_clear_extent_bit(&inode->io_tree, start_pos, end_of_last_block,
       EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
       cached);

ret = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
extra_bits, cached);
if (ret)
return ret;

However btrfs_set_extent_delalloc() can return IO errors other than -ENOMEM
through the following callchain:

btrfs_set_extent_delalloc()
\- btrfs_find_new_delalloc_bytes()
    \- btrfs_get_extent()
       \- btrfs_lookup_file_extent()
          \- btrfs_search_slot()

When such IO error happened, the previous btrfs_clear_extent_bit() has
cleared the EXTENT_DELALLOC for the range, and we're expecting
btrfs_set_extent_delalloc() to re-set EXTENT_DELALLOC.

But since btrfs_set_extent_delalloc() failed before
btrfs_set_extent_bit(), EXTENT_DELALLOC flag is no longer present.

And if the folio range is dirty before entering
btrfs_set_extent_delalloc(), we got a dirty folio but no EXTENT_DELALLOC
flag now.

Then we hit the folio writeback:

  extent_writepage()
  |- writepage_delalloc()
  |  No ordered extent is created, as there is no EXTENT_DELALLOC set
  |  for the folio range.
  |  This also means the folio has no ordered flag set.
  |
  |- extent_writepage_io()
     \- if (unlikely(!folio_test_ordered(folio))
        Now we hit the warning.

[FIX]
Introduce a new helper, btrfs_reset_extent_delalloc() to replace the
currently open-coded btrfs_clear_extent_bit() +
btrfs_set_extent_delalloc() combination.

Instead of calling btrfs_clear_extent_bit() first, update
EXTENT_DELALLOC_NEW first, as that part can fail due to metadata IO,
meanwhile btrfs_clear_extent_bit() and btrfs_set_extent_bit() won't
return any error but retry memory allocation until succeeded.

This allows us to fail early without clearing EXTENT_DELALLOC bit, so
even if that new btrfs_reset_extent_delalloc() failed before touching
EXTENT_DELALLOC, the existing dirty range will still have their old
EXTENT_DELALLOC flag present, thus avoid the warning.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unnecessary ctl argument from write_cache_extent_entries()

There is no need to pass the free space control structure as an argument
because we can grab it from the given block group.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unnecessary ctl argument from __btrfs_write_out_cache()

We can get the free space control structure from the given block group,
so there is no need to pass it as an argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove block group argument from copy_free_space_cache()

It's not necessary since we can get the block group from the given
free space control structure.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove op field from struct btrfs_free_space_ctl

The op field always points to the same use_bitmap function, the only
exception is during self tests where we make it temporarily point to a
different function. So just because of this op pointer field we are
increasing the structure size by 8 bytes.

Instead of storing a pointer to a use_bitmap function in struct
btrfs_free_space_ctl, move the pointer to struct btrfs_info, make
insert_into_bitmap() use that pointer if we are running the self tests
and initialize that pointer to the current, default use_bitmap function
(now exported for the tests as btrfs_use_bitmap). This way we reduce
the size of struct btrfs_free_space_ctl from 136 to 128 bytes and can
now fit 32 structures in a 4K page instead of 30. This also avoids the
cost of the indirection of a function pointer call when we are not
running the self tests.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: reduce size of struct btrfs_free_space_ctl

We have a 4 bytes hole in the structure, reorder some fields so that we
eliminate the hole and reduce the structure size from 144 bytes down to
136 bytes. This way on a 4K page system, we can fit 30 structures per
page instead of 28.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unit field from struct btrfs_free_space_ctl

The unit field always has a value matching the sector size, and since we
have a block group pointer in the structure, we can access the block group
and then its fs_info field to get to the sector size. So remove the field,
which will allow us later to shrink the structure size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove start field from struct btrfs_free_space_ctl

There's no need for the start field, we can take it from the block group.
This reduces the structure size from 152 bytes down to 144 bytes, so on
a 4K page system we can now fit 28 structures instead of 26.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use a kmem_cache for free space control structures

We are currently allocating the free space control structures for block
groups using the generic slabs, and given that the size of the
btrfs_free_space_ctl structure is 152 bytes (on a release kernel), we end
up using the kmalloc-192 slab and therefore waste quite some memory since
on a 4K page system we can only fit 21 free space control structures per
page. These structures are allocated and delallocated every time we create
and remove block groups.

So use a kmem_cache for free space control structures, this way on a 4K
page system we can fit 26 structures instead of 21.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>