Tejun Heo [Mon, 11 May 2026 23:18:23 +0000 (13:18 -1000)]
sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock,
but that does not pin parent->sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.
Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.
Tejun Heo [Mon, 11 May 2026 23:18:19 +0000 (13:18 -1000)]
sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
On scx_link_sched() error paths (parent disabled, hash insert failure),
&sch->all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a
list_head that was never initialized triggering a corruption warning.
Initialize &sch->all.
Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()") Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 12 May 2026 20:30:00 +0000 (10:30 -1000)]
sched_ext: Drop NONE early return in scx_disable_and_exit_task()
d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p->scx.sched dangles - print_scx_info() from sched_show_task() reads
sch->ops.name from freed memory.
Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only
work left - and the one thing the path actually needs.
v2: Extend the SUB_INIT block comment to note that the flag is only
set on the sub-enable path, so it's always clear on the NONE
re-entry (Andrea).
Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
KVM: x86: Swap the dst and src operand for MOVNTDQA
Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same
characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores
to registers, while MOVNTDQ loads from registers and stores to memory.
Per the SDM:
MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal
hint.
MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint
if WC memory type.
Reported-by: Josh Eads <josheads@google.com> Fixes: c57d9bafbd0b ("KVM: x86: Add support for emulating MOVNTDQA") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260506213514.2781948-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sun, 3 May 2026 21:09:17 +0000 (23:09 +0200)]
KVM: x86: use again the flush argument of __link_shadow_page()
Except in the case of parentless nested-TDP pages, mmu_page_zap_pte()
clears the SPTE but leaves the invalid_list empty. In this case, using
kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill.
Avoid flushing the entirety of the remote TLBs unless the invalid_list
was populated: instead, use a more efficient gfn-targeting flush (if
available) and skip it altogether if the caller guarantees that a TLB
flush is not necessary.
Based-on: <20260503201029.106481-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260503210917.121840-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Val Packett [Thu, 12 Mar 2026 00:53:37 +0000 (21:53 -0300)]
arm64: dts: qcom: x1-dell-thena: remove i2c20 (battery SMBus) and reserve its pins
i2c20 is used by the battmgr service on the ADSP to communicate with the
SBS interface of the battery. Initializing it from Linux would break the
battmgr functionality when booted in EL2. Mark those pins as reserved.
Fixes: e7733b42111c ("arm64: dts: qcom: Add support for Dell Inspiron 7441 / Latitude 7455") Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Abel Vesa <abel.vesa@oss.qualcomm.com> Signed-off-by: Val Packett <val@packett.cool> Link: https://lore.kernel.org/r/20260312005731.12488-2-val@packett.cool Signed-off-by: Bjorn Andersson <andersson@kernel.org>
KVM: selftests: Ensure gmem file sizes are multiple of host page size
When creating a guest_memfd file and associated memslot to validate shared
guest memory, size the file+memslot to the maximum of the host or guest
page size. Attempting to allocate a single guest page will fail if the
host page size is greater than the guest page size, as KVM requires that
the size of memslots and guest_memfd files are a multiple of the host page
size.
For simplicity, verify the entire file can be shared between guest and host,
e.g. instead of trying to validate "partial" mappings.
Fixes: 42188667be38 ("KVM: selftests: Add guest_memfd testcase to fault-in on !mmap()'d memory") Reported-by: Zenghui Yu <zenghui.yu@linux.dev> Closes: https://lore.kernel.org/all/0064952b-048c-455d-ad89-e27e5cb82591@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260512155634.772602-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 12 May 2026 20:19:20 +0000 (22:19 +0200)]
Merge tag 'kvmarm-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.1, take #2
- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
that the EL3 firmware does its part of the job. KVM will refuse to
initialise otherwise.
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0.
- Correctly deal with permission faults in guest_memfd memslots.
- Fix the steal-time selftest after the infrastructure was reworked.
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure.
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390.
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events.
- Handle the current vcpu being NULL on EL2 panic.
- Fix the selftest_vcpu memcache being empty at the point of donation or
sharing.
- Check that the memcache has enough capacity before engaging on the
share/donate path.
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context.
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running,
and instead always configure vmcb02 according to L1's exact capabilities
and desires.
The purpose of intercepting PAUSE after N attempts is to detect when the
vCPU may be stuck waiting on a lock, so that KVM can schedule in a
different vCPU that may be holding said lock. Barring a very interesting
setup, L1 and L2 do not share locks, and it's extremely unlikely that an
L1 vCPU would hold a spinlock while running L2. I.e. having a vCPU
executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU
to make forward progress, and vice versa.
While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is
doable, in all likelihood it would do more harm than good for most setups.
KVM has limited visibility into which L2 "vCPUs" belong to the same VM,
and thus share a locking domain. And even if L2 vCPUs are in the same
VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the
L1 hypervisor.
Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is
intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as
nested_svm_intercept() gives priority to the vmcb12 intercept. As such,
overriding the count/threshold fields in vmcb02 with vmcb01's values is
nonsensical, as doing so clobbers all the training/learning that has been
done in L1.
Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE
exits, then KVM will adjust the PLE knobs based on L2 behavior, which could
very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE
training with bad data.
And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even
less sense, because again, the purpose of PLE is to detect spinning vCPUs.
Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit
has no relevance as to the behavior of the vCPU when it executes in L1.
The only scenarios where any of this actually works is if at least one
of KVM or L1 is NOT intercepting PAUSE for the guest. Per the original
changelog, those were the only scenarios considered to be supported.
Disabling KVM's use of PLE makes it so the VM is always in a "supported"
mode.
Last, but certainly not least, using KVM's count/threshold instead of the
values provided by L1 is a blatant violation of the SVM architecture.
Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Aaron Sacks [Tue, 12 May 2026 06:07:42 +0000 (02:07 -0400)]
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
kvm_reset_dirty_gfn() guards the gfn range with
if (!memslot || (offset + __fls(mask)) >= memslot->npages)
return;
but offset is u64 and the addition is unchecked. The check can be
silently bypassed by a u64 wrap.
The dirty ring backing those entries is MAP_SHARED at
KVM_DIRTY_LOG_PAGE_OFFSET of the vcpu fd, so the VMM can rewrite the
slot and offset fields of any entry between when the kernel pushes
them and when KVM_RESET_DIRTY_RINGS consumes them. On reset,
kvm_dirty_ring_reset() re-reads the values via READ_ONCE() and feeds
them straight back into this check; only the flags handshake is
treated as the handover, the slot/offset payload is taken on trust.
makes the coalescing loop in kvm_dirty_ring_reset() compute
delta = (s64)(0 - 0xffffffffffffffc1) = 63
which falls in [0, BITS_PER_LONG), so it folds entry[i+1] into the
existing mask by setting bit 63. The trailing kvm_reset_dirty_gfn()
call then sees offset = 0xffffffffffffffc1 and __fls(mask) = 63;
the sum is 0 in u64 and the bounds check passes.
That offset propagates into kvm_arch_mmu_enable_log_dirty_pt_masked()
unchanged. On the legacy MMU path -- kvm_memslots_have_rmaps() ==
true, i.e. shadow paging, any VM that has allocated shadow roots, or
a write-tracked slot -- it reaches gfn_to_rmap(), which indexes
slot->arch.rmap[0][] with a near-U64_MAX gfn. That is an
out-of-bounds load of a kvm_rmap_head, followed by a conditional
clear of PT_WRITABLE_MASK in whatever the loaded pointer points at.
The path is reachable from any process holding /dev/kvm.
Range-check offset on its own first, so the addition cannot wrap.
memslot->npages is bounded well below U64_MAX, so once offset <
npages holds, offset + __fls(mask) (with __fls(mask) < BITS_PER_LONG)
stays in range.
Sergio Correia [Tue, 12 May 2026 13:28:59 +0000 (14:28 +0100)]
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
AUDIT_ADD_RULE and AUDIT_DEL_RULE correctly check for AUDIT_LOCKED
and return -EPERM, but AUDIT_TRIM and AUDIT_MAKE_EQUIV do not. This
allows a process with CAP_AUDIT_CONTROL to modify directory tree
watches and equivalence mappings even when the audit configuration
has been locked, undermining the purpose of the lock.
Add AUDIT_LOCKED checks to both commands.
Cc: stable@vger.kernel.org Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Sergio Correia [Tue, 12 May 2026 13:28:33 +0000 (14:28 +0100)]
audit: fix incorrect inheritable capability in CAPSET records
__audit_log_capset() records the effective capability set into the
inheritable field due to a copy-paste error. Every CAPSET audit
record therefore reports cap_pi (process inheritable) with the value
of cap_effective instead of cap_inheritable.
This silently corrupts audit data used for compliance and forensic
analysis: an attacker who modifies inheritable capabilities to
prepare for a privilege-escalating exec would have the change masked
in the audit trail.
The bug has been present since the original introduction of CAPSET
audit records in 2008.
Cc: stable@vger.kernel.org Fixes: e68b75a027bb ("When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets.") Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Raphael Zimmer [Tue, 12 May 2026 16:16:40 +0000 (18:16 +0200)]
libceph: Fix potential null-ptr-deref in decode_choose_args()
A message of type CEPH_MSG_OSD_MAP contains an OSD map that itself
contains a CRUSH map. When decoding this CRUSH map in crush_decode(), an
array of max_buckets CRUSH buckets is decoded, where some indices may
not refer to actual buckets and are therefore set to NULL. The received
CRUSH map may optionally contain choose_args that get decoded in
decode_choose_args(). When decoding a crush_choose_arg_map, a series of
choose_args for different buckets is decoded, with the bucket_index
being read from the incoming message. It is only checked that the bucket
index does not exceed max_buckets, but not that it doesn't point to an
index with a NULL bucket. If a (potentially corrupted) message contains
a crush_choose_arg_map including such a bucket_index, a null pointer
dereference may occur in the subsequent processing when attempting to
access the bucket with the given index.
This patch fixes the issue by extending the affected check. Now, it is
only attempted to access the bucket if it is not NULL.
Raphael Zimmer [Tue, 12 May 2026 07:29:30 +0000 (09:29 +0200)]
libceph: handle rbtree insertion error in decode_choose_args()
A message of type CEPH_MSG_OSD_MAP contains an OSD map that itself
contains a CRUSH map. The received CRUSH map may optionally contain
choose_args that get decoded in decode_choose_args(). In this function,
num_choose_arg_maps is read from the message, and a corresponding number
of crush_choose_arg_maps gets decoded afterwards. Each
crush_choose_arg_map has a choose_args_index, which serves as the key
when inserting it into the choose_args rbtree of the decoded crush_map.
If a (potentially corrupted) message contains two crush_choose_arg_maps
with the same index, the assertion in insert_choose_arg_map() triggers a
kernel BUG when trying to insert the second crush_choose_arg_map.
This patch fixes the issue by switching to the non-asserting rbtree
insertion function and rejecting the message if the insertion fails.
hwmon: (asus_atk0110) Check ACPI_COMPANION() against NULL
Every platform driver can be forced to match a device that doesn't match
its list of device IDs because of device_match_driver_override(), so
platform drivers that rely on the existence of a device's ACPI companion
object need to verify its presence.
Accordingly, add a requisite ACPI_HANDLE() check against NULL to the
asus_atk0110 hwmon driver.
Fixes: ee1752590733 ("hwmon: (asus_atk0110) Convert ACPI driver to a platform one") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/2261594.irdbgypaU6@rafael.j.wysocki Signed-off-by: Guenter Roeck <linux@roeck-us.net>
hwmon: (acpi_power_meter) Check ACPI_COMPANION() against NULL
Every platform driver can be forced to match a device that doesn't match
its list of device IDs because of device_match_driver_override(), so
platform drivers that rely on the existence of a device's ACPI companion
object need to verify its presence.
Accordingly, add a requisite ACPI_COMPANION() check against NULL to the
acpi_power_meter hwmon driver.
Fixes: afc6c4aedea5 ("hwmon: (acpi_power_meter) Convert ACPI driver to a platform one") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/5068745.GXAFRqVoOG@rafael.j.wysocki Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Input: atlas - check ACPI_COMPANION() against NULL
Every platform driver can be forced to match a device that doesn't match
its list of device IDs because of device_match_driver_override(), so
platform drivers that rely on the existence of a device's ACPI companion
object need to verify its presence.
Accordingly, add a requisite ACPI_COMPANION() check against NULL to the
atlas_btns driver.
Fixes: b8303880b641 ("Input: atlas - convert ACPI driver to a platform one") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/8696590.T7Z3S40VBb@rafael.j.wysocki Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Linus Torvalds [Tue, 12 May 2026 17:18:02 +0000 (10:18 -0700)]
Merge tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
Since the ftrace adds its NOPs at .kprobes.text section (which stores
an array), a wrong entry is added when loading a module which uses
"__kprobes" attribute.
To solve this, add "notrace" to __kprobes functions
- test_kprobes: clear kprobes between test runs
Clear all kprobes in the test program after running a test set,
because Kunit test can run several times
- fprobe: Fix unregister_fprobe() to wait for RCU grace period
Since the fprobe data structure is removed with hlist_del_rcu(), it
should wait for the RCU grace period. If the caller waits for RCU, we
can use the async variant (e.g. eBPF)
* tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fprobe: Fix unregister_fprobe() to wait for RCU grace period
test_kprobes: clear kprobes between test runs
kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
Willy Tarreau [Sat, 9 May 2026 09:47:55 +0000 (11:47 +0200)]
Documentation: security-bugs: clarify requirements for AI-assisted reports
AI tools are increasingly used to assist in bug discovery. While these
tools can identify valid issues, reports that are submitted without
manual verification often lack context, contain speculative impact
assessments, or include unnecessary formatting. Such reports increase
triage effort, waste maintainers' time and may be ignored.
Reports where the reporter has verified the issue and the proposed fix
typically meet quality standards. This documentation outlines specific
requirements for length, formatting, and impact evaluation to reduce
the effort needed to deal with these reports.
Willy Tarreau [Sat, 9 May 2026 09:47:54 +0000 (11:47 +0200)]
Documentation: security-bugs: explain what is and is not a security bug
The use of automated tools to find bugs in random locations of the kernel
induces a raise of security reports even if most of them should just be
reported as regular bugs. This patch is an attempt at drawing a line
between what qualifies as a security bug and what does not, hoping to
improve the situation and ease decision on the reporter's side.
It defers the enumeration to a new file, threat-model.rst, that tries
to enumerate various classes of issues that are and are not security
bugs. This should permit to more easily update this file for various
subsystem-specific rules without having to revisit the security bug
reporting guide.
Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Leon Romanovsky <leon@kernel.org> Suggested-by: Leon Romanovsky <leon@kernel.org> Suggested-by: Greg KH <gregkh@linuxfoundation.org> Reviewed-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260509094755.2838-3-w@1wt.eu>
Willy Tarreau [Sat, 9 May 2026 09:47:53 +0000 (11:47 +0200)]
Documentation: security-bugs: do not systematically Cc the security team
With the increase of automated reports, the security team is dealing
with way more messages than really needed. The reporting process works
well with most teams so there is no need to systematically involve the
security team in reports.
Let's suggest to keep it for small lists of recipients and new reporters
only. This should continue to cover the risk of lost messages while
reducing the volume from prolific reporters.
Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Leon Romanovsky <leon@kernel.org> Reviewed-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260509094755.2838-2-w@1wt.eu>
ACPI: PAD: xen: Check ACPI_COMPANION() against NULL
Every platform driver can be forced to match a device that doesn't match
its list of device IDs because of device_match_driver_override(), so
platform drivers that rely on the existence of a device's ACPI companion
object need to verify its presence.
Accordingly, add a requisite ACPI_COMPANION() check against NULL to the
Xen variant of the ACPI processor aggregator device (PAD) driver.
accel/qaic: Add overflow check to remap_pfn_range during mmap
The call to remap_pfn_range in qaic_gem_object_mmap is susceptible to
(re)mapping beyond the VMA if the BO is too large. This can cause use
after free issues when munmap() unmaps only the VMA region and not the
additional mappings. To prevent this, check the remaining size of the
VMA before remapping and truncate the remapped length if sg->length is
too large.
Reported-by: Lukas Maar <lukas.maar@tugraz.at> Fixes: ff13be830333 ("accel/qaic: Add datapath") Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Signed-off-by: Zack McKevitt <zachary.mckevitt@oss.qualcomm.com> Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
[jhugo: fix braces from checkpatch --strict] Signed-off-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com> Link: https://patch.msgid.link/20260430193858.1178641-1-zachary.mckevitt@oss.qualcomm.com
Tomasz Pakuła [Sun, 10 May 2026 12:23:52 +0000 (14:23 +0200)]
HID: pidff: Fix integer overflow in pidff_rescale
Rescaling values close to the max (U16_MAX) temporarily creates values
that exceed the s32 range. This caused value overflow in case when, for
example, a periodic effect phase was higer than 180 degrees. In turn,
rescale function could return values outised of the logical range of the
HID field.
Fix by using 64 bit signed integer to store the value during calculation
but still return only 32 bit integer.
Closes: https://github.com/JacKeTUs/universal-pidff/issues/116 Fixes: 224ee88fe395 ("Input: add force feedback driver for PID devices") Cc: stable@vger.kernel.org Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Xu Rao [Sat, 9 May 2026 08:21:32 +0000 (16:21 +0800)]
HID: i2c-hid: add reset quirk for BLTP7853 touchpad
The BLTP7853 I2C HID touchpad may fail to probe after reboot or
reprobe because reset completion is not signalled to the host. The
driver then waits for the reset-complete interrupt until it times out
and the device probe fails:
i2c_hid i2c-BLTP7853:00: failed to reset device.
i2c_hid i2c-BLTP7853:00: can't add hid device: -61
i2c_hid: probe of i2c-BLTP7853:00 failed with error -61
Add I2C_HID_QUIRK_NO_IRQ_AFTER_RESET for the device so i2c-hid does
not wait for a reset interrupt that may never arrive.
hid_input_report() is used in too many places to have a commit that
doesn't cross subsystem borders. Instead of changing the API, introduce
a new one when things matters in the transport layers:
- usbhid
- i2chid
This effectively revert to the old behavior for those two transport
layers.
commit 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing
bogus memset()") enforced the provided data to be at least the size of
the declared buffer in the report descriptor to prevent a buffer
overflow. However, we can try to be smarter by providing both the buffer
size and the data size, meaning that hid_report_raw_event() can make
better decision whether we should plaining reject the buffer (buffer
overflow attempt) or if we can safely memset it to 0 and pass it to the
rest of the stack.
Fixes: 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Acked-by: Johan Hovold <johan@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Myeonghun Pak [Fri, 24 Apr 2026 12:50:41 +0000 (21:50 +0900)]
HID: google: hammer: stop hardware on devres action failure
hammer_probe() starts the HID hardware before registering the devres
action that stops it. If devm_add_action() fails, probe returns an
error with the hardware still started because the cleanup action was
never registered and the driver's remove callback is not called after a
failed probe.
Use devm_add_action_or_reset() so the stop action runs immediately on
registration failure while preserving the existing devres-managed cleanup
path for later probe failures and remove.
Signed-off-by: Myeonghun Pak <mhun512@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Sangyun Kim [Mon, 20 Apr 2026 05:13:18 +0000 (14:13 +0900)]
HID: appletb-kbd: run inactivity autodim from workqueues
The autodim code in hid-appletb-kbd takes backlight_device->ops_lock
via backlight_device_set_brightness() -> mutex_lock() from two
different atomic contexts:
* appletb_inactivity_timer() is a struct timer_list callback, so it
runs in softirq context. Every expiry triggers
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
Call Trace:
<IRQ>
__might_resched
__mutex_lock
backlight_device_set_brightness
appletb_inactivity_timer
call_timer_fn
run_timer_softirq
* reset_inactivity_timer() is called from appletb_kbd_hid_event() and
appletb_kbd_inp_event(). On real USB hardware these run in
softirq/IRQ context (URB completion and input-event dispatch).
When the Touch Bar has already been dimmed or turned off, the
reset path calls backlight_device_set_brightness() directly to
restore brightness, producing the same warning.
Both call sites hit the same mutex_lock()-from-atomic bug. Fix them
together by moving the blocking work onto the system workqueue:
* Convert the inactivity timer from struct timer_list to
struct delayed_work; the callback (appletb_inactivity_work) now
runs in process context where mutex_lock() is legal.
* Add a dedicated struct work_struct restore_brightness_work and have
reset_inactivity_timer() schedule it instead of calling
backlight_device_set_brightness() directly.
Cancel both works synchronously during driver tear-down alongside the
existing backlight reference drop.
The semantics are unchanged (same delays, same state transitions on
dim, turn-off and user activity); only the execution context of the
sleeping call changes. The timer field and callback are renamed to
match their new type; reset_inactivity_timer() keeps its name because
it is invoked from input event paths that read naturally as "reset
the inactivity timer".
Fixes: 93a0fc489481 ("HID: hid-appletb-kbd: add support for automatic brightness control while using the touchbar") Cc: stable@vger.kernel.org Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Sangyun Kim [Mon, 20 Apr 2026 05:13:17 +0000 (14:13 +0900)]
HID: appletb-kbd: fix UAF in inactivity-timer cleanup path
Commit 38224c472a03 ("HID: appletb-kbd: fix slab use-after-free bug in
appletb_kbd_probe") added timer_delete_sync(&kbd->inactivity_timer) to
both the probe close_hw error path and appletb_kbd_remove(), but the
way it was wired in left the inactivity timer reachable during driver
tear-down via two distinct windows.
Window A -- put_device() before timer_delete_sync():
The inactivity_timer softirq reads kbd->backlight_dev and calls
backlight_device_set_brightness() -> mutex_lock(&ops_lock). If a
concurrent hid_appletb_bl unbind drops the last devm reference
between these two calls, the backlight_device is freed and the
mutex_lock() touches freed memory.
Window B -- backlight cleanup before hid_hw_stop():
if (kbd->backlight_dev) {
timer_delete_sync(...);
put_device(...);
}
hid_hw_close(hdev);
hid_hw_stop(hdev);
Even after Window A is closed, hid_hw_close()/hid_hw_stop() still run
afterwards, so a late ".event" callback from the HID core (USB URB
completion on real Apple hardware) can arrive after
timer_delete_sync() drained the softirq but before put_device() drops
the reference. That callback reaches reset_inactivity_timer(), which
calls mod_timer() and re-arms the timer. The freshly re-armed timer
can then fire on the about-to-be-freed backlight_device.
Both windows produce the same KASAN slab-use-after-free:
BUG: KASAN: slab-use-after-free in __mutex_lock+0x1aab/0x21c0
Read of size 8 at addr ffff88803ee9a108 by task swapper/0/0
Call Trace:
<IRQ>
__mutex_lock
backlight_device_set_brightness
appletb_inactivity_timer
call_timer_fn
run_timer_softirq
handle_softirqs
Allocated by task N:
devm_backlight_device_register
appletb_bl_probe
Freed by task M:
(concurrent hid_appletb_bl unbind path)
Close both windows at once by reworking the tear-down in
appletb_kbd_remove() and in the probe close_hw error path so that
1) hid_hw_close()/hid_hw_stop() run before the backlight cleanup,
guaranteeing no further .event callback can fire and re-arm the
timer, and
2) inside the "if (kbd->backlight_dev)" block, timer_delete_sync()
runs before put_device(), so the softirq is drained before the
final reference is dropped.
Fixes: 38224c472a03 ("HID: appletb-kbd: fix slab use-after-free bug in appletb_kbd_probe") Cc: stable@vger.kernel.org Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Jiri Kosina <jkosina@suse.com>
T.J. Mercier [Fri, 17 Apr 2026 15:47:02 +0000 (08:47 -0700)]
HID: playstation: Clamp num_touch_reports
A device would never lie about the number of touch reports would it?
If it does the loop in dualshock4_parse_report will read off the end of
the touch_reports array, up to about 2 KiB for the maximum number of 256
loop iteraions. The data that is read is emitted via evdev if the
DS4_TOUCH_POINT_INACTIVE bit happens to be set. Protect against this by
clamping the num_touch_reports value provided by the device to the
maximum size of the touch_reports array.
Lee Jones [Thu, 16 Apr 2026 13:16:54 +0000 (14:16 +0100)]
HID: magicmouse: Prevent out-of-bounds (OOB) read during DOUBLE_REPORT_ID
It is currently possible for a malicious or misconfigured USB device to
cause an out-of-bounds (OOB) read when submitting reports using
DOUBLE_REPORT_ID by specifying a large report length and providing a
smaller one.
Let's prevent that by comparing the specified report length with the
actual size of the data read in from userspace. If the actual data
length ends up being smaller than specified, we'll politely warn the
user and prevent any further processing.
Signed-off-by: Lee Jones <lee@kernel.org> Reviewed-by: Günther Noack <gnoack@google.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
HID: mcp2221: fix OOB write in mcp2221_raw_event()
mcp2221_raw_event() copies device-supplied data into mcp->rxbuf at
offset rxbuf_idx without checking that the copy fits within the
destination buffer. A device responding with up to 60 bytes to a
small I2C/SMBus read can overflow the buffer.
Add a rxbuf_size field to struct mcp2221, set it alongside rxbuf in
mcp_i2c_smbus_read(), and check rxbuf_idx + data[3] <= rxbuf_size
before the memcpy.
Thomas Huth [Tue, 21 Apr 2026 14:27:01 +0000 (16:27 +0200)]
xen/arm: Replace __ASSEMBLY__ with __ASSEMBLER__ in interface.h
While the GCC and Clang compilers already define __ASSEMBLER__
automatically when compiling assembly code, __ASSEMBLY__ is a
macro that only gets defined by the Makefiles in the kernel.
This can be very confusing when switching between userspace
and kernelspace coding, or when dealing with uapi headers that
rather should use __ASSEMBLER__ instead. So let's standardize now
on the __ASSEMBLER__ macro that is provided by the compilers.
Sungwoo Kim [Tue, 12 May 2026 05:09:29 +0000 (01:09 -0400)]
block: bio-integrity: Fix null-ptr-deref in bio_integrity_map_user()
pin_user_pages_fast() can partially succeed and return the number of
pages that were actually pinned. However, the bio_integrity_map_user()
does not handle this partial pinning. This leads to a general protection
fault since bvec_from_pages() dereferences an unpinned page address,
which is 0.
To fix this, add a check to verify that all requested memory is pinned.
If partial pinning occurs, unpin the memory and return -EFAULT.
Kernel Oops:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 UID: 0 PID: 1061 Comm: nvme-passthroug Not tainted 7.0.0-11783-g90957f9314e8-dirty #16 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
RIP: 0010:bio_integrity_map_user.cold+0x1b0/0x9d6
Fixes: 492c5d455969 ("block: bio-integrity: directly map user buffers") Acked-by: Chao Shi <cshi008@fiu.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Acked-by: Dave Tian <daveti@purdue.edu> Signed-off-by: Sungwoo Kim <iam@sung-woo.kim> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://github.com/linux-blktests/blktests/pull/244 Link: https://patch.msgid.link/20260512050929.541397-2-iam@sung-woo.kim Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lukas Bulwahn [Thu, 5 Feb 2026 08:11:31 +0000 (09:11 +0100)]
HID: quirks: really enable the intended work around for appledisplay
Commit c7fabe4ad921 ("HID: quirks: work around VID/PID conflict for
appledisplay") intends to add a quirk for kernels built with Apple Cinema
Display support, but it refers to the non-existing config option
CONFIG_APPLEDISPLAY, whereas the config option for Apple Cinema Display
support is named CONFIG_USB_APPLEDISPLAY.
Refer to the intended config option CONFIG_USB_APPLEDISPLAY in the ifdef
directive.
Fixes: c7fabe4ad921 ("HID: quirks: work around VID/PID conflict for appledisplay") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Casey Chen [Mon, 11 May 2026 21:22:30 +0000 (15:22 -0600)]
block: recompute nr_integrity_segments in blk_insert_cloned_request
blk_insert_cloned_request() already recomputes nr_phys_segments
against the bottom queue, because "the queue settings related to
segment counting may differ from the original queue." The exact same
reasoning applies to integrity segments: a stacked driver's underlying
queue can have tighter virt_boundary_mask, seg_boundary_mask, or
max_segment_size than the top queue, in which case
blk_rq_count_integrity_sg() against the bottom queue produces a
different count than the cached rq->nr_integrity_segments inherited
from the source request by blk_rq_prep_clone().
When the cached count is lower than the bottom queue's actual count,
blk_rq_map_integrity_sg() trips
BUG_ON(segments > rq->nr_integrity_segments);
on dispatch. The same families of stacked setups that motivated the
existing nr_phys_segments recompute -- dm-multipath fanning out to
nvme-rdma in particular -- can produce this.
Mirror the nr_phys_segments handling: when the request carries
integrity, recompute nr_integrity_segments against the bottom queue
and reject the request if it exceeds the bottom queue's
max_integrity_segments. blk_rq_count_integrity_sg() and
queue_max_integrity_segments() are both already available via
<linux/blk-integrity.h>, which blk-mq.c includes.
This closes a latent gap in the stacking contract and brings the
integrity-segment accounting in line with the existing
phys-segment accounting.
David Carlier [Mon, 11 May 2026 21:51:51 +0000 (22:51 +0100)]
block: don't overwrite bip_vcnt in bio_integrity_copy_user()
bio_integrity_add_page() already sets bip_vcnt to 1 for the bounce
segment. Overwriting it with nr_vecs breaks bip_vcnt <= bip_max_vcnt
on WRITE (bip_max_vcnt is 1), so the gap-merge checks in block/blk.h
read past the bip_vec[] flex array. On READ the read is in bounds
but lands on a saved user bvec instead of the bounce.
The line was added for split propagation, but bio_integrity_clone()
doesn't copy bip_vcnt and BIP_CLONE_FLAGS excludes BIP_COPY_USER.
Oliver Neukum [Tue, 3 Mar 2026 09:48:54 +0000 (10:48 +0100)]
HID: hid-sjoy: race between init and usage
The driver uses an initial IO to set the device to a default
state. That initialization is currently being done after the device
node has been created. That means that the single buffer used
for output can be altered while IO is in progress.
Move the intialization before announcement to user space.
Fixes: fac733f029251 ("HID: force feedback support for SmartJoy PLUS PS2/USB adapter") Signed-off-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Steve French [Tue, 12 May 2026 02:55:23 +0000 (21:55 -0500)]
SMB3.1.1: add missing QUERY_DIR info levels
New Infolevels for QUERY_DIR (and QUERY_INFO) levels 78 through 81 are
now being used by Windows clients and were added to the documentation.
Add defines for them (and correct some typos in documentation). See
MS-SMB2 2.2.33 and MS-FSCC 2.4
Signed-off-by: Steve French <stfrench@microsoft.com>
Paolo Abeni [Tue, 12 May 2026 14:15:03 +0000 (16:15 +0200)]
Merge branch 'net-shaper-fix-various-minor-bugs'
Jakub Kicinski says:
====================
net: shaper: fix various minor bugs
Fix various minor bugs in the net shaper API.
First 2 patches deal with ordering issues around inserting
and publishing new shapers. Shapers are inserted "tentatively"
and marked valid only after HW op succeeded, this used to
be slightly racy.
Only other patch of note is patch 8. We want to add a Netlink
policy check on the handle ID. This necessitates patch 7.
Jakub Kicinski [Sun, 10 May 2026 19:29:04 +0000 (12:29 -0700)]
net: shaper: reject QUEUE scope handle with missing id
net_shaper_parse_handle() does not enforce that the user provides
the handle ID. For NODE the ID defaults to UNSPEC for all other
cases it defaults to 0.
For NETDEV 0 is the only option. For QUEUE defaulting to 0 makes
less intuitive sense. Specifically because the behavior should
(IMHO) be the same for all cases where there may be more than
one ID (QUEUE and NODE).
We should either document this as intentional or reject.
I picked the latter with no strong conviction.
Jakub Kicinski [Sun, 10 May 2026 19:29:03 +0000 (12:29 -0700)]
net: shaper: enforce singleton NETDEV scope with id 0
The NETDEV scope represents a singleton root shaper in the per-device
hierarchy. All code assumes NETDEV shapers have id 0:
net_shaper_default_parent() hardcodes parent->id = 0 when returning
the NETDEV parent for QUEUE/NODE children, and the UAPI documentation
describes NETDEV scope as "the main shaper" (singular, not plural).
net_shaper_parse_handle() reads the user-supplied handle ID via
nla_get_u32(), accepting the full u32 range. However, the xarray key
is built by net_shaper_handle_to_index() using
FIELD_PREP(NET_SHAPER_ID_MASK, handle->id), where NET_SHAPER_ID_MASK
is GENMASK(25, 0) - only 26 bits wide. FIELD_PREP silently masks off
the upper bits at runtime. A user-supplied NODE id like 0x04000123
becomes id 0x123.
Additionally, a user-supplied id equal to NET_SHAPER_ID_UNSPEC
(0x03FFFFFF, which is NET_SHAPER_ID_MASK itself) would collide with
the sentinel used internally by the group operation to signal
"allocate a new NODE id".
Reject user-supplied IDs >= NET_SHAPER_ID_MASK (i.e., >= 0x03FFFFFF)
in the policy.
Jakub Kicinski [Sun, 10 May 2026 19:29:01 +0000 (12:29 -0700)]
tools: ynl: add scope qualifier for definitions
Using definitions in kernel policies is awkward right now.
On one hand we want defines for max values and such.
On the other we don't have a way of adding kernel-only defines.
Adding unnecessary defines to uAPI is a bad idea, we won't
be able to delete them. And when it comes to policy user
space should just query it via the policy dump, not use
hard coded defines.
Add a "scope" property to definitions, which will let us tell
the codegen that a definition is for kernel use only. Support
following values:
- uapi: render into the uAPI header (default, today's behavior)
- kernel: render to kernel header only
- user: same as kernel but for the user-side generated header
Definitions may have a header property (definition is "external",
provided by existing header). Extend the scope to headers, too.
If definition has both scope and header properties we will only
generate the includes in the right scope.
Jakub Kicinski [Sun, 10 May 2026 19:29:00 +0000 (12:29 -0700)]
net: shaper: fix undersized reply skb allocation in GROUP command
net_shaper_group_send_reply() writes both the NET_SHAPER_A_IFINDEX
attribute (via net_shaper_fill_binding()) and the nested
NET_SHAPER_A_HANDLE attribute (via net_shaper_fill_handle()), but
the reply skb at the call site in net_shaper_nl_group_doit() is
allocated using net_shaper_handle_size(), which only accounts for
the nested handle.
The allocation is therefore short by nla_total_size(sizeof(u32))
(8 bytes) for the IFINDEX attribute. In practice the slab allocator
rounds up the small allocation so the bug is latent, but the size
accounting is wrong and could bite if the reply grew further.
Introduce net_shaper_group_reply_size() that accounts for the full
reply payload and use it both at the genlmsg_new() call site and in
the defensive WARN_ONCE message.
Jakub Kicinski [Sun, 10 May 2026 19:28:57 +0000 (12:28 -0700)]
net: shaper: reject duplicate leaves in GROUP request
net_shaper_nl_group_doit() does not deduplicate NET_SHAPER_A_LEAVES
entries. When userspace supplies the same leaf handle twice, the same
old-parent pointer lands twice in old_nodes[]. The cleanup loop double
frees the parent. Of course the same parent may still be in old_nodes[]
twice if we are moving multiple of its leaves.
Note that this patch also implicitly fixes the fact that the
i >= leaves_count path forgets to set ret.
Jakub Kicinski [Sun, 10 May 2026 19:28:55 +0000 (12:28 -0700)]
net: shaper: flip the polarity of the valid flag
The usual way of inserting entries which are not yet fully ready
into XArray is to have a VALID flag. The shaper code has a NOT_VALID
flag. Since XArray code does not let us create entries with marks
already set - the creation of entries is currently not atomic.
Flip the polarity of the VALID flag. This closes the tiny race
in net_shaper_pre_insert() of entries being created without
the NOT_VALID flag.
Kang Yang [Tue, 28 Apr 2026 06:17:37 +0000 (14:17 +0800)]
wifi: ath10k: skip WMI and beacon transmission when device is wedged
In ath10k_wmi_cmd_send(), the current code detects ATH10K_STATE_WEDGED
and sets ret to -ESHUTDOWN, but still proceeds to transmit pending
beacons and calls ath10k_wmi_cmd_send_nowait().
This can lead to incorrect behavior, as WMI commands and beacons are
still sent after the device has been marked as wedged, and the original
-ESHUTDOWN return value may be overwritten by the result of the send
path.
The wedged state indicates the hardware is already unreliable, and no
further interaction with firmware is expected or meaningful in this
state.
Fix this by skipping beacon transmission and the WMI send path entirely
once ATH10K_STATE_WEDGED is detected, ensuring consistent return values
and avoiding unnecessary firmware interaction.
Fixes: c256a94d1b1b ("wifi: ath10k: shutdown driver when hardware is unreliable") Signed-off-by: Kang Yang <kang.yang@oss.qualcomm.com> Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com> Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Link: https://patch.msgid.link/20260428061737.37-1-kang.yang@oss.qualcomm.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Nicolas Escande [Wed, 6 May 2026 13:42:40 +0000 (15:42 +0200)]
wifi: ath11k: fix error path leak in ath11k_tm_cmd_wmi_ftm()
This is similar to what was fixed by previous patches. We have a call
to ath11k_wmi_cmd_send() which does check the return value, but forgot
to free the related skb on error.
Fixes: b43310e44edc ("wifi: ath11k: factory test mode support") Signed-off-by: Nicolas Escande <nico.escande@gmail.com> Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com> Link: https://patch.msgid.link/20260506134240.2284016-4-nico.escande@gmail.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Nicolas Escande [Wed, 6 May 2026 13:42:39 +0000 (15:42 +0200)]
wifi: ath11k: fix error path leaks in some WMI calls
This is the same pattern that was previously identified as problematic:
direct 'return ath11k_wmi_cmd_send(...)' will leak the skb in the error
path if it is not explicitly handled.
Fixes: c417b247ba04 ("ath11k: implement hardware data filter") Fixes: 9cbd7fc9be82 ("ath11k: support MAC address randomization in scan") Fixes: ba9177fcef21 ("ath11k: Add basic WoW functionalities") Fixes: fec4b898f369 ("ath11k: Add WoW net-detect functionality") Fixes: c3c36bfe998b ("ath11k: support ARP and NS offload") Fixes: a16d9b50cfba ("ath11k: support GTK rekey offload") Fixes: 652f69ed9c1b ("ath11k: Add support for SAR") Fixes: 0f84a156aa3b ("ath11k: Handle keepalive during WoWLAN suspend and resume") Signed-off-by: Nicolas Escande <nico.escande@gmail.com> Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com> Link: https://patch.msgid.link/20260506134240.2284016-3-nico.escande@gmail.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Nicolas Escande [Wed, 6 May 2026 13:42:38 +0000 (15:42 +0200)]
wifi: ath11k: fix error path leaks in some WMI WOW calls
Fix two instances where we used to directly return the result of
ath11k_wmi_cmd_send(...). Because we did not check the return value, we
also did not free the skb in the error path.
Fixes: 79802b13a492 ("ath11k: implement WoW enable and wakeup commands") Signed-off-by: Nicolas Escande <nico.escande@gmail.com> Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com> Link: https://patch.msgid.link/20260506134240.2284016-2-nico.escande@gmail.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
net: ethernet: cs89x0: remove stale CONFIG_MACH_MX31ADS reference
The legacy ARM board file for MACH_MX31ADS was removed in commit c93197b0041d ("ARM: imx: Remove i.MX31 board files"), but a reference
to it remained in the cs89x0 driver. Drop this unused code.
Linus Walleij [Fri, 8 May 2026 22:13:38 +0000 (00:13 +0200)]
net: ethernet: cortina: Carry over frag counter
The gmac_rx() NAPI poll function assembles packets in an
SKB from a ring buffer.
If the ring buffer gets completely emptied during a poll cycle,
we exit gmac_rx(), but the packet is not yet completely
assembled in the SKB, yet the fragment counter frag_nr is
reset to zero on the next invocation.
Solve this by making the RX fragment counter a part of the
port struct, and carry it over between invocations.
Reset the fragment counter only right after calling
napi_gro_frags(), on error (after calling napi_free_frags())
or if stopping the port.
Reset it in some place where not strictly necessary just to
emphasize what is going on.
This was found by Sashiko during normal patch review.
Linus Walleij [Fri, 8 May 2026 22:13:37 +0000 (00:13 +0200)]
net: ethernet: cortina: Make RX SKB per-port
The SKB used to assemble packets from fragments in gmac_rx()
is static local, but the Gemini has two ethernet ports, meaning
there can be races between the ports on a bad day if a device
is using both.
Make the RX SKB a per-port variable and carry it over between
invocations in the port struct instead.
Zero the pointer once we call napi_gro_frags(), on error (after
calling napi_free_frags()) or if the port is stopped.
Zero it in some place where not strictly necessary just to
emphasize what is going on.
This was found by Sashiko during normal patch review.
Here are the outstanding miscellaneous fixes for netfslib gathered together
and with some fixes-to-fixes folded down and one rearrangement. Various
Sashiko review comments[1][2][3][4][5] are addressed:
(1) Fix subrequest cancellation cleanup in DIO read and single-read.
(2) Fix missing locking around retry adding new subrequests.
(3) Fix read and write result collection to use barriering correctly to
access a request's subrequest lists without taking a lock.
This adds list_add_tail_release() and
list_first_entry_or_null_acquire() to appropriate incorporate
barriering into some list functions.
(4) Fix netfs_read_to_pagecache() to pause on subrequest I/O failure.
(5) Fix the potential for 64-bit tearing on a 32-bit machine when reading
netfs_inode->remote_i_size and ->zero_point by using much the same
mechanism as is used for ->i_size.
(6) Fix the calculation of zero_point in netfs_release_folio() to limit it
to ->remote_i_size, not ->i_size.
(7) Fix triggering of a VM_BUG_ON_FOLIO() in netfs_write_begin().
(8) Fix a potentially uninitialised error value in
netfs_extract_user_iter().
(9) Fix error handling in netfs_extract_user_iter().
(10) Fix overrun checking in netfs_extract_user_iter().
(11) Fix netfs_invalidate_folio() to clear the folio dirty bit if all dirty
data removed.
(12) Defer the emission of trace_netfs_folio() in netfs_perform_write().
This allows the next patch to emit the correct traces.
(13) Fix the handling of a partially failed copy (ie. EFAULT) into a
streaming write folio. Also remove the netfs_folio if a streaming
write folio is entirely overwritten.
(14) Fix a potential deadlock in writethrough writing.
(15) Fix netfs_read_gaps() to remove the netfs_folio from a filled folio.
(16) Fix netfs_perform_write() to not disable streaming writes when writing
to an fd that's open O_RDWR.
(17) Fix an early put of the sink page used in netfs_read_gaps(), before
the request has completed.
(18) Fix request leak in netfs_write_begin() error handling.
(19) Fix a potential UAF in netfs_unlock_abandoned_read_pages() due to
trying to check index of each folio we're abandoning to see if that
folio is actually owned by the caller (in which case, we're not
actually allowed to dereference it).
(20) Fix incorrect adjustment of dirty region when partially invalidating a
streaming write folio.
(21) Fix the handling of folio->private in netfs_perform_write() and the
attached netfs_folio and/or group when a streaming write folio is
modified.
(22) Fix netfs_read_folio() to wait on writeback first (it holds the folio
lock) otherwise we aren't allowed to look at the netfs_folio struct as
that could be modified at any time by the writeback collector.
(23) Fix write skipping in dir/symlink writepages.
* patches from https://patch.msgid.link/20260512123404.719402-1-dhowells@redhat.com: (24 commits)
afs: Fix the locking used by afs_get_link()
netfs, afs: Fix write skipping in dir/link writepages
netfs: Fix netfs_read_folio() to wait on writeback
netfs: Fix folio->private handling in netfs_perform_write()
netfs: Fix partial invalidation of streaming-write folio
netfs: Fix potential UAF in netfs_unlock_abandoned_read_pages()
netfs: Fix leak of request in netfs_write_begin() error handling
netfs: Fix early put of sink folio in netfs_read_gaps()
netfs: Fix write streaming disablement if fd open O_RDWR
netfs: Fix read-gaps to remove netfs_folio from filled folio
netfs: Fix potential deadlock in write-through mode
netfs: Fix streaming write being overwritten
netfs: Defer the emission of trace_netfs_folio()
netfs: Fix netfs_invalidate_folio() to clear dirty bit if all changes gone
netfs: Fix overrun check in netfs_extract_user_iter()
netfs: fix error handling in netfs_extract_user_iter()
netfs: Fix potential uninitialised var in netfs_extract_user_iter()
netfs: fix VM_BUG_ON_FOLIO() issue in netfs_write_begin() call
netfs: Fix zeropoint update where i_size > remote_i_size
netfs: Fix potential for tearing in ->remote_i_size and ->zero_point
...
David Howells [Tue, 12 May 2026 12:34:01 +0000 (13:34 +0100)]
afs: Fix the locking used by afs_get_link()
The afs filesystem in the kernel doesn't do locking correctly for symbolic
links. There are a number of problems:
(1) It doesn't do any locking around afs_read_single() to prevent races
between multiple ->get_link() calls, thereby allowing the possibility
of leaks.
(2) It doesn't use RCU barriering when accessing the buffer pointers
during RCU pathwalk.
(3) It can race with another thread updating the contents of the symlink
if a third party updated it on the server.
Fix this by the following means:
(0) Move symlink handling into its own file as this makes it more
complicated.
(1) Take the validate_lock around afs_read_single() to prevent races
between multiple ->get_link() calls.
(2) Keep a separate copy of the symlink contents with an rcu_head. This
is always going to be a lot smaller than a page, so it can be
kmalloc'd and save quite a bit of memory. It also needs a refcount
for non-RCU pathwalk.
(3) Split the symlink read and write-to-cache routines in afs from those
for directories.
(4) Discard the I/O buffer as soon as the write-to-cache completes as this
is a full page (plus a folio_queue).
(5) If there's no cache, discard the I/O buffer immediately after reading
and copying if there is no cache.
Fixes: eae9e78951bb ("afs: Use netfslib for symlinks, allowing them to be cached") Fixes: 6698c02d64b2 ("afs: Locally initialise the contents of a new symlink on creation") Closes: https://sashiko.dev/#/patchset/20260326104544.509518-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-25-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:34:00 +0000 (13:34 +0100)]
netfs, afs: Fix write skipping in dir/link writepages
Fix netfs_write_single() and afs_single_writepages() to better handle a
write that would be skipped due to lock contention and WB_SYNC_NONE by
returning 1 from netfs_write_single() if it skipped and making
afs_single_writepages() skip also. If a skip occurs, the inode must be
re-marked as the VFS may have cleared the mark.
This is really only theoretical for directories in netfs_write_single() as
the only path to that is through afs_single_writepages() that takes the
->validate_lock around it, thereby serialising it.
Fixes: 6dd80936618c ("afs: Use netfslib for directories") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-24-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:59 +0000 (13:33 +0100)]
netfs: Fix netfs_read_folio() to wait on writeback
Fix netfs_read_folio() to wait for an ongoing writeback to complete so that
it can trust the dirty flag and whatever is attached to folio->private
(folio->private may get cleaned up by the collector before it clears the
writeback flag).
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-23-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:58 +0000 (13:33 +0100)]
netfs: Fix folio->private handling in netfs_perform_write()
Under some circumstances, netfs_perform_write() doesn't correctly
manipulate folio->private between NULL, NETFS_FOLIO_COPY_TO_CACHE, pointing
to a group and pointing to a netfs_folio struct, leading to potential
multiple attachments of private data with associated folio ref leaks and
also leaks of netfs_folio structs or netfs_group refs.
Fix this by consolidating the place at which a folio is marked uptodate in
one place and having that look at what's attached to folio->private and
decide how to clean it up and then set the new group. Also, the content
shouldn't be flushed if group is NULL, even if a group is specified in the
netfs_group parameter, as that would be the case for a new folio. A
filesystem should always specify netfs_group or never specify netfs_group.
The Sashiko auto-review tool noted that it was theoretically possible that
the fpos >= ctx->zero_point section might leak if it modified a streaming
write folio. This is unlikely, but with a network filesystem, third party
changes can happen. It also pointed out that __netfs_set_group() would
leak if called multiple times on the same folio from the "whole folio
modify section".
Fixes: 8f52de0077ba ("netfs: Reduce number of conditional branches in netfs_perform_write()") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-22-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:57 +0000 (13:33 +0100)]
netfs: Fix partial invalidation of streaming-write folio
In netfs_invalidate_folio(), if the region of a partial invalidation
overlaps the front (but not all) of a dirty write cached in a streaming
write page (dirty, but not uptodate, with the dirty region tracked by a
netfs_folio struct), the function modifies the dirty region - but
incorrectly as it moves the region forward by setting the start to the
start, not the end, of the invalidation region.
Fix this by setting finfo->dirty_offset to the end of the invalidation
region (iend).
Fixes: cce6bfa6ca0e ("netfs: Fix trimming of streaming-write folios in netfs_inval_folio()") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-21-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:56 +0000 (13:33 +0100)]
netfs: Fix potential UAF in netfs_unlock_abandoned_read_pages()
netfs_unlock_abandoned_read_pages(rreq) accesses the index of the folios it
is wanting to unlock and compares that to rreq->no_unlock_folio so that it
doesn't unlock a folio being read for netfs_perform_write() or
netfs_write_begin().
However, given that netfs_unlock_abandoned_read_pages() is called _after_
NETFS_RREQ_IN_PROGRESS is cleared, the one folio that it's not allowed to
dereference is the one specified by ->no_unlock_folio as ownership
immediately reverts to the caller.
Fix this by storing the folio pointer instead and using that rather than
the index. Also fix netfs_unlock_read_folio() where the same applies.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-20-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:55 +0000 (13:33 +0100)]
netfs: Fix leak of request in netfs_write_begin() error handling
Fix netfs_write_begin() to not leak our ref on the request in the event
that we get an error from netfs_wait_for_read().
Fixes: 4090b31422a6 ("netfs: Add a function to consolidate beginning a read") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-19-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:54 +0000 (13:33 +0100)]
netfs: Fix early put of sink folio in netfs_read_gaps()
Fix netfs_read_gaps() to release the sink page it uses after waiting for
the request to complete. The way the sink page is used is that an
ITER_BVEC-class iterator is created that has the gaps from the target folio
at either end, but has the sink page tiled over the middle so that a single
read op can fill in both gaps.
The bug was found by KASAN detecting a UAF on the generic/075 xfstest in
the cifsd kernel thread that handles reception of data from the TCP socket:
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Reported-by: Steve French <sfrench@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-18-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:53 +0000 (13:33 +0100)]
netfs: Fix write streaming disablement if fd open O_RDWR
In netfs_perform_write(), "write streaming" (the caching of dirty data in
dirty but !uptodate folios) is performed to avoid the need to read data
that is just going to get immediately overwritten. However, this is/will
be disabled in three circumstances: if the fd is open O_RDWR, if fscache is
in use (as we need to round out the blocks for DIO) or if content
encryption is enabled (again for rounding out purposes).
The idea behind disabling it if the fd is open O_RDWR is that we'd need to
flush the write-streaming page before we could read the data, particularly
through mmap. But netfs now fills in the gaps if ->read_folio() is called
on the page, so that is unnecessary. Further, this doesn't actually work
if a separate fd is open for reading.
Fix this by removing the check for O_RDWR, thereby allowing streaming
writes even when we might read.
This caused a number of problems with the generic/522 xfstest, but those
are now fixed.
Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-17-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:52 +0000 (13:33 +0100)]
netfs: Fix read-gaps to remove netfs_folio from filled folio
Fix netfs_read_gaps() to remove the netfs_folio record from the folio
record before marking the folio uptodate if it successfully fills the gaps
around the dirty data in a streaming write folio (dirty, but not uptodate).
David Howells [Tue, 12 May 2026 12:33:51 +0000 (13:33 +0100)]
netfs: Fix potential deadlock in write-through mode
Fix netfs_advance_writethrough() to always unlock the supplied folio and to
mark it dirty if it isn't yet written to the end. Unfortunately, it can't
be marked for writeback until the folio is done with as that may cause a
deadlock against mmapped reads and writes.
Even though it has been marked dirty, premature writeback can't occur as
the caller is holding both inode->i_rwsem (which will prevent concurrent
truncation, fallocation, DIO and other writes) and ictx->wb_lock (which
will cause flushing to wait and writeback to skip or wait).
Note that this may be easier to deal with once the queuing of folios is
split from the generation of subrequests.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Closes: https://sashiko.dev/#/patchset/20260427154639.180684-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-15-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:50 +0000 (13:33 +0100)]
netfs: Fix streaming write being overwritten
In order to avoid reading whilst writing, netfslib will allow "streaming
writes" in which dirty data is stored directly into folios without reading
them first. Such folios are marked dirty but may not be marked uptodate.
If a folio is entirely written by a streaming write, uptodate will be set,
otherwise it will have a netfs_folio struct attached to ->private recording
the dirty region.
In the event that a partially written streaming write page is to be
overwritten entirely by a single write(), netfs_perform_write() will try to
copy over it, but doesn't discard the netfs_folio if it succeeds; further,
it doesn't correctly handle a partial copy that overwrites some of the
dirty data.
Fix this by the following:
(1) If the folio is successfully overwritten, free the netfs_folio struct
before marking the page uptodate.
(2) If the copy to the folio partially fails, but short of the dirty data,
just ignore the copy.
(3) If the copy partially fails and overwrites some of the dirty data,
accept the copy, update the netfs_folio struct to record the new data.
If the folio is now filled, free the netfs_folio and set uptodate,
otherwise return a partial write.
It shows folio 0x24 misbehaving if the FMODE_READ check is commented out in
netfs_perform_write():
if (//(file->f_mode & FMODE_READ) ||
netfs_is_cache_enabled(ctx)) {
and no fscache. This was initially found with the generic/522 xfstest.
Fixes: 8f52de0077ba ("netfs: Reduce number of conditional branches in netfs_perform_write()") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-14-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:49 +0000 (13:33 +0100)]
netfs: Defer the emission of trace_netfs_folio()
Change netfs_perform_write() to keep the netfs_folio trace value in a
variable and emit it later to make it easier to choose the value displayed.
This is a prerequisite for a subsequent patch.
Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-13-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:48 +0000 (13:33 +0100)]
netfs: Fix netfs_invalidate_folio() to clear dirty bit if all changes gone
If a streaming write is made, this will leave the relevant modified folio
in a not-uptodate, but dirty state with a netfs_folio struct hung off of
folio->private indicating the dirty range. Subsequently truncating the
file such that the dirty data in the folio is removed, but the first part
of the folio theoretically remains will cause the netfs_folio struct to be
discarded... but will leave the dirty flag set.
If the folio is then read via mmap(), netfs_read_folio() will see that the
page is dirty and jump to netfs_read_gaps() to fill in the missing bits.
netfs_read_gaps(), however, expects there to be a netfs_folio struct
present and can oops because truncate removed it.
Fix this by calling folio_cancel_dirty() in netfs_invalidate_folio() in the
event that all the dirty data in the folio is erased (as nfs does).
Also add some tracepoints to log modifications to a dirty page.
with fscaching disabled (otherwise streaming writes are suppressed) and a
change to netfs_perform_write() to disallow streaming writes if the fd is
open O_RDWR:
if (//(file->f_mode & FMODE_READ) || <--- comment this out
netfs_is_cache_enabled(ctx)) {
It should be reproducible even without this change, but if prevents the
above trivial xfs_io command from reproducing it.
Note that the initial dd is important: the file must start out sufficiently
large that the zero-point logic doesn't just clear the gaps because it
knows there's nothing in the file to read yet. Unmounting and mounting is
needed to clear the pagecache (there are other ways to do that that may
also work).
This was initially reproduced with the generic/522 xfstest on some patches
that remove the FMODE_READ restriction.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-12-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:47 +0000 (13:33 +0100)]
netfs: Fix overrun check in netfs_extract_user_iter()
Fix netfs_extract_user_iter() so that if iov_iter_extract_pages() overfills
pages[], then those pages don't get included in the iterator constructed at
the end of the function. If there was an overfill, memory corruption has
already happened.
Fixes: 85dd2c8ff368 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Closes: https://sashiko.dev/#/patchset/20260427154639.180684-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-11-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
Paulo Alcantara [Tue, 12 May 2026 12:33:46 +0000 (13:33 +0100)]
netfs: fix error handling in netfs_extract_user_iter()
In netfs_extract_user_iter(), if iov_iter_extract_pages() failed to
extract user pages, bail out on -ENOMEM, otherwise return the error
code only if @npages == 0, allowing short DIO reads and writes to be
issued.
This fixes mmapstress02 from LTP tests against CIFS.
Fixes: 85dd2c8ff368 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-10-dhowells@redhat.com Cc: netfs@lists.linux.dev Cc: stable@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:45 +0000 (13:33 +0100)]
netfs: Fix potential uninitialised var in netfs_extract_user_iter()
In netfs_extract_user_iter(), if it's given a zero-length iterator, it will
fall through the loop without setting ret, and so the error handling
behaviour will be undefined, depending on whether ret happens to be
negative. The value of ret then propagates back up the callstack.
Fix this by presetting ret to 0.
Fixes: 85dd2c8ff368 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-9-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
The race sequence:
1. Read completes -> netfs_read_collection() runs
2. netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, ...)
3. netfs_wait_for_read() returns -EFAULT to netfs_write_begin()
4. The netfs_unlock_abandoned_read_pages() unlocks the folio
5. netfs_write_begin() calls folio_unlock(folio) -> VM_BUG_ON_FOLIO()
The key reason of the issue that netfs_unlock_abandoned_read_pages()
doesn't check the flag NETFS_RREQ_NO_UNLOCK_FOLIO and executes
folio_unlock() unconditionally. This patch implements in
netfs_unlock_abandoned_read_pages() logic similar to
netfs_unlock_read_folio().
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-8-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: Ceph Development <ceph-devel@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:43 +0000 (13:33 +0100)]
netfs: Fix zeropoint update where i_size > remote_i_size
Fix the update of the zero point[*] by netfs_release_folio() when there is
uncommitted data in the pagecache beyond the folio being released but the
on-server EOF is in this folio (ie. i_size > remote_i_size). The update
needs to limit zero_point to remote_i_size, not i_size as i_size is a local
phenomenon reflecting updates made locally to the pagecache, not stuff
written to the server. remote_i_size tracks the server's i_size.
[*] The zero point is the file position from which we can assume that the
server will just return zeros, so we can avoid generating reads.
Note that netfs_invalidate_folio() probably doesn't need fixing as
zero_point should be updated by setattr after truncation or fallocate.
David Howells [Tue, 12 May 2026 12:33:42 +0000 (13:33 +0100)]
netfs: Fix potential for tearing in ->remote_i_size and ->zero_point
Fix potential tearing in using ->remote_i_size and ->zero_point by copying
i_size_read() and i_size_write() and using the same seqcount as for i_size.
We need to make sure that netfslib and the filesystems that use it always
hold i_lock whilst updating any of the sizes to prevent i_size_seqcount
from getting corrupted.
Fixes: 4058f742105e ("netfs: Keep track of the actual remote file size") Fixes: 100ccd18bb41 ("netfs: Optimise away reads above the point at which there can be no data") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-6-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:40 +0000 (13:33 +0100)]
netfs: Fix missing barriers when accessing stream->subrequests locklessly
The list of subrequests attached to stream->subrequests is accessed without
locks by netfs_collect_read_results() and netfs_collect_write_results(),
and then they access subreq->flags without taking a barrier after getting
the subreq pointer from the list. Relatedly, the functions that build the
list don't use any sort of write barrier when constructing the list to make
sure that the NETFS_SREQ_IN_PROGRESS flag is perceived to be set first if
no lock is taken.
Fix this by:
(1) Add a new list_add_tail_release() function that uses a release barrier
to set the pointer to the new member of the list.
(2) Add a new list_first_entry_or_null_acquire() function that uses an
acquire barrier to read the pointer to the first member in a list (or
return NULL).
(3) Use list_add_tail_release() when adding a subreq to ->subrequests.
(4) Use list_first_entry_or_null_acquire() when initially accessing the
front of the list (when an item is removed, the pointer to the new
front iterm is obtained under the same lock).
David Howells [Tue, 12 May 2026 12:33:39 +0000 (13:33 +0100)]
netfs: Fix missing locking around retry adding new subreqs
Fix netfs_retry_read_subrequests() and netfs_retry_write_stream() to take
the appropriate lock when adding extra subrequests into
stream->subrequests.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Closes: https://sashiko.dev/#/patchset/20260425125426.3855807-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-3-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Tue, 12 May 2026 12:33:38 +0000 (13:33 +0100)]
netfs: Fix cancellation of a DIO and single read subrequests
When the preparation of a new subrequest for a read fails, if the
subrequest has already been added to the stream->subrequests list, it can't
simply be put and abandoned as the collector may see it. Also, if it
hasn't been queued yet, it has two outstanding refs that both need to be
put. Both DIO read and single-read dispatch fail at this; further, both
differ in the order they do things to the way buffered read works.
Fix cancellation of both DIO-read and single-read subrequests that failed
preparation by the following steps:
(1) Harmonise all three reads (buffered, dio, single) to queue the subreq
before prepping it.
(2) Make all three call netfs_queue_read() to do the queuing.
(3) Set NETFS_RREQ_ALL_QUEUED independently of the queuing as we don't
know the length of the subreq at this point.
(4) In all cases, set the error and NETFS_SREQ_FAILED flag on the subreq
and then call netfs_read_subreq_terminated() to deal with it. This
will pass responsibility off to the collector for dealing with it.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Closes: https://sashiko.dev/#/patchset/20260425125426.3855807-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-2-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
before calling poll_select_set_timeout() -> timespec64_valid(). Both
operands of the seconds sum are unbounded user-controlled signed long.
A crafted pair where tv_usec is a negative multiple of USEC_PER_SEC
drives the sum across the wrap boundary - e.g.
yields sec = LONG_MAX, nsec = 0, which passes timespec64_valid() and
then flows through timespec64_add_safe(), which saturates the absolute
deadline to TIME64_MAX (clamped further to KTIME_MAX downstream).
select(2) therefore blocks effectively forever instead of returning
-EINVAL as POSIX requires for a negative timeout.
Only the legacy __NR_select syscall takes this path. pselect6, ppoll,
poll and epoll_pwait2 all hand the user's two fields directly to
poll_select_set_timeout(), which validates *before* doing any
arithmetic:
/* fs/select.c:271 -- the validator */
int poll_select_set_timeout(struct timespec64 *to, time64_t sec, long nsec)
{
struct timespec64 ts = {.tv_sec = sec, .tv_nsec = nsec};
if (!timespec64_valid(&ts))
return -EINVAL;
...
}
/* include/linux/time64.h:97 -- timespec64_valid */
if (ts->tv_sec < 0) return false;
if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC) return false;
/* fs/select.c:744 do_pselect() (pselect6, pselect6_time32) */
if (get_timespec64(&ts, tsp)) return -EFAULT;
if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) return -EINVAL;
/* fs/select.c:1097 ppoll */
if (get_timespec64(&ts, tsp)) return -EFAULT;
if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) return -EINVAL;
/* fs/select.c:1065 poll -- timeout_msecs is int; >= 0 gates the math */
if (timeout_msecs >= 0)
poll_select_set_timeout(to, timeout_msecs / MSEC_PER_SEC,
NSEC_PER_MSEC * (timeout_msecs % MSEC_PER_SEC));
/* fs/eventpoll.c:2512 epoll_pwait2 */
if (get_timespec64(&ts, timeout)) return -EFAULT;
if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) return -EINVAL;
In every one of these the wrap-prone arithmetic from kern_select()
simply does not exist; the user fields reach timespec64_valid()
unmodified. glibc routes the C-library select() through pselect6,
so the bug is reachable only via a direct syscall(__NR_select, ...).
The pre-validation negative check that used to live here was lost
when the syscall was switched to the poll_select_set_timeout() helper.
Restore it: reject tv_sec < 0 || tv_usec < 0 up front, mirroring what
glibc does in userspace. do_compat_select() has the same arithmetic
pattern but is only reachable on 32-bit compat and from a different
syscall entry; left for a follow-up so this change stays minimal.
Reproducer (returns -1/EINVAL on a fixed kernel; blocks indefinitely
on an unfixed one):
Fixes: 4d36a9e65d49 ("select: deal with math overflow from borderline valid userland data") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260429-timeval-v1-1-4448e2588bbf@debian.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
====================
vsock/virtio: fix vsockmon tap skb construction
While reviewing the patch posted by Yiqi Sun [1] to fix an issue in
virtio_transport_build_skb(), I discovered another issue related to
the offset and length of the payload to be copied in the new skb.
This was introduced when we did the skb conversion, and fixed by
patch 1.
Patch 2 fixes the issue found by Yiqi Sun in a different way: using
iov_iter_kvec() to properly initialize all the iov_iter fields and
removing the linear vs non-linear split like we alredy do in
vhost-vsock.
It could have been a single patch, but since there were two affected
commits, I decided to keep the fixes separate.
vsock/virtio: fix empty payload in tap skb for non-linear buffers
For non-linear skbs, virtio_transport_build_skb() goes through
virtio_transport_copy_nonlinear_skb() to copy the original payload
in the new skb to be delivered to the vsockmon tap device.
This manually initializes an iov_iter but does not set iov_iter.count.
Since the iov_iter is zero-initialized, the copy length is zero and no
payload is actually copied to the monitor interface, leaving data
un-initialized.
Fix this by removing the linear vs non-linear split and using
skb_copy_datagram_iter() with iov_iter_kvec() for all cases, as
vhost-vsock already does. This handles both linear and non-linear skbs,
properly initializes the iov_iter, and removes the now unused
virtio_transport_copy_nonlinear_skb().
While touching this code, let's also check the return value of
skb_copy_datagram_iter(), even though it's unlikely to fail.
Fixes: 4b0bf10eb077 ("vsock/virtio: non-linear skb handling for tap") Reported-by: Yiqi Sun <sunyiqixm@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-3-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
vsock/virtio: fix length and offset in tap skb for split packets
virtio_transport_build_skb() builds a new skb to be delivered to the
vsockmon tap device. To build the new skb, it uses the original skb
data length as payload length, but as the comment notes, the original
packet stored in the skb may have been split in multiple packets, so we
need to use the length in the header, which is correctly updated before
the packet is delivered to the tap, and the offset for the data.
This was also similar to what we did before commit 71dc9ec9ac7d
("virtio/vsock: replace virtio_vsock_pkt with sk_buff") where we probably
missed something during the skb conversion.
Also update the comment above, which was left stale by the skb
conversion and still mentioned a buffer pointer that no longer exists.
Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-2-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Quan Sun [Fri, 8 May 2026 12:46:36 +0000 (20:46 +0800)]
net: hsr: fix NULL pointer dereference in hsr_get_node_data()
In the HSR (High-availability Seamless Redundancy) protocol, node
information is maintained in the node_db. When a supervision frame is
received, node->addr_B_port is updated to track the receiving port type
(e.g., HSR_PT_SLAVE_B).
If the underlying physical interface associated with this slave port is
removed (e.g., via `ip link del`), hsr_del_port() frees the hsr_port
object. However, the stale node->addr_B_port reference is kept in the
node_db until the node ages out.
Subsequently, if userspace queries the node status via the Netlink
command HSR_C_GET_NODE_STATUS, the kernel calls hsr_get_node_data().
This function unconditionally dereferences the pointer returned by
hsr_port_get_hsr():
if (node->addr_B_port != HSR_PT_NONE) {
port = hsr_port_get_hsr(hsr, node->addr_B_port);
*addr_b_ifindex = port->dev->ifindex; // <-- NULL deref
}
If the slave port has been deleted, hsr_port_get_hsr() returns NULL,
resulting in a kernel panic.
Oops: general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:hsr_get_node_data+0x7b6/0x9e0
Call Trace:
<TASK>
hsr_get_node_status+0x445/0xa40
Fix this by adding a proper NULL pointer check. If the port lookup fails
due to a stale port type, gracefully treat it as if no valid port exists
and assign -1 to the interface index.
Steps to reproduce:
1. Create an HSR interface with two slave devices.
2. Receive a supervision frame to populate node_db with
addr_B_port assigned to SLAVE_B.
3. Delete the underlying slave device B.
4. Send an HSR_C_GET_NODE_STATUS Netlink message.
drm/i915/dp: Fix VSC dynamic range signaling for RGB formats
For RGB, set dynamic_range to CTA or VESA based on
crtc_state->limited_color_range so sinks apply correct
quantization. YCbCr remains limited (CTA) range.
(DP v1.4, Table 5-1)
drm/i915: skip __i915_request_skip() for already signaled requests
After a GPU reset the HWSP is zeroed, so previously completed
requests appear incomplete. If such a request is picked up during
reset_rewind() and marked guilty, i915_request_set_error_once()
returns early (fence already signaled), leaving fence.error without
a fatal error code. The subsequent __i915_request_skip() then hits:
```
GEM_BUG_ON(!fatal_error(rq->fence.error))
```
Fixes a kernel BUG observed on Sandy Bridge (Gen6) during
heartbeat-triggered engine resets.
```
kernel BUG at drivers/gpu/drm/i915/i915_request.c:556!
RIP: __i915_request_skip+0x15e/0x1d0 [i915]
...
__i915_request_reset+0x212/0xa70 [i915]
reset_rewind+0xe4/0x280 [i915]
intel_gt_reset+0x30d/0x5b0 [i915]
heartbeat+0x516/0x530 [i915]
```
Guard __i915_request_skip() with i915_request_signaled(), if the
fence is already signaled, the ring content is committed and there
is nothing left to skip.
Fixes: 36e191f0644b ("drm/i915: Apply i915_request_skip() on submission") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/13729 Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Cc: stable@vger.kernel.org # v5.7+ Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/fe76921d35b6ae85aa651822726d0d9815aa5362.1776339012.git.sebastian.brzezinka@intel.com
(cherry picked from commit 5ba54393dcd7adf75a9f39f5a933b1538349cad5) Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Sven Eckelmann [Sat, 2 May 2026 19:25:19 +0000 (21:25 +0200)]
batman-adv: tt: prevent TVLV entry number overflow
The helpers to prepare the buffers for the local and global TT based
replies are trying to sum up all TT entries which can be found for each
VLAN. In theory, this sum can be too big for an u16 and therefore overflow.
A too small buffer would then be allocated for the TVLV.
The too small buffer will be handled gracefully by
batadv_tt_tvlv_generate() and is not causing a buffer overflow - just a
truncated reply. But this overflow shouldn't have happened in the first and
the too small buffer should never have been allocated when an overflow was
detected.
Cc: stable@kernel.org Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 2 May 2026 18:47:34 +0000 (20:47 +0200)]
batman-adv: tt: avoid empty VLAN responses
The commit 16116dac2339 ("batman-adv: prevent TT request storms by not
sending inconsistent TT TLVLs") added checks to the local (direct) TT
response code. But the response can also be done indirectly by another node
using the global TT state. To avoid such inconsistency states reported in
the original fix, also avoid sending empty VLANs for replies from the
global TT state.
Cc: stable@kernel.org Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 2 May 2026 17:47:11 +0000 (19:47 +0200)]
batman-adv: tt: fix TOCTOU race for reported vlans
The local TT based TVLV is generated by first checking the number of VLANs
which have at least one TT entry. A new buffer with the correct size for
the VLANs is then allocated. Only then, the list of VLANs s used to fill
the VLAN entries in the buffer. During this time, the meshif_vlan_list_lock
is held. But the actual number of TT entries of each VLAN can still
increase during this time - just not the number of VLANs in the list.
But the prefilter used in the buffer size calculation might still cause an
increase of the number of VLANs which need to be stored. Simply because a
VLAN might now suddenly have at least one entry when it had none in the
pre-alloc check - and then needs to occupy space which was not allocated.
It is better to overestimate the buffer size at the beginning and then fill
the buffer only with the VLANs which are not empty.
Cc: stable@kernel.org Fixes: 16116dac2339 ("batman-adv: prevent TT request storms by not sending inconsistent TT TLVLs") Signed-off-by: Sven Eckelmann <sven@narfation.org>