When the kobject of the kset for /sys/fs/nilfs2 is initialized, its ktype
is set to kset_ktype, which has a ->sysfs_ops of kobj_sysfs_ops. When
nilfs_feature_attr_group is added to that kobject via
sysfs_create_group(), the kernfs_ops of each files is sysfs_file_kfops_rw,
which will call sysfs_kf_seq_show() when ->seq_show() is called.
sysfs_kf_seq_show() in turn calls kobj_attr_show() through
->sysfs_ops->show(). kobj_attr_show() casts the provided attribute out to
a 'struct kobj_attribute' via container_of() and calls ->show(), resulting
in the CFI violation since neither nilfs_feature_revision_show() nor
nilfs_feature_README_show() match the prototype of ->show() in 'struct
kobj_attribute'.
Resolve the CFI violation by adjusting the second parameter in
nilfs_feature_{revision,README}_show() from 'struct attribute' to 'struct
kobj_attribute' to match the expected prototype.
The original code uses cancel_delayed_work() in cnic_cm_stop_bnx2x_hw(),
which does not guarantee that the delayed work item 'delete_task' has
fully completed if it was already running. Additionally, the delayed work
item is cyclic, the flush_workqueue() in cnic_cm_stop_bnx2x_hw() only
blocks and waits for work items that were already queued to the
workqueue prior to its invocation. Any work items submitted after
flush_workqueue() is called are not included in the set of tasks that the
flush operation awaits. This means that after the cyclic work items have
finished executing, a delayed work item may still exist in the workqueue.
This leads to use-after-free scenarios where the cnic_dev is deallocated
by cnic_free_dev(), while delete_task remains active and attempt to
dereference cnic_dev in cnic_delete_task().
A typical race condition is illustrated below:
CPU 0 (cleanup) | CPU 1 (delayed work callback)
cnic_netdev_event() |
cnic_stop_hw() | cnic_delete_task()
cnic_cm_stop_bnx2x_hw() | ...
cancel_delayed_work() | /* the queue_delayed_work()
flush_workqueue() | executes after flush_workqueue()*/
| queue_delayed_work()
cnic_free_dev(dev)//free | cnic_delete_task() //new instance
| dev = cp->dev; //use
Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure
that the cyclic delayed work item is properly canceled and that any
ongoing execution of the work item completes before the cnic_dev is
deallocated. Furthermore, since cancel_delayed_work_sync() uses
__flush_work(work, true) to synchronously wait for any currently
executing instance of the work item to finish, the flush_workqueue()
becomes redundant and should be removed.
This bug was identified through static analysis. To reproduce the issue
and validate the fix, I simulated the cnic PCI device in QEMU and
introduced intentional delays — such as inserting calls to ssleep()
within the cnic_delete_task() function — to increase the likelihood
of triggering the bug.
The expression `(conf->instr_type == 64) << iq_no` can overflow because
`iq_no` may be as high as 64 (`CN23XX_MAX_RINGS_PER_PF`). Casting the
operand to `u64` ensures correct 64-bit arithmetic.
Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This reverts commit d24341740fe48add8a227a753e68b6eedf4b385a.
It causes errors when trying to configure QoS, as well as
loss of L2 connectivity (on multi-host devices).
Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/20250910170011.70528106@kernel.org Fixes: d24341740fe4 ("net/mlx5e: Update and set Xon/Xoff upon port speed set") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
syzbot reported the splat below where a socket had tcp_sk(sk)->fastopen_rsk
in the TCP_ESTABLISHED state. [0]
syzbot reused the server-side TCP Fast Open socket as a new client before
the TFO socket completes 3WHS:
1. accept()
2. connect(AF_UNSPEC)
3. connect() to another destination
As of accept(), sk->sk_state is TCP_SYN_RECV, and tcp_disconnect() changes
it to TCP_CLOSE and makes connect() possible, which restarts timers.
Since tcp_disconnect() forgot to clear tcp_sk(sk)->fastopen_rsk, the
retransmit timer triggered the warning and the intended packet was not
retransmitted.
Let's call reqsk_fastopen_remove() in tcp_disconnect().
i40e has a feature which writes to memory location last descriptor
successfully sent. Memory barrier in i40e_clean_tx_irq() was used to
avoid forward-reading descriptor fields in case DD bit was not set.
Having mentioned feature in place implies that such situation will not
happen as we know in advance how many descriptors HW has dealt with.
Besides, this barrier placement was wrong. Idea is to have this
protection *after* reading DD bit from HW descriptor, not before.
Digging through git history showed me that indeed barrier was before DD
bit check, anyways the commit introducing i40e_get_head() should have
wiped it out altogether.
Also, there was one commit doing s/read_barrier_depends/smp_rmb when get
head feature was already in place, but it was only theoretical based on
ixgbe experiences, which is different in these terms as that driver has
to read DD bit from HW descriptor.
Fixes: 1943d8ba9507 ("i40e/i40evf: enable hardware feature head write back") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
`netif_rx()` already increments `rx_dropped` core stat when it fails.
The driver was also updating `ndev->stats.rx_dropped` in the same path.
Since both are reported together via `ip -s -s` command, this resulted
in drops being counted twice in user-visible stats.
Keep the driver update on `if (unlikely(!skb))`, but skip it after
`netif_rx()` errors.
Fixes: caf586e5f23c ("net: add a core netdev->rx_dropped counter") Signed-off-by: Yeounsu Moon <yyyynoom@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250913060135.35282-3-yyyynoom@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In the protection override dump path, the firmware can return far too
many GRC elements, resulting in attempting to write past the end of the
previously-kmalloc'ed dump buffer.
This will result in a kernel panic with reason:
BUG: unable to handle kernel paging request at ADDRESS
where "ADDRESS" is just past the end of the protection override dump
buffer. The start address of the buffer is:
p_hwfn->cdev->dbg_features[DBG_FEATURE_PROTECTION_OVERRIDE].dump_buf
and the size of the buffer is buf_size in the same data structure.
The panic can be arrived at from either the qede Ethernet driver path:
When register_virtio_device() fails in virtio_uml_probe(),
the code sets vu_dev->registered = 1 even though
the device was not successfully registered.
This can lead to use-after-free or other issues.
Fixes: 04e5b1fb0183 ("um: virtio: Remove device on disconnect") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation
This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.
As described in the added code comment, a reference to .exit.text is ok
for drivers registered via platform_driver_probe(). Make this explicit
to prevent the following section mismatch warning
The variable ret is declared as a u32 type, but it is assigned a value
of -EOPNOTSUPP. Since unsigned types cannot correctly represent negative
values, the type of ret should be changed to int.
The ALSA HwDep character device of the firewire-motu driver incorrectly
returns EPOLLOUT in poll(2), even though the driver implements no operation
for write(2). This misleads userspace applications to believe write() is
allowed, potentially resulting in unnecessarily wakeups.
This issue dates back to the driver's initial code added by a commit 71c3797779d3 ("ALSA: firewire-motu: add hwdep interface"), and persisted
when POLLOUT was updated to EPOLLOUT by a commit a9a08845e9ac ('vfs: do
bulk POLL* -> EPOLL* replacement("").').
The root cause is that unpoison_memory() tries to check the PG_HWPoison
flags of an uninitialized page. So VM_BUG_ON_PAGE(PagePoisoned(page)) is
triggered. This can be reproduced by below steps:
This scenario can be identified by pfn_to_online_page() returning NULL.
And ZONE_DEVICE pages are never expected, so we can simply fail if
pfn_to_online_page() == NULL to fix the bug.
Link: https://lkml.kernel.org/r/20250828024618.1744895-1-linmiaohe@huawei.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Adjust context ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
for_each_set_bit() expects size to be in bits, not bytes. The abox mask
iteration uses bytes, but it works by coincidence, because the local
variable holding the mask is unsigned long, and the mask only ever has
bit 2 as the highest bit. Using a smaller type could lead to subtle and
very hard to track bugs.
Fixes: 62afef2811e4 ("drm/i915/rkl: RKL uses ABOX0 for pixel transfers") Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: stable@vger.kernel.org # v5.9+ Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250905104149.1144751-1-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit 7ea3baa6efe4bb93d11e1c0e6528b1468d7debf6) Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
[ adapted struct intel_display *display parameters to struct drm_i915_private *dev_priv ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Firmware that doesn't provide section headers leave both e_shentsize and
e_shnum 0, which obvious isn't compatible with the newly introduced
stricter checks.
Make the section-related checks conditional on either of these values
being non-zero.
Fixes: 9f9967fed9d0 ("soc: qcom: mdt_loader: Ensure we don't read past the ELF header") Reported-by: Val Packett <val@packett.cool> Closes: https://lore.kernel.org/all/ece307c3-7d65-440f-babd-88cf9705b908@packett.cool/ Reported-by: Neil Armstrong <neil.armstrong@linaro.org> Closes: https://lore.kernel.org/all/aec9cd03-6fc2-4dc8-b937-8b7cf7bf4128@linaro.org/ Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com> Fixes: 9f35ab0e53cc ("soc: qcom: mdt_loader: Fix error return values in mdt_header_valid()") Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250730-mdt-loader-shentsize-zero-v1-1-04f43186229c@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org> Cc: Yongqin Liu <yongqin.liu@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we don't have a clock specified in the device tree, we have no way to
ensure the BAM is on. This is often the case for remotely-controlled or
remotely-powered BAM instances. In this case, we need to read num-channels
from the DT to have all the necessary information to complete probing.
However, at the moment invalid device trees without clock and without
num-channels still continue probing, because the error handling is missing
return statements. The driver will then later try to read the number of
channels from the registers. This is unsafe, because it relies on boot
firmware and lucky timing to succeed. Unfortunately, the lack of proper
error handling here has been abused for several Qualcomm SoCs upstream,
causing early boot crashes in several situations [1, 2].
Avoid these early crashes by erroring out when any of the required DT
properties are missing. Note that this will break some of the existing DTs
upstream (mainly BAM instances related to the crypto engine). However,
clearly these DTs have never been tested properly, since the error in the
kernel log was just ignored. It's safer to disable the crypto engine for
these broken DTBs.
Fix a critical memory allocation bug in edma_setup_from_hw() where
queue_priority_map was allocated with insufficient memory. The code
declared queue_priority_map as s8 (*)[2] (pointer to array of 2 s8),
but allocated memory using sizeof(s8) instead of the correct size.
This caused out-of-bounds memory writes when accessing:
queue_priority_map[i][0] = i;
queue_priority_map[i][1] = i;
The bug manifested as kernel crashes with "Oops - undefined instruction"
on ARM platforms (BeagleBoard-X15) during EDMA driver probe, as the
memory corruption triggered kernel hardening features on Clang.
Change the allocation to use sizeof(*queue_priority_map) which
automatically gets the correct size for the 2D array structure.
Fixes: 2b6b3b742019 ("ARM/dmaengine: edma: Merge the two drivers under drivers/dma/") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Link: https://lore.kernel.org/r/20250830094953.3038012-1-anders.roxell@linaro.org Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Since j1939_sk_bind() and j1939_sk_release() call j1939_local_ecu_put()
when J1939_SOCK_BOUND was already set, but the error handling path for
j1939_sk_bind() will not set J1939_SOCK_BOUND when j1939_local_ecu_get()
fails, j1939_local_ecu_get() needs to undo priv->ents[sa].nusers++ when
j1939_local_ecu_get() returns an error.
Commit 25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct
callback") expects that a call to j1939_priv_put() can be unconditionally
delayed until j1939_sk_sock_destruct() is called. But a refcount leak will
happen when j1939_sk_bind() is called again after j1939_local_ecu_get()
from previous j1939_sk_bind() call returned an error. We need to call
j1939_priv_put() before j1939_sk_bind() returns an error.
If request_irq() in i40e_vsi_request_irq_msix() fails in an iteration
later than the first, the error path wants to free the IRQs requested
so far. However, it uses the wrong dev_id argument for free_irq(), so
it does not free the IRQs correctly and instead triggers the warning:
Use the same dev_id for free_irq() as for request_irq().
I tested this with inserting code to fail intentionally.
Fixes: 493fb30011b3 ("i40e: Move q_vectors from pointer to array to array of pointers") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The driver uses irq_set_affinity_hint() for two purposes:
- To set the affinity_hint which is consumed by the userspace for
distributing the interrupts
- To apply an affinity that it provides for the i40e interrupts
The latter is done to ensure that all the interrupts are evenly spread
across all available CPUs. However, since commit a0c9259dc4e1 ("irq/matrix:
Spread interrupts on allocation") the spreading of interrupts is
dynamically performed at the time of allocation. Hence, there is no need
for the drivers to enforce their own affinity for the spreading of
interrupts.
Also, irq_set_affinity_hint() applying the provided cpumask as an affinity
for the interrupt is an undocumented side effect. To remove this side
effect irq_set_affinity_hint() has been marked as deprecated and new
interfaces have been introduced. Hence, replace the irq_set_affinity_hint()
with the new interface irq_update_affinity_hint() that only sets the
pointer for the affinity_hint.
The discussion about removing the side effect of irq_set_affinity_hint() of
actually applying the cpumask (if not NULL) as affinity to the interrupt,
unearthed a few unpleasantries:
1) The modular perf drivers rely on the current behaviour for the very
wrong reasons.
2) While none of the other drivers prevents user space from changing
the affinity, a cursorily inspection shows that there are at least
expectations in some drivers.
#1 needs to be cleaned up anyway, so that's not a problem
#2 might result in subtle regressions especially when irqbalanced (which
nowadays ignores the affinity hint) is disabled.
Provide new interfaces:
irq_update_affinity_hint() - Only sets the affinity hint pointer
irq_set_affinity_and_hint() - Set the pointer and apply the affinity to
the interrupt
Make irq_set_affinity_hint() a wrapper around irq_apply_affinity_hint() and
document it to be phased out.
Add a function to allow the affinity of an interrupt be switched to
managed, such that interrupts allocated for platform devices may be
managed.
This new interface has certain limitations, and attempts to use it in the
following circumstances will fail:
- For when the kernel is configured for generic IRQ reservation mode (in
config GENERIC_IRQ_RESERVATION_MODE). The reason being that it could
conflict with managed vs. non-managed interrupt accounting.
- The interrupt is already started, which should not be the case during
init
- The interrupt is already configured as managed, which means double init
The igb driver incorrectly skips the link test when the network
interface is admin down (if_running == false), causing the test to
always report PASS regardless of the actual physical link state.
This behavior is inconsistent with other drivers (e.g. i40e, ice, ixgbe,
etc.) which correctly test the physical link state regardless of admin
state.
Remove the if_running check to ensure link test always reflects the
physical link state.
Fixes: 8d420a1b3ea6 ("igb: correct link test not being run when link is down") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If a GSO skb is sent through a Geneve tunnel and if Geneve options are
added, the split GSO skb might not fit in the MTU anymore and an ICMP
frag needed packet can be generated. In such case the ICMP packet might
go through the segmentation logic (and dropped) later if it reaches a
path were the GSO status is checked and segmentation is required.
This is especially true when an OvS bridge is used with a Geneve tunnel
attached to it. The following set of actions could lead to the ICMP
packet being wrongfully segmented:
1. An skb is constructed by the TCP layer (e.g. gso_type SKB_GSO_TCPV4,
segs >= 2).
2. The skb hits the OvS bridge where Geneve options are added by an OvS
action before being sent through the tunnel.
3. When the skb is xmited in the tunnel, the split skb does not fit
anymore in the MTU and iptunnel_pmtud_build_icmp is called to
generate an ICMP fragmentation needed packet. This is done by reusing
the original (GSO!) skb. The GSO metadata is not cleared.
4. The ICMP packet being sent back hits the OvS bridge again and because
skb_is_gso returns true, it goes through queue_gso_packets...
5. ...where __skb_gso_segment is called. The skb is then dropped.
6. Note that in the above example on re-transmission the skb won't be a
GSO one as it would be segmented (len > MSS) and the ICMP packet
should go through.
Fix this by resetting the GSO information before reusing an skb in
iptunnel_pmtud_build_icmp and iptunnel_pmtud_build_icmpv6.
Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Reported-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Link: https://patch.msgid.link/20250904125351.159740-1-atenart@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Having setup time 0 violates tAR, tCLR of some chips, for instance
TOSHIBA TC58NVG2S3ETAI0 cannot be detected successfully (first ID byte
being read duplicated, i.e. 98 98 dc 90 15 76 14 03 instead of
98 dc 90 15 76 ...).
Atmel Application Notes postulated 1 cycle NRD_SETUP without explanation
[1], but it looks more appropriate to just calculate setup time properly.
Looks like a copy'n'paste mistake introduced when initially adding the
dynamic timings feature with commit f9ce2eddf176 ("mtd: nand: atmel: Add
->setup_data_interface() hooks"). The context around this and
especially the code itself suggests 'read' is meant instead of write.
Commit 8ee53820edfd ("thp: mmu_notifier_test_young") introduced
mmu_notifier_test_young(), but we are passing the wrong address.
In xxx_scan_pmd(), the actual iteration address is "_address" not
"address". We seem to misuse the variable on the very beginning.
Change it to the right one.
[akpm@linux-foundation.org fix whitespace, per everyone] Link: https://lkml.kernel.org/r/20250822063318.11644-1-richard.weiyang@gmail.com Fixes: 8ee53820edfd ("thp: mmu_notifier_test_young") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The FUSE protocol uses struct fuse_write_out to convey the return value of
copy_file_range, which is restricted to uint32_t. But the COPY_FILE_RANGE
interface supports a 64-bit size copies.
Currently the number of bytes copied is silently truncated to 32-bit, which
may result in poor performance or even failure to copy in case of
truncation to zero.
In case OOB write is requested during a data write, ECC is currently
lost. Avoid this issue by only writing in the free spare area.
This issue has been seen with a YAFFS2 file system.
ocfs2_fiemap() takes a read lock of the ip_alloc_sem semaphore (since v2.6.22-527-g7307de80510a) and calls fiemap_fill_next_extent() to read the
extent list of this running mmap executable. The user supplied buffer to
hold the fiemap information page faults calling ocfs2_page_mkwrite() which
will take a write lock (since v2.6.27-38-g00dc417fa3e7) of the same
semaphore. This recursive semaphore will hold filesystem locks and causes
a hang of the fileystem.
The ip_alloc_sem protects the inode extent list and size. Release the
read semphore before calling fiemap_fill_next_extent() in ocfs2_fiemap()
and ocfs2_fiemap_inline(). This does an unnecessary semaphore lock/unlock
on the last extent but simplifies the error path.
Link: https://lkml.kernel.org/r/61d1a62b-2631-4f12-81e2-cd689914360b@oracle.com Fixes: 00dc417fa3e7 ("ocfs2: fiemap support") Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Reported-by: syzbot+541dcc6ee768f77103e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=541dcc6ee768f77103e7 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clang 22 recently added support for defining __SANITIZE__ macros similar
to GCC [1], which causes warnings (or errors with CONFIG_WERROR=y or W=e)
with the existing defines that the kernel creates to emulate this behavior
with existing clang versions.
In file included from <built-in>:3:
In file included from include/linux/compiler_types.h:171:
include/linux/compiler-clang.h:37:9: error: '__SANITIZE_THREAD__' macro redefined [-Werror,-Wmacro-redefined]
37 | #define __SANITIZE_THREAD__
| ^
<built-in>:352:9: note: previous definition is here
352 | #define __SANITIZE_THREAD__ 1
| ^
Refactor compiler-clang.h to only define the sanitizer macros when they
are undefined and adjust the rest of the code to use these macros for
checking if the sanitizers are enabled, clearing up the warnings and
allowing the kernel to easily drop these defines when the minimum
supported version of LLVM for building the kernel becomes 22.0.0 or newer.
dma_free_coherent() must only be called if the corresponding
dma_alloc_coherent() call has succeeded. Calling it when the allocation fails
leads to undefined behavior.
When the check_[op]_overflow() helpers were introduced, all arguments
were required to be the same type to make the fallback macros simpler.
However, now that the fallback macros have been removed[1], it is fine
to allow mixed types, which makes using the helpers much more useful,
as they can be used to test for type-based overflows (e.g. adding two
large ints but storing into a u8), as would be handy in the drm core[2].
Remove the restriction, and add additional self-tests that exercise
some of the mixed-type overflow cases, and double-check for accidental
macro side-effects.
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: linux-hardening@vger.kernel.org Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Reviewed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Tested-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org>
[ dropped the test portion of the commit as that doesn't apply to
5.15.y - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1. Load a sk_msg prog that calls bpf_msg_cork_bytes(msg, cork_bytes)
2. Attach the prog to a SOCKMAP
3. Add a socket to the SOCKMAP
4. Activate fault injection
5. Send data less than cork_bytes
At 5., the data is carried over to the next sendmsg() as it is
smaller than the cork_bytes specified by bpf_msg_cork_bytes().
Then, tcp_bpf_send_verdict() tries to allocate psock->cork to hold
the data, but this fails silently due to fault injection + __GFP_NOWARN.
If the allocation fails, we need to revert the sk->sk_forward_alloc
change done by sk_msg_alloc().
Let's call sk_msg_free() when tcp_bpf_send_verdict fails to allocate
psock->cork.
The "*copied" also needs to be updated such that a proper error can
be returned to the caller, sendmsg. It fails to allocate psock->cork.
Nothing has been corked so far, so this patch simply sets "*copied"
to 0.
Typo in ff_lseg_match_mirrors makes the diff ineffective. This results
in merge happening all the time. Merge happening all the time is
problematic because it marks lsegs invalid. Marking lsegs invalid
causes all outstanding IO to get restarted with EAGAIN and connections
to get closed.
Closing connections constantly triggers race conditions in the RDMA
implementation...
Fixes: 660d1eb22301c ("pNFS/flexfile: Don't merge layout segments if the mirrors don't match") Signed-off-by: Jonathan Curley <jcurley@purestorage.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Both tracing_mark_write and tracing_mark_raw_write call
__copy_from_user_inatomic during preempt_disable. But in some case,
__copy_from_user_inatomic may trigger page fault, and will call schedule()
subtly. And if a task is migrated to other cpu, the following warning will
be trigger:
if (RB_WARN_ON(cpu_buffer,
!local_read(&cpu_buffer->committing)))
An example can illustrate this issue:
process flow CPU
---------------------------------------------------------------------
Don't clear the capabilities that are not going to get reset by the call
to _nfs4_server_capabilities().
Reported-by: Scott Haiden <scott.b.haiden@gmail.com> Fixes: b01f21cacde9 ("NFS: Fix the setting of capabilities when automounting a new filesystem") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Recent commit f06bedfa62d5 ("pNFS/flexfiles: don't attempt pnfs on fatal DS
errors") has changed the error return type of ff_layout_choose_ds_for_read() from
NULL to an error pointer. However, not all code paths have been updated
to match the change. Thus, some non-NULL checks will accept error pointers
as a valid return value.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: f06bedfa62d5 ("pNFS/flexfiles: don't attempt pnfs on fatal DS errors") Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When I ran the repro [0] and waited a few seconds, I observed two
LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1]
Reproduction Steps:
1) Mount CIFS
2) Add an iptables rule to drop incoming FIN packets for CIFS
3) Unmount CIFS
4) Unload the CIFS module
5) Remove the iptables rule
At step 3), the CIFS module calls sock_release() for the underlying
TCP socket, and it returns quickly. However, the socket remains in
FIN_WAIT_1 because incoming FIN packets are dropped.
At this point, the module's refcnt is 0 while the socket is still
alive, so the following rmmod command succeeds.
# ss -tan
State Recv-Q Send-Q Local Address:Port Peer Address:Port
FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445
This highlights a discrepancy between the lifetime of the CIFS module
and the underlying TCP socket. Even after CIFS calls sock_release()
and it returns, the TCP socket does not die immediately in order to
close the connection gracefully.
While this is generally fine, it causes an issue with LOCKDEP because
CIFS assigns a different lock class to the TCP socket's sk->sk_lock
using sock_lock_init_class_and_name().
Once an incoming packet is processed for the socket or a timer fires,
sk->sk_lock is acquired.
Then, LOCKDEP checks the lock context in check_wait_context(), where
hlock_class() is called to retrieve the lock class. However, since
the module has already been unloaded, hlock_class() logs a warning
and returns NULL, triggering the null-ptr-deref.
If LOCKDEP is enabled, we must ensure that a module calling
sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded
while such a socket is still alive to prevent this issue.
Let's hold the module reference in sock_lock_init_class_and_name()
and release it when the socket is freed in sk_prot_free().
Note that sock_lock_init() clears sk->sk_owner for svc_create_socket()
that calls sock_lock_init_class_and_name() for a listening socket,
which clones a socket by sk_clone_lock() without GFP_ZERO.
The driver defines IMX214_DEFAULT_LINK_FREQ 480000000, and then
IMX214_DEFAULT_PIXEL_RATE ((IMX214_DEFAULT_LINK_FREQ * 8LL) / 10),
which works out as 384MPix/s. (The 8 is 4 lanes and DDR.)
Parsing the PLL registers with the defined 24MHz input. We're in single
PLL mode, so MIPI frequency is directly linked to pixel rate. VTCK ends
up being 1200MHz, and VTPXCK and OPPXCK both are 120MHz. Section 5.3
"Frame rate calculation formula" says "Pixel rate
[pixels/s] = VTPXCK [MHz] * 4", so 120 * 4 = 480MPix/s, which basically
agrees with my number above.
3.1.4. MIPI global timing setting says "Output bitrate = OPPXCK * reg
0x113[7:0]", so 120MHz * 10, or 1200Mbit/s. That would be a link
frequency of 600MHz due to DDR.
That also matches to 480MPix/s * 10bpp / 4 lanes / 2 for DDR.
Keep the previous link frequency for backward compatibility.
Acked-by: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: André Apitzsch <git@apitzsch.eu> Fixes: 436190596241 ("media: imx214: Add imx214 camera sensor driver") Cc: stable@vger.kernel.org Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
[ changed dev_err() to dev_err_probe() for the final error case ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A flush of the MPTCP endpoints should not affect the MPTCP limits. In
other words, 'ip mptcp endpoint flush' should not change 'ip mptcp
limits'.
But it was the case: the MPTCP_PM_ATTR_RCV_ADD_ADDRS (add_addr_accepted)
limit was reset by accident. Removing the reset of this counter during a
flush fixes this issue.
Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Cc: stable@vger.kernel.org Reported-by: Thomas Dreibholz <dreibh@simula.no> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/579 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250815-net-mptcp-misc-fixes-6-17-rc2-v1-2-521fe9957892@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[adjusted patch by removing WRITE_ONCE to take into account the missing
commit 72603d207d59 ("mptcp: use WRITE_ONCE for the pernet *_max")] Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cross-thread attacks are generally harder as they require the victim to be
co-located on a core. However, with VMSCAPE the adversary targets belong to
the same guest execution, that are more likely to get co-located. In
particular, a thread that is currently executing userspace hypervisor
(after the IBPB) may still be targeted by a guest execution from a sibling
thread.
Issue a warning about the potential risk, except when:
- SMT is disabled
- STIBP is enabled system-wide
- Intel eIBRS is enabled (which implies STIBP protection)
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Amit Shah <amit.shah@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cpu_bugs_smt_update() uses global variables from different mitigations. For
SMT updates it can't currently use vmscape_mitigation that is defined after
it.
Since cpu_bugs_smt_update() depends on many other mitigations, move it
after all mitigations are defined. With that, it can use vmscape_mitigation
in a moment.
No functional change.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Amit Shah <amit.shah@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
VMSCAPE is a vulnerability that exploits insufficient branch predictor
isolation between a guest and a userspace hypervisor (like QEMU). Existing
mitigations already protect kernel/KVM from a malicious guest. Userspace
can additionally be protected by flushing the branch predictors after a
VMexit.
Since it is the userspace that consumes the poisoned branch predictors,
conditionally issue an IBPB after a VMexit and before returning to
userspace. Workloads that frequently switch between hypervisor and
userspace will incur the most overhead from the new IBPB.
This new IBPB is not integrated with the existing IBPB sites. For
instance, a task can use the existing speculation control prctl() to
get an IBPB at context switch time. With this implementation, the
IBPB is doubled up: one at context switch and another before running
userspace.
The intent is to integrate and optimize these cases post-embargo.
[ dhansen: elaborate on suboptimal IBPB solution ]
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com>
[Amit: Use CPUID word 22 instead of 21 for the 5.10 backport] Signed-off-by: Amit Shah <amit.shah@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The VMSCAPE vulnerability may allow a guest to cause Branch Target
Injection (BTI) in userspace hypervisors.
Kernels (both host and guest) have existing defenses against direct BTI
attacks from guests. There are also inter-process BTI mitigations which
prevent processes from attacking each other. However, the threat in this
case is to a userspace hypervisor within the same process as the attacker.
Userspace hypervisors have access to their own sensitive data like disk
encryption keys and also typically have access to all guest data. This
means guest userspace may use the hypervisor as a confused deputy to attack
sensitive guest kernel data. There are no existing mitigations for these
attacks.
Introduce X86_BUG_VMSCAPE for this vulnerability and set it on affected
Intel and AMD CPUs.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
[Amit:
* Drop unsupported Intel families: ARROWLAKE, LUNARLAKE, METEORLAKE,
GRANITERAPIDS_X, EMERALDRAPIDS_X, ATOM_CRESTMONT_X; and unlisted ATOM
types for RAPTORLAKE and ALDERLAKE
* s/ATOM_GRACEMONT/ALDERLAKE_N/
* Drop unsupported AMD family: 0x1a] Signed-off-by: Amit Shah <amit.shah@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The echo_interval is not limited in any way during mounting,
which makes it possible to write a large number to it. This can
cause an overflow when multiplying ctx->echo_interval by HZ in
match_server().
Add constraints for echo_interval to smb3_fs_context_parse_param().
Found by Linux Verification Center (linuxtesting.org) with Svace.
Fixes: adfeb3e00e8e1 ("cifs: Make echo interval tunable") Cc: stable@vger.kernel.org Signed-off-by: Roman Smirnov <r.smirnov@omp.ru> Signed-off-by: Steve French <stfrench@microsoft.com>
[ Adapted to older CIFS filesystem structure and mount option parsing ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Enable the retain_ff_enable bit of GDSCR only if the GDSC is already ON.
Once the GDSCR moves to HW control, SW no longer can determine the state
of the GDSCR and setting the retain_ff bit could destroy all the register
contents we intended to save.
Therefore, move the retain_ff configuration before switching the GDSC to
HW trigger mode.
Cc: stable@vger.kernel.org Fixes: 173722995cdb ("clk: qcom: gdsc: Add support to enable retention of GSDCR") Signed-off-by: Taniya Das <quic_tdas@quicinc.com> Reviewed-by: Imran Shaik <quic_imrashai@quicinc.com> Tested-by: Imran Shaik <quic_imrashai@quicinc.com> # on QCS8300 Link: https://lore.kernel.org/r/20250214-gdsc_fixes-v1-1-73e56d68a80f@quicinc.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
[ Changed error path ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In DMA mode fsl_lpspi_reset() is always called at the end, even when the
transfer is aborted. In PIO mode aborts skip the reset leaving the FIFO
filled and the module enabled.
Fix it by always calling fsl_lpspi_reset().
Fixes: a15dc3d657fa ("spi: lpspi: Fix CLK pin becomes low before one transfer") Signed-off-by: Larisa Grigore <larisa.grigore@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://patch.msgid.link/20250828-james-nxp-lpspi-v2-3-6262b9aa9be4@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The driver currently supports multiple chip-selects, but only sets the
polarity for the first one (CS 0). Fix it by setting the PCSPOL bit for
the desired chip-select.
Fixes: 5314987de5e5 ("spi: imx: add lpspi bus driver") Signed-off-by: Larisa Grigore <larisa.grigore@nxp.com> Signed-off-by: James Clark <james.clark@linaro.org> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20250828-james-nxp-lpspi-v2-2-6262b9aa9be4@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 6a130448498c ("spi: lpspi: Fix wrong transmission when don't use
CONT") breaks transmissions when CONT is used. The TDIE interrupt should
not be disabled in all cases. If CONT is used and the TX transfer is not
yet completed yet, but the interrupt handler is called because there are
characters to be received, TDIE is replaced with FCIE. When the transfer
is finally completed, SR_TDF is set but the interrupt handler isn't
called again.
Fixes: 6a130448498c ("spi: lpspi: Fix wrong transmission when don't use CONT") Signed-off-by: Larisa Grigore <larisa.grigore@nxp.com> Signed-off-by: James Clark <james.clark@linaro.org> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20250828-james-nxp-lpspi-v2-1-6262b9aa9be4@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In the do_validate_mem(), the call to add_interval() does not
handle errors. If kmalloc() fails in add_interval(), it could
result in a null pointer being inserted into the linked list,
leading to illegal memory access when sub_interval() is called
next.
This patch adds an error handling for the add_interval(). If
add_interval() returns an error, the function will return early
with the error code.
It was reported that HP EliteDesk 800 G4 DM 65W (SSID 103c:845a) needs
the similar quirk for enabling HDMI outputs, too. This patch adds the
corresponding quirk entry.
With lan88xx based devices the lan78xx driver can get stuck in an
interrupt loop while bringing the device up, flooding the kernel log
with messages like the following:
lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
Removing interrupt support from the lan88xx PHY driver forces the
driver to use polling instead, which avoids the problem.
The issue has been observed with Raspberry Pi devices at least since
4.14 (see [1], bug report for their downstream kernel), as well as
with Nvidia devices [2] in 2020, where disabling interrupts was the
vendor-suggested workaround (together with the claim that phylib
changes in 4.9 made the interrupt handling in lan78xx incompatible).
Iperf reports well over 900Mbits/sec per direction with client in
--dualtest mode, so there does not seem to be a significant impact on
throughput (lan88xx device connected via switch to the peer).
In preparation of removing the .ack_interrupt() callback, we must replace
its occurrences (aka phy_clear_interrupt), from the 2 places where it is
called from (phy_enable_interrupts and phy_disable_interrupts), with
equivalent functionality.
This means that clearing interrupts now becomes something that the PHY
driver is responsible of doing, before enabling interrupts and after
clearing them. Make this driver follow the new contract.
In an attempt to actually support shared IRQs in phylib, we now move the
responsibility of triggering the phylib state machine or just returning
IRQ_NONE, based on the IRQ status register, to the PHY driver. Having
3 different IRQ handling callbacks (.handle_interrupt(),
.did_interrupt() and .ack_interrupt() ) is confusing so let the PHY
driver implement directly an IRQ handler like any other device driver.
Make this driver follow the new convention.
Based on changes in the 2021 public version of the randstruct
out-of-tree GCC plugin[1], more carefully update the attributes on
resulting decls, to avoid tripping checks in GCC 15's
comptypes_check_enum_int() when it has been configured with
"--enable-checking=misc":
arch/arm64/kernel/kexec_image.c:132:14: internal compiler error: in comptypes_check_enum_int, at c/c-typeck.cc:1519
132 | const struct kexec_file_ops kexec_image_ops = {
| ^~~~~~~~~~~~~~
internal_error(char const*, ...), at gcc/gcc/diagnostic-global-context.cc:517
fancy_abort(char const*, int, char const*), at gcc/gcc/diagnostic.cc:1803
comptypes_check_enum_int(tree_node*, tree_node*, bool*), at gcc/gcc/c/c-typeck.cc:1519
...
When building the randomized replacement tree of struct members, the
randstruct GCC plugin would insert, as the first member, a 0-sized void
member. This appears as though it was done to catch non-designated
("unnamed") static initializers, which wouldn't be stable since they
depend on the original struct layout order.
This was accomplished by having the side-effect of the "void member"
tripping an assert in GCC internals (count_type_elements) if the member
list ever needed to be counted (e.g. for figuring out the order of members
during a non-designated initialization), which would catch impossible type
(void) in the struct:
security/landlock/fs.c: In function ‘hook_file_ioctl_common’:
security/landlock/fs.c:1745:61: internal compiler error: in count_type_elements, at expr.cc:7075
1745 | .u.op = &(struct lsm_ioctlop_audit) {
| ^
However this is a redundant safety measure since randstruct uses the
__designated_initializer attribute both internally and within the
__randomized_layout attribute macro so that this would be enforced
by the compiler directly even when randstruct was not enabled (via
-Wdesignated-init).
A recent change in Landlock ended up tripping the same member counting
routine when using a full-struct copy initializer as part of an anonymous
initializer. This, however, is a false positive as the initializer is
copying between identical structs (and hence identical layouts). The
"path" member is "struct path", a randomized struct, and is being copied
to from another "struct path", the "f_path" member:
As can be seen with the coming randstruct KUnit test, there appears to
be no behavioral problems with this kind of initialization when the void
member is removed from the randstruct GCC plugin, so remove it.
Reported-by: "Dr. David Alan Gilbert" <linux@treblig.org> Closes: https://lore.kernel.org/lkml/Z_PRaKx7q70MKgCA@gallifrey/ Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/lkml/20250407-kbuild-disable-gcc-plugins-v1-1-5d46ae583f5e@kernel.org/ Reported-by: WangYuli <wangyuli@uniontech.com> Closes: https://lore.kernel.org/lkml/337D5D4887277B27+3c677db3-a8b9-47f0-93a4-7809355f1381@uniontech.com/ Fixes: 313dd1b62921 ("gcc-plugins: Add the randstruct plugin") Signed-off-by: Kees Cook <kees@kernel.org>
Stable-dep-of: f39f18f3c353 ("randstruct: gcc-plugin: Fix attribute addition") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The two alarm LEDs of on the uDPU board are stopped working since
commit 78efa53e715e ("leds: Init leds class earlier").
The LEDs are driven by the GPIO{15,16} pins of the North Bridge
GPIO controller. These pins are part of the 'spi_quad' pin group
for which the 'spi' function is selected via the default pinctrl
state of the 'spi' node. This is wrong however, since in order to
allow controlling the LEDs, the pins should use the 'gpio' function.
Before the commit mentined above, the 'spi' function is selected
first by the pinctrl core before probing the spi driver, but then
it gets overridden to 'gpio' implicitly via the
devm_gpiod_get_index_optional() call from the 'leds-gpio' driver.
After the commit, the LED subsystem gets initialized before the
SPI subsystem, so the function of the pin group remains 'spi'
which in turn prevents controlling of the LEDs.
Despite the change of the initialization order, the root cause is
that the pinctrl state definition is wrong since its initial commit 0d45062cfc89 ("arm64: dts: marvell: Add device tree for uDPU board"),
To fix the problem, override the function in the 'spi_quad_pins'
node to 'gpio' and move the pinctrl state definition from the
'spi' node into the 'leds' node.
Cc: stable@vger.kernel.org # needs adjustment for < 6.1 Fixes: 0d45062cfc89 ("arm64: dts: marvell: Add device tree for uDPU board") Signed-off-by: Gabor Juhos <j4g8y7@gmail.com> Signed-off-by: Imre Kaloz <kaloz@openwrt.org> Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
[ Applied to .dts instead of .dtsi ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, when device mtu is updated, vmxnet3 updates netdev mtu, quiesces
the device and then reactivates it for the ESXi to know about the new mtu.
So, technically the OS stack can start using the new mtu before ESXi knows
about the new mtu.
This can lead to issues for TSO packets which use mss as per the new mtu
configured. This patch fixes this issue by moving the mtu write after
device quiesce.
Cc: stable@vger.kernel.org Fixes: d1a890fa37f2 ("net: VMware virtual Ethernet NIC driver: vmxnet3") Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Guolin Yang <guolin.yang@broadcom.com>
Changes v1-> v2:
Moved MTU write after destroy of rx rings Link: https://patch.msgid.link/20250515190457.8597-1-ronak.doshi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ no WRITE_ONCE() in older trees ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The pointer arithmentic for accessing the tail tag only works
for linear skbs.
For nonlinear skbs, it reads uninitialized memory inside the
skb headroom, essentially randomizing the tag. I have observed
it gets set to 6 most of the time.
Example where ksz9477_rcv thinks that the packet from port 1 comes from port 6
(which does not exist for the ksz9896 that's in use), dropping the packet.
Debug prints added by me (not included in this patch):
Call skb_linearize before trying to access the tag.
This patch fixes ksz9477_rcv which is used by the ksz9896 I have at
hand, and also applies the same fix to ksz8795_rcv which seems to have
the same problem.
Signed-off-by: Jakob Unterwurzacher <jakob.unterwurzacher@cherry.de> CC: stable@vger.kernel.org Fixes: 016e43a26bab ("net: dsa: ksz: Add KSZ8795 tag code") Fixes: 8b8010fb7876 ("dsa: add support for Microchip KSZ tail tagging") Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/20250515072920.2313014-1-jakob.unterwurzacher@cherry.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix a potential deadlock bug. Observe that in the mtk-cqdma.c
file, functions like mtk_cqdma_issue_pending() and
mtk_cqdma_free_active_desc() properly acquire the pc lock before the vc
lock when handling pc and vc fields. However, mtk_cqdma_tx_status()
violates this order by first acquiring the vc lock before invoking
mtk_cqdma_find_active_desc(), which subsequently takes the pc lock. This
reversed locking sequence (vc → pc) contradicts the established
pc → vc order and creates deadlock risks.
Fix the issue by moving the vc lock acquisition code from
mtk_cqdma_find_active_desc() to mtk_cqdma_tx_status(). Ensure the pc lock
is acquired before the vc lock in the calling function to maintain correct
locking hierarchy. Note that since mtk_cqdma_find_active_desc() is a
static function with only one caller (mtk_cqdma_tx_status()), this
modification safely eliminates the deadlock possibility without affecting
other components.
This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency bugs
including deadlocks, data races and atomicity violations.
If an input changes state during wake-up and is used as an interrupt
source, the IRQ handler reads the volatile input register to clear the
interrupt mask and deassert the IRQ line. However, the IRQ handler is
triggered before access to the register is granted, causing the read
operation to fail.
As a result, the IRQ handler enters a loop, repeatedly printing the
"failed reading register" message, until `pca953x_resume()` is eventually
called, which restores the driver context and enables access to
registers.
Fix by disabling the IRQ line before entering suspend mode, and
re-enabling it after the driver context is restored in `pca953x_resume()`.
An IRQ can be disabled with disable_irq() and still wake the system as
long as the IRQ has wake enabled, so the wake-up functionality is
preserved.
The threaded IRQ function in this driver is reading the flag twice: once to
lock a mutex and once to unlock it. Even though the code setting the flag
is designed to prevent it, there are subtle cases where the flag could be
true at the mutex_lock stage and false at the mutex_unlock stage. This
results in the mutex not being unlocked, resulting in a deadlock.
Fix it by making the opt3001_irq() code generally more robust, reading the
flag into a variable and using the variable value at both stages.
Follow the pattern of other drivers and use aligned_s64 for the
timestamp. This will ensure that the timestamp is correctly aligned on
all architectures.
Also move the unaligned.h header while touching this since it was the
only one not in alphabetical order.
Fixes: 13e945631c2f ("iio:chemical:pms7003: Fix timestamp alignment and prevent data leak.") Signed-off-by: David Lechner <dlechner@baylibre.com> Reviewed-by: Nuno Sá <nuno.sa@analog.com> Link: https://patch.msgid.link/20250417-iio-more-timestamp-alignment-v1-4-eafac1e22318@baylibre.com Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
[ linux/unaligned.h => asm/unaligned.h ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure
irqfd->producer isn't modified while kvm_irq_routing_update() is running.
The only lock held when a producer is added/removed is irqbypass's mutex.
The handling of the limits_changed flag in struct sugov_policy needs to
be explicitly synchronized to ensure that cpufreq policy limits updates
will not be missed in some cases.
Without that synchronization it is theoretically possible that
the limits_changed update in sugov_should_update_freq() will be
reordered with respect to the reads of the policy limits in
cpufreq_driver_resolve_freq() and in that case, if the limits_changed
update in sugov_limits() clobbers the one in sugov_should_update_freq(),
the new policy limits may not take effect for a long time.
Likewise, the limits_changed update in sugov_limits() may theoretically
get reordered with respect to the updates of the policy limits in
cpufreq_set_policy() and if sugov_should_update_freq() runs between
them, the policy limits change may be missed.
To ensure that the above situations will not take place, add memory
barriers preventing the reordering in question from taking place and
add READ_ONCE() and WRITE_ONCE() annotations around all of the
limits_changed flag updates to prevent the compiler from messing up
with that code.
Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") Cc: 5.3+ <stable@vger.kernel.org> # 5.3+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net
[ Adjust context ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
object_err() reports details of an object for further debugging, such as
the freelist pointer, redzone, etc. However, if the pointer is invalid,
attempting to access object metadata can lead to a crash since it does
not point to a valid object.
One known path to the crash is when alloc_consistency_checks()
determines the pointer to the allocated object is invalid because of a
freelist corruption, and calls object_err() to report it. The debug code
should report and handle the corruption gracefully and not crash in the
process.
In case the pointer is NULL or check_valid_pointer() returns false for
the pointer, only print the pointer value and skip accessing metadata.
If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
it to be locked.
Page table traversal is allowed under any one of the mmap lock, the
anon_vma lock (if the VMA is associated with an anon_vma), and the
mapping lock (if the VMA is associated with a mapping); and so to be
able to remove page tables, we must hold all three of them.
retract_page_tables() bails out if an ->anon_vma is attached, but does
this check before holding the mmap lock (as the comment above the check
explains).
If we racily merged an existing ->anon_vma (shared with a child
process) from a neighboring VMA, subsequent rmap traversals on pages
belonging to the child will be able to see the page tables that we are
concurrently removing while assuming that nothing else can access them.
Repeat the ->anon_vma check once we hold the mmap lock to ensure that
there really is no concurrent page table access.
Hitting this bug causes a lockdep warning in collapse_and_free_pmd(),
in the line "lockdep_assert_held_write(&vma->anon_vma->root->rwsem)".
It can also lead to use-after-free access.
Link: https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230111133351.807024-1-jannh@google.com Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <jannh@google.com> Reported-by: Zach O'Keefe <zokeefe@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[doebel@amazon.de: Kernel 5.10 uses different control flow pattern,
context adjustments] Signed-off-by: Bjoern Doebel <doebel@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix a possible heap overflow in e1000_set_eeprom function by adding
input validation for the requested length of the change in the EEPROM.
In addition, change the variable type from int to size_t for better
code practices and rearrange declarations to RCT.
Cc: stable@vger.kernel.org Fixes: bc7f75fa9788 ("[E1000E]: New pci-express e1000 driver (currently for ICH9 devices only)") Co-developed-by: Mikael Wessel <post@mikaelkw.online> Signed-off-by: Mikael Wessel <post@mikaelkw.online> Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
batadv_nc_skb_decode_packet() trusts coded_len and checks only against
skb->len. XOR starts at sizeof(struct batadv_unicast_packet), reducing
payload headroom, and the source skb length is not verified, allowing an
out-of-bounds read and a small out-of-bounds write.
Validate that coded_len fits within the payload area of both destination
and source sk_buffs before XORing.
Fixes: 2df5278b0267 ("batman-adv: network coding - receive coded packets and decode them") Cc: stable@vger.kernel.org Reported-by: Stanislav Fort <disclosure@aisle.com> Signed-off-by: Stanislav Fort <stanislav.fort@aisle.com> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix a use-after-free window by correcting the buffer release sequence in
the deferred receive path. The code freed the RQ buffer first and only
then cleared the context pointer under the lock. Concurrent paths (e.g.,
ABTS and the repost path) also inspect and release the same pointer under
the lock, so the old order could lead to double-free/UAF.
Note that the repost path already uses the correct pattern: detach the
pointer under the lock, then free it after dropping the lock. The
deferred path should do the same.
Fixes: 472e146d1cf3 ("scsi: lpfc: Correct upcalling nvmet_fc transport during io done downcall") Cc: stable@vger.kernel.org Signed-off-by: John Evans <evans1210144@gmail.com> Link: https://lore.kernel.org/r/20250828044008.743-1-evans1210144@gmail.com Reviewed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The adapter->chan_stats[] array is initialized in
mwifiex_init_channel_scan_gap() with vmalloc(), which doesn't zero out
memory. The array is filled in mwifiex_update_chan_statistics()
and then the user can query the data in mwifiex_cfg80211_dump_survey().
There are two potential issues here. What if the user calls
mwifiex_cfg80211_dump_survey() before the data has been filled in.
Also the mwifiex_update_chan_statistics() function doesn't necessarily
initialize the whole array. Since the array was not initialized at
the start that could result in an information leak.
Also this array is pretty small. It's a maximum of 900 bytes so it's
more appropriate to use kcalloc() instead vmalloc().
Cc: stable@vger.kernel.org Fixes: bf35443314ac ("mwifiex: channel statistics support for mwifiex") Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/20250815023055.477719-1-rongqianfeng@vivo.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In __iodyn_find_io_region(), pcmcia_make_resource() is assigned to
res and used in pci_bus_alloc_resource(). There is a dereference of res
in pci_bus_alloc_resource(), which could lead to a NULL pointer
dereference on failure of pcmcia_make_resource().
Fix this bug by adding a check of res.
Cc: stable@vger.kernel.org Fixes: 49b1153adfe1 ("pcmcia: move all pcmcia_resource_ops providers into one module") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>