When a Function Block declares it being a legacy MIDI1 device, it has
to be only with a single UMP Group. Correct the attribute when a
device declares it wrongly.
Fixes: 37e0e14128e0 ("ALSA: ump: Support UMP Endpoint and Function Block parsing") Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20240722140610.10845-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a device tries to update the FB name string even if its Endpoint
is declared as static, we should skip it, just already done for the FB
info update reply.
Fixes: 37e0e14128e0 ("ALSA: ump: Support UMP Endpoint and Function Block parsing") Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20240722135929.8612-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In a commit 1d717123bb1a ("ALSA: firewire-lib: Avoid
-Wflex-array-member-not-at-end warning"), DEFINE_FLEX() macro was used to
handle variable length of array for header field in struct fw_iso_packet
structure. The usage of macro has a side effect that the designated
initializer assigns the count of array to the given field. Therefore
CIP_HEADER_QUADLETS (=2) is assigned to struct fw_iso_packet.header,
while the original designated initializer assigns zero to all fields.
With CIP_NO_HEADER flag, the change causes invalid length of header in
isochronous packet for 1394 OHCI IT context. This bug affects all of
devices supported by ALSA fireface driver; RME Fireface 400, 800, UCX, UFX,
and 802.
This commit fixes the bug by replacing it with the alternative version of
macro which corresponds no initializer.
Adjust quirks for 0x3a20, 0x3a30, 0x3a50 to match the 0x3a60. This
set has now been confirmed to work with this patch.
Signed-off-by: Luke D. Jones <luke@ljones.dev> Fixes: 811dd426a9b1 ("ALSA: hda/realtek: Add quirks for Asus ROG 2024 laptops using CS35L41") Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20240723011224.115579-1-luke@ljones.dev Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is
pending delivery, in vmx_has_nested_events() and drop the one-off
kvm_x86_ops.guest_apic_has_interrupt() hook.
In addition to dropping a superfluous hook, this fixes a bug where KVM
would incorrectly treat virtual interrupts _for L2_ as always enabled due
to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(),
treating IRQs as enabled if L2 is active and vmcs12 is configured to exit
on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake
event based on L1's IRQ blocking status.
Check for pending (and notified!) posted interrupts when checking if L2
has a pending wake event, as fully posted/notified virtual interrupt is a
valid wake event for HLT.
Note that KVM must check vmx->nested.pi_pending to avoid prematurely
waking L2, e.g. even if KVM sees a non-zero PID.PIR and PID.0N=1, the
virtual interrupt won't actually be recognized until a notification IRQ is
received by the vCPU or the vCPU does (nested) VM-Enter.
Fixes: 26844fee6ade ("KVM: x86: never write to memory from kvm_vcpu_check_block()") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/all/20231207010302.2240506-1-jmattson@google.com Link: https://lore.kernel.org/r/20240607172609.3205077-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When requesting an immediate exit from L2 in order to inject a pending
event, do so only if the pending event actually requires manual injection,
i.e. if and only if KVM actually needs to regain control in order to
deliver the event.
Avoiding the "immediate exit" isn't simply an optimization, it's necessary
to make forward progress, as the "already expired" VMX preemption timer
trick that KVM uses to force a VM-Exit has higher priority than events
that aren't directly injected.
At present time, this is a glorified nop as all events processed by
vmx_has_nested_events() require injection, but that will not hold true in
the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI.
I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired
VMX preemption timer will trigger VM-Exit before the virtual interrupt is
delivered, and KVM will effectively hang the vCPU in an endless loop of
forced immediate VM-Exits (because the pending virtual interrupt never
goes away).
Add a helper to retrieve the highest pending vector given a Posted
Interrupt descriptor. While the actual operation is straightforward, it's
surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's
find_highest_vector(), which doesn't work with PID.PIR due to the APIC's
IRR and ISR component registers being physically discontiguous (they're
4-byte registers aligned at 16-byte intervals).
To make PIR handling more consistent with respect to IRR and ISR handling,
return -1 to indicate "no interrupt pending".
Move the non-VMX chunk of the "interrupt blocked" checks to a separate
helper so that KVM can reuse the code to detect if interrupts are blocked
for L2, e.g. to determine if a virtual interrupt _for L2_ is a valid wake
event. If L1 disables HLT-exiting for L2, nested APICv is enabled, and L2
HLTs, then L2 virtual interrupts are valid wake events, but if and only if
interrupts are unblocked for L2.
Add support for using DPDES in the library for using guest state
buffers. DPDES support is needed for enabling usage of doorbells in a L2
KVM on PAPR guest.
Fixes: 6ccbbc33f06a ("KVM: PPC: Add helper library for Guest State Buffers") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240605113913.83715-2-gautam@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Doorbell emulation is broken for KVM on PAPR guests as support for DPDES
was not added in the initial patch series. Due to this, a KVM on PAPR
guest with SMT > 1 cannot be booted with the XICS interrupt controller
as doorbells are setup in the initial probe path when using XICS
(pSeries_smp_probe()).
The current DRAM size is insufficient for the HEVC feature, which
requires more memory for proper functionality. This change ensures the
feature has the necessary resources.
Signed-off-by: Jason Chen <Jason-ch.Chen@mediatek.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240703034409.698-1-Jason-ch.Chen@mediatek.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Prior to the ongoing command privacy is on, it would return -1 to
indicate the current privacy status, and the ongoing command would
be well executed by firmware as well, so this is not error. This
patch changes its behavior to notify privacy on directly by V4L2
privacy control instead of reporting error.
Fixes: 29006e196a56 ("media: pci: intel: ivsc: Add CSI submodule") Cc: stable@vger.kernel.org # for 6.6 and later Reported-by: Hao Yao <hao.yao@intel.com> Signed-off-by: Wentong Wu <wentong.wu@intel.com> Tested-by: Jason Chen <jason.z.chen@intel.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The function uvc_video_clock_update() supports a single SOF overflow. Or
in other words, the maximum difference between the first ant the last
timestamp can be 4096 ticks or 4.096 seconds.
This results in a maximum value for y2 of: 0x12FBECA00, that overflows
32bits.
y2 = (u32)ktime_to_ns(ktime_sub(last->host_time, first->host_time)) + y1;
Extend the size of y2 to u64 to support all its values.
Commit 9f356e5a4f12 ("jbd2: Account descriptor blocks into
t_outstanding_credits") started to account descriptor blocks into
transactions outstanding credits. However it didn't appropriately
decrease the maximum amount of credits available to userspace. Thus if
the filesystem requests a transaction smaller than
j_max_transaction_buffers but large enough that when descriptor blocks
are added the size exceeds j_max_transaction_buffers, we confuse
add_transaction_credits() into thinking previous handles have grown the
transaction too much and enter infinite journal commit loop in
start_this_handle() -> add_transaction_credits() trying to create
transaction with enough credits available.
Fix the problem by properly accounting for transaction space reserved
for descriptor blocks when verifying requested transaction handle size.
Instead of computing the number of descriptor blocks a transaction can
have each time we need it (which is currently when starting each
transaction but will become more frequent later) precompute the number
once during journal initialization together with maximum transaction
size. We perform the precomputation whenever journal feature set is
updated similarly as for computation of
journal->j_revoke_records_per_block.
CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240624170127.3253-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There's no reason to have jbd2_journal_get_max_txn_bufs() public
function. Currently all users are internal and can use
journal->j_max_transaction_buffers instead. This saves some unnecessary
recomputations of the limit as a bonus which becomes important as this
function gets more complex in the following patch.
CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240624170127.3253-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There're possibilities that privacy status change notification happens
in the middle of the ongoing mei command which already takes the command
lock, but v4l2_ctrl_s_ctrl() would also need the same lock prior to this
patch, so this may results in circular locking problem. This patch adds
one dedicated lock for v4l2 control handler to avoid described issue.
Fixes: 29006e196a56 ("media: pci: intel: ivsc: Add CSI submodule") Cc: stable@vger.kernel.org # for 6.6 and later Reported-by: Hao Yao <hao.yao@intel.com> Signed-off-by: Wentong Wu <wentong.wu@intel.com> Tested-by: Jason Chen <jason.z.chen@intel.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The race fixed in timer_trig_activate() between a blocking
set_brightness() call and trigger->activate() can affect any trigger.
So move the call to flush_work() into led_trigger_set() where it can
avoid the race for all triggers.
Fixes: 0db37915d912 ("leds: avoid races with workqueue") Fixes: 8c0f693c6eff ("leds: avoid flush_work in atomic context") Cc: stable@vger.kernel.org Tested-by: Dustin L. Howett <dustin@howett.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20240613-led-trigger-flush-v2-1-f4f970799d77@weissschuh.net Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__write_sb_page() rounds up the io size to the optimal io size if it
doesn't exceed the data offset, but it doesn't check the final size
exceeds the bitmap length.
For example:
page count - 1
page size - 4K
data offset - 1M
optimal io size - 256K
The final io size would be 256K (64 pages) but md_bitmap_storage_alloc()
allocated 1 page, the IO would write 1 valid page and 63 pages that
happens to be allocated afterwards. This leaks memory to the raid device
superblock.
This issue caused a data transfer failure in nvme-tcp. The network
drivers checks the first page of an IO with sendpage_ok(), it returns
true if the page isn't a slabpage and refcount >= 1. If the page
!sendpage_ok() the network driver disables MSG_SPLICE_PAGES.
As of now the network layer assumes all the pages of the IO are
sendpage_ok() when MSG_SPLICE_PAGES is on.
The bitmap pages aren't slab pages, the first page of the IO is
sendpage_ok(), but the additional pages that happens to be allocated
after the bitmap pages might be !sendpage_ok(). That cause
skb_splice_from_iter() to stop the data transfer, in the case below it
hangs 'mdadm --create'.
The bug is reproducible, in order to reproduce we need nvme-over-tcp
controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid
with bitmap over those devices reproduces the bug.
In order to simulate large optimal IO size you can use dm-stripe with a
single device.
Script to reproduce the issue on top of brd devices using dm-stripe is
attached below (will be added to blktest).
I have added some logs to test the theory:
...
md: created bitmap (1 pages) for device md127
__write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee
=== __write_sb_page before md_super_write. logging pages ===
pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap
pfn: 0x53ef, slab: 1
pfn: 0x53f0, slab: 0
pfn: 0x53f1, slab: 0
pfn: 0x53f2, slab: 0
pfn: 0x53f3, slab: 1
...
nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0
skbuff: before sendpage_ok() - pfn: 0x53ee
skbuff: before sendpage_ok() - pfn: 0x53ef
WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1
...
Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240607072748.3182199-1-ofir.gal@volumez.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ich7_lpc_probe() uses pci_read_config_dword() that returns PCIBIOS_*
codes. The error handling code assumes incorrectly it's a normal errno
and checks for < 0. The return code is returned from the probe function
as is but probe functions should return normal errnos.
Remove < 0 from the check and convert PCIBIOS_* returns code using
pcibios_err_to_errno() into normal errno before returning it.
Fixes: a328e95b82c1 ("leds: LED driver for Intel NAS SS4200 series (v5)") Cc: <stable@vger.kernel.org> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20240527132700.14260-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently return status is not getting checked for get_api_version
and because of that for x86 arch we are getting below smatch error.
CC drivers/soc/xilinx/zynqmp_power.o
drivers/soc/xilinx/zynqmp_power.c: In function 'zynqmp_pm_probe':
drivers/soc/xilinx/zynqmp_power.c:295:12: warning: 'pm_api_version' is
used uninitialized [-Wuninitialized]
295 | if (pm_api_version < ZYNQMP_PM_VERSION)
| ^
CHECK drivers/soc/xilinx/zynqmp_power.c
drivers/soc/xilinx/zynqmp_power.c:295 zynqmp_pm_probe() error:
uninitialized symbol 'pm_api_version'.
So, check return status of pm_get_api_version and return error in case
of failure to avoid checking uninitialized pm_api_version variable.
Fixes: b9b3a8be28b3 ("firmware: xilinx: Remove eemi ops for get_api_version") Signed-off-by: Jay Buddhabhatti <jay.buddhabhatti@amd.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240515112345.24673-1-jay.buddhabhatti@amd.com Signed-off-by: Michal Simek <michal.simek@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The code refactoring added new error paths between the np device node
allocation and the call to of_node_put(), which leads to memory leaks if
any of those errors occur.
Add the missing of_node_put() in the error paths that require it.
Cc: stable@vger.kernel.org Fixes: 57f2f8b4aa0c ("cpufreq: qcom: Refactor the driver to make it easier to extend") Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently the number of frames sent to the chip in a single USB Request
Block is limited only by the size of the TX buffer, which is 20 KiB.
Testing reveals that as many as 13 frames get aggregated. This is more
than what any of the chips would like to receive. RTL8822CU, RTL8822BU,
and RTL8821CU want at most 3 frames, and RTL8723DU wants only 1 frame
per URB.
RTL8723DU in particular reliably malfunctions during a speed test if it
receives more than 1 frame per URB. All traffic seems to stop. Pinging
the AP no longer works.
Fix this problem by limiting the number of frames sent to the chip in a
single URB according to what each chip likes.
Also configure RTL8822CU, RTL8822BU, and RTL8821CU to expect 3 frames
per URB.
RTL8703B may or may not be found in USB devices. Declare that it wants
only 1 frame per URB, just in case.
When there is beacon loss, for example due to unrelated Bluetooth
devices transmitting music nearby, the wifi connection dies soon
after the first beacon loss message:
When the beacon loss happens, mac80211 makes rtw88 transmit a QOS
NULL frame and asks to confirm the ACK status. Even though rtw88
confirms to mac80211 that the QOS NULL was transmitted successfully,
the connection still dies. This is because rtw88 is handing the QOS
NULL back to mac80211 with skb->data pointing to the headroom (the
TX descriptor) instead of ieee80211_hdr.
Fix the disconnection by moving skb->data to the correct position
before ieee80211_tx_status_irqsafe().
The problem was observed with RTL8811AU (TP-Link Archer T2U Nano)
and the potential future rtw88_8821au driver. Also tested with
RTL8811CU (Tenda U9).
There is a length limit on the commands we send to firmware, so
dividing to two commands is sometimes required when scanning.
When aborting scan, we should not send second scan command to
firmware after the first one is finished. This could cause some
unexpected errors when we cannot receive firmware events
(e.g. in suspend).
Another case is scan happens before suspending, ieee80211_do_stop() is
called to abort scan and driver indicate scan completion by
ieee80211_scan_completed(), which queues event to scan work. But scan work
might be late to execute after ieee80211_do_stop(). To correct this, driver
indicates ieee80211_scan_completed() before returning, so that
ieee80211_do_stop() can flush scan work properly.
When changing the interface type we also need to update the bss_num, the
driver private data is searched based on a unique (bss_type, bss_num)
tuple, therefore every time bss_type changes, bss_num must also change.
This fixes for example an issue in which, after the mode changed, a
wireless scan on the changed interface would not finish, leading to
repeated -EBUSY messages to userspace when other scan requests were
sent.
Fixes: c606008b7062 ("mwifiex: Properly initialize private structure on interface type changes") Cc: stable@vger.kernel.org Signed-off-by: Rafael Beims <rafael.beims@toradex.com> Signed-off-by: Francesco Dolcini <francesco.dolcini@toradex.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240510110458.15475-1-francesco@dolcini.it Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The callers of request_irq() don't care about IRQF_ONESHOT because they
don't provide threaded handlers, but if they happen to share the IRQ with
the ACPI SCI, which has a threaded handler and sets IRQF_ONESHOT,
request_irq() will fail for them due to a flags mismatch.
Address this by making request_irq() add IRQF_COND_ONESHOT to the flags
passed to request_threaded_irq() for all of its callers.
Fixes: 7a36b901a6eb ("ACPI: OSL: Use a threaded interrupt handler for SCI") Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Stefan Seyfried <stefan.seyfried@googlemail.com> Cc: stable@vger.kerel.org Link: https://lore.kernel.org/r/5800834.DvuYhMxLoT@rjwysocki.net Closes: https://lore.kernel.org/lkml/205bd84a-fe8e-4963-968e-0763285f35ba@message-id.googlemail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
IORING_SETUP_IOPOLL rings don't have any netpoll handling, let's fail
attempts to register netpolling in this case, there might be people who
will mix up IOPOLL and netpoll.
io_uring_cancel_generic() should retry if any state changes like a
request is completed, however in case of a task exit it only goes for
another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT)
request got completed.
Let's assume we have a non-tracked request executing in iowq and a
tracked request linked to it. Let's also assume
io_uring_cancel_generic() fails to find and cancel the request, i.e.
via io_run_local_work(), which may happen as io-wq has gaps.
Next, the request logically completes, io-wq still hold a ref but queues
it for completion via tw, which happens in
io_uring_try_cancel_requests(). After, right before prepare_to_wait()
io-wq puts the request, grabs the linked one and tries executes it, e.g.
arms polling. Finally the cancellation loop calls prepare_to_wait(),
there are no tw to run, no tracked request was completed, so the
tctx_inflight() check passes and the task is put to indefinite sleep.
There is a report that iowq executed getsockopt never completes. The
reason being that io_uring_cmd_sock() can return a positive result, and
io_uring_cmd() propagates it back to core io_uring, instead of IOU_OK.
In case of io_wq_submit_work(), the request will be dropped without
completing it.
The offending code was introduced by a hack in a9c3eda7eada9 ("io_uring: fix submission-failure handling for uring-cmd"),
however it was fine until getsockopt was introduced and started
returning positive results.
The right solution is to always return IOU_OK, since e0b23d9953b0c ("io_uring: optimise ltimeout for inline execution"),
we should be able to do it without problems, however for the sake of
backporting and minimising side effects, let's keep returning negative
return codes and otherwise do IOU_OK.
The syzbot constructs a directory that has no dirblock but is non-inline,
i.e. the first directory block is a hole. And no errors are reported when
creating files in this directory in the following flow.
ext4_mknod
...
ext4_add_entry
// Read block 0
ext4_read_dirblock(dir, block, DIRENT)
bh = ext4_bread(NULL, inode, block, 0)
if (!bh && (type == INDEX || type == DIRENT_HTREE))
// The first directory block is a hole
// But type == DIRENT, so no error is reported.
After that, we get a directory block without '.' and '..' but with a valid
dentry. This may cause some code that relies on dot or dotdot (such as
make_indexed_dir()) to crash.
Therefore when ext4_read_dirblock() finds that the first directory block
is a hole report that the filesystem is corrupted and return an error to
avoid loading corrupted data from disk causing something bad.
Reported-by: syzbot+ae688d469e36fb5138d0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ae688d469e36fb5138d0 Fixes: 4e19d6b65fb4 ("ext4: allow directory holes") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240702132349.2600605-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The immediate cause of this problem is that there is only one valid dentry
for the block to be split during do_split, so split==0 results in out of
bounds accesses to the map triggering the issue.
The maximum length of a filename is 255 and the minimum block size is 1024,
so it is always guaranteed that the number of entries is greater than or
equal to 2 when do_split() is called.
But syzbot's crafted image has no dot and dotdot in dir, and the dentry
distribution in dirblock is as follows:
So when renaming dentry1 increases its name_len length by 1, neither hole
nor free is sufficient to hold the new dentry, and make_indexed_dir() is
called.
In make_indexed_dir() it is assumed that the first two entries of the
dirblock must be dot and dotdot, so bus and dentry1 are left in dx_root
because they are treated as dot and dotdot, and only dentry2 is moved
to the new leaf block. That's why count is equal to 1.
Therefore add the ext4_check_dx_root() helper function to add more sanity
checks to dot and dotdot before starting the conversion to avoid the above
issue.
Reported-by: syzbot+ae688d469e36fb5138d0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ae688d469e36fb5138d0 Fixes: ac27a0ec112a ("[PATCH] ext4: initial copy of files from ext3") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240702132349.2600605-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
IO logical block size is one fundamental queue limit, and every IO has
to be aligned with logical block size because our bio split can't deal
with unaligned bio.
The check has to be done with queue usage counter grabbed because device
reconfiguration may change logical block size, and we can prevent the
reconfiguration from happening by holding queue usage counter.
logical_block_size stays in the 1st cache line of queue_limits, and this
cache line is always fetched in fast path via bio_may_exceed_limits(),
so IO perf won't be affected by this check.
Cc: Yi Zhang <yi.zhang@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240620030631.3114026-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On an Amiga 1200 equipped with a Warp1260 accelerator, an interrupt
storm coming from the accelerator board causes the machine to crash in
local_irq_enable() or auto_irq_enable(). Disabling interrupts for the
Warp1260 in amiga_parse_bootinfo() fixes the problem.
When the filesystem block bitmap is corrupted, we detect the corruption
while loading the bitmap and fail the allocation with error. However the
next allocation from the same bitmap will notice the bitmap buffer is
already loaded and tries to allocate from the bitmap with mixed results
(depending on the exact nature of the bitmap corruption). Fix the
problem by using BH_verified bit to indicate whether the bitmap is valid
or not.
Reported-by: syzbot+5f682cd029581f9edfd1@syzkaller.appspotmail.com CC: stable@vger.kernel.org Link: https://patch.msgid.link/20240617154201.29512-2-jack@suse.cz Fixes: 1e0d4adf17e7 ("udf: Check consistency of Space Bitmap Descriptor") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Re-introduce task_work_cancel(), this time to cancel an actual callback
and not *any* callback pointing to a given function. This is going to be
needed for perf events event freeing.
A proper task_work_cancel() API that actually cancels a callback and not
*any* callback pointing to a given function is going to be needed for
perf events event freeing. Do the appropriate rename to prepare for
that.
Although by default we negotiate CIFS Unix Extensions for SMB1 mounts to
Samba (and they work if the user does not specify "unix" or "posix" or
"linux" on mount), and we do properly handle when a user turns them off
with "nounix" mount parm. But with the changes to the mount API we
broke cases where the user explicitly specifies the "unix" option (or
equivalently "linux" or "posix") on mount with vers=1.0 to Samba or other
servers which support the CIFS Unix Extensions.
"mount error(95): Operation not supported"
and logged:
"CIFS: VFS: Check vers= mount option. SMB3.11 disabled but required for POSIX extensions"
even though CIFS Unix Extensions are supported for vers=1.0 This patch fixes
the case where the user specifies both "unix" (or equivalently "posix" or
"linux") and "vers=1.0" on mount to a server which supports the
CIFS Unix Extensions.
Cc: stable@vger.kernel.org Reviewed-by: David Howells <dhowell@redhat.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When mounting with the SMB1 Unix Extensions (e.g. mounts
to Samba with vers=1.0), reconnects no longer reset the
Unix Extensions (SetFSInfo SET_FILE_UNIX_BASIC) after tcon so most
operations (e.g. stat, ls, open, statfs) will fail continuously
with:
"Operation not supported"
if the connection ever resets (e.g. due to brief network disconnect)
Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter reported a Smack static checker warning:
fs/smb/client/cifsfs.c:1981 init_cifs()
error: we previously assumed 'serverclose_wq' could be null (see line 1895)
The patch which introduced the serverclose workqueue used the wrong
oredering in error paths in init_cifs() for freeing it on errors.
Fixes: 173217bd7336 ("smb3: retrying on failed server close") Cc: stable@vger.kernel.org Cc: Ritvik Budhiraja <rbudhiraja@microsoft.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Howells <dhowell@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bios version was popluated for FDMI response. Systems with EFI would show
optrom version as 0. EFI version is populated here and BIOS version is
already displayed under FDMI_HBA_BOOT_BIOS_NAME.
In psb_intel_lvds_get_modes(), the return value of drm_mode_duplicate() is
assigned to mode, which will lead to a possible NULL pointer dereference
on failure of drm_mode_duplicate(). Add a check to avoid npd.
In cdv_intel_lvds_get_modes(), the return value of drm_mode_duplicate()
is assigned to mode, which will lead to a NULL pointer dereference on
failure of drm_mode_duplicate(). Add a check to avoid npd.
If io-wq worker creation fails, we retry it by queueing up a task_work.
tasK_work is needed because it should be done from the user process
context. The problem is that retries are not limited, and if queueing a
task_work is the reason for the failure, we might get into an infinite
loop.
It doesn't seem to happen now but it would with the following patch
executing task_work in the freezer's loop. For now, arbitrarily limit the
number of attempts to create a worker.
Cc: stable@vger.kernel.org Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") Reported-by: Julian Orth <ju.orth@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8280436925db88448c7c85c6656edee1a43029ea.1720634146.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current security_inode_setxattr() and security_inode_removexattr()
hooks rely on individual LSMs to either call into the associated
capability hooks (cap_inode_setxattr() or cap_inode_removexattr()), or
return a magic value of 1 to indicate that the LSM layer itself should
perform the capability checks. Unfortunately, with the default return
value for these LSM hooks being 0, an individual LSM hook returning a
1 will cause the LSM hook processing to exit early, potentially
skipping a LSM. Thankfully, with the exception of the BPF LSM, none
of the LSMs which currently register inode xattr hooks should end up
returning a value of 1, and in the BPF LSM case, with the BPF LSM hooks
executing last there should be no real harm in stopping processing of
the LSM hooks. However, the reliance on the individual LSMs to either
call the capability hooks themselves, or signal the LSM with a return
value of 1, is fragile and relies on a specific set of LSMs being
enabled. This patch is an effort to resolve, or minimize, these
issues.
Before we discuss the solution, there are a few observations and
considerations that we need to take into account:
* BPF LSM registers an implementation for every LSM hook, and that
implementation simply returns the hook's default return value, a
0 in this case. We want to ensure that the default BPF LSM behavior
results in the capability checks being called.
* SELinux and Smack do not expect the traditional capability checks
to be applied to the xattrs that they "own".
* SELinux and Smack are currently written in such a way that the
xattr capability checks happen before any additional LSM specific
access control checks. SELinux does apply SELinux specific access
controls to all xattrs, even those not "owned" by SELinux.
* IMA and EVM also register xattr hooks but assume that the LSM layer
and specific LSMs have already authorized the basic xattr operation.
In order to ensure we perform the capability based access controls
before the individual LSM access controls, perform only one capability
access control check for each operation, and clarify the logic around
applying the capability controls, we need a mechanism to determine if
any of the enabled LSMs "own" a particular xattr and want to take
responsibility for controlling access to that xattr. The solution in
this patch is to create a new LSM hook, 'inode_xattr_skipcap', that is
not exported to the rest of the kernel via a security_XXX() function,
but is used by the LSM layer to determine if a LSM wants to control
access to a given xattr and avoid the traditional capability controls.
Registering an inode_xattr_skipcap hook is optional, if a LSM declines
to register an implementation, or uses an implementation that simply
returns the default value (0), there is no effect as the LSM continues
to enforce the capability based controls (unless another LSM takes
ownership of the xattr). If none of the LSMs signal that the
capability checks should be skipped, the capability check is performed
and if access is granted the individual LSM xattr access control hooks
are executed, keeping with the DAC-before-LSM convention.
Cc: stable@vger.kernel.org Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
FW_LOADER config only selects the firmware loader API, but we also need
the sysfs_upload symbols for firmware_upload_unregister() and
firmware_upload_register() to function properly.
Always initialize i_uid/i_gid inside the sysfs core so set_ownership()
can safely skip setting them.
Commit 5ec27ec735ba ("fs/proc/proc_sysctl.c: fix the default values of
i_uid/i_gid on /proc/sys inodes.") added defaults for i_uid/i_gid when
set_ownership() was not implemented. It also missed adjusting
net_ctl_set_ownership() to use the same default values in case the
computation of a better value failed.
Fixes: 5ec27ec735ba ("fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes.") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
SAT-5 revision 8 specification removed the text about the ANSI INCITS
431-2007 compliance which was requiring SCSI/ATA Translation (SAT) to
return descriptor format sense data for the ATA PASS-THROUGH commands
regardless of the setting of the D_SENSE bit.
Let's honor the D_SENSE bit for ATA PASS-THROUGH commands while
generating the "ATA PASS-THROUGH INFORMATION AVAILABLE" sense data.
SAT-5 revision 7
================
12.2.2.8 Fixed format sense data
Table 212 shows the fields returned in the fixed format sense data
(see SPC-5) for ATA PASS-THROUGH commands. SATLs compliant with ANSI
INCITS 431-2007, SCSI/ATA Translation (SAT) return descriptor format
sense data for the ATA PASS-THROUGH commands regardless of the setting
of the D_SENSE bit.
SAT-5 revision 8
================
12.2.2.8 Fixed format sense data
Table 211 shows the fields returned in the fixed format sense data
(see SPC-5) for ATA PASS-THROUGH commands.
Current ata_gen_passthru_sense() code performs two actions:
1. Generates sense data based on the ATA 'status' and ATA 'error' fields.
2. Populates "ATA Status Return sense data descriptor" / "Fixed format
sense data" with ATA taskfile fields.
The problem is that #1 generates sense data even when a valid sense data
is already present (ATA_QCFLAG_SENSE_VALID is set). Factoring out #2 into
a separate function allows us to generate sense data only when there is
no valid sense data (ATA_QCFLAG_SENSE_VALID is not set).
As a bonus, we can now delete a FIXME comment in atapi_qc_complete()
which states that we don't want to translate taskfile registers into
sense descriptors for ATAPI.
Additionally, always set SAM_STAT_CHECK_CONDITION when CK_COND=1 because
SAT specification mandates that SATL shall return CHECK CONDITION if
the CK_COND bit is set.
The ATA PASS-THROUGH handling logic in ata_scsi_qc_complete() is hard
to read/understand. Improve the readability of the code by moving checks
into self-explanatory boolean variables.
Cc: stable@vger.kernel.org # 4.19+ Co-developed-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Igor Pylypiv <ipylypiv@google.com> Link: https://lore.kernel.org/r/20240702024735.1152293-3-ipylypiv@google.com Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There appears to be a possible use after free with vdec_close().
The firmware will add buffer release work to the work queue through
HFI callbacks as a normal part of decoding. Randomly closing the
decoder device from userspace during normal decoding can incur
a read after free for inst.
In tpm_bios_measurements_open(), get_device() is called on the device
embedded in struct tpm_chip. In the error path, however, put_device() is
not called. This results in a reference count leak, which prevents the
device from being properly released. This commit makes sure to call
put_device() when the seq_open() call fails.
Cc: stable@vger.kernel.org # +v4.18 Fixes: 9b01b5356629 ("tpm: Move shared eventlog functions to common.c") Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Correct the ATA PASS-THROUGH fixed format sense data offsets to conform
to SPC-6 and SAT-5 specifications. Additionally, set the VALID bit to
indicate that the INFORMATION field contains valid information.
INFORMATION
===========
SAT-5 Table 212 — "Fixed format sense data INFORMATION field for the ATA
PASS-THROUGH commands" defines the following format:
SPC-6 Table 48 - "Fixed format sense data" specifies that the INFORMATION
field starts at byte 3 in sense buffer resulting in the following offsets
for the ATA PASS-THROUGH commands:
+------------+-------------------------+
| Field | Offset in sense buffer |
+------------+-------------------------+
| ERROR | 3 |
| STATUS | 4 |
| DEVICE | 5 |
| COUNT(7:0) | 6 |
+------------+-------------------------+
COMMAND-SPECIFIC INFORMATION
============================
SAT-5 Table 213 - "Fixed format sense data COMMAND-SPECIFIC INFORMATION
field for ATA PASS-THROUGH" defines the following format:
SPC-6 Table 48 - "Fixed format sense data" specifies that
the COMMAND-SPECIFIC-INFORMATION field starts at byte 8
in sense buffer resulting in the following offsets for
the ATA PASS-THROUGH commands:
Offsets of these fields in the fixed sense format are as follows:
+-------------------+-------------------------+
| Field | Offset in sense buffer |
+-------------------+-------------------------+
| FLAGS | LOG INDEX | 8 |
| LBA (7:0) | 9 |
| LBA (15:8) | 10 |
| LBA (23:16) | 11 |
+-------------------+-------------------------+
The description of the fua module parameter is defined using
MODULE_PARM_DESC() with the first argument passed being "zoned". That is
the wrong name, obviously. Fix that by using the correct "fua" parameter
name so that "modinfo null_blk" displays correct information.
As commit ad8606702f26 ("md/raid5: remove rcu protection to access rdev
from conf") explains, rcu protection can be removed, however, there are
three places left, there won't be any real problems.
As was done in 0200679fc795 ("tmpfs: verify {g,u}id mount options correctly")
we need to validate that the requested uid and/or gid is representable in
the filesystem's idmapping.
Cribbing from the above commit log,
The contract for {g,u}id mount options and {g,u}id values in general set
from userspace has always been that they are translated according to the
caller's idmapping. In so far, fuse has been doing the correct thing.
But since fuse is mountable in unprivileged contexts it is also
necessary to verify that the resulting {k,g}uid is representable in the
namespace of the superblock.
Fixes: c30da2e981a7 ("fuse: convert to use the new mount API") Cc: stable@vger.kernel.org # 5.4+ Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/8f07d45d-c806-484d-a2e3-7a2199df1cd2@redhat.com Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a task's weight is being changed, set_load_weight() is called with
@update_load set. As weight changes aren't trivial for the fair class,
set_load_weight() calls fair.c::reweight_task() for fair class tasks.
However, set_load_weight() first tests task_has_idle_policy() on entry and
skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as
SCHED_IDLE tasks are just fair tasks with a very low weight and they would
incorrectly skip load, vlag and position updates.
Fix it by updating reweight_task() to take struct load_weight as idle weight
can't be expressed with prio and making set_load_weight() call
reweight_task() for SCHED_IDLE tasks too when @update_load is set.
Fixes: 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v4.15+ Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Chanctx emulation didn't info IEEE80211_CONF_CHANGE_CHANNEL to drivers
during ieee80211_restart_hw (ieee80211_emulate_add_chanctx). It caused
non-chanctx drivers to not stand on the correct channel after recovery.
RX then behaved abnormally. Finally, disconnection/reconnection occurred.
So, set IEEE80211_CONF_CHANGE_CHANNEL when in_reconfig.
I noticed LAYOUTGET(LAYOUTIOMODE4_RW) returning NFS4ERR_ACCESS
unexpectedly. The NFS client had created a file with mode 0444, and
the server had returned a write delegation on the OPEN(CREATE). The
client was requesting a RW layout using the write delegation stateid
so that it could flush file modifications.
Creating a read-only file does not seem to be problematic for
NFSv4.1 without pNFS, so I began looking at NFSD's implementation of
LAYOUTGET.
The failure was because fh_verify() was doing a permission check as
part of verifying the FH presented during the LAYOUTGET. It uses the
loga_iomode value to specify the @accmode argument to fh_verify().
fh_verify(MAY_WRITE) on a file whose mode is 0444 fails with -EACCES.
To permit LAYOUT* operations in this case, add OWNER_OVERRIDE when
checking the access permission of the incoming file handle for
LAYOUTGET and LAYOUTCOMMIT.
Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # v6.6+
Message-Id: 4E9C0D74-A06D-4DC3-A48A-73034DC40395@oracle.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The caching mode for buffer objects with VRAM as a possible
placement was forced to write-combined, regardless of placement.
However, write-combined system memory is expensive to allocate and
even though it is pooled, the pool is expensive to shrink, since
it involves global CPU TLB flushes.
Moreover write-combined system memory from TTM is only reliably
available on x86 and DGFX doesn't have an x86 restriction.
So regardless of the cpu caching mode selected for a bo,
internally use write-back caching mode for system memory on DGFX.
Coherency is maintained, but user-space clients may perceive a
difference in cpu access speeds.
v2:
- Update RB- and Ack tags.
- Rephrase wording in xe_drm.h (Matt Roper)
v3:
- Really rephrase wording.
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: 622f709ca629 ("drm/xe/uapi: Add support for CPU caching mode") Cc: Pallavi Mishra <pallavi.mishra@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: dri-devel@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Effie Yu <effie.yu@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Jose Souza <jose.souza@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Acked-by: Matthew Auld <matthew.auld@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Fixes: 622f709ca629 ("drm/xe/uapi: Add support for CPU caching mode") Acked-by: Michal Mrozek <michal.mrozek@intel.com> Acked-by: Effie Yu <effie.yu@intel.com> #On chat Link: https://patchwork.freedesktop.org/patch/msgid/20240705132828.27714-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 01e0cfc994be484ddcb9e121e353e51d8bb837c0) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the source address is selected, the scope must be checked. For
example, if a loopback address is assigned to the vrf device, it must not
be chosen for packets sent outside.
CC: stable@vger.kernel.org Fixes: afbac6010aec ("net: ipv6: Address selection needs to consider L3 domains") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240710081521.3809742-4-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By default, an address assigned to the output interface is selected when
the source address is not specified. This is problematic when a route,
configured in a vrf, uses an interface from another vrf (aka route leak).
The original vrf does not own the selected source address.
Let's add a check against the output interface and call the appropriate
function to select the source address.
CC: stable@vger.kernel.org Fixes: 8cbb512c923d ("net: Add source address lookup op for VRF") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240710081521.3809742-2-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By default, an address assigned to the output interface is selected when
the source address is not specified. This is problematic when a route,
configured in a vrf, uses an interface from another vrf (aka route leak).
The original vrf does not own the selected source address.
Let's add a check against the output interface and call the appropriate
function to select the source address.
CC: stable@vger.kernel.org Fixes: 0d240e7811c4 ("net: vrf: Implement get_saddr for IPv6") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20240710081521.3809742-3-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
io_uring can asynchronously add a task_work while the task is getting
freezed. TIF_NOTIFY_SIGNAL will prevent the task from sleeping in
do_freezer_trap(), and since the get_signal()'s relock loop doesn't
retry task_work, the task will spin there not being able to sleep
until the freezing is cancelled / the task is killed / etc.
Run task_works in the freezer path. Keep the patch small and simple
so it can be easily back ported, but we might need to do some cleaning
after and look if there are other places with similar problems.
At add_ra_bio_pages() we are accessing the extent map to calculate
'add_size' after we dropped our reference on the extent map, resulting
in a use-after-free. Fix this by computing 'add_size' before dropping our
extent map reference.
Reported-by: syzbot+853d80cba98ce1157ae6@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000038144061c6d18f2@google.com/ Fixes: 6a4049102055 ("btrfs: subpage: make add_ra_bio_pages() compatible") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To ensure non-reentrancy, __queue_work() attempts to enqueue a work
item to the pool of the currently executing worker. This is not only
unnecessary for an ordered workqueue, where order inherently suggests
non-reentrancy, but it could also disrupt the sequence if the item is
not enqueued on the newest PWQ.
Just queue it to the newest PWQ and let order management guarantees
non-reentrancy.
The issue initially stems from libpcap. The ethertype will be overwritten
as the VLAN TPID if the network interface lacks hardware VLAN offloading.
In the outbound packet path, if hardware VLAN offloading is unavailable,
the VLAN tag is inserted into the payload but then cleared from the sk_buff
struct. Consequently, this can lead to a false negative when checking for
the presence of a VLAN tag, causing the packet sniffing outcome to lack
VLAN tag information (i.e., TCI-TPID). As a result, the packet capturing
tool may be unable to parse packets as expected.
The TCI-TPID is missing because the prb_fill_vlan_info() function does not
modify the tp_vlan_tci/tp_vlan_tpid values, as the information is in the
payload and not in the sk_buff struct. The skb_vlan_tag_present() function
only checks vlan_all in the sk_buff struct. In cooked mode, the L2 header
is stripped, preventing the packet capturing tool from determining the
correct TCI-TPID value. Additionally, the protocol in SLL is incorrect,
which means the packet capturing tool cannot parse the L3 header correctly.
Currently, netconsole cleans up the netpoll structure before disabling
the target. This approach can lead to race conditions, as message
senders (write_ext_msg() and write_msg()) check if the target is
enabled before using netpoll. The sender can validate that the target is
enabled, but, the netpoll might be de-allocated already, causing
undesired behaviours.
This patch reverses the order of operations:
1. Disable the target
2. Clean up the netpoll structure
This change eliminates the potential race condition, ensuring that
no messages are sent through a partially cleaned-up netpoll structure.
Fixes: 2382b15bcc39 ("netconsole: take care of NETDEV_UNREGISTER event") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240712143415.1141039-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Running the LTP hotplug stress test on a aarch64 machine results in
rcu_sched stall warnings when the broadcast hrtimer was owned by the
un-plugged CPU. The issue is the following:
CPU1 (owns the broadcast hrtimer) CPU2
tick_broadcast_enter()
// shutdown local timer device
broadcast_shutdown_local()
...
tick_broadcast_exit()
clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT)
// timer device is not programmed
cpumask_set_cpu(cpu, tick_broadcast_force_mask)
initiates offlining of CPU1
take_cpu_down()
/*
* CPU1 shuts down and does not
* send broadcast IPI anymore
*/
takedown_cpu()
hotplug_cpu__broadcast_tick_pull()
// move broadcast hrtimer to this CPU
clockevents_program_event()
bc_set_next()
hrtimer_start()
/*
* timer device is not programmed
* because only the first expiring
* timer will trigger clockevent
* device reprogramming
*/
What happens is that CPU2 exits broadcast mode with force bit set, then the
local timer device is not reprogrammed and CPU2 expects to receive the
expired event by the broadcast IPI. But this does not happen because CPU1
is offlined by CPU2. CPU switches the clockevent device to ONESHOT state,
but does not reprogram the device.
The subsequent reprogramming of the hrtimer broadcast device does not
program the clockevent device of CPU2 either because the pending expiry
time is already in the past and the CPU expects the event to be delivered.
As a consequence all CPUs which wait for a broadcast event to be delivered
are stuck forever.
Fix this issue by reprogramming the local timer device if the broadcast
force bit of the CPU is set so that the broadcast hrtimer is delivered.
[ tglx: Massage comment and change log. Add Fixes tag ]
Linux kernel uses thermal zone node name during registering thermal
zones and has a hard-coded limit of 20 characters, including terminating
NUL byte. The bindings expect node names to finish with '-thermal'
which is eight bytes long, thus we have only 11 characters for the reset
of the node name (thus 10 for the pattern after leading fixed character).
Reported-by: Rob Herring <robh@kernel.org> Closes: https://lore.kernel.org/all/CAL_JsqKogbT_4DPd1n94xqeHaU_J8ve5K09WOyVsRX3jxxUW3w@mail.gmail.com/ Fixes: 1202a442a31f ("dt-bindings: thermal: Add yaml bindings for thermal zones") Cc: stable@vger.kernel.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240702145248.47184-1-krzysztof.kozlowski@linaro.org Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
During the probe, driver enables clocks necessary to access registers
(in get_temp()) and then registers thermal zone with managed-resources
(devm) interface. Removal of device is not done in reversed order,
because:
1. Clock will be disabled in driver remove() callback - thermal zone is
still registered and accessible to users,
2. devm interface will unregister thermal zone.
This leaves short window between (1) and (2) for accessing the
get_temp() callback with disabled clock.
Fix this by enabling clock also via devm-interface, so entire cleanup
path will be in proper, reversed order.
Fixes: 8454c8c09c77 ("thermal/drivers/bcm2835: Remove buggy call to thermal_of_zone_unregister") Cc: stable@vger.kernel.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240709-thermal-probe-v1-1-241644e2b6e0@linaro.org Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When accessing a file with more entries than ES_MAX_ENTRY_NUM, the bh-array
is allocated in __exfat_get_entry_set. The problem is that the bh-array is
allocated with GFP_KERNEL. It does not make sense. In the following cases,
a deadlock for sbi->s_lock between the two processes may occur.
The header_length field is byte unit, thus it can not express the number of
elements in header field. It seems that the argument for counted_by
attribute can have no arithmetic expression, therefore this commit just
reverts the issued commit.
8117961d98f ("x86/efi: Disregard setup header of loaded image")
which triggers boot issues on older Dell laptops. As it turns out,
switching back to a heap allocation for the struct boot_params
constructed by the EFI stub works around this, even though it is unclear
why.
The fail label is only used in a situation where the previous EFI API
call succeeded, and so status will be set to EFI_SUCCESS. Fix this, by
dropping the goto entirely, and call efi_exit() with the correct error
code.
mem_cgroup_calculate_protection() is not stateless and should only be used
as part of a top-down tree traversal. shrink_one() traverses the per-node
memcg LRU instead of the root_mem_cgroup tree, and therefore it should not
call mem_cgroup_calculate_protection().
The existing misuse in shrink_one() can cause ineffective protection of
sub-trees that are grandchildren of root_mem_cgroup. Fix it by reusing
lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to
calculate the protection.
Previously lru_gen_age_node() opportunistically skips the first pass,
i.e., when scan_control->priority is DEF_PRIORITY. On the second pass,
lruvec_is_sizable() uses appropriate scan_control->priority, set by
set_initial_priority() from lru_gen_shrink_node(), to decide whether a
memcg is too small to reclaim from.
Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree.
So it should call set_initial_priority() upfront, to make sure
lruvec_is_sizable() uses appropriate scan_control->priority on the first
pass. Otherwise, lruvec_is_reclaimable() can return false negatives and
result in premature OOM kills when min_ttl_ms is used.
set_initial_priority() tries to jump-start global reclaim by estimating
the priority based on cold/hot LRU pages. The estimation does not account
for shrinker objects, and it cannot do so because their sizes can be in
different units other than page.
If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where
ZFS ARC can use almost all system memory, set_initial_priority() can
vastly underestimate how much memory ARC shrinker can evict and assign
extreme low values to scan_control->priority, resulting in overshoots of
shrinker objects.
To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a
test ZFS pool and the following commands:
for ((i = 0; i < 20; i++))
do
sleep 120
fio --name=mglru.anon --numjobs=16 --ioengine=mmap \
--filename=/dev/zero --size=1024m --fadvise_hint=0 \
--rw=randrw --random_distribution=random \
--time_based --runtime=1m
done
To fix the problem:
1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent
the jump-start from being overly aggressive.
2. Account for the progress from mm_account_reclaimed_pages(), to
prevent kswapd_shrink_node() from raising the priority
unnecessarily.
Commit 2b5067a8143e ("mm: mmap_lock: add tracepoints around lock
acquisition") introduced TRACE_MMAP_LOCK_EVENT() macro using
preempt_disable() in order to let get_mm_memcg_path() return a percpu
buffer exclusively used by normal, softirq, irq and NMI contexts
respectively.
Commit 832b50725373 ("mm: mmap_lock: use local locks instead of disabling
preemption") replaced preempt_disable() with local_lock(&memcg_paths.lock)
based on an argument that preempt_disable() has to be avoided because
get_mm_memcg_path() might sleep if PREEMPT_RT=y.
We could replace local_lock() with local_lock_irqsave() in order to
suppress these messages. But this patch instead replaces percpu buffers
with on-stack buffer, for the size of each buffer returned by
get_memcg_path_buf() is only 256 bytes which is tolerable for allocating
from current thread's kernel stack memory.
Link: https://lkml.kernel.org/r/ef22d289-eadb-4ed9-863b-fbc922b33d8d@I-love.SAKURA.ne.jp Reported-by: syzbot <syzbot+40905bca570ae6784745@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=40905bca570ae6784745 Fixes: 832b50725373 ("mm: mmap_lock: use local locks instead of disabling preemption") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
evict_folios() uses a second pass to reclaim folios that have gone through
page writeback and become clean before it finishes the first pass, since
folio_rotate_reclaimable() cannot handle those folios due to the
isolation.
The second pass tries to avoid potential double counting by deducting
scan_control->nr_scanned. However, this can result in underflow of
nr_scanned, under a condition where shrink_folio_list() does not increment
nr_scanned, i.e., when folio_trylock() fails.
The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
vmpressure_calc_level(), to become zero, resulting in the following crash:
[exception RIP: vmpressure_work_fn+101]
process_one_work at ffffffffa3313f2b
Since scan_control->nr_scanned has no established semantics, the potential
double counting has minimal risks. Therefore, fix the problem by not
deducting scan_control->nr_scanned in evict_folios().
Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated") Reported-by: Wei Xu <weixugc@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alexander Motin <mav@ixsystems.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Lockdep considers this an AA deadlock because the different resize_lock
mutexes reside in the same lockdep class, but this is a false positive.
Place them in distinct classes to avoid these warnings.
Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When trying to allocate a hugepage with no reserved ones free, it may be
allowed in case a number of overcommit hugepages was configured (using
/proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached.
This allows for a behavior of having extra hugepages allocated
dynamically, if there're resources for it. Some sysadmins even prefer not
reserving any hugepages and setting a big number of overcommit hugepages.
But while attempting to allocate overcommit hugepages in a multi node
system (either NUMA or mempolicy/cpuset) said allocations might randomly
fail even when there're resources available for the allocation.
This happens due to allowed_mems_nr() only accounting for the number of
free hugepages in the nodes the current process belongs to and the surplus
hugepage allocation is done so it can be allocated in any node. In case
one or more of the requested surplus hugepages are allocated in a
different node, the whole allocation will fail due allowed_mems_nr()
returning a lower value.
So allocate surplus hugepages in one of the nodes the current process
belongs to.
Easy way to reproduce this issue is to use a 2+ NUMA nodes system:
xarray can't support arbitrary page cache size. the largest and supported
page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71
("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However,
it's possible to have 512MB page cache in the huge memory's collapsing
path on ARM64 system whose base page size is 64KB. 512MB page cache is
breaking the limitation and a warning is raised when the xarray entry is
split as shown in the following example.
[root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize
KernelPageSize: 64 kB
[root@dhcp-10-26-1-207 ~]# cat /tmp/test.c
:
int main(int argc, char **argv)
{
const char *filename = TEST_XFS_FILENAME;
int fd = 0;
void *buf = (void *)-1, *p;
int pgsize = getpagesize();
int ret = 0;
if (pgsize != 0x10000) {
fprintf(stdout, "System with 64KB base page size is required!\n");
return -EPERM;
}
Fix it by correcting the supported page cache orders, different sets for
DAX and other files. With it corrected, 512MB page cache becomes
disallowed on all non-DAX files on ARM64 system where the base page size
is 64KB. After this patch is applied, the test program fails with error
-EINVAL returned from __thp_vma_allowable_orders() and the madvise()
system call to collapse the page caches.
Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't
force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is
because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT.
When a process' cred struct is replaced, this _almost_ always invokes
the cred_prepare LSM hook; but in one special case (when
KEYCTL_SESSION_TO_PARENT updates the parent's credentials), the
cred_transfer LSM hook is used instead. Landlock only implements the
cred_prepare hook, not cred_transfer, so KEYCTL_SESSION_TO_PARENT causes
all information on Landlock restrictions to be lost.
This basically means that a process with the ability to use the fork()
and keyctl() syscalls can get rid of all Landlock restrictions on
itself.
Fix it by adding a cred_transfer hook that does the same thing as the
existing cred_prepare hook. (Implemented by having hook_cred_prepare()
call hook_cred_transfer() so that the two functions are less likely to
accidentally diverge in the future.)
Cc: stable@kernel.org Fixes: 385975dca53e ("landlock: Set up the security framework and manage credentials") Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20240724-landlock-houdini-fix-v1-1-df89a4560ca3@google.com Signed-off-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When mtk-cmdq unbinds, a WARN_ON message with condition
pm_runtime_get_sync() < 0 occurs.
According to the call tracei below:
cmdq_mbox_shutdown
mbox_free_channel
mbox_controller_unregister
__devm_mbox_controller_unregister
...
The root cause can be deduced to be calling pm_runtime_get_sync() after
calling pm_runtime_disable() as observed below:
1. CMDQ driver uses devm_mbox_controller_register() in cmdq_probe()
to bind the cmdq device to the mbox_controller, so
devm_mbox_controller_unregister() will automatically unregister
the device bound to the mailbox controller when the device-managed
resource is removed. That means devm_mbox_controller_unregister()
and cmdq_mbox_shoutdown() will be called after cmdq_remove().
2. CMDQ driver also uses devm_pm_runtime_enable() in cmdq_probe() after
devm_mbox_controller_register(), so that devm_pm_runtime_disable()
will be called after cmdq_remove(), but before
devm_mbox_controller_unregister().
To fix this problem, cmdq_probe() needs to move
devm_mbox_controller_register() after devm_pm_runtime_enable() to make
devm_pm_runtime_disable() be called after
devm_mbox_controller_unregister().
Two TXDB_V2 channels are used between Linux and System Manager(SM).
Channel0 for normal TX, Channel 1 for notification completion.
The TXDB_V2 trigger logic is using imx_mu_xcr_rmw which uses
read/modify/update logic.
Note: clear MUB GSR BITs, the MUA side GCR BITs will also got cleared per
hardware design.
Channel0 Linux
read GCR->modify GCR->write GCR->M33 SM->read GSR----->clear GSR
|-(1)-|
Channel1 Linux start in time slot(1)
read GCR->modify GCR->write GCR->M33 SM->read GSR->clear GSR
So Channel1 read GCR will read back the GCR that Channel0 wrote, because
M33 has not finish clear GSR, this means Channel1 GCR writing will
trigger Channel1 and Channel0 interrupt both which is wrong.
Channel0 will be freed(SCMI channel status set to FREE) in M33 SM when
processing the 1st Channel0 interrupt. So when 2nd interrupt trigger
(channel 0/1 trigger together), SM will see a freed Channel0, and report
protocol error.
To address the issue, not using read/modify/update logic, just use
write, because write 0 to GCR will be ignored. And after write MUA GCR,
wait the SM to clear MUB GSR by looping MUA GCR value.
Fixes: 5bfe4067d350 ("mailbox: imx: support channel type tx doorbell v2") Reviewed-by: Ranjani Vaidyanathan <ranjani.vaidyanathan@nxp.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>