In nilfs_rename(), calls to nilfs_put_page() to release pages obtained
with nilfs_find_entry() or nilfs_dotdot() are alternated in the normal
path.
When replacing the kernel memory mapping method from kmap to
kmap_local_{page,folio}, this violates the constraint on the calling order
of kunmap_local().
Swap the order of nilfs_put_page calls where the kmap sections of multiple
pages overlap so that they are nested, allowing direct replacement of
nilfs_put_page() -> unmap_and_put_page().
Without this reordering, that replacement will cause a kernel WARNING in
kunmap_local_indexed() on architectures with high memory mapping.
Link: https://lkml.kernel.org/r/20231127143036.2425-3-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: ee70999a988b ("nilfs2: handle errors that nilfs_prepare_chunk() may return") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch series "nilfs2: Folio conversions for directory paths".
This series applies page->folio conversions to nilfs2 directory
operations. This reduces hidden compound_head() calls and also converts
deprecated kmap calls to kmap_local in the directory code.
Although nilfs2 does not yet support large folios, Matthew has done his
best here to include support for large folios, which will be needed for
devices with large block sizes.
This series corresponds to the second half of the original post [1], but
with two complementary patches inserted at the beginning and some
adjustments, to prevent a kmap_local constraint violation found during
testing with highmem mapping.
I have reviewed all changes and tested this for regular and small block
sizes, both on machines with and without highmem mapping. No issues
found.
This patch (of 17):
In a few directory operations, the call to nilfs_put_page() for a page
obtained using nilfs_find_entry() or nilfs_dotdot() is hidden in
nilfs_set_link() and nilfs_delete_entry(), making it difficult to track
page release and preventing change of its call position.
By moving nilfs_put_page() out of these functions, this makes the page
get/put correspondence clearer and makes it easier to swap
nilfs_put_page() calls (and kunmap calls within them) when modifying
multiple directory entries simultaneously in nilfs_rename().
Also, update comments for nilfs_set_link() and nilfs_delete_entry() to
reflect changes in their behavior.
To make nilfs_put_page() visible from namei.c, this moves its definition
to nilfs.h and replaces existing equivalents to use it, but the exposure
of that definition is temporary and will be removed on a later kmap ->
kmap_local conversion.
The EAX of the CPUID Leaf 023H enumerates the mask of valid sub-leaves.
To tell the availability of the sub-leaf 1 (enumerate the counter mask),
perf should check the bit 1 (0x2) of EAS, rather than bit 0 (0x1).
The error is not user-visible on bare metal. Because the sub-leaf 0 and
the sub-leaf 1 are always available. However, it may bring issues in a
virtualization environment when a VMM only enumerates the sub-leaf 0.
Introduce the cpuid35_e?x to replace the macros, which makes the
implementation style consistent.
Fixes: eb467aaac21e ("perf/x86/intel: Support Architectural PerfMon Extension leaf") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250129154820.3755948-3-kan.liang@linux.intel.com
[ The patch is not exactly the same as the upstream patch. Because in the 6.6
stable kernel, the umask2/eq enumeration is not supported. The number of
counters is used rather than the counter mask. But the change is
straightforward, which utilizes the structured union to replace the macros
when parsing the CPUID enumeration. It also fixed a wrong macros. ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In general the delay should be added by the PHY instead of the MAC,
and this improves network stability on some boards which seem to
need different delay.
Commit cf1b6d4441ff ("md: simplify md_seq_ops") introduce following
regressions:
1) If list all_mddevs is emptly, personalities and unused devices won't
be showed to user anymore.
2) If seq_file buffer overflowed from md_seq_show(), then md_seq_start()
will be called again, hence personalities will be showed to user
again.
3) If seq_file buffer overflowed from md_seq_stop(), seq_read_iter()
doesn't handle this, hence unused devices won't be showed to user.
Fix above problems by printing personalities and unused devices in
md_seq_show().
Commit ac619781967b ("md: use separate work_struct for md_start_sync()")
use a new sync_work to replace del_work, however, stop_sync_thread() and
__md_stop_writes() was trying to wait for sync_thread to be done, hence
they should switch to use sync_work as well.
Noted that md_start_sync() from sync_work will grab 'reconfig_mutex',
hence other contex can't held the same lock to flush work, and this will
be fixed in later patches.
Fixes: ac619781967b ("md: use separate work_struct for md_start_sync()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231205094215.1824240-2-yukuai1@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When profile rollback fails in mlx5e_netdev_change_profile, the netdev
profile var is left set to NULL. Avoid a crash when unloading the driver
by not calling profile->cleanup in such a case.
This was encountered while testing, with the original trigger that
the wq rescuer thread creation got interrupted (presumably due to
Ctrl+C-ing modprobe), which gets converted to ENOMEM (-12) by
mlx5e_priv_init, the profile rollback also fails for the same reason
(signal still active) so the profile is left as NULL, leading to a crash
later in _mlx5e_remove.
Check if a function is already in the manager ops of a subops. A manager
ops contains multiple subops, and if two or more subops are tracing the
same function, the manager ops only needs a single entry in its hash.
The function tracer should record the preemption level at the point when
the function is invoked. If the tracing subsystem decrement the
preemption counter it needs to correct this before feeding the data into
the trace buffer. This was broken in the commit cited below while
shifting the preempt-disabled section.
Use tracing_gen_ctx_dec() which properly subtracts one from the
preemption counter on a preemptible kernel.
Cc: stable@vger.kernel.org Cc: Wander Lairson Costa <wander@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/20250220140749.pfw8qoNZ@linutronix.de Fixes: ce5e48036c9e7 ("ftrace: disable preemption when recursion locked") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The previous implementation incorrectly configured the cmn_interrupt_2_enable
register for interrupt handling. Using cmn_interrupt_2_enable to configure
Tag, Data RAM ECC interrupts would lead to issues like double handling of the
interrupts (EL1 and EL3) as cmn_interrupt_2_enable is meant to be configured
for interrupts which needs to be handled by EL3.
EL1 LLCC EDAC driver needs to use cmn_interrupt_0_enable register to configure
Tag, Data RAM ECC interrupts instead of cmn_interrupt_2_enable.
dma_map_single is using physical/bus device (DMA) but dma_unmap_single
is using framework device(NAND controller), which is incorrect.
Fixed dma_unmap_single to use correct physical/bus device.
Fixes: ec4ba01e894d ("mtd: rawnand: Add new Cadence NAND driver to MTD subsystem") Cc: stable@vger.kernel.org Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@intel.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a sanity check to madvise_dontneed_free() to address a corner case in
madvise where a race condition causes the current vma being processed to
be backed by a different page size.
During a madvise(MADV_DONTNEED) call on a memory region registered with a
userfaultfd, there's a period of time where the process mm lock is
temporarily released in order to send a UFFD_EVENT_REMOVE and let
userspace handle the event. During this time, the vma covering the
current address range may change due to an explicit mmap done concurrently
by another thread.
If, after that change, the memory region, which was originally backed by
4KB pages, is now backed by hugepages, the end address is rounded down to
a hugepage boundary to avoid data loss (see "Fixes" below). This rounding
may cause the end address to be truncated to the same address as the
start.
Make this corner case follow the same semantics as in other similar cases
where the requested region has zero length (ie. return 0).
This will make madvise_walk_vmas() continue to the next vma in the range
(this time holding the process mm lock) which, due to the prev pointer
becoming stale because of the vma change, will be the same hugepage-backed
vma that was just checked before. The next time madvise_dontneed_free()
runs for this vma, if the start address isn't aligned to a hugepage
boundary, it'll return -EINVAL, which is also in line with the madvise
api.
From userspace perspective, madvise() will return EINVAL because the start
address isn't aligned according to the new vma alignment requirements
(hugepage), even though it was correctly page-aligned when the call was
issued.
Link: https://lkml.kernel.org/r/20250203075206.1452208-1-rcn@igalia.com Fixes: 8ebe0a5eaaeb ("mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs") Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Florent Revest <revest@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In [1] it was reported that the acct(2) system call can be used to
trigger NULL deref in cases where it is set to write to a file that
triggers an internal lookup. This can e.g., happen when pointing acc(2)
to /sys/power/resume. At the point the where the write to this file
happens the calling task has already exited and called exit_fs(). A
lookup will thus trigger a NULL-deref when accessing current->fs.
Reorganize the code so that the the final write happens from the
workqueue but with the caller's credentials. This preserves the
(strange) permission model and has almost no regression risk.
The spcm->stream[substream->stream].substream is set during open and was
left untouched. After the first PCM stream it will never be NULL and we
have code which checks for substream NULLity as indication if the stream is
active or not.
For the compressed cstream pointer the same has been done, this change will
correct the handling of PCM streams.
Fixes: 090349a9feba ("ASoC: SOF: Add support for compress API for stream data/offset") Cc: stable@vger.kernel.org Reported-by: Curtis Malainey <cujomalainey@chromium.org> Closes: https://github.com/thesofproject/linux/pull/5214 Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Curtis Malainey <cujomalainey@chromium.org> Link: https://patch.msgid.link/20250205135232.19762-3-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Check the return value of snd_ctl_rename_id() in
snd_hda_create_dig_out_ctls(). Ensure that failures
are properly handled.
[ Note: the error cannot happen practically because the only error
condition in snd_ctl_rename_id() is the missing ID, but this is a
rename, hence it must be present. But for the code consistency,
it's safer to have always the proper return check -- tiwai ]
If 'micfil->quality' received from micfil_quality_set() somehow ends
up with an unpredictable value, switch() operator will fail to
initialize local variable qsel before regmap_update_bits() tries
to utilize it.
While it is unlikely, play it safe and enable a default case that
returns -EINVAL error.
Found by Linux Verification Center (linuxtesting.org) with static
analysis tool SVACE.
The nullity of sps->cstream should be checked similarly as it is done in
sof_set_stream_data_offset() function.
Assuming that it is not NULL if sps->stream is NULL is incorrect and can
lead to NULL pointer dereference.
Fixes: 090349a9feba ("ASoC: SOF: Add support for compress API for stream data/offset") Cc: stable@vger.kernel.org Reported-by: Curtis Malainey <cujomalainey@chromium.org> Closes: https://github.com/thesofproject/linux/pull/5214 Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com> Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Curtis Malainey <cujomalainey@chromium.org> Link: https://patch.msgid.link/20250205135232.19762-2-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
import_iovec() says that it should always be fine to kfree the iovec
returned in @iovp regardless of the error code. __import_iovec_ubuf()
never reallocates it and thus should clear the pointer even in cases when
copy_iovec_*() fail.
Link: https://lkml.kernel.org/r/378ae26923ffc20fd5e41b4360d673bf47b1775b.1738332461.git.asml.silence@gmail.com Fixes: 3b2deb0e46da ("iov_iter: import single vector iovecs as ITER_UBUF") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If drop_monitor is built as a kernel module, syzkaller may have time
to send a netlink NET_DM_CMD_START message during the module loading.
This will call the net_dm_monitor_start() function that uses
a spinlock that has not yet been initialized.
To fix this, let's place resource initialization above the registration
of a generic netlink family.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with Syzkaller.
Fixes: 9a8afc8d3962 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol") Cc: stable@vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250213152054.2785669-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
OP-TEE supplicant is a user-space daemon and it's possible for it
be hung or crashed or killed in the middle of processing an OP-TEE
RPC call. It becomes more complicated when there is incorrect shutdown
ordering of the supplicant process vs the OP-TEE client application which
can eventually lead to system hang-up waiting for the closure of the
client application.
Allow the client process waiting in kernel for supplicant response to
be killed rather than indefinitely waiting in an unkillable state. Also,
a normal uninterruptible wait should not have resulted in the hung-task
watchdog getting triggered, but the endless loop would.
This fixes issues observed during system reboot/shutdown when supplicant
got hung for some reason or gets crashed/killed which lead to client
getting hung in an unkillable state. It in turn lead to system being in
hung up state requiring hard power off/on to recover.
At the end of a 128b/132b link training sequence, the HW expects the
transcoder training pattern to be set to TPS2 and from that to normal
mode (disabling the training pattern). Transitioning from TPS1 directly
to normal mode leaves the transcoder in a stuck state, resulting in
page-flip timeouts later in the modeset sequence.
Atm, in case of a failure during link training, the transcoder may be
still set to output the TPS1 pattern. Later the transcoder is then set
from TPS1 directly to normal mode in intel_dp_stop_link_train(), leading
to modeset failures later as described above. Fix this by setting the
training patter to TPS2, if the link training failed at any point.
The clue in the specification about the above HW behavior is the
explicit mention that TPS2 must be set after the link training sequence
(and there isn't a similar requirement specified for the 8b/10b link
training), see the Bspec links below.
v2: Add bspec aspect/link to the commit log. (Jani)
Bspec: 54128, 65448, 68849 Cc: stable@vger.kernel.org # v5.18+ Cc: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250217223828.1166093-2-imre.deak@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 8b4bbaf8ddc1f68f3ee96a706f65fdb1bcd9d355) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Any active plane needs to have its crtc included in the atomic
state. For planes enabled via uapi that is all handler in the core.
But when we use a plane for joiner the uapi code things the plane
is disabled and therefore doesn't have a crtc. So we need to pull
those in by hand. We do it first thing in
intel_joiner_add_affected_crtcs() so that any newly added crtc will
subsequently pull in all of its joined crtcs as well.
The symptoms from failing to do this are:
- duct tape in the form of commit 1d5b09f8daf8 ("drm/i915: Fix NULL
ptr deref by checking new_crtc_state")
- the plane's hw state will get overwritten by the disabled
uapi state if it can't find the uapi counterpart plane in
the atomic state from where it should copy the correct state
Disable pingpong dither in dpu_encoder_helper_phys_cleanup().
This avoids the issue where an encoder unknowingly uses dither after
reserving a pingpong block that was previously bound to an encoder that
had enabled dither.
Most SoC dtsi files have the display output interfaces disabled by
default, and only enabled on boards that utilize them. The MT8183
has it backwards: the display outputs are left enabled by default,
and only disabled at the board level.
Reverse the situation for the DSI output so that it follows the
normal scheme. For ease of backporting the DPI output is handled
in a separate patch.
Most kernel configs enable multiple Tegra SoC generations, causing this
typo to go unnoticed. But in the case where a kernel config is strictly
for Tegra186, this is a problem.
The generic_map_lookup_batch currently returns EINTR if it fails with
ENOENT and retries several times on bpf_map_copy_value. The next batch
would start from the same location, presuming it's a transient issue.
This is incorrect if a map can actually have "holes", i.e.
"get_next_key" can return a key that does not point to a valid value. At
least the array of maps type may contain such holes legitly. Right now
these holes show up, generic batch lookup cannot proceed any more. It
will always fail with EINTR errors.
Rather, do not retry in generic_map_lookup_batch. If it finds a non
existing element, skip to the next key. This simple solution comes with
a price that transient errors may not be recovered, and the iteration
might cycle back to the first key under parallel deletion. For example,
Hou Tao <houtao@huaweicloud.com> pointed out a following scenario:
For LPM trie map:
(1) ->map_get_next_key(map, prev_key, key) returns a valid key
(2) bpf_map_copy_value() return -ENOMENT
It means the key must be deleted concurrently.
(3) goto next_key
It swaps the prev_key and key
(4) ->map_get_next_key(map, prev_key, key) again
prev_key points to a non-existing key, for LPM trie it will treat just
like prev_key=NULL case, the returned key will be duplicated.
With the retry logic, the iteration can continue to the key next to the
deleted one. But if we directly skip to the next key, the iteration loop
would restart from the first key for the lpm_trie type.
However, not all races may be recovered. For example, if current key is
deleted after instead of before bpf_map_copy_value, or if the prev_key
also gets deleted, then the loop will still restart from the first key
for lpm_tire anyway. For generic lookup it might be better to stay
simple, i.e. just skip to the next key. To guarantee that the output
keys are not duplicated, it is better to implement map type specific
batch operations, which can properly lock the trie and synchronize with
concurrent mutators.
Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op") Closes: https://lore.kernel.org/bpf/Z6JXtA1M5jAZx8xD@debian.debian/ Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/85618439eea75930630685c467ccefeac0942e2b.1739171594.git.yan@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
nvme_validate_passthru_nsid() logs an err message whose format string is
split over 2 lines. There is a missing space between the two pieces,
resulting in log lines like "... does not match nsid (1)of namespace".
Add the missing space between ")" and "of". Also combine the format
string pieces onto a single line to make the err message easier to grep.
Fixes: e7d4b5493a2d ("nvme: factor out a nvme_validate_passthru_nsid helper") Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
If userspace is trying to achieve a timeout of zero, let 'em have it.
Only round up if the timeout is greater than zero.
Fixes: 4969bccd5f4e ("drm/msm: Avoid rounding down to zero jiffies") Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
Patchwork: https://patchwork.freedesktop.org/patch/632264/ Signed-off-by: Sasha Levin <sashal@kernel.org>
In case we have to retry the loop, we are missing to unlock+put the
folio. In that case, we will keep failing make_device_exclusive_range()
because we cannot grab the folio lock, and even return from the function
with the folio locked and referenced, effectively never succeeding the
make_device_exclusive_range().
While at it, convert the other unlock+put to use a folio as well.
Size of variable sd_gain equals four bytes - DA9150_QIF_SD_GAIN_SIZE.
Size of variable shunt_val equals two bytes - DA9150_QIF_SHUNT_VAL_SIZE.
The expression sd_gain * shunt_val is currently being evaluated using
32-bit arithmetic. So during the multiplication an overflow may occur.
As the value of type 'u64' is used as storage for the eventual result, put
ULL variable at the first position of each expression in order to give the
compiler complete information about the proper arithmetic to use. According
to C99 the guaranteed width for a variable of type 'unsigned long long' >=
64 bits.
Remove the explicit cast to u64 as it is meaningless.
Just for the sake of consistency, perform the similar trick with another
expression concerning 'iavg'.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: a419b4fd9138 ("power: Add support for DA9150 Fuel-Gauge") Signed-off-by: Andrey Vatoropin <a.vatoropin@crpt.ru> Link: https://lore.kernel.org/r/20250130090030.53422-1-a.vatoropin@crpt.ru Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The following commit bc235cdb423a ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")
first introduced deadlock prevention for fentry/fexit programs attaching
on bpf_task_storage helpers. That commit also employed the logic in map
free path in its v6 version.
Later bpf_cgrp_storage was first introduced in c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
which faces the same issue as bpf_task_storage, instead of its busy
counter, NULL was passed to bpf_local_storage_map_free() which opened
a window to cause deadlock:
Progs:
- A: SEC("fentry/cgroup_storage_ptr")
- cgid (BPF_MAP_TYPE_HASH)
Record the id of the cgroup the current task belonging
to in this hash map, using the address of the cgroup
as the map key.
- cgrpa (BPF_MAP_TYPE_CGRP_STORAGE)
If current task is a kworker, lookup the above hash
map using function parameter @owner as the key to get
its corresponding cgroup id which is then used to get
a trusted pointer to the cgroup through
bpf_cgroup_from_id(). This trusted pointer can then
be passed to bpf_cgrp_storage_get() to finally trigger
the deadlock issue.
- B: SEC("tp_btf/sys_enter")
- cgrpb (BPF_MAP_TYPE_CGRP_STORAGE)
The only purpose of this prog is to fill Prog A's
hash map by calling bpf_cgrp_storage_get() for as
many userspace tasks as possible.
Steps to reproduce:
- Run A;
- while (true) { Run B; Destroy B; }
Fix this issue by passing its busy counter to the free procedure so
it can be properly incremented before storage/smap locking.
Fixes: c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241221061018.37717-1-wuyun.abel@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, only TCP supports strparser, but sockmap doesn't intercept
non-TCP connections to attach strparser. For example, with UDP, although
the read/write handlers are replaced, strparser is not executed due to
the lack of a read_sock operation.
Furthermore, in udp_bpf_recvmsg(), it checks whether the psock has data,
and if not, it falls back to the native UDP read interface, making
UDP + strparser appear to read correctly. According to its commit history,
this behavior is unexpected.
Moreover, since UDP lacks the concept of streams, we intercept it directly.
Fixes: 1fa1fe8ff161 ("bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0") Signed-off-by: Jiayuan Chen <mrpre@163.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://patch.msgid.link/20250122100917.49845-4-mrpre@163.com Signed-off-by: Sasha Levin <sashal@kernel.org>
'sk->copied_seq' was updated in the tcp_eat_skb() function when the action
of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, the
update logic for 'sk->copied_seq' was moved to tcp_bpf_recvmsg_parser()
to ensure the accuracy of the 'fionread' feature.
It works for a single stream_verdict scenario, as it also modified
sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb
to remove updating 'sk->copied_seq'.
However, for programs where both stream_parser and stream_verdict are
active (strparser purpose), tcp_read_sock() was used instead of
tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock).
tcp_read_sock() now still updates 'sk->copied_seq', leading to duplicate
updates.
In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
The issue causes incorrect copied_seq calculations, which prevent
correct data reads from the recv() interface in user-land.
We do not want to add new proto_ops to implement a new version of
tcp_read_sock, as this would introduce code complexity [1].
We could have added noack and copied_seq to desc, and then called
ops->read_sock. However, unfortunately, other modules didn’t fully
initialize desc to zero. So, for now, we are directly calling
tcp_read_sock_noack() in tcp_bpf.c.
We use map->freeze_mutex to prevent races between map_freeze() and
memory mapping BPF map contents with writable permissions. The way we
naively do this means we'll hold freeze_mutex for entire duration of all
the mm and VMA manipulations, which is completely unnecessary. This can
potentially also lead to deadlocks, as reported by syzbot in [0].
So, instead, hold freeze_mutex only during writeability checks, bump
(proactively) "write active" count for the map, unlock the mutex and
proceed with mmap logic. And only if something went wrong during mmap
logic, then undo that "write active" counter increment.
For all BPF maps we ensure that VM_MAYWRITE is cleared when
memory-mapping BPF map contents as initially read-only VMA. This is
because in some cases BPF verifier relies on the underlying data to not
be modified afterwards by user space, so once something is mapped
read-only, it shouldn't be re-mmap'ed as read-write.
As such, it's not necessary to check VM_MAYWRITE in bpf_map_mmap() and
map->ops->map_mmap() callbacks: VM_WRITE should be consistently set for
read-write mappings, and if VM_WRITE is not set, there is no way for
user space to upgrade read-only mapping to read-write one.
This patch cleans up this VM_WRITE vs VM_MAYWRITE handling within
bpf_map_mmap(), which is an entry point for any BPF map mmap()-ing
logic. We also drop unnecessary sanitization of VM_MAYWRITE in BPF
ringbuf's map_mmap() callback implementation, as it is already performed
by common code in bpf_map_mmap().
Note, though, that in bpf_map_mmap_{open,close}() callbacks we can't
drop VM_MAYWRITE use, because it's possible (and is outside of
subsystem's control) to have initially read-write memory mapping, which
is subsequently dropped to read-only by user space through mprotect().
In such case, from BPF verifier POV it's read-write data throughout the
lifetime of BPF map, and is counted as "active writer".
But its VMAs will start out as VM_WRITE|VM_MAYWRITE, then mprotect() can
change it to just VM_MAYWRITE (and no VM_WRITE), so when its finally
munmap()'ed and bpf_map_mmap_close() is called, vm_flags will be just
VM_MAYWRITE, but we still need to decrement active writer count with
bpf_map_write_active_dec() as it's still considered to be a read-write
mapping by the rest of BPF subsystem.
Similar reasoning applies to bpf_map_mmap_open(), which is called
whenever mmap(), munmap(), and/or mprotect() forces mm subsystem to
split original VMA into multiple discontiguous VMAs.
KMSAN reported a use-after-free issue in eth_skb_pkt_type()[1]. The
cause of the issue was that eth_skb_pkt_type() accessed skb's data
that didn't contain an Ethernet header. This occurs when
bpf_prog_test_run_xdp() passes an invalid value as the user_data
argument to bpf_test_init().
Fix this by returning an error when user_data is less than ETH_HLEN in
bpf_test_init(). Additionally, remove the check for "if (user_size >
size)" as it is unnecessary.
CPU: 1 UID: 0 PID: 17276 Comm: syz.1.16450 Not tainted 6.12.0-05490-g9bb88c659673 #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Fixes: be3d72a2896c ("bpf: move user_size out of bpf_test_init") Reported-by: syzkaller <syzkaller@googlegroups.com> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20250121150643.671650-1-syoshida@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The "submit->cmd[i].size" and "submit->cmd[i].offset" variables are u32
values that come from the user via the submit_lookup_cmds() function.
This addition could lead to an integer wrapping bug so use size_add()
to prevent that.
Fixes: 198725337ef1 ("drm/msm: fix cmdstream size check") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Patchwork: https://patchwork.freedesktop.org/patch/624696/ Signed-off-by: Rob Clark <robdclark@chromium.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Error messages resulting from incorrect usage of the kernel uabi should
not spam dmesg by default. But it is useful to enable them to debug
userspace. So demote to DRM_UT_DRIVER.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Patchwork: https://patchwork.freedesktop.org/patch/564189/
Stable-dep-of: 3a47f4b439be ("drm/msm/gem: prevent integer overflow in msm_ioctl_gem_submit()") Signed-off-by: Sasha Levin <sashal@kernel.org>
The driver has a spinlock for protecting the irq_masks field and irq
enable registers. However, the driver misses protecting the irq status
registers which can lead to races.
Take the spinlock when accessing irqstatus too.
Fixes: 32a1795f57ee ("drm/tidss: New driver for TI Keystone platform Display SubSystem") Cc: stable@vger.kernel.org Signed-off-by: Devarsh Thakkar <devarsht@ti.com>
[Tomi: updated the desc] Reviewed-by: Jonathan Cormier <jcormier@criticallink.com> Tested-by: Jonathan Cormier <jcormier@criticallink.com> Reviewed-by: Aradhya Bhatia <aradhya.bhatia@linux.dev> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241021-tidss-irq-fix-v1-6-82ddaec94e4a@ideasonboard.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
- create a pair of netns
- run a basic TCP test over ipcomp6
- delete the pair of netns
The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.
The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.
We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The external PHY will undergo a soft reset twice during the resume process
when it wake up from suspend. The first reset occurs when the axienet
driver calls phylink_of_phy_connect(), and the second occurs when
mdio_bus_phy_resume() invokes phy_init_hw(). The second soft reset of the
external PHY does not reinitialize the internal PHY, which causes issues
with the internal PHY, resulting in the PHY link being down. To prevent
this, setting the mac_managed_pm flag skips the mdio_bus_phy_resume()
function.
Fixes: a129b41fe0a8 ("Revert "net: phy: dp83867: perform soft reset and retain established link"") Signed-off-by: Nick Hu <nick.hu@sifive.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250217055843.19799-1-nick.hu@sifive.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The arp_req_set_public() function is called with the rtnl lock held,
which provides enough synchronization protection. This makes the RCU
variant of dev_getbyhwaddr() unnecessary. Switch to using the simpler
dev_getbyhwaddr() function since we already have the required rtnl
locking.
This change helps maintain consistency in the networking code by using
the appropriate helper function for the existing locking context.
Since we're not holding the RCU read lock in arp_req_set_public()
existing code could trigger false positive locking warnings.
Add dedicated helper for finding devices by hardware address when
holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents
PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not.
Extract common address comparison logic into dev_addr_cmp().
The context about this change could be found in the following
discussion:
Fix how port range keys are handled in __skb_flow_bpf_to_target() by:
- Separating PORTS and PORTS_RANGE key handling
- Using correct key_ports_range structure for range keys
- Properly initializing both key types independently
This ensures port range information is correctly stored in its dedicated
structure rather than incorrectly using the regular ports key structure.
Fixes: 59fb9b62fb6c ("flow_dissector: Fix to use new variables for port ranges in bpf hook") Reported-by: Qiang Zhang <dtzq01@gmail.com> Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/ Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://patch.msgid.link/20250218043210.732959-4-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch fixes a bug in TC flower filter where rules combining a
specific destination port with a source port range weren't working
correctly.
The specific case was when users tried to configure rules like:
tc filter add dev ens38 ingress protocol ip flower ip_proto udp \
dst_port 5000 src_port 2000-3000 action drop
The root cause was in the flow dissector code. While both
FLOW_DISSECTOR_KEY_PORTS and FLOW_DISSECTOR_KEY_PORTS_RANGE flags
were being set correctly in the classifier, the __skb_flow_dissect_ports()
function was only populating one of them: whichever came first in
the enum check. This meant that when the code needed both a specific
port and a port range, one of them would be left as 0, causing the
filter to not match packets as expected.
Fix it by removing the either/or logic and instead checking and
populating both key types independently when they're in use.
Fixes: 8ffb055beae5 ("cls_flower: Fix the behavior using port ranges with hw-offload") Reported-by: Qiang Zhang <dtzq01@gmail.com> Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/ Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250218043210.732959-2-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
As explained in the previous patch, iterating for_each_netdev() and
gn->geneve_list during ->exit_batch_rtnl() could trigger ->dellink()
twice for the same device.
If CONFIG_DEBUG_LIST is enabled, we will see a list_del() corruption
splat in the 2nd call of geneve_dellink().
Let's remove for_each_netdev() in geneve_destroy_tunnels() and delegate
that part to default_device_exit_batch().
Brad Spengler reported the list_del() corruption splat in
gtp_net_exit_batch_rtnl(). [0]
Commit eb28fd76c0a0 ("gtp: Destroy device along with udp socket's netns
dismantle.") added the for_each_netdev() loop in gtp_net_exit_batch_rtnl()
to destroy devices in each netns as done in geneve and ip tunnels.
However, this could trigger ->dellink() twice for the same device during
->exit_batch_rtnl().
Say we have two netns A & B and gtp device B that resides in netns B but
whose UDP socket is in netns A.
1. cleanup_net() processes netns A and then B.
2. gtp_net_exit_batch_rtnl() finds the device B while iterating
netns A's gn->gtp_dev_list and calls ->dellink().
[ device B is not yet unlinked from netns B
as unregister_netdevice_many() has not been called. ]
3. gtp_net_exit_batch_rtnl() finds the device B while iterating
netns B's for_each_netdev() and calls ->dellink().
gtp_dellink() cleans up the device's hash table, unlinks the dev from
gn->gtp_dev_list, and calls unregister_netdevice_queue().
Basically, calling gtp_dellink() multiple times is fine unless
CONFIG_DEBUG_LIST is enabled.
Let's remove for_each_netdev() in gtp_net_exit_batch_rtnl() and
delegate the destruction to default_device_exit_batch() as done
in bareudp.
Since commit under Fixes we set the window clamp in accordance
to newly measured rcvbuf scaling_ratio. If the scaling_ratio
decreased significantly we may put ourselves in a situation
where windows become smaller than rcvq_space, preventing
tcp_rcv_space_adjust() from increasing rcvbuf.
The significant decrease of scaling_ratio is far more likely
since commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"),
which increased the "default" scaling ratio from ~30% to 50%.
Hitting the bad condition depends a lot on TCP tuning, and
drivers at play. One of Meta's workloads hits it reliably
under following conditions:
- default rcvbuf of 125k
- sender MTU 1500, receiver MTU 5000
- driver settles on scaling_ratio of 78 for the config above.
Initial rcvq_space gets calculated as TCP_INIT_CWND * tp->advmss
(10 * 5k = 50k). Once we find out the true scaling ratio and
MSS we clamp the windows to 38k. Triggering the condition also
depends on the message sequence of this workload. I can't repro
the problem with simple iperf or TCP_RR-style tests.
sockmap expects all vsocks to have a transport assigned, which is expressed
in vsock_proto::psock_update_sk_prot(). However, there is an edge case
where an unconnected (connectible) socket may lose its previously assigned
transport. This is handled with a NULL check in the vsock/BPF recv path.
Another design detail is that listening vsocks are not supposed to have any
transport assigned at all. Which implies they are not supported by the
sockmap. But this is complicated by the fact that a socket, before
switching to TCP_LISTEN, may have had some transport assigned during a
failed connect() attempt. Hence, we may end up with a listening vsock in a
sockmap, which blows up quickly:
For connectible sockets, instead of relying solely on the state of
vsk->transport, tell sockmap to only allow those representing established
connections. This aligns with the behaviour for AF_INET and AF_UNIX.
Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <mhal@rbox.co> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Previously, after successfully flushing the xmit buffer to VIOS,
the tx_bytes stat was incremented by the length of the skb.
It is invalid to access the skb memory after sending the buffer to
the VIOS because, at any point after sending, the VIOS can trigger
an interrupt to free this memory. A race between reading skb->len
and freeing the skb is possible (especially during LPM) and will
result in use-after-free:
==================================================================
BUG: KASAN: slab-use-after-free in ibmvnic_xmit+0x75c/0x1808 [ibmvnic]
Read of size 4 at addr c00000024eb48a70 by task hxecom/14495
<...>
Call Trace:
[c000000118f66cf0] [c0000000018cba6c] dump_stack_lvl+0x84/0xe8 (unreliable)
[c000000118f66d20] [c0000000006f0080] print_report+0x1a8/0x7f0
[c000000118f66df0] [c0000000006f08f0] kasan_report+0x128/0x1f8
[c000000118f66f00] [c0000000006f2868] __asan_load4+0xac/0xe0
[c000000118f66f20] [c0080000046eac84] ibmvnic_xmit+0x75c/0x1808 [ibmvnic]
[c000000118f67340] [c0000000014be168] dev_hard_start_xmit+0x150/0x358
<...>
Freed by task 0:
kasan_save_stack+0x34/0x68
kasan_save_track+0x2c/0x50
kasan_save_free_info+0x64/0x108
__kasan_mempool_poison_object+0x148/0x2d4
napi_skb_cache_put+0x5c/0x194
net_tx_action+0x154/0x5b8
handle_softirqs+0x20c/0x60c
do_softirq_own_stack+0x6c/0x88
<...>
The buggy address belongs to the object at c00000024eb48a00 which
belongs to the cache skbuff_head_cache of size 224
==================================================================
Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250214155233.235559-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Allow tracking of packets sent with send_subcrq direct vs
indirect. `ethtool -S <dev>` will now provide a counter
of the number of uses of each xmit method. This metric will
be useful in performance debugging.
Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241001163531.1803152-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: bdf5d13aa05e ("ibmvnic: Don't reference skb after sending to VIOS") Signed-off-by: Sasha Levin <sashal@kernel.org>
Firmware supports two hcalls to send a sub-crq request:
H_SEND_SUB_CRQ_INDIRECT and H_SEND_SUB_CRQ. The indirect hcall allows
for submission of batched messages while the other hcall is limited to
only one message. This protocol is defined in PAPR section 17.2.3.3.
Previously, the ibmvnic xmit function only used the indirect hcall. This
allowed the driver to batch it's skbs. A single skb can occupy a few
entries per hcall depending on if FW requires skb header information or
not. The FW only needs header information if the packet is segmented.
By this logic, if an skb is not GSO then it can fit in one sub-crq
message and therefore is a candidate for H_SEND_SUB_CRQ.
Batching skb transmission is only useful when there are more packets
coming down the line (ie netdev_xmit_more is true).
As it turns out, H_SEND_SUB_CRQ induces less latency than
H_SEND_SUB_CRQ_INDIRECT. Therefore, use H_SEND_SUB_CRQ where
appropriate.
Small latency gains seen when doing TCP_RR_150 (request/response
workload). Ftrace results (graph-time=1):
Previous:
ibmvnic_xmit = 29618270.83 us / 8860058.0 hits = AVG 3.34
ibmvnic_tx_scrq_flush = 21972231.02 us / 6553972.0 hits = AVG 3.35
Now:
ibmvnic_xmit = 22153350.96 us / 8438942.0 hits = AVG 2.63
ibmvnic_tx_scrq_flush = 15858922.4 us / 6244076.0 hits = AVG 2.54
In ibmvnic_xmit() if ibmvnic_tx_scrq_flush() returns H_CLOSED then
it will inform upper level networking functions to disable tx
queues. H_CLOSED signals that the connection with the vnic server is
down and a transport event is expected to recover the device.
Previously, ibmvnic_tx_scrq_flush() was hard-coded to return success.
Therefore, the queues would remain active until ibmvnic_cleanup() is
called within do_reset().
The problem is that do_reset() depends on the RTNL lock. If several
ibmvnic devices are resetting then there can be a long wait time until
the last device can grab the lock. During this time the tx/rx queues
still appear active to upper level functions.
FYI, we do make a call to netif_carrier_off() outside the RTNL lock but
its calls to dev_deactivate() are also dependent on the RTNL lock.
As a result, large amounts of retransmissions were observed in a short
period of time, eventually leading to ETIMEOUT. This was specifically
seen with HNV devices, likely because of even more RTNL dependencies.
Therefore, ensure the return code of ibmvnic_tx_scrq_flush() is
propagated to the xmit function to allow for an earlier (and lock-less)
response to a transport event.
According to device_release() in /drivers/base/core.c,
a device without a release function is a broken device
and must be fixed.
The current code directly frees the device after calling device_add()
without waiting for other kernel parts to release their references.
Thus, a reference could still be held to a struct device,
e.g., by sysfs, leading to potential use-after-free
issues if a proper release function is not set.
Fixes: 8c81ba20349d ("net/smc: De-tangle ism and smc device initialization") Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: Julian Ruess <julianr@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250214120137.563409-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When a destination client is a user client in the legacy MIDI mode and
it sets the no-UMP-conversion flag, currently the all UMP events are
still passed as-is. But this may confuse the user-space, because the
event packet size is different from the legacy mode.
Since we cannot handle UMP events in user clients unless it's running
in the UMP client mode, we should filter out those events instead of
accepting blindly. This patch addresses it by slightly adjusting the
conditions for UMP event handling at the event delivery time.
tcf_exts_miss_cookie_base_alloc() calls xa_alloc_cyclic() which can
return 1 if the allocation succeeded after wrapping. This was treated as
an error, with value 1 returned to caller tcf_exts_init_ex() which sets
exts->actions to NULL and returns 1 to caller fl_change().
fl_change() treats err == 1 as success, calling tcf_exts_validate_ex()
which calls tcf_action_init() with exts->actions as argument, where it
is dereferenced.
Fixes: 80cd22c35c90 ("net/sched: cls_api: Support hardware miss to tc action") Signed-off-by: Pierre Riteau <pierre@stackhpc.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250213223610.320278-1-pierre@stackhpc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch corrects the full-scale volume setting logic. On certain
platforms, the full-scale volume bit is required. The current logic
mistakenly sets this bit and incorrectly clears reserved bit 0, causing
the headphone output to be muted.
Fixes: 342b6b610ae2 ("ALSA: hda/cs8409: Fix Full Scale Volume setting for all variants") Signed-off-by: Vitaly Rodionov <vitalyr@opensource.cirrus.com> Link: https://patch.msgid.link/20250214210736.30814-1-vitalyr@opensource.cirrus.com Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
syzkaller reported a use-after-free in geneve_find_dev() [0]
without repro.
geneve_configure() links struct geneve_dev.next to
net_generic(net, geneve_net_id)->geneve_list.
The net here could differ from dev_net(dev) if IFLA_NET_NS_PID,
IFLA_NET_NS_FD, or IFLA_TARGET_NETNSID is set.
When dev_net(dev) is dismantled, geneve_exit_batch_rtnl() finally
calls unregister_netdevice_queue() for each dev in the netns,
and later the dev is freed.
However, its geneve_dev.next is still linked to the backend UDP
socket netns.
Then, use-after-free will occur when another geneve dev is created
in the netns.
Let's call geneve_dellink() instead in geneve_destroy_tunnels().
[0]:
BUG: KASAN: slab-use-after-free in geneve_find_dev drivers/net/geneve.c:1295 [inline]
BUG: KASAN: slab-use-after-free in geneve_configure+0x234/0x858 drivers/net/geneve.c:1343
Read of size 2 at addr ffff000054d6ee24 by task syz.1.4029/13441
The buggy address belongs to the object at ffff000054d6e000
which belongs to the cache kmalloc-cg-4k of size 4096
The buggy address is located 3620 bytes inside of
freed 4096-byte region [ffff000054d6e000, ffff000054d6f000)
f8 corresponds to KASAN_VMALLOC_INVALID which means the area is not
initialised hence not supposed to be used yet.
Powerpc text patching infrastructure allocates a virtual memory area
using get_vm_area() and flags it as VM_ALLOC. But that flag is meant
to be used for vmalloc() and vmalloc() allocated memory is not
supposed to be used before a call to __vmalloc_node_range() which is
never called for that area.
That went undetected until commit e4137f08816b ("mm, kasan, kmsan:
instrument copy_from/to_kernel_nofault")
The area allocated by text_area_cpu_up() is not vmalloc memory, it is
mapped directly on demand when needed by map_kernel_page(). There is
no VM flag corresponding to such usage, so just pass no flag. That way
the area will be unpoisonned and usable immediately.
Rewrite __real_pte() and __rpte_to_hidx() as static inline in order to
avoid following warnings/errors when building with 4k page size:
CC arch/powerpc/mm/book3s64/hash_tlb.o
arch/powerpc/mm/book3s64/hash_tlb.c: In function 'hpte_need_flush':
arch/powerpc/mm/book3s64/hash_tlb.c:49:16: error: variable 'offset' set but not used [-Werror=unused-but-set-variable]
49 | int i, offset;
| ^~~~~~
CC arch/powerpc/mm/book3s64/hash_native.o
arch/powerpc/mm/book3s64/hash_native.c: In function 'native_flush_hash_range':
arch/powerpc/mm/book3s64/hash_native.c:782:29: error: variable 'index' set but not used [-Werror=unused-but-set-variable]
782 | unsigned long hash, index, hidx, shift, slot;
| ^~~~~
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501081741.AYFwybsq-lkp@intel.com/ Fixes: ff31e105464d ("powerpc/mm/hash64: Store the slot information at the right offset for hugetlb") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/e0d340a5b7bd478ecbf245d826e6ab2778b74e06.1736706263.git.christophe.leroy@csgroup.eu Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 2f45a4e289779 ("ASoC: rockchip: i2s_tdm: Fixup config for
SND_SOC_DAIFMT_DSP_A/B") applied a partial change to fix the
configuration for DSP A and DSP B formats.
The shift control also needs updating to set the correct offset for
frame data compared to LRCK. Set the correct values.
Fixes: 081068fd64140 ("ASoC: rockchip: add support for i2s-tdm controller") Signed-off-by: John Keeping <jkeeping@inmusicbrands.com> Link: https://patch.msgid.link/20250204161311.2117240-1-jkeeping@inmusicbrands.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
device_del() can lead to new work being scheduled in gadget->work
workqueue. This is observed, for example, with the dwc3 driver with the
following call stack:
device_del()
gadget_unbind_driver()
usb_gadget_disconnect_locked()
dwc3_gadget_pullup()
dwc3_gadget_soft_disconnect()
usb_gadget_set_state()
schedule_work(&gadget->work)
Move flush_work() after device_del() to ensure the workqueue is cleaned
up.
Fixes: 5702f75375aa9 ("usb: gadget: udc-core: move sysfs_notify() to a workqueue") Cc: stable <stable@kernel.org> Signed-off-by: Roy Luo <royluo@google.com> Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Reviewed-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com> Link: https://lore.kernel.org/r/20250204233642.666991-1-royluo@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
udc device and gadget device are tightly coupled, yet there's no good
way to corelate the two. Add a sysfs link in udc that points to the
corresponding gadget device.
An example use case: userspace configures a f_midi configfs driver and
bind the udc device, then it tries to locate the corresponding midi
device, which is a child device of the gadget device. The gadget device
that's associated to the udc device has to be identified in order to
index the midi device. Having a sysfs link would make things much
easier.
Signed-off-by: Roy Luo <royluo@google.com> Link: https://lore.kernel.org/r/20240307030922.3573161-1-royluo@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Stable-dep-of: 399a45e5237c ("usb: gadget: core: flush gadget workqueue after device removal") Signed-off-by: Sasha Levin <sashal@kernel.org>
This means the MAC addresses are stored in reverse byte order. We have
to swap the bytes before passing them to the upper layers. The storage
format is consistent to the one used on i.MX6 using imx-ocotp driver
which does the same byte swapping as introduced here.
With this patch the MAC address on my i.MX93 TQ board correctly reads as
00:d0:93:6b:27:b8 instead of b8:27:6b:93:d0:00.
This hook is meant to be used by any provider and instantiating a layout
just for this is useless. Let's instead move this hook to the nvmem
device and add it to the config structure to be easily shared by the
providers.
While at moving this hook, rename it ->fixup_dt_cell_info() to clarify
its main intended purpose.
The layout entry is not used and will anyway be made useless by the new
layout bus infrastructure coming next, so drop it. While at it, clarify
the kdoc entry.
Before adding all the NVMEM layout bus infrastructure to the core, let's
move the main nvmem_device structure in an internal header, only
available to the core. This way all the additional code can be added in
a dedicated file in order to keep the current core file tidy.
When an async control is written, we copy a pointer to the file handle
that started the operation. That pointer will be used when the device is
done. Which could be anytime in the future.
If the user closes that file descriptor, its structure will be freed,
and there will be one dangling pointer per pending async control, that
the driver will try to use.
Clean all the dangling pointers during release().
To avoid adding a performance penalty in the most common case (no async
operation), a counter has been introduced with some logic to make sure
that it is properly handled.
Cc: stable@vger.kernel.org Fixes: e5225c820c05 ("media: uvcvideo: Send a control event when a Control Change interrupt arrives") Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Link: https://lore.kernel.org/r/20241203-uvc-fix-async-v6-3-26c867231118@chromium.org Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Now we keep a reference to the active fh for any call to uvc_ctrl_set,
regardless if it is an actual set or if it is a just a try or if the
device refused the operation.
We should only keep the file handle if the device actually accepted
applying the operation.
Cc: stable@vger.kernel.org Fixes: e5225c820c05 ("media: uvcvideo: Send a control event when a Control Change interrupt arrives") Suggested-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Link: https://lore.kernel.org/r/20241203-uvc-fix-async-v6-1-26c867231118@chromium.org Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Avoid using the iterators after the list_for_each() constructs.
This patch should be a NOP, but makes cocci, happier:
drivers/media/usb/uvc/uvc_ctrl.c:1861:44-50: ERROR: invalid reference to the index variable of the iterator on line 1850
drivers/media/usb/uvc/uvc_ctrl.c:2195:17-23: ERROR: invalid reference to the index variable of the iterator on line 2179
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Stable-dep-of: d9fecd096f67 ("media: uvcvideo: Only save async fh if success") Signed-off-by: Sasha Levin <sashal@kernel.org>
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new() which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20230925095532.1984344-15-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Stable-dep-of: c9c0036c1990 ("soc: mediatek: mtk-devapc: Fix leaking IO map on driver remove") Signed-off-by: Sasha Levin <sashal@kernel.org>
The address space in ADSP PAS (Peripheral Authentication Service)
remoteproc node should point to the QDSP PUB address space
(QDSP6...SS_PUB): 0x0680_0000 with length of 0x10000.
0x3000_0000, value used so far, is the main region of CDSP. Downstream
DTS uses 0x0300_0000, which is oddly similar to 0x3000_0000, yet quite
different and points to unused area.
Correct the base address and length, which also moves the node to
different place to keep things sorted by unit address. The diff looks
big, but only the unit address and "reg" property were changed. This
should have no functional impact on Linux users, because PAS loader does
not use this address space at all.
Add dma-coherent property to fastRPC context bank nodes to pass dma
sequence test in fastrpc sanity test, ensure that data integrity is
maintained during DMA operations.
The address space in ADSP PAS (Peripheral Authentication Service)
remoteproc node should point to the QDSP PUB address space
(QDSP6...SS_PUB): 0x0300_0000 with length of 0x10000, which also matches
downstream DTS. 0x3000_0000, value used so far, was in datasheet is the
region of CDSP.
Correct the base address and length, which also moves the node to
different place to keep things sorted by unit address. The diff looks
big, but only the unit address and "reg" property were changed. This
should have no functional impact on Linux users, because PAS loader does
not use this address space at all.
Fail I/Os instead of retry to prevent user space processes from being
blocked on the I/O completion for several minutes.
Retrying I/Os during "depopulation in progress" or "depopulation restore in
progress" results in a continuous retry loop until the depopulation
completes or until the I/O retry loop is aborted due to a timeout by the
scsi_cmd_runtime_exceeced().
Depopulation is slow and can take 24+ hours to complete on 20+ TB HDDs.
Most I/Os in the depopulation retry loop end up taking several minutes
before returning the failure to user space.
Cc: stable@vger.kernel.org # 4.18.x: 2bbeb8d scsi: core: Handle depopulation and restoration in progress Cc: stable@vger.kernel.org # 4.18.x Fixes: e37c7d9a0341 ("scsi: core: sanitize++ in progress") Signed-off-by: Igor Pylypiv <ipylypiv@google.com> Link: https://lore.kernel.org/r/20250131184408.859579-1-ipylypiv@google.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The default handling of the NOT READY sense key is to wait for the device
to become ready. The "wait" is assumed to be relatively short. However
there is a sub-class of NOT READY that have the "... in progress" phrase in
their additional sense code and these can take much longer. Following on
from commit 505aa4b6a883 ("scsi: sd: Defer spinning up drive while SANITIZE
is in progress") we now have element depopulation and restoration that can
take a long time. For example, over 24 hours for a 20 TB, 7200 rpm hard
disk to depopulate 1 of its 20 elements.
Add handling of ASC/ASCQ: 0x4,0x24 (depopulation in progress)
and ASC/ASCQ: 0x4,0x25 (depopulation restoration in progress)
to sd.c . The scsi_lib.c has incomplete handling of these
two messages, so complete it.
Commit 2e4955167ec5 ("firmware: qcom: scm: Fix __scm and waitq
completion variable initialization") introduced a write barrier in probe
function to store global '__scm' variable. It also claimed that it
added a read barrier, because as we all known barriers are paired (see
memory-barriers.txt: "Note that write barriers should normally be paired
with read or address-dependency barriers"), however it did not really
add it.
The offending commit used READ_ONCE() to access '__scm' global which is
not a barrier.
The barrier is needed so the store to '__scm' will be properly visible.
This is most likely not fatal in current driver design, because missing
read barrier would mean qcom_scm_is_available() callers will access old
value, NULL. Driver does not support unbinding and does not correctly
handle probe failures, thus there is no risk of stale or old pointer in
'__scm' variable.
However for code correctness, readability and to be sure that we did not
mess up something in this tricky topic of SMP barriers, add a read
barrier for accessing '__scm'. Change also comment from useless/obvious
what does barrier do, to what is expected: which other parts of the code
are involved here.
My static checker rule complains about this code. The concern is that
if "sample_space" is negative then the "sample_space >= runtime->channels"
condition will not work as intended because it will be type promoted to a
high unsigned int value.
strm->fifo_sample_size is SSI_FIFO_DEPTH (32). The SSIFSR_TDC_MASK is
0x3f. Without any further context it does seem like a reasonable warning
and it can't hurt to add a check for negatives.
Cc: stable@vger.kernel.org Fixes: 03e786bd4341 ("ASoC: sh: Add RZ/G2L SSIF-2 driver") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://patch.msgid.link/e07c3dc5-d885-4b04-a742-71f42243f4fd@stanley.mountain Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When enabling a pass-through port an interrupt might come before psmouse
driver binds to the pass-through port. However synaptics sub-driver
tries to access psmouse instance presumably associated with the
pass-through port to figure out if only 1 byte of response or entire
protocol packet needs to be forwarded to the pass-through port and may
crash if psmouse instance has not been attached to the port yet.
Fix the crash by introducing open() and close() methods for the port and
check if the port is open before trying to access psmouse instance.
Because psmouse calls serio_open() only after attaching psmouse instance
to serio port instance this prevents the potential crash.
serio_pause_rx() and serio_continue_rx() are usually used together to
temporarily stop receiving interrupts/data for a given serio port.
Define "serio_pause_rx" guard for this so that the port is always
resumed once critical section is over.
Different connectivity boards may be attached to the same platform. For
example, QCA6698-based boards can support either a two-antenna or
three-antenna solution, both of which work on the sa8775p-ride platform.
Due to differences in connectivity boards and variations in RF
performance from different foundries, different NVM configurations are
used based on the board ID.
Therefore, in the firmware-name property, if the NVM file has an
extension, the NVM file will be used. Otherwise, the system will first
try the .bNN (board ID) file, and if that fails, it will fall back to
the .bin file.
Possible configurations:
firmware-name = "QCA6698/hpnv21";
firmware-name = "QCA6698/hpnv21.bin";
Signed-off-by: Cheng Jiang <quic_chejiang@quicinc.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Stable-dep-of: a2fad248947d ("Bluetooth: qca: Fix poor RF performance for WCN6855") Signed-off-by: Sasha Levin <sashal@kernel.org>