Current timeout handler of mad agent acquires/releases mad_agent_priv
lock for every timed out WRs. This causes heavy locking contention
when higher no. of WRs are to be handled inside timeout handler.
This leads to softlockup with below trace in some use cases where
rdma-cm path is used to establish connection between peer nodes
Simplified timeout handler by creating local list of timed out WRs
and invoke send handler post creating the list. The new method acquires/
releases lock once to fetch the list and hence helps to reduce locking
contetiong when processing higher no. of WRs
Some distros have grub2 config files with the lines
if [ x"${feature_menuentry_id}" = xy ]; then
menuentry_id_option="--id"
else
menuentry_id_option=""
fi
which match the skip regex defined for grub2 in get_grub_index():
$skip = '^\s*menuentry';
These false positives cause the grub number to be higher than it
should be, and the wrong kernel can end up booting.
Grub documents the menuentry command with whitespace between it and the
title, so make the skip regex reflect this.
Link: https://lore.kernel.org/20240904175530.84175-1-daniel.m.jordan@oracle.com Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: John 'Warthog9' Hawley (Tenstorrent) <warthog9@eaglescrag.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When the filesystem is mounted with errors=remount-ro, we were setting
SB_RDONLY flag to stop all filesystem modifications. We knew this misses
proper locking (sb->s_umount) and does not go through proper filesystem
remount procedure but it has been the way this worked since early ext2
days and it was good enough for catastrophic situation damage
mitigation. Recently, syzbot has found a way (see link) to trigger
warnings in filesystem freezing because the code got confused by
SB_RDONLY changing under its hands. Since these days we set
EXT4_FLAGS_SHUTDOWN on the superblock which is enough to stop all
filesystem modifications, modifying SB_RDONLY shouldn't be needed. So
stop doing that.
Note that insn at 0x211 is 6-byte cond jump insn now since its offset
becomes 0x80 based on previous round (0x293 - 0x213 = 0x80). At the same
time, insn at 0x292 is a 2-byte insn since its offset is -124.
pass6 will repeat the same code as in pass4. pass7 will repeat the same
code as in pass5, and so on. This will prevent eventual convergence.
Passes 1-14 are with padding = 0. At pass15, padding is 1 and related
insn looks like:
The similar code in pass14:
211: 74 7d je 0x290
213: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
...
249: 48 85 d2 test rdx,rdx
24c: 74 21 je 0x26f
24e: 48 01 f7 add rdi,rsi
...
Before generating the following insn,
250: 74 21 je 0x273
"padding = 1" enables some checking to ensure nops is either 0 or 4
where
#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
nops = INSN_SZ_DIFF - 2
In this specific case,
addrs[i] = 0x24e // from pass14
addrs[i-1] = 0x24d // from pass15
prog - temp = 3 // from 'test rdx,rdx' in pass15
so
nops = -4
and this triggers the failure.
To fix the issue, we need to break cycles of je <-> jmp. For example,
in the above case, we have
211: 74 7d je 0x290
the offset is 0x7d. If 2-byte je insn is generated only if
the offset is less than 0x7d (<= 0x7c), the cycle can be
break and we can achieve the convergence.
I did some study on other cases like je <-> je, jmp <-> je and
jmp <-> jmp which may cause cycles. Those cases are not from actual
reproducible cases since it is pretty hard to construct a test case
for them. the results show that the offset <= 0x7b (0x7b = 123) should
be enough to cover all cases. This patch added a new helper to generate 8-bit
cond/uncond jmp insns only if the offset range is [-128, 123].
Adding/removing large amount of pages at once to/from the CMM balloon
can result in rcu_sched stalls or workqueue lockups, because of busy
looping w/o cond_resched().
Prevent this by adding a cond_resched(). cmm_free_pages() holds a
spin_lock while looping, so it cannot be added directly to the existing
loop. Instead, introduce a wrapper function that operates on maximum 256
pages at once, and add it there.
Disable compile time optimizations of test_facility() for the
decompressor. The decompressor should not contain any optimized code
depending on the architecture level set the kernel image is compiled
for to avoid unexpected operation exceptions.
Add a __DECOMPRESSOR check to test_facility() to enforce that
facilities are always checked during runtime for the decompressor.
Percpu map is often used, but the map value size limit often ignored,
like issue: https://github.com/iovisor/bcc/issues/2519. Actually,
percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we
can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first,
like percpu map of local_storage. Maybe the error message seems clearer
compared with "cannot allocate memory".
Signed-off-by: Jinke Han <jinkehan@didiglobal.com> Signed-off-by: Tao Chen <chen.dylane@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Calling irq_domain_remove() will lead to freeing the IRQ domain
prematurely. The domain is still referenced and will be attempted to get
used via rmi_free_function_list() -> rmi_unregister_function() ->
irq_dispose_mapping() -> irq_get_irq_data()'s ->domain pointer.
With PaX's MEMORY_SANITIZE this will lead to an access fault when
attempting to dereference embedded pointers, as in Torsten's report that
was faulting on the 'domain->ops->unmap' test.
Fix this by releasing the IRQ domain only after all related IRQs have
been deactivated.
This fixes the following issue discovered by code review:
after vqs have been created, a buggy device can send an interrupt.
A control vq callback will then try to schedule control_work which has
not been initialized yet. Similarly for config interrupt. Further, in
and out vq callbacks invoke find_port_by_vq which attempts to take
ports_lock which also has not been initialized.
To fix, init all locks and work before creating vqs.
Message-ID: <ad982e975a6160ad110c623c016041311ca15b4f.1726511547.git.mst@redhat.com> Fixes: 17634ba25544 ("virtio: console: Add a new MULTIPORT feature, support for generic ports") Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
setup_loopback.sh and net_helper.sh are meant to be sourced from other
scripts, not executed directly. Therefore, remove the executable bits from
those files' permissions.
This change is similar to commit 49078c1b80b6 ("selftests: forwarding:
Remove executable bits from lib.sh")
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test") Fixes: 3bdd9fd29cb0 ("selftests/net: synchronize udpgro tests' tx and rx connection") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Link: https://lore.kernel.org/r/20240131140848.360618-4-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The sockets used by udpgso_bench_tx aren't always ready when
udpgso_bench_tx transmits packets. This issue is more prevalent in -rt
kernels, but can occur in both. Replace the hacky sleep calls with a
function that checks whether the ports in the namespace are ready for
use.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Lucas Karpinski <lkarpins@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 9d851dd4dab6 ("selftests: net: Remove executable bits from library scripts") Signed-off-by: Sasha Levin <sashal@kernel.org>
In some conditions, background processes in udpgro don't have enough
time to set up the sockets. When foreground processes start, this
results in the test failing with "./udpgso_bench_tx: sendmsg: Connection
refused". For instance, this happens from time to time on a Qualcomm
SA8540P SoC running CentOS Stream 9.
To fix this, increase the time given to background processes to
complete the startup before foreground processes start.
Signed-off-by: Adrien Thierry <athierry@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 9d851dd4dab6 ("selftests: net: Remove executable bits from library scripts") Signed-off-by: Sasha Levin <sashal@kernel.org>
The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index
into the other arrays. The map_cmdline_to_pid[] is a mapping back to the
full pid as it can be larger than PID_MAX_DEFAULT. And the
saved_cmdlines[] just holds the COMMs associated to the pids.
Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated
together (in reality the saved_cmdlines is just in the memory of the
rounding of the allocation of the structure as it is always allocated in
powers of two). The map_cmdline_to_pid[] array is allocated separately.
Since the rounding to a power of two is rather large (it allows for 8000
elements in saved_cmdlines), also include the map_cmdline_to_pid[] array.
(This drops it to 6000 by default, which is still plenty for most use
cases). This saves even more memory as the map_cmdline_to_pid[] array
doesn't need to be allocated.
Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.068211d9@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.182330529@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This reverts 60be76eeabb3d ("tracing: Add size check when printing
trace_marker output"). The only reason the precision check was added
was because of a bug that miscalculated the write size of the string into
the ring buffer and it truncated it removing the terminating nul byte. On
reading the trace it crashed the kernel. But this was due to the bug in
the code that happened during development and should never happen in
practice. If anything, the precision can hide bugs where the string in the
ring buffer isn't nul terminated and it will not be checked.
The fields in the hist_entry are filled on-demand which means they only
have meaningful values when relevant sort keys are used.
So if neither of 'dso' nor 'sym' sort keys are used, the map/symbols in
the hist entry can be garbage. So it shouldn't access it
unconditionally.
I got a segfault, when I wanted to see cgroup profiles.
$ sudo perf record -a --all-cgroups --synth=cgroup true
$ sudo perf report -s cgroup
Program received signal SIGSEGV, Segmentation fault.
0x00005555557a8d90 in map__dso (map=0x0) at util/map.h:48
48 return RC_CHK_ACCESS(map)->dso;
(gdb) bt
#0 0x00005555557a8d90 in map__dso (map=0x0) at util/map.h:48
#1 0x00005555557aa39b in map__load (map=0x0) at util/map.c:344
#2 0x00005555557aa592 in map__find_symbol (map=0x0, addr=140736115941088) at util/map.c:385
#3 0x00005555557ef000 in hists__findnew_entry (hists=0x555556039d60, entry=0x7fffffffa4c0, al=0x7fffffffa8c0, sample_self=true)
at util/hist.c:644
#4 0x00005555557ef61c in __hists__add_entry (hists=0x555556039d60, al=0x7fffffffa8c0, sym_parent=0x0, bi=0x0, mi=0x0, ki=0x0,
block_info=0x0, sample=0x7fffffffaa90, sample_self=true, ops=0x0) at util/hist.c:761
#5 0x00005555557ef71f in hists__add_entry (hists=0x555556039d60, al=0x7fffffffa8c0, sym_parent=0x0, bi=0x0, mi=0x0, ki=0x0,
sample=0x7fffffffaa90, sample_self=true) at util/hist.c:779
#6 0x00005555557f00fb in iter_add_single_normal_entry (iter=0x7fffffffa900, al=0x7fffffffa8c0) at util/hist.c:1015
#7 0x00005555557f09a7 in hist_entry_iter__add (iter=0x7fffffffa900, al=0x7fffffffa8c0, max_stack_depth=127, arg=0x7fffffffbce0)
at util/hist.c:1260
#8 0x00005555555ba7ce in process_sample_event (tool=0x7fffffffbce0, event=0x7ffff7c14128, sample=0x7fffffffaa90, evsel=0x555556039ad0,
machine=0x5555560388e8) at builtin-report.c:334
#9 0x00005555557b30c8 in evlist__deliver_sample (evlist=0x555556039010, tool=0x7fffffffbce0, event=0x7ffff7c14128,
sample=0x7fffffffaa90, evsel=0x555556039ad0, machine=0x5555560388e8) at util/session.c:1232
#10 0x00005555557b32bc in machines__deliver_event (machines=0x5555560388e8, evlist=0x555556039010, event=0x7ffff7c14128,
sample=0x7fffffffaa90, tool=0x7fffffffbce0, file_offset=110888, file_path=0x555556038ff0 "perf.data") at util/session.c:1271
#11 0x00005555557b3848 in perf_session__deliver_event (session=0x5555560386d0, event=0x7ffff7c14128, tool=0x7fffffffbce0,
file_offset=110888, file_path=0x555556038ff0 "perf.data") at util/session.c:1354
#12 0x00005555557affaf in ordered_events__deliver_event (oe=0x555556038e60, event=0x555556135aa0) at util/session.c:132
#13 0x00005555557bb605 in do_flush (oe=0x555556038e60, show_progress=false) at util/ordered-events.c:245
#14 0x00005555557bb95c in __ordered_events__flush (oe=0x555556038e60, how=OE_FLUSH__ROUND, timestamp=0) at util/ordered-events.c:324
#15 0x00005555557bba46 in ordered_events__flush (oe=0x555556038e60, how=OE_FLUSH__ROUND) at util/ordered-events.c:342
#16 0x00005555557b1b3b in perf_event__process_finished_round (tool=0x7fffffffbce0, event=0x7ffff7c15bb8, oe=0x555556038e60)
at util/session.c:780
#17 0x00005555557b3b27 in perf_session__process_user_event (session=0x5555560386d0, event=0x7ffff7c15bb8, file_offset=117688,
file_path=0x555556038ff0 "perf.data") at util/session.c:1406
As you can see the entry->ms.map was NULL even if he->ms.map has a
value. This is because 'sym' sort key is not given, so it cannot assume
whether he->ms.sym and entry->ms.sym is the same. I only checked the
'sym' sort key here as it implies 'dso' behavior (so maps are the same).
Fixes: ac01c8c4246546fd ("perf hist: Update hist symbol when updating maps") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Matt Fleming <matt@readmodwrite.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20240826221045.1202305-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 3f8ca2e115e5 ("vhost/scsi: Extract common handling code
from control queue handler") a null pointer dereference bug can be
triggered when guest sends an SCSI AN request.
In vhost_scsi_ctl_handle_vq(), `vc.target` is assigned with
`&v_req.tmf.lun[1]` within a switch-case block and is then passed to
vhost_scsi_get_req() which extracts `vc->req` and `tpg`. However, for
a `VIRTIO_SCSI_T_AN_*` request, tpg is not required, so `vc.target` is
set to NULL in this branch. Later, in vhost_scsi_get_req(),
`vc->target` is dereferenced without being checked, leading to a null
pointer dereference bug. This bug can be triggered from guest.
When this bug occurs, the vhost_worker process is killed while holding
`vq->mutex` and the corresponding tpg will remain occupied
indefinitely.
The problem is that d_fsdata was not accessed under d_lock, because
d_release() normally is only called once the dentry is otherwise no
longer accessible but since we also call it explicitly in v9fs_remove
that lock is required:
move the hlist out of the dentry under lock then unref its fids once
they are no longer accessible.
If ENOMEM fails when the extent is splitting, we need to restore the length
of the split extent.
In the ext4_split_extent_at function, only in ext4_ext_create_new_leaf will
it alloc memory and change the shape of the extent tree,even if an ENOMEM
is returned at this time, the extent tree is still self-consistent, Just
restore the split extent lens in the function ext4_split_extent_at.
When a battery hook returns an error when adding a new battery, then
the battery hook is automatically unregistered.
However the battery hook provider cannot know that, so it will later
call battery_hook_unregister() on the already unregistered battery
hook, resulting in a crash.
Fix this by using the list head to mark already unregistered battery
hooks as already being unregistered so that they can be ignored by
battery_hook_unregister().
Fixes: fa93854f7a7e ("battery: Add the battery hooking API") Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://patch.msgid.link/20241001212835.341788-3-W_Armin@gmx.de Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Move the conditional locking from __battery_hook_unregister()
into battery_hook_unregister() and rename the low-level function
to simplify the locking during battery hook removal.
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Pali Rohár <pali@kernel.org> Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://patch.msgid.link/20241001212835.341788-2-W_Armin@gmx.de Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Stable-dep-of: 76959aff14a0 ("ACPI: battery: Fix possible crash when unregistering a battery hook") Signed-off-by: Sasha Levin <sashal@kernel.org>
RTL8125 added fields to the tally counter, what may result in the chip
dma'ing these new fields to unallocated memory. Therefore make sure
that the allocated memory area is big enough to hold all of the
tally counter values, even if we use only parts of it.
Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125") Cc: stable@vger.kernel.org Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/741d26a9-2b2b-485d-91d9-ecb302e345b5@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Rectify the logical value of reset-gpio so that it is set to
0 (disabled) during power-on and to 1 (enabled) during power-off.
Set the reset-gpio to GPIO_OUT_HIGH at initialization time to make
sure it starts off in reset. Also drop the "Set XCLR" comment which
is not-so-informative.
The existing usage of imx335 had reset-gpios polarity inverted
(GPIO_ACTIVE_HIGH) in their device-tree sources. With this patch
included, those DTS will not be able to stream imx335 anymore. The
reset-gpio polarity will need to be rectified in the device-tree
sources as shown in [1] example, in order to get imx335 functional
again (as it remains in reset prior to this fix).
The RK3066 VOP sets a dma_stop bit when it's done scanning out a frame
and needs the driver to acknowledge that by clearing the bit.
Unless we clear it "between" frames, the RGB output only shows noise
instead of the picture. atomic_flush is the place for it that least
affects other code (doing it on vblank would require converting all
other usages of the reg_lock to spin_(un)lock_irq, which would affect
performance for everyone).
This seems to be a redundant synchronization mechanism that was removed
in later iterations of the VOP hardware block.
The RK3399 has a 1024-entry gamma LUT with 10 bits per component on its
"big" VOP and a 256-entry, 8 bit per component LUT on the "little" VOP.
Compared to the RK3288, it no longer requires disabling gamma while
updating the LUT. On the RK3399, the LUT can be updated at any time as
the hardware has two LUT buffers, one can be written while the other is
in use. A swap of the buffers is triggered by writing 1 to the
update_gamma_lut register.
The VOP on RK3399 has a different approach from previous versions for
setting a gamma lookup table, using an update_gamma_lut register. As
this differs from RK3288, give RK3399 its own set of "common" register
definitions.
Harden build ID parsing logic, adding explicit READ_ONCE() where it's
important to have a consistent value read and validated just once.
Also, as pointed out by Andi Kleen, we need to make sure that entire ELF
note is within a page bounds, so move the overflow check up and add an
extra note_size boundaries validation.
Fixes tag below points to the code that moved this code into
lib/buildid.c, and then subsequently was used in perf subsystem, making
this code exposed to perf_event_open() users in v5.12+.
xol_add_vma() maps the uninitialized page allocated by __create_xol_area()
into userspace. On some architectures (x86) this memory is readable even
without VM_READ, VM_EXEC results in the same pgprot_t as VM_EXEC|VM_READ,
although this doesn't really matter, debugger can read this memory anyway.
A number of Arm Ltd CPUs suffer from errata whereby an MSR to the SSBS
special-purpose register does not affect subsequent speculative
instructions, permitting speculative store bypassing for a window of
time.
We worked around this for a number of CPUs in commits:
Since then, a (hopefully final) batch of updates have been published,
with two more affected CPUs. For the affected CPUs the existing
mitigation is sufficient, as described in their respective Software
Developer Errata Notice (SDEN) documents:
The dax_iomap_rw() does two things in each iteration: map written blocks
and copy user data to blocks. If the process is killed by user(See signal
handling in dax_iomap_iter()), the copied data will be returned and added
on inode size, which means that the length of written extents may exceed
the inode size, then fsck will fail. An example is given as:
Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
sync file size update and thus if we crash at unfortunate moment, the
file can have smaller size although O_SYNC IO has reported successful
completion. The problem happens because update of on-disk inode size is
handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
dio_complete() in particular) has returned and generic_file_sync() gets
called by dio_complete(). Fix the problem by handling on-disk inode size
update directly in our ->end_io completion handler.
References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> CC: stable@vger.kernel.org Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20231013121350.26872-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: dda898d7ffe8 ("ext4: dax: fix overflowing extents beyond inode size when partially writing") Signed-off-by: Sasha Levin <sashal@kernel.org>
The pm_runtime_disable() is missing in the remove function, fix it
by using devm_pm_runtime_enable(), so the pm_runtime_disable() in
the probe error path can also be removed.
Fixes: 2d13f2ff6073 ("spi: bcm63xx-spi: fix pm_runtime") Cc: stable@vger.kernel.org # v5.13+ Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by: Jonas Gorski <jonas.gorski@gmail.com> Link: https://patch.msgid.link/20240819123349.4020472-3-ruanjinjie@huawei.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
It is not valid to call pm_runtime_set_suspended() for devices
with runtime PM enabled because it returns -EAGAIN if it is enabled
already and working. So, call pm_runtime_disable() before to fix it.
Krzysztof reported an issue [0] which is caused by parallel attempts to
instantiate the same I2C client device. This can happen if driver
supports auto-detection, but certain devices are also instantiated
explicitly.
The original change isn't actually wrong, it just revealed that I2C core
isn't prepared yet to handle this scenario.
Calls to i2c_new_client_device() can be nested, therefore we can't use a
simple mutex here. Parallel instantiation of devices at different addresses
is ok, so we just have to prevent parallel instantiation at the same address.
We can use a bitmap with one bit per 7-bit I2C client address, and atomic
bit operations to set/check/clear bits.
Now a parallel attempt to instantiate a device at the same address will
result in -EBUSY being returned, avoiding the "sysfs: cannot create duplicate
filename" splash.
Note: This patch version includes small cosmetic changes to the Tested-by
version, only functional change is that address locking is supported
for slave addresses too.
Fixes: caba40ec3531 ("eeprom: at24: Probe for DDR3 thermal sensor in the SPD case") Cc: stable@vger.kernel.org Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Two drivers already implement custom debugfs handling for their
i2c_adapter and more will come. So, let the core create a debugfs
directory per adapter and pass that to drivers for their debugfs files.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
Stable-dep-of: 8d3cefaf6592 ("i2c: core: Lock address during client device instantiation") Signed-off-by: Sasha Levin <sashal@kernel.org>
Change of_*() functions to device_*() for firmware agnostic usage.
This allows to have the smbus_alert interrupt without any changes
in the controller drivers using the ACPI table.
Signed-off-by: Akhil R <akhilrajeev@nvidia.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
Stable-dep-of: 8d3cefaf6592 ("i2c: core: Lock address during client device instantiation") Signed-off-by: Sasha Levin <sashal@kernel.org>
Add fwnode_irq_get_byname() to get an interrupt by name from either
ACPI table or Device Tree, whichever is used for enumeration.
In the ACPI case, this allow us to use 'interrupt-names' in
_DSD which can be mapped to Interrupt() resource by index.
The implementation is similar to 'interrupt-names' in the
Device Tree.
Signed-off-by: Akhil R <akhilrajeev@nvidia.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
Stable-dep-of: 8d3cefaf6592 ("i2c: core: Lock address during client device instantiation") Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch introduces a new helper routine - fwnode_iomap(), which
allows to map the memory mapped IO for a given device node.
This implementation does not cover the ACPI case and may be expanded
in the future. The main purpose here is to be able to develop resource
provider agnostic drivers.
Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Anand Ashok Dumbre <anand.ashok.dumbre@xilinx.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20211203212358.31444-2-anand.ashok.dumbre@xilinx.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Stable-dep-of: 8d3cefaf6592 ("i2c: core: Lock address during client device instantiation") Signed-off-by: Sasha Levin <sashal@kernel.org>
[Why]
Connected with a Thunderbolt monitor and do the suspend and the system
may hang while resume.
The TBT monitor HPD will be triggered during the resume procedure
and call the drm_client_modeset_probe() while
struct drm_connector connector->dev->master is NULL.
It will mess up the pipe topology after resume.
[How]
Skip the TBT monitor HPD during the resume procedure because we
currently will probe the connectors after resume by default.
Reviewed-by: Wayne Lin <wayne.lin@amd.com> Signed-off-by: Tom Chung <chiahsuan.chung@amd.com> Signed-off-by: Fangzhi Zuo <jerry.zuo@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 453f86a26945207a16b8f66aaed5962dc2b95b85) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Without the locking amdgpu currently can race between
amdgpu_ctx_set_entity_priority() (via drm_sched_entity_modify_sched()) and
drm_sched_job_arm(), leading to the latter accesing potentially
inconsitent entity->sched_list and entity->num_sched_list pair.
v2:
* Improve commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify sched list") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Luben Tuikov <ltuikov89@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: dri-devel@lists.freedesktop.org Cc: Philipp Stanner <pstanner@redhat.com> Cc: <stable@vger.kernel.org> # v5.7+ Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240913160559.49054-2-tursulin@igalia.com Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cloning a descriptor table picks the size that would cover all currently
opened files. That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.
Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().
The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.
That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.
It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size(). There we are under ->files_lock,
so the race is trivially avoided.
So we do the following:
* switch close_files() from calling unshare_fd() to calling
dup_fd().
* undo the calling convention change done to unshare_fd() in 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting. NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.
After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:
On a few platforms such as TI's AM69 device, disable_irq() fails to keep
track of the interrupts that happen between disable_irq() and
enable_irq() and those interrupts are missed. Use the ->irq_unmask() and
->irq_mask() methods instead of ->irq_enable() and ->irq_disable() to
correctly keep track of edges when disable_irq is called.
This solves the issue of disable_irq() not working as expected on such
platforms.
During unmount, at close_ctree(), we have the following steps in this order:
1) Park the cleaner kthread - this doesn't destroy the kthread, it basically
halts its execution (wake ups against it work but do nothing);
2) We stop the cleaner kthread - this results in freeing the respective
struct task_struct;
3) We call btrfs_stop_all_workers() which waits for any jobs running in all
the work queues and then free the work queues.
Syzbot reported a case where a fixup worker resulted in a crash when doing
a delayed iput on its inode while attempting to wake up the cleaner at
btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread
was already freed. This can happen during unmount because we don't wait
for any fixup workers still running before we call kthread_stop() against
the cleaner kthread, which stops and free all its resources.
Fix this by waiting for any fixup workers at close_ctree() before we call
kthread_stop() against the cleaner and run pending delayed iputs.
The stack traces reported by syzbot were the following:
BUG: KASAN: slab-use-after-free in __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
Read of size 8 at addr ffff8880272a8a18 by task kworker/u8:3/52
Last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
__kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541
__call_rcu_common kernel/rcu/tree.c:3086 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
context_switch kernel/sched/core.c:5318 [inline]
__schedule+0x184b/0x4ae0 kernel/sched/core.c:6675
schedule_idle+0x56/0x90 kernel/sched/core.c:6793
do_idle+0x56a/0x5d0 kernel/sched/idle.c:354
cpu_startup_entry+0x42/0x60 kernel/sched/idle.c:424
start_secondary+0x102/0x110 arch/x86/kernel/smpboot.c:314
common_startup_64+0x13e/0x147
The buggy address belongs to the object at ffff8880272a8000
which belongs to the cache task_struct of size 7424
The buggy address is located 2584 bytes inside of
freed 7424-byte region [ffff8880272a8000, ffff8880272a9d00)
[CAUSE]
The allocation failure happens at the start_transaction() inside
prepare_to_relocate(), and during the error handling we call
unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.
Then we continue the error path cleanup in btrfs_balance() by calling
reset_balance_state() which will call del_balance_item() to fully delete
the balance item in the root tree.
However during the small window between set_reloc_contrl() and
unset_reloc_control(), we can have a subvolume tree update and created a
reloc_root for that subvolume.
Then we go into the final btrfs_commit_transaction() of
del_balance_item(), and into btrfs_update_reloc_root() inside
commit_fs_roots().
That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
dereference.
[FIX]
Just add extra check on fs_info->reloc_ctl inside
btrfs_update_reloc_root(), before checking
fs_info->reloc_ctl->merge_reloc_tree.
That DEAD_RELOC_TREE handling is to prevent further modification to the
reloc tree during merge stage, but since there is no reloc_ctl at all,
we do not need to bother that.
Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Like other Asus ExpertBook models the B2502CVA has its keybopard IRQ (1)
described as ActiveLow in the DSDT, which the kernel overrides to EdgeHigh
which breaks the keyboard.
Add the B2502CVA to the irq1_level_low_skip_override[] quirk table to fix
this.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217760 Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://patch.msgid.link/20240927141606.66826-4-hdegoede@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Like other Asus Vivobook models the X1704VAP has its keybopard IRQ (1)
described as ActiveLow in the DSDT, which the kernel overrides to EdgeHigh
which breaks the keyboard.
Add the X1704VAP to the irq1_level_low_skip_override[] quirk table to fix
this.
Reported-by: Lamome Julien <julien.lamome@wanadoo.fr> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078696 Closes: https://lore.kernel.org/all/1226760b-4699-4529-bf57-6423938157a3@wanadoo.fr/ Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://patch.msgid.link/20240927141606.66826-3-hdegoede@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The adp5589 seems to have the same behavior as similar devices as
explained in commit 910a9f5636f5 ("Input: adp5588-keys - get value from
data out when dir is out").
Basically, when the gpio is set as output we need to get the value from
ADP5589_GPO_DATA_OUT_A register instead of ADP5589_GPI_STATUS_A.
Fixes: 9d2e173644bb ("Input: ADP5589 - new driver for I2C Keypad Decoder and I/O Expander") Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://lore.kernel.org/r/20241001-b4-dev-adp5589-fw-conversion-v1-2-fca0149dfc47@analog.com Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We register a devm action to call adp5589_clear_config() and then pass
the i2c client as argument so that we can call i2c_get_clientdata() in
order to get our device object. However, i2c_set_clientdata() is only
being set at the end of the probe function which means that we'll get a
NULL pointer dereference in case the probe function fails early.
Fixes: 30df385e35a4 ("Input: adp5589-keys - use devm_add_action_or_reset() for register clear") Signed-off-by: Nuno Sa <nuno.sa@analog.com> Link: https://lore.kernel.org/r/20241001-b4-dev-adp5589-fw-conversion-v1-1-fca0149dfc47@analog.com Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The commit b8c43360f6e4 ("net: stmmac: No need to calculate speed divider
when offload is disabled") allows the "port_transmit_rate_kbps" to be
set to a value of 0, which is then passed to the "div_s64" function when
tc-cbs is disabled. This leads to a zero-division error.
When tc-cbs is disabled, the idleslope, sendslope, and credit values the
credit values are not required to be configured. Therefore, adding a return
statement after setting the txQ mode to DCB when tc-cbs is disabled would
prevent a zero-division error.
Fixes: b8c43360f6e4 ("net: stmmac: No need to calculate speed divider when offload is disabled") Cc: <stable@vger.kernel.org> Co-developed-by: Choong Yong Liang <yong.liang.choong@linux.intel.com> Signed-off-by: Choong Yong Liang <yong.liang.choong@linux.intel.com> Signed-off-by: KhaiWenTan <khai.wen.tan@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240918061422.1589662-1-khai.wen.tan@linux.intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alfred Agrell found that TOMOYO cannot handle execveat(AT_EMPTY_PATH)
inside chroot environment where /dev and /proc are not mounted, for
commit 51f39a1f0cea ("syscalls: implement execveat() system call") missed
that TOMOYO tries to canonicalize argv[0] when the filename fed to the
executed program as argv[0] is supplied using potentially nonexistent
pathname.
Since "/dev/fd/<fd>" already lost symlink information used for obtaining
that <fd>, it is too late to reconstruct symlink's pathname. Although
<filename> part of "/dev/fd/<fd>/<filename>" might not be canonicalized,
TOMOYO cannot use tomoyo_realpath_nofollow() when /dev or /proc is not
mounted. Therefore, fallback to tomoyo_realpath_from_path() when
tomoyo_realpath_nofollow() failed.
Move ST2 reading with overflow handling after measurement data
reading.
ST2 register read have to be read after read measurment data,
because it means end of the reading and realease the lock on the data.
Remove ST2 read skip on interrupt based waiting because ST2 required to
be read out at and of the axis read.
With PWRSTS_OFF_ON, PCIe GDSCs are turned off during gdsc_disable(). This
can happen during scenarios such as system suspend and breaks the resume
of PCIe controllers from suspend.
So use PWRSTS_RET_ON to indicate the GDSC driver to not turn off the GDSCs
during gdsc_disable() and allow the hardware to transition the GDSCs to
retention when the parent domain enters low power state during system
suspend.
in venus_probe, core->work is bound with venus_sys_error_handler, which is
used to handle error. The code use core->sys_err_done to make sync work.
The core->work is started in venus_event_notify.
If we call venus_remove, there might be an unfished work. The possible
sequence is as follows:
The branch clocks of gcc_cpuss_ahb_clk_src are marked critical
and hence these clocks vote on XO blocking the suspend.
De-register these clocks and its source as there is no rate
setting happening on them.
Valid frequencies may result in BCM votes that exceed the max HW value.
Set vote ceiling to BCM_TCS_CMD_VOTE_MASK to ensure the votes aren't
truncated, which can result in lower frequencies than desired.
The cec_msg_set_reply_to() helper function never zeroed the
struct cec_msg flags field, this can cause unexpected behavior
if flags was uninitialized to begin with.
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Fixes: 0dbacebede1e ("[media] cec: move the CEC framework out of staging and to media") Cc: <stable@vger.kernel.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The sun4i_csi driver doesn't implement link validation for the subdev it
registers, leaving the link between the subdev and its source
unvalidated. Fix it, using the v4l2_subdev_link_validate() helper.
There is a clk == NULL check after the switch to check for
unsupported clk types. Since clk is re-assigned in a loop,
this check is useless right now for anything but the first
round. Let's fix this up by assigning clk = NULL in the
loop before the switch statement.
For fixing CVE-2023-6270, f98364e92662 ("aoe: fix the potential
use-after-free problem in aoecmd_cfg_pkts") makes tx() calling dev_put()
instead of doing in aoecmd_cfg_pkts(). It avoids that the tx() runs
into use-after-free.
Then Nicolai Stange found more places in aoe have potential use-after-free
problem with tx(). e.g. revalidate(), aoecmd_ata_rw(), resend(), probe()
and aoecmd_cfg_rsp(). Those functions also use aoenet_xmit() to push
packet to tx queue. So they should also use dev_hold() to increase the
refcnt of skb->dev.
On the other hand, moving dev_put() to tx() causes that the refcnt of
skb->dev be reduced to a negative value, because corresponding
dev_hold() are not called in revalidate(), aoecmd_ata_rw(), resend(),
probe(), and aoecmd_cfg_rsp(). This patch fixed this issue.
According to RFC 8881, all minor versions of NFSv4 support PUTPUBFH.
Replace the XDR decoder for PUTPUBFH with a "noop" since we no
longer want the minorversion check, and PUTPUBFH has no arguments to
decode. (Ideally nfsd4_decode_noop should really be called
nfsd4_decode_void).
PUTPUBFH should now behave just like PUTROOTFH.
Reported-by: Cedric Blancher <cedric.blancher@gmail.com> Fixes: e1a90ebd8b23 ("NFSD: Combine decode operations for v4 and v4.1") Cc: Dan Shelton <dan.f.shelton@gmail.com> Cc: Roland Mainz <roland.mainz@nrubsig.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The pair of bloom filtered used by delegation_blocked() was intended to
block delegations on given filehandles for between 30 and 60 seconds. A
new filehandle would be recorded in the "new" bit set. That would then
be switch to the "old" bit set between 0 and 30 seconds later, and it
would remain as the "old" bit set for 30 seconds.
Unfortunately the code intended to clear the old bit set once it reached
30 seconds old, preparing it to be the next new bit set, instead cleared
the *new* bit set before switching it to be the old bit set. This means
that the "old" bit set is always empty and delegations are blocked
between 0 and 30 seconds.
This patch updates bd->new before clearing the set with that index,
instead of afterwards.
Reported-by: Olga Kornievskaia <okorniev@redhat.com> Cc: stable@vger.kernel.org Fixes: 6282cd565553 ("NFSD: Don't hand out delegations for 30 seconds after recalling them.") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
AddressSanitizer found a use-after-free bug in the symbol code which
manifested as 'perf top' segfaulting.
==1238389==ERROR: AddressSanitizer: heap-use-after-free on address 0x60b00c48844b at pc 0x5650d8035961 bp 0x7f751aaecc90 sp 0x7f751aaecc80
READ of size 1 at 0x60b00c48844b thread T193
#0 0x5650d8035960 in _sort__sym_cmp util/sort.c:310
#1 0x5650d8043744 in hist_entry__cmp util/hist.c:1286
#2 0x5650d8043951 in hists__findnew_entry util/hist.c:614
#3 0x5650d804568f in __hists__add_entry util/hist.c:754
#4 0x5650d8045bf9 in hists__add_entry util/hist.c:772
#5 0x5650d8045df1 in iter_add_single_normal_entry util/hist.c:997
#6 0x5650d8043326 in hist_entry_iter__add util/hist.c:1242
#7 0x5650d7ceeefe in perf_event__process_sample /home/matt/src/linux/tools/perf/builtin-top.c:845
#8 0x5650d7ceeefe in deliver_event /home/matt/src/linux/tools/perf/builtin-top.c:1208
#9 0x5650d7fdb51b in do_flush util/ordered-events.c:245
#10 0x5650d7fdb51b in __ordered_events__flush util/ordered-events.c:324
#11 0x5650d7ced743 in process_thread /home/matt/src/linux/tools/perf/builtin-top.c:1120
#12 0x7f757ef1f133 in start_thread nptl/pthread_create.c:442
#13 0x7f757ef9f7db in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
When updating hist maps it's also necessary to update the hist symbol
reference because the old one gets freed in map__put().
While this bug was probably introduced with 5c24b67aae72f54c ("perf
tools: Replace map->referenced & maps->removed_maps with map->refcnt"),
the symbol objects were leaked until c087e9480cf33672 ("perf machine:
Fix refcount usage when processing PERF_RECORD_KSYMBOL") was merged so
the bug was masked.
Fixes: c087e9480cf33672 ("perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL") Reported-by: Yunzhao Li <yunzhao@cloudflare.com> Signed-off-by: Matt Fleming (Cloudflare) <matt@readmodwrite.com> Cc: Ian Rogers <irogers@google.com> Cc: kernel-team@cloudflare.com Cc: Namhyung Kim <namhyung@kernel.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Cc: stable@vger.kernel.org # v5.13+ Link: https://lore.kernel.org/r/20240815142212.3834625-1-matt@readmodwrite.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the first directory entry in the root directory is not a bitmap
directory entry, 'bh' will not be released and reassigned, which
will cause a memory leak.
This is used in poison.h for poison pointer offset. Based on current
SV39, SV48 and SV57 vm layout, 0xdead000000000000 is a proper value
that is not mappable, this can avoid potentially turning an oops to
an expolit.
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch moves the call to this function so that an handle
can be used. If a transaction fails to start, then there's not point in
trying to mark the filesystem as ineligible, and an error will eventually be
returned to user-space.
Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240923104909.18342-3-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch fixes the calls to this function in
__track_dentry_update() by adding an extra parameter to the callback used in
ext4_fc_track_template().
Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240923104909.18342-2-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a full journal commit is on-going, any fast commit has to be enqueued
into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing
is done only once, i.e. if an inode is already queued in a previous fast
commit entry it won't be enqueued again. However, if a full commit starts
_after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
be done into FC_Q_STAGING. And this is not being done in function
ext4_fc_track_template().
This patch fixes the issue by re-enqueuing an inode into the STAGING queue
during the fast commit clean-up callback when doing a full commit. However,
to prevent a race with a fast-commit, the clean-up callback has to be called
with the journal locked.
This bug was found using fstest generic/047. This test creates several 32k
bytes files, sync'ing each of them after it's creation, and then shutting
down the filesystem. Some data may be loss in this operation; for example a
file may have it's size truncated to zero.
Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240717172220.14201-1-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Function jbd2_journal_shrink_checkpoint_list() assumes that '0' is not a
valid value for transaction IDs, which is incorrect. Don't assume that and
use two extra boolean variables to control the loop iterations and keep
track of the first and last tid.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240724161119.13448-4-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Function ext4_wait_for_tail_page_commit() assumes that '0' is not a valid
value for transaction IDs, which is incorrect. Don't assume that and invoke
jbd2_log_wait_commit() if the journal had a committing transaction instead.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240724161119.13448-2-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In ext4_find_extent(), if the path is not big enough, we free it and set
*orig_path to NULL. But after reallocating and successfully initializing
the path, we don't update *orig_path, in which case the caller gets a
valid path but a NULL ppath, and this may cause a NULL pointer dereference
or a path memory leak. For example:
In ext4_ext_try_to_merge_up(), set path[1].p_bh to NULL after it has been
released, otherwise it may be released twice. An example of what triggers
this is as follows:
split2 map split1
|--------|-------|--------|
ext4_ext_map_blocks
ext4_ext_handle_unwritten_extents
ext4_split_convert_extents
// path->p_depth == 0
ext4_split_extent
// 1. do split1
ext4_split_extent_at
|ext4_ext_insert_extent
| ext4_ext_create_new_leaf
| ext4_ext_grow_indepth
| le16_add_cpu(&neh->eh_depth, 1)
| ext4_find_extent
| // return -ENOMEM
|// get error and try zeroout
|path = ext4_find_extent
| path->p_depth = 1
|ext4_ext_try_to_merge
| ext4_ext_try_to_merge_up
| path->p_depth = 0
| brelse(path[1].p_bh) ---> not set to NULL here
|// zeroout success
// 2. update path
ext4_find_extent
// 3. do split2
ext4_split_extent_at
ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
le16_add_cpu(&neh->eh_depth, 1)
ext4_find_extent
path[0].p_bh = NULL;
path->p_depth = 1
read_extent_tree_block ---> return err
// path[1].p_bh is still the old value
ext4_free_ext_path
ext4_ext_drop_refs
// path->p_depth == 1
brelse(path[1].p_bh) ---> brelse a buffer twice
Finally got the following WARRNING when removing the buffer from lru:
Fixes: ecb94f5fdf4b ("ext4: collapse a single extent tree block into the inode if possible") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-9-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As Ojaswin mentioned in Link, in ext4_ext_insert_extent(), if the path is
reallocated in ext4_ext_create_new_leaf(), we'll use the stale path and
cause UAF. Below is a sample trace with dummy values:
When calling ext4_force_split_extent_at() in ext4_ext_replay_update_ex(),
the 'ppath' is updated but it is the 'path' that is freed, thus potentially
triggering a double-free in the following process:
So drop the unnecessary ppath and use path directly to avoid this problem.
And use ext4_find_extent() directly to update path, avoiding unnecessary
memory allocation and freeing. Also, propagate the error returned by
ext4_find_extent() instead of using strange error codes.
Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-8-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Function __jbd2_log_wait_for_space() assumes that '0' is not a valid value
for transaction IDs, which is incorrect. Don't assume that and invoke
jbd2_log_wait_commit() if the journal had a committing transaction instead.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240724161119.13448-3-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Even though ext4_find_extent() returns an error, ext4_insert_range() still
returns 0. This may confuse the user as to why fallocate returns success,
but the contents of the file are not as expected. So propagate the error
returned by ext4_find_extent() to avoid inconsistencies.
Fixes: 331573febb6a ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-11-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>