Commit d678cbd2f867 ("xsk: Fix handling of invalid descriptors in XSK TX
batching API") fixed batch API usage against set of descriptors with
invalid ones but introduced a problem when AF_XDP SW rings are smaller
than HW ones. Mismatch of reported Tx'ed frames between HW generator and
user space app was observed. It turned out that backpressure mechanism
became a bottleneck when the amount of produced descriptors to CQ is
lower than what we grabbed from XSK Tx ring.
Say that 512 entries had been taken from XSK Tx ring but we had only 490
free entries in CQ. Then callsite (ZC driver) will produce only 490
entries onto HW Tx ring but 512 entries will be released from Tx ring
and this is what will be seen by the user space.
In order to fix this case, mix XSK Tx/CQ ring interractions by moving
around internal functions and changing call order:
* pull out xskq_prod_nb_free() from xskq_prod_reserve_addr_batch()
up to xsk_tx_peek_release_desc_batch();
** move xskq_cons_release_n() into xskq_cons_read_desc_batch()
After doing so, algorithm can be described as follows:
1. lookup Tx entries
2. use value from 1. to reserve space in CQ (*)
3. Read from Tx ring as much descriptors as value from 2
3a. release descriptors from XSK Tx ring (**)
4. Finally produce addresses to CQ
Fixes: d678cbd2f867 ("xsk: Fix handling of invalid descriptors in XSK TX batching API") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830121705.8618-1-maciej.fijalkowski@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
The current pseudo_lock.c code overwrites the value of the
MSR_MISC_FEATURE_CONTROL to 0 even if the original value is not 0.
Therefore, modify it to save and restore the original values.
'status' is known to be 0 at this point. The expected error code is
PTR_ERR(clk).
Switch to dev_err_probe() in order to display the expected error code (in a
human readable way).
This also filters -EPROBE_DEFER cases, should it happen.
Commit 23c7f8d7989e ("net: Fix esp GSO on inter address family
tunnels.") is incomplete. It passes to skb_eth_gso_segment the
protocol for the outer IP version, instead of the inner IP version, so
we end up calling inet_gso_segment on an inner IPv6 packet and
ipv6_gso_segment on an inner IPv4 packet and the packets are dropped.
This patch completes the fix by selecting the correct protocol based
on the inner mode's family.
Since audit_proctitle is generated at syscall exit time, its value is
used immediately and cached for the next syscall. Since this is the
case, then only clear it at task exit time. Otherwise, there is no
point in caching the value OR bearing the overhead of regenerating it.
Fixes: 12c5e81d3fd0 ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Be explicit in checking the struct audit_context "context" member enum
value rather than assuming the order of context enum values.
Fixes: 12c5e81d3fd0 ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When `data` points to a boolean value, casting it to `int *` is problematic
and could lead to a wrong value being passed to `jsonw_bool`. Change the
cast to `bool *` instead.
From 'IEEE Std 802.11-2020 section 11.8.8.4.1':
The mesh channel switch may be triggered by the need to avoid
interference to a detected radar signal, or to reassign mesh STA
channels to ensure the MBSS connectivity.
A 20/40 MHz MBSS may be changed to a 20 MHz MBSS and a 20 MHz
MBSS may be changed to a 20/40 MHz MBSS.
Since the standard allows the change of bandwidth during
the channel switch in mesh, remove the bandwidth check present in
ieee80211_set_csa_beacon.
Fixes: c6da674aff94 ("{nl,cfg,mac}80211: enable the triggering of CSA frame in mesh") Signed-off-by: Hari Chandrakanthan <quic_haric@quicinc.com> Link: https://lore.kernel.org/r/1658903549-21218-1-git-send-email-quic_haric@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When checking for channel regulatory validity, use the
AP link chandef (and not mesh's chandef).
Fixes: 7b0a0e3c3a88 ("wifi: cfg80211: do some rework towards MLO link APIs") Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, verifier verifies callback functions (sync and async) as if
they will be executed once, (i.e. it explores execution state as if the
function was being called once). The next insn to explore is set to
start of subprog and the exit from nested frame is handled using
curframe > 0 and prepare_func_exit. In case of async callback it uses a
customized variant of push_stack simulating a kind of branch to set up
custom state and execution context for the async callback.
While this approach is simple and works when callback really will be
executed only once, it is unsafe for all of our current helpers which
are for_each style, i.e. they execute the callback multiple times.
A callback releasing acquired references of the caller may do so
multiple times, but currently verifier sees it as one call inside the
frame, which then returns to caller. Hence, it thinks it released some
reference that the cb e.g. got access through callback_ctx (register
filled inside cb from spilled typed register on stack).
Similarly, it may see that an acquire call is unpaired inside the
callback, so the caller will copy the reference state of callback and
then will have to release the register with new ref_obj_ids. But again,
the callback may execute multiple times, but the verifier will only
account for acquired references for a single symbolic execution of the
callback, which will cause leaks.
Note that for async callback case, things are different. While currently
we have bpf_timer_set_callback which only executes it once, even for
multiple executions it would be safe, as reference state is NULL and
check_reference_leak would force program to release state before
BPF_EXIT. The state is also unaffected by analysis for the caller frame.
Hence async callback is safe.
Since we want the reference state to be accessible, e.g. for pointers
loaded from stack through callback_ctx's PTR_TO_STACK, we still have to
copy caller's reference_state to callback's bpf_func_state, but we
enforce that whatever references it adds to that reference_state has
been released before it hits BPF_EXIT. This requires introducing a new
callback_ref member in the reference state to distinguish between caller
vs callee references. Hence, check_reference_leak now errors out if it
sees we are in callback_fn and we have not released callback_ref refs.
Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2
etc. we need to also distinguish between whether this particular ref
belongs to this callback frame or parent, and only error for our own, so
we store state->frameno (which is always non-zero for callbacks).
In short, callbacks can read parent reference_state, but cannot mutate
it, to be able to use pointers acquired by the caller. They must only
undo their changes (by releasing their own acquired_refs before
BPF_EXIT) on top of caller reference_state before returning (at which
point the caller and callback state will match anyway, so no need to
copy it back to caller).
Fixed register define is not used, but register definition shall be kept
in sync.
Fixes: 403f69bbdbad ("tsnep: Add TSN endpoint Ethernet MAC driver") Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The mutex might still be in use until the devm cleanup callback
devm_led_classdev_flash_release() is called. This only happens some time
after lm3601x_remove() completed.
Fixes: e63a744871a3 ("leds: lm3601x: Convert class registration to device managed") Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Discussion around a recently-submitted patch provided historical
context for check_refcount_ok [0]. Specifically, the function and its
helpers - may_be_acquire_function and arg_type_may_be_refcounted -
predate the OBJ_RELEASE type flag and the addition of many more helpers
with acquire/release semantics.
The purpose of check_refcount_ok is to ensure:
1) Helper doesn't have multiple uses of return reg's ref_obj_id
2) Helper with release semantics only has one arg needing to be
released, since that's tracked using meta->ref_obj_id
With current verifier, it's safe to remove check_refcount_ok and its
helpers. Since addition of OBJ_RELEASE type flag, case 2) has been
handled by the arg_type_is_release check in check_func_arg. To ensure
case 1) won't result in verifier silently prioritizing one use of
ref_obj_id, this patch adds a helper_multiple_ref_obj_use check which
fails loudly if a helper passes > 1 test for use of ref_obj_id.
/* size: 504, cachelines: 8, members: 7 */
/* sum members: 503, holes: 1, sum holes: 1 */
/* last cacheline: 56 bytes */
};
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Stable-dep-of: 883743422ced ("bpf: Fix ref_obj_id for dynptr data slices in verifier") Signed-off-by: Sasha Levin <sashal@kernel.org>
When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.
And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared
peer map event log:
[ 306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[ 306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[ 306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[ 306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[ 306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166
peer unmap event log:
[ 435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[ 435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[ 435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[ 435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[ 435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166
use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010] ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041] drv_sta_state+0x115/0x677 [mac80211]
[21713.800059] __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076] __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093] ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110] ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137] cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153] nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161] genl_rcv_msg+0x38e/0x3be
[21713.800166] netlink_rcv_skb+0x89/0xf7
[21713.800171] genl_rcv+0x28/0x36
[21713.800176] netlink_unicast+0x179/0x24b
[21713.800181] netlink_sendmsg+0x3a0/0x40e
[21713.800187] sock_sendmsg+0x72/0x76
[21713.800192] ____sys_sendmsg+0x16d/0x1e3
[21713.800196] ___sys_sendmsg+0x95/0xd1
[21713.800200] __sys_sendmsg+0x85/0xbf
[21713.800205] do_syscall_64+0x43/0x55
[21713.800210] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241] ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254] ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265] ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277] ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283] ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294] ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305] ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310] ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316] net_rx_action+0xdc/0x361
[21713.800320] __do_softirq+0x163/0x29a
[21713.800325] asm_call_irq_on_stack+0x12/0x20
[21713.800331] do_softirq_own_stack+0x3c/0x48
[21713.800337] __irq_exit_rcu+0x9b/0x9d
[21713.800342] common_interrupt+0xc9/0x14d
[21713.800346] asm_common_interrupt+0x1e/0x40
[21713.800351] ksoftirqd_should_run+0x5/0x16
[21713.800357] smpboot_thread_fn+0x148/0x211
[21713.800362] kthread+0x150/0x15f
[21713.800367] ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498] ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515] drv_sta_state+0x115/0x677 [mac80211]
[21713.800532] __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548] __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565] ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581] ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598] cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614] nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619] genl_rcv_msg+0x38e/0x3be
[21713.800623] netlink_rcv_skb+0x89/0xf7
[21713.800628] genl_rcv+0x28/0x36
[21713.800632] netlink_unicast+0x179/0x24b
[21713.800637] netlink_sendmsg+0x3a0/0x40e
[21713.800642] sock_sendmsg+0x72/0x76
[21713.800646] ____sys_sendmsg+0x16d/0x1e3
[21713.800651] ___sys_sendmsg+0x95/0xd1
[21713.800655] __sys_sendmsg+0x85/0xbf
[21713.800659] do_syscall_64+0x43/0x55
[21713.800663] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: d0eeafad1189 ("ath10k: Clean up peer when sta goes away.") Signed-off-by: Wen Gong <quic_wgong@quicinc.com> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Link: https://lore.kernel.org/r/20220801141930.16794-1-quic_wgong@quicinc.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Since IQK could spend time, we make a cache of IQK result matrix that looks
like iqk_matrix[channel_idx].val[x][y], and we can reload the matrix if we
have made a cache. To determine a cache is made, we check
iqk_matrix[channel_idx].val[0][0].
The initial commit 7274a8c22980 ("rtlwifi: rtl8192de: Merge phy routines")
make a mistake that checks incorrect iqk_matrix[channel_idx].val[0] that
is always true, and this mistake is found by commit ee3db469dd31
("wifi: rtlwifi: remove always-true condition pointed out by GCC 12"), so
I recall the vendor driver to find fix and apply the correctness.
The "testb" here is using %dil, but the %rdi register was cleared before
returning from "callq *pv_ops+536(%rip)". Adding the proper constraints
results in the use of a different register:
If an NFS server returns NFS4ERR_RESOURCE on the first operation in
an NFSv4 COMPOUND, there's no way for a client to know where the
problem is and then simplify the compound to make forward progress.
So instead, make NFSD process as many operations in an oversized
COMPOUND as it can and then return NFS4ERR_RESOURCE on the first
operation it did not process.
pynfs NFSv4.0 COMP6 exercises this case, but checks only for the
COMPOUND status code, not whether the server has processed any
of the operations.
pynfs NFSv4.1 SEQ6 and SEQ7 exercise the NFSv4.1 case, which detects
too many operations per COMPOUND by checking against the limits
negotiated when the session was created.
Suggested-by: Bruce Fields <bfields@fieldses.org> Fixes: 0078117c6d91 ("nfsd: return RESOURCE not GARBAGE_ARGS on too many ops") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 2825a7f90753 ("nfsd4: allow encoding across page boundaries")
added an explicit computation of the remaining length in the rq_res
XDR buffer.
The computation appears to suffer from an "off-by-one" bug. Because
buflen is too large by one page, XDR encoding can run off the end of
the send buffer by eventually trying to use the struct page address
in rq_page_end, which always contains NULL.
Fixes: bddfdbcddbe2 ("NFSD: Extract the svcxdr_init_encode() helper") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Ensure that stream-based argument decoding can't go past the actual
end of the receive buffer. xdr_init_decode's calculation of the
value of xdr->end over-estimates the end of the buffer because the
Linux kernel RPC server code does not remove the size of the RPC
header from rqstp->rq_arg before calling the upper layer's
dispatcher.
The server-side still uses the svc_getnl() macros to decode the
RPC call header. These macros reduce the length of the head iov
but do not update the total length of the message in the buffer
(buf->len).
A proper fix for this would be to replace the use of svc_getnl() and
friends in the RPC header decoder, but that would be a large and
invasive change that would be difficult to backport.
Fixes: 5191955d6fc6 ("SUNRPC: Prepare for xdr_stream-style decoding on the server-side") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If this memdup_user() call fails, the memory allocated in a previous call
a few lines above should be freed. Otherwise it leaks.
Fixes: 6ee95d1c8991 ("nfsd: add support for upcall version 2") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
elf_update_symbol fails to preserve the special st_shndx values
between [SHN_LORESERVE, SHN_HIRESERVE], which results in it
converting SHN_ABS entries into SHN_UNDEF, for example. Explicitly
check for the special indexes and ensure these symbols are not
marked undefined.
Fixes: ead165fa1042 ("objtool: Fix symbol creation") Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-17-samitolvanen@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, mbox_client_txdone() is called from the PCC address space
handler and that expects the user the Tx state machine to be controlled
by the client which is not the case and the below warning is thrown:
| PCCT: Client can't run the TX ticker
Let the controller run the state machine and the end of Tx can be
acknowledge by calling mbox_chan_txdone() instead.
Fixes: 77e2a04745ff ("ACPI: PCC: Implement OperationRegion handler for the PCC Type 3 subtype") Signed-off-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, the function waiting for completion of mailbox operation is
'wait_for_completion()'. The PCC method will be permanently blocked if
this mailbox message fails to execute. So this patch replaces it with
'wait_for_completion_timeout()'. And set the timeout interval to an
arbitrary retries on top of nominal to prevent the remote processor is
slow to respond to PCC commands.
Fixes: 77e2a04745ff ("ACPI: PCC: Implement OperationRegion handler for the PCC Type 3 subtype") Signed-off-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The allocated memory for the pcc_data struct doesn't get freed under an
error path in pcc_mbox_request_channel() or acpi_os_ioremap(). Also, the
PCC mailbox channel doesn't get freed under an error path in
acpi_os_ioremap().
Fixes: 77e2a04745ff8 ("ACPI: PCC: Implement OperationRegion handler for the PCC Type 3 subtype") Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
MT_MEMORY_RO is introduced by commit 598f0a99fa8a ("ARM: 9210/1:
Mark the FDT_FIXED sections as shareable"), which is a readonly
memory type for FDT area, but there are some different between
ARM_LPAE and non-ARM_LPAE, we need to setup PMD_SECT_AP2 and
L_PMD_SECT_RDONLY for MT_MEMORY_RO when ARM_LAPE enabled.
Fixes: 598f0a99fa8a ("ARM: 9210/1: Mark the FDT_FIXED sections as shareable") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
After ARM supports p4d page tables, the pg_level for note_page()
in walk_pmd() should be 4, not 3, fix it.
Fixes: 84e6ffb2c49c ("arm: add support for folded p4d page tables") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
In file included from ./include/linux/io.h:13,
from ./arch/arm/mach-rpc/include/mach/uncompress.h:9,
from arch/arm/boot/compressed/misc.c:31:
./arch/arm/include/asm/io.h:85:22: error: conflicting types for ‘__raw_writeb’
85 | #define __raw_writeb __raw_writeb
| ^~~~~~~~~~~~
./arch/arm/include/asm/io.h:86:20: note: in expansion of macro ‘__raw_writeb’
86 | static inline void __raw_writeb(u8 val, volatile void __iomem *addr)
| ^~~~~~~~~~~~
In file included from arch/arm/boot/compressed/misc.c:26:
arch/arm/boot/compressed/misc-ep93xx.h:13:20: note: previous definition of ‘__raw_writeb’ was here
13 | static inline void __raw_writeb(unsigned char value, unsigned int ptr)
| ^~~~~~~~~~~~
To: Russell King <linux@armlinux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-arm-kernel@lists.infradead.org Fixes: 0361c7e504b1 ("ARM: ep93xx: multiplatform support") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Use kill_anon_super() instead of generic_shutdown_super() since the
mount() in erofs fscache mode uses get_tree_nodev() and associated
anon bdev needs to be freed.
As syzbot reported [1], the root cause is that i_size field is a
signed type, and negative i_size is also less than EROFS_BLKSIZ.
As a consequence, it's handled as fast symlink unexpectedly.
Let's fall back to the generic path to deal with such unusual i_size.
In error case in bridge_platform_create after calling
platform_device_add()/platform_device_add_data()/
platform_device_add_resources(), release the failed
'pdev' or it will be leak, call platform_device_put()
to fix this problem.
Besides, 'pdev' is divided into 'pdev_wd' and 'pdev_bd',
use platform_device_unregister() to release sgi_w1
resources when xtalk-bridge registration fails.
Fixes: 5dc76a96e95a ("MIPS: PCI: use information from 1-wire PROM for IOC3 detection") Signed-off-by: Lin Yujun <linyujun809@huawei.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
In error case in bridge_platform_create after calling
platform_device_add()/platform_device_add_data()/
platform_device_add_resources(), release the failed
'pdev' or it will be leak, call platform_device_put()
to fix this problem.
Besides, 'pdev' is divided into 'pdev_wd' and 'pdev_bd',
use platform_device_unregister() to release sgi_w1
resources when xtalk-bridge registration fails.
Fixes: fd27234f24ae ("MIPS: add support for SGI Octane (IP30)") Signed-off-by: Lin Yujun <linyujun809@huawei.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
As done for other sections, define the extern as a character array,
which relaxes many of the compiler-time object size checks, which would
otherwise assume it's a single long. Solves the following build error:
arch/sh/kernel/machvec.c: error: array subscript 'struct sh_machine_vector[0]' is partly outside array bounds of 'long int[1]' [-Werror=array-bounds]: => 105:33
To avoid some new AMD processors use wrong highest perf when amd pstate
driver loaded, this fix will query the highest perf from MSR register
MSR_AMD_CPPC_CAP1 and cppc_acpi interface firstly, then compare with the
highest perf value got by calling amd_get_highest_perf() function.
The lower value will be the correct highest perf we need to use.
Otherwise the CPU max MHz will be incorrect if the
amd_get_highest_perf() did not cover the new process family and model ID.
Like this lscpu info, the max frequency is incorrect.
Vendor ID: AuthenticAMD
Socket(s): 1
Stepping: 2
CPU max MHz: 5410.0000
CPU min MHz: 400.0000
BogoMIPS: 5600.54
Fixes: 3743d55b289c2 (x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations) Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Perry Yuan <Perry.Yuan@amd.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The xattr code in ntfs3 is currently a bit confused. For example, it
defines a POSIX ACL i_op->set_acl() method but instead of relying on the
generic POSIX ACL VFS helpers it defines its own set of xattr helpers
with the consequence that i_op->set_acl() is currently dead code.
Switch ntfs3 to rely on the VFS POSIX ACL xattr handlers. Also remove
i_op->{g,s}et_acl() methods from symlink inode operations. Symlinks
don't support xattrs.
This is a preliminary change for the following patches which move
handling idmapped mounts directly in posix_acl_xattr_set().
Since userfaultfd doesn't implement a write operation, it is more
appropriate to open it read-only.
When userfaultfds are opened read-write like it is now, and such fd is
passed from one process to another, SELinux will check both read and
write permissions for the target process, even though it can't actually
do any write operation on the fd later.
Inspired by the following bug report, which has hit the SELinux scenario
described above:
https://bugzilla.redhat.com/show_bug.cgi?id=1974559
Reported-by: Robert O'Callahan <roc@ocallahan.org> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Limit validating the hash algorithm to just security.ima xattr, not
the security.evm xattr or any of the protected EVM security xattrs,
nor posix acls.
Fixes: 50f742dd9147 ("IMA: block writes of the security.ima xattr with unsupported algorithms") Reported-by: Christian Brauner <brauner@kernel.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The latest version of grep claims that egrep is now obsolete so the build
now contains warnings that look like:
egrep: warning: egrep is obsolescent; using grep -E
fix this by using "grep -E" instead.
Cc: Paul Moore <paul@paul-moore.com> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Eric Paris <eparis@parisplace.org> Cc: selinux@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[PM: tweak to remove vdso reference, cleanup subj line] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
manage_dm_interrupts disable/enable vblank using drm_crtc_vblank_off/on
which causes drm_crtc_vblank_get in vrr_transition to fail, and later
when drm_crtc_vblank_put is called the refcount on vblank will be messed
up. Therefore move the call to after manage_dm_interrupts.
Take the DG2 CCS+CC modifier into account when calculating the
watermarks. Othwerwise we'll calculate the watermarks thinking this
tile-4 modifier is linear.
The rc_surface part is actually a nop since that is not used
for any glk+ platform.
Take the DG2 CCS modifiers into account when calculating the
watermarks. Othwerwise we'll calculate the watermarks thinking these
tile-4 modifiers are linear.
The rc_surface part is actually a nop since that is not used
for any glk+ platform.
Take the gen12+ CCS+CC modifier into account when calculating the
watermarks. Othwerwise we'll calculate the watermarks thinking this
Y-tiled modifier is linear.
The rc_surface part is actually a nop since that is not used
for any glk+ platform.
Take the gen12+ MC CCS modifier into account when calculating the
watermarks. Othwerwise we'll calculate the watermarks thinking this
Y-tiled modifier is linear.
The rc_surface part is actually a nop since that is not used
for any glk+ platform.
v2: Split RC CCS vs. MC CCS to separate patches
Cc: stable@vger.kernel.org Fixes: 2dfbf9d2873a ("drm/i915/tgl: Gen-12 display can decompress surfaces compressed by the media engine") Reviewed-by: Juha-Pekka Heikkila <juhapekka.heikkila@gmail.com> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221003111544.8007-3-ville.syrjala@linux.intel.com
(cherry picked from commit 91c9651425fe955b1387f3637607dda005f3f710) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Take the gen12+ RC CCS modifier into account when calculating the
watermarks. Othwerwise we'll calculate the watermarks thinking this
Y-tiled modifier is linear.
The rc_surface part is actually a nop since that is not used
for any glk+ platform.
Now that the scratch page and page directories have a reference back to
the i915_address_space, we cannot do an immediate free of the ppgtt upon
error as those buffer objects will perform a later i915_vm_put in their
deferred frees.
The downside is that by replacing the onion unwind along the error
paths, the ppgtt cleanup must handle a partially constructed vm. This
includes ensuring that the vm->cleanup is set prior to the error path.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6900 Signed-off-by: Chris Wilson <chris.p.wilson@intel.com> Fixes: 4d8151ae5329 ("drm/i915: Don't free shared locks while shared") Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: <stable@vger.kernel.org> # v5.14+ Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220926153333.102195-1-matthew.auld@intel.com
(cherry picked from commit c286558f58535cf97b717b946d6c96d774a09d17) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
nouveau_bo_init() is backed by ttm_bo_init() and ferries its return code
back to the caller. On failures, ttm will call nouveau_bo_del_ttm() and
free the memory.Thus, when nouveau_bo_init() returns an error, the gem
object has already been released. Then the call to nouveau_bo_ref() will
use the freed "nvbo->bo" and lead to a use-after-free bug.
We should delete the call to nouveau_bo_ref() to avoid the use-after-free.
Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Lyude Paul <lyude@redhat.com> Fixes: 019cbd4a4feb ("drm/nouveau: Initialize GEM object before TTM object") Cc: Thierry Reding <treding@nvidia.com> Cc: <stable@vger.kernel.org> # v5.4+ Link: https://patchwork.freedesktop.org/patch/msgid/20220705132546.2247677-1-niejianglei2021@163.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As it turns out: while Nvidia does actually have interlacing knobs on their
GPU still pretty much no current GPUs since Volta actually support it.
Trying interlacing on these GPUs will result in NVDisplay being quite
unhappy like so:
nouveau 0000:1f:00.0: disp: chid 0 stat 00004802 reason 4 [INVALID_ARG] mthd 2008 data 00000001 code 00080000
nouveau 0000:1f:00.0: disp: chid 0 stat 10005080 reason 5 [INVALID_STATE] mthd 0200 data 00000001 code 00000001
So let's fix this by following the same behavior Nvidia's driver does and
disable interlacing entirely.
In the greybus audio_helper code, the debugfs file for the dapm has the
potential to be removed and memory will be leaked. There is also the
very real potential for this code to remove ALL debugfs entries from the
system, and it seems like this is what will really happen if this code
ever runs. This all is very wrong as the greybus audio driver did not
create this debugfs file, the sound core did and controls the lifespan
of it.
So remove all of the debugfs logic from the audio_helper code as there's
no way it could be correct. If this really is needed, it can come back
with a fixup for the incorrect usage of the debugfs_lookup() call which
is what caused this to be noticed at all.
Deliberately truncate the exception error code when shoving it into the
VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12).
Intel CPUs are incapable of handling 32-bit error codes and will never
generate an error code with bits 31:16, but userspace can provide an
arbitrary error code via KVM_SET_VCPU_EVENTS. Failure to drop the bits
on exception injection results in failed VM-Entry, as VMX disallows
setting bits 31:16. Setting the bits on VM-Exit would at best confuse
L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to
reinject the exception back into L2.
Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-3-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Don't propagate vmcs12's VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL to vmcs02.
KVM doesn't disallow L1 from using VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL
even when KVM itself doesn't use the control, e.g. due to the various
CPU errata that where the MSR can be corrupted on VM-Exit.
Preserve KVM's (vmcs01) setting to hopefully avoid having to toggle the
bit in vmcs02 at a later point. E.g. if KVM is loading PERF_GLOBAL_CTRL
when running L1, then odds are good KVM will also load the MSR when
running L2.
Fixes: 8bf00a529967 ("KVM: VMX: add support for switching of PERF_GLOBAL_CTRL") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20220830133737.1539624-18-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Drop pending exceptions and events queued for re-injection when leaving
nested guest mode, even if the "exit" is due to VM-Fail, SMI, or forced
by host userspace. Failure to purge events could result in an event
belonging to L2 being injected into L1.
This _should_ never happen for VM-Fail as all events should be blocked by
nested_run_pending, but it's possible if KVM, not the L1 hypervisor, is
the source of VM-Fail when running vmcs02.
SMI is a nop (barring unknown bugs) as recognition of SMI and thus entry
to SMM is blocked by pending exceptions and re-injected events.
Forced exit is definitely buggy, but has likely gone unnoticed because
userspace probably follows the forced exit with KVM_SET_VCPU_EVENTS (or
some other ioctl() that purges the queue).
Fixes: 4f350c6dbcb9 ("kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-2-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The emulator checks the wrong variable while setting the CPU
interruptibility state, the target segment is embedded in the instruction
opcode, not the ModR/M register. Fix the condition.
Signed-off-by: Michal Luczaj <mhal@rbox.co> Fixes: a5457e7bcf9a ("KVM: emulate: POP SS triggers a MOV SS shadow too") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20220821215900.1419215-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Our test found a problem that wbt inflight counter is negative, which
will cause io hang(noted that this problem doesn't exist in mainline):
t1: device create t2: issue io
add_disk
blk_register_queue
wbt_enable_default
wbt_init
rq_qos_add
// wb_normal is still 0
/*
* in mainline, disk can't be opened before
* bdev_add(), however, in old kernels, disk
* can be opened before blk_register_queue().
*/
blkdev_issue_flush
// disk size is 0, however, it's not checked
submit_bio_wait
submit_bio
blk_mq_submit_bio
rq_qos_throttle
wbt_wait
bio_to_wbt_flags
rwb_enabled
// wb_normal is 0, inflight is not increased
wbt_queue_depth_changed(&rwb->rqos);
wbt_update_limits
// wb_normal is initialized
rq_qos_track
wbt_track
rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
// wb_normal is not 0,wbt_flags will be set
t3: io completion
blk_mq_free_request
rq_qos_done
wbt_done
wbt_is_tracked
// return true
__wbt_done
wbt_rqw_done
atomic_dec_return(&rqw->inflight);
// inflight is decreased
commit 8235b5c1e8c1 ("block: call bdev_add later in device_add_disk") can
avoid this problem, however it's better to fix this problem in wbt:
1) Lower kernel can't backport this patch due to lots of refactor.
2) Root cause is that wbt call rq_qos_add() before wb_normal is
initialized.
The problem is that the second bio is finished after 10s instead of 20s.
Root cause:
1) second bio will be flagged:
__blk_throtl_bio
while (true) {
...
if (sq->nr_queued[rw]) -> some bio is throttled already
break
};
bio_set_flag(bio, BIO_THROTTLED); -> flag the bio
2) flagged bio will be dispatched without waiting:
throtl_dispatch_tg
tg_may_dispatch
tg_with_in_bps_limit
if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED))
*wait = 0; -> wait time is zero
return true;
commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
support to count split bios for iops limit, thus it adds flagged bio
checking in tg_with_in_bps_limit() so that split bios will only count
once for bps limit, however, it introduce a new problem that io throttle
won't work if multiple bios are throttled.
In order to fix the problem, handle iops/bps limit in different ways:
1) for iops limit, there is no flag to record if the bio is throttled,
and iops is always applied.
2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED,
and io throttle will ignore bio with the flag.
Noted this patch also remove the code to set flag in __bio_clone(), it's
introduced in commit 111be8839817 ("block-throttle: avoid double
charge"), and author thinks split bio can be resubmited and throttled
again, which is wrong because split bio will continue to dispatch from
caller.
Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit") Cc: <stable@vger.kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The busy status bit may never de-assert if number of programmed skip
bits is incorrect, resulting in a kernel hang because the bit is polled
endlessly in the code. Fix it by adding timeout for the bit-polling.
This problem is reproducible by setting the data_bit_offset field of
the HEVC slice params to a wrong value by userspace.
Cc: stable@vger.kernel.org Fixes: 7678c5462680 (media: cedrus: Fix decoding for some HEVC videos) Reported-by: Nicolas Dufresne <nicolas.dufresne@collabora.com> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The cedrus_hw_resume() crashes with NULL deference on driver probe if
runtime PM is disabled because it uses platform data that hasn't been
set up yet. Fix this by setting the platform data earlier during probe.
Cc: stable@vger.kernel.org Fixes: 50e761516f2b (media: platform: Add Cedrus VPU decoder driver) Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com> Reviewed-by: Samuel Holland <samuel@sholland.org> Acked-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Software that has run before the USB4 CM in Linux runs may have disabled
hotplug events for a given lane adapter.
Other CMs such as that one distributed with Windows 11 will enable hotplug
events. Do the same thing in the Linux CM which fixes hotplug events on
"AMD Pink Sardine".
Cc: stable@vger.kernel.org Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The rpmsg_dev_remove() in rpmsg_core is the place for releasing
this default endpoint.
So need to avoid destroying the default endpoint in
rpmsg_chrdev_eptdev_destroy(), this should be the same as
rpmsg_eptdev_release(). Otherwise there will be double destroy
issue that ept->refcount report warning:
The problem is that the synthetic event field "char file[]" will read
the value given to it as a string without any memory checks to make sure
the address is valid. The above example will pass in the user space
address and the sythetic event code will happily call strlen() on it
and then strscpy() where either one will cause an oops when accessing
user space addresses.
Use the helper functions from trace_kprobe and trace_eprobe that can
read strings safely (and actually succeed when the address is from user
space and the memory is mapped in).
Have the specific functions for kernel probes that read strings to inject
the "(fault)" name directly. trace_probes.c does this too (for uprobes)
but as the code to read strings are going to be used by synthetic events
(and perhaps other utilities), it simplifies the code by making sure those
other uses do not need to implement the "(fault)" name injection as well.
Link: https://lkml.kernel.org/r/20221012104534.644803645@goodmis.org Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Fixes: bd82631d7ccdc ("tracing: Add support for dynamic strings to synthetic events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
are identical in both trace_kprobe.c and trace_eprobe.c. Move them into
a new header file trace_probe_kernel.h to share it. This code will later
be used by the synthetic events as well.
Marked for stable as a fix for a crash in synthetic events requires it.
Link: https://lkml.kernel.org/r/20221012104534.467668078@goodmis.org Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Fixes: bd82631d7ccdc ("tracing: Add support for dynamic strings to synthetic events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The ftrace_boot_snapshot and alloc_snapshot cmdline options allocate the
snapshot buffer at boot up for use later. The ftrace_boot_snapshot in
particular requires the snapshot to be allocated because it will take a
snapshot at the end of boot up allowing to see the traces that happened
during boot so that it's not lost when user space takes over.
When a tracer is registered (started) there's a path that checks if it
requires the snapshot buffer or not, and if it does not and it was
allocated it will do a synchronization and free the snapshot buffer.
This is only required if the previous tracer was using it for "max
latency" snapshots, as it needs to make sure all max snapshots are
complete before freeing. But this is only needed if the previous tracer
was using the snapshot buffer for latency (like irqoff tracer and
friends). But it does not make sense to free it, if the previous tracer
was not using it, and the snapshot was allocated by the cmdline
parameters. This basically takes away the point of allocating it in the
first place!
Note, the allocated snapshot worked fine for just trace events, but fails
when a tracer is enabled on the cmdline.
Further investigation, this goes back even further and it does not require
a tracer on the cmdline to fail. Simply enable snapshots and then enable a
tracer, and it will remove the snapshot.
Link: https://lkml.kernel.org/r/20221005113757.041df7fe@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: 45ad21ca5530 ("tracing: Have trace_array keep track if snapshot buffer is allocated") Reported-by: Ross Zwisler <zwisler@kernel.org> Tested-by: Ross Zwisler <zwisler@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a process is waiting on the ring buffer for data, there currently isn't
a clean way to force it to wake up. Add an ioctl call that will force any
tasks that are waiting on the trace_pipe_raw file to wake up.
Link: https://lkml.kernel.org/r/20220929095029.117f913f@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When tracing is disabled, there's no reason that waiters should stay
waiting, wake them up, otherwise tasks get stuck when they should be
flushing the buffers.
Cc: stable@vger.kernel.org Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the file that represents the ring buffer is closed, there may be
waiters waiting on more input from the ring buffer. Call
ring_buffer_wake_waiters() to wake up any waiters when the file is
closed.
Link: https://lkml.kernel.org/r/20220927231825.182416969@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It was found that some tracing functions in kernel/trace/trace.c acquire
an arch_spinlock_t with preemption and irqs enabled. An example is the
tracing_saved_cmdlines_size_read() function which intermittently causes
a "BUG: using smp_processor_id() in preemptible" warning when the LTP
read_all_proc test is run.
That can be problematic in case preemption happens after acquiring the
lock. Add the necessary preemption or interrupt disabling code in the
appropriate places before acquiring an arch_spinlock_t.
The convention here is to disable preemption for trace_cmdline_lock and
interupt for max_lock.
Link: https://lkml.kernel.org/r/20220922145622.1744826-1-longman@redhat.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: stable@vger.kernel.org Fixes: a35873a0993b ("tracing: Add conditional snapshot") Fixes: 939c7a4f04fc ("tracing: Introduce saved_cmdlines_size file") Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The ring buffer is broken up into sub buffers (currently of page size).
Each sub buffer has a pointer to its "tail" (the last event written to the
sub buffer). When a new event is requested, the tail is locally
incremented to cover the size of the new event. This is done in a way that
there is no need for locking.
If the tail goes past the end of the sub buffer, the process of moving to
the next sub buffer takes place. After setting the current sub buffer to
the next one, the previous one that had the tail go passed the end of the
sub buffer needs to be reset back to the original tail location (before
the new event was requested) and the rest of the sub buffer needs to be
"padded".
The race happens when a reader takes control of the sub buffer. As readers
do a "swap" of sub buffers from the ring buffer to get exclusive access to
the sub buffer, it replaces the "head" sub buffer with an empty sub buffer
that goes back into the writable portion of the ring buffer. This swap can
happen as soon as the writer moves to the next sub buffer and before it
updates the last sub buffer with padding.
Because the sub buffer can be released to the reader while the writer is
still updating the padding, it is possible for the reader to see the event
that goes past the end of the sub buffer. This can cause obvious issues.
To fix this, add a few memory barriers so that the reader definitely sees
the updates to the sub buffer, and also waits until the writer has put
back the "tail" of the sub buffer back to the last event that was written
on it.
To be paranoid, it will only spin for 1 second, otherwise it will
warn and shutdown the ring buffer code. 1 second should be enough as
the writer does have preemption disabled. If the writer doesn't move
within 1 second (with preemption disabled) something is horribly
wrong. No interrupt should last 1 second!
On closing of a file that represents a ring buffer or flushing the file,
there may be waiters on the ring buffer that needs to be woken up and exit
the ring_buffer_wait() function.
Add ring_buffer_wake_waiters() to wake up the waiters on the ring buffer
and allow them to exit the wait loop.
Link: https://lkml.kernel.org/r/20220928133938.28dc2c27@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The wake up waiters only checks the "wakeup_full" variable and not the
"full_waiters_pending". The full_waiters_pending is set when a waiter is
added to the wait queue. The wakeup_full is only set when an event is
triggered, and it clears the full_waiters_pending to avoid multiple calls
to irq_work_queue().
The irq_work callback really needs to check both wakeup_full as well as
full_waiters_pending such that this code can be used to wake up waiters
when a file is closed that represents the ring buffer and the waiters need
to be woken up.
Link: https://lkml.kernel.org/r/20220927231824.209460321@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 15693458c4bc0 ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The logic to know when the shortest waiters on the ring buffer should be
woken up or not has uses a less than instead of a greater than compare,
which causes the shortest_full to actually be the longest.
Link: https://lkml.kernel.org/r/20220927231823.718039222@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a page is partially read, and then the splice system call is run
against the ring buffer, it will always fail to read, no matter how much
is in the ring buffer. That's because the code path for a partial read of
the page does will fail if the "full" flag is set.
The splice system call wants full pages, so if the read of the ring buffer
is not yet full, it should return zero, and the splice will block. But if
a previous read was done, where the beginning has been consumed, it should
still be given to the splice caller if the rest of the page has been
written to.
This caused the splice command to never consume data in this scenario, and
let the ring buffer just fill up and lose events.
Weak functions started causing havoc as they showed up in the
"available_filter_functions" and this confused people as to why some
functions marked as "notrace" were listed, but when enabled they did
nothing. This was because weak functions can still have fentry calls, and
these addresses get added to the "available_filter_functions" file.
kallsyms is what converts those addresses to names, and since the weak
functions are not listed in kallsyms, it would just pick the function
before that.
To solve this, there was a trick to detect weak functions listed, and
these records would be marked as DISABLED so that they do not get enabled
and are mostly ignored. As the processing of the list of all functions to
figure out what is weak or not can take a long time, this process is put
off into a kernel thread and run in parallel with the rest of start up.
Now the issue happens whet function tracing is enabled via the kernel
command line. As it starts very early in boot up, it can be enabled before
the records that are weak are marked to be disabled. This causes an issue
in the accounting, as the weak records are enabled by the command line
function tracing, but after boot up, they are not disabled.
The ftrace records have several accounting flags and a ref count. The
DISABLED flag is just one. If the record is enabled before it is marked
DISABLED it will get an ENABLED flag and also have its ref counter
incremented. After it is marked for DISABLED, neither the ENABLED flag nor
the ref counter is cleared. There's sanity checks on the records that are
performed after an ftrace function is registered or unregistered, and this
detected that there were records marked as ENABLED with ref counter that
should not have been.
Note, the module loading code uses the DISABLED flag as well to keep its
functions from being modified while its being loaded and some of these
flags may get set in this process. So changing the verification code to
ignore DISABLED records is a no go, as it still needs to verify that the
module records are working too.
Also, the weak functions still are calling a trampoline. Even though they
should never be called, it is dangerous to leave these weak functions
calling a trampoline that is freed, so they should still be set back to
nops.
There's two places that need to not skip records that have the ENABLED
and the DISABLED flags set. That is where the ftrace_ops is processed and
sets the records ref counts, and then later when the function itself is to
be updated, and the ENABLED flag gets removed. Add a helper function
"skip_record()" that returns true if the record has the DISABLED flag set
but not the ENABLED flag.
When executing following commands like what document said, but the log
"#### all functions enabled ####" was not shown as expect:
1. Set a 'mod' filter:
$ echo 'write*:mod:ext3' > /sys/kernel/tracing/set_ftrace_filter
2. Invert above filter:
$ echo '!write*:mod:ext3' >> /sys/kernel/tracing/set_ftrace_filter
3. Read the file:
$ cat /sys/kernel/tracing/set_ftrace_filter
By some debugging, I found that flag FTRACE_HASH_FL_MOD was not unset
after inversion like above step 2 and then result of ftrace_hash_empty()
is incorrect.
Link: https://lkml.kernel.org/r/20220926152008.2239274-1-zhengyejian1@huawei.com Cc: <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 8c08f0d5c6fb ("ftrace: Have cached module filters be an active filter") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The KLP transition code depends on the TIF_PATCH_PENDING and
the task->patch_state to stay in sync. On a normal (forward)
transition, TIF_PATCH_PENDING will be set on every task in
the system, while on a reverse transition (after a failed
forward one) first TIF_PATCH_PENDING will be cleared from
every task, followed by it being set on tasks that need to
be transitioned back to the original code.
However, the fork code copies over the TIF_PATCH_PENDING flag
from the parent to the child early on, in dup_task_struct and
setup_thread_stack. Much later, klp_copy_process will set
child->patch_state to match that of the parent.
However, the parent's patch_state may have been changed by KLP loading
or unloading since it was initially copied over into the child.
This results in the KLP code occasionally hitting this warning in
klp_complete_transition:
Set, or clear, the TIF_PATCH_PENDING flag in the child task
depending on whether or not it is needed at the time
klp_copy_process is called, at a point in copy_process where the
tasklist_lock is held exclusively, preventing races with the KLP
code.
The KLP code does have a few places where the state is changed
without the tasklist_lock held, but those should not cause
problems because klp_update_patch_state(current) cannot be
called while the current task is in the middle of fork,
klp_check_and_switch_task() which is called under the pi_lock,
which prevents rescheduling, and manipulation of the patch
state of idle tasks, which do not fork.
This should prevent this warning from triggering again in the
future, and close the race for both normal and reverse transitions.
Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Breno Leitao <leitao@debian.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Fixes: d83a7cb375ee ("livepatch: change to a per-task consistency model") Cc: stable@kernel.org Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20220808150019.03d6a67b@imladris.surriel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To avoid to 'state->fc_regions_size' mismatch with 'state->fc_regions'
when fail to reallocate 'fc_reqions',only update 'state->fc_regions_size'
after 'state->fc_regions' is allocated successfully.
Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-4-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As krealloc may return NULL, in this case 'state->fc_regions' may not be
freed by krealloc, but 'state->fc_regions' already set NULL. Then will
lead to 'state->fc_regions' memory leak.
Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As krealloc may return NULL, in this case 'state->fc_modified_inodes'
may not be freed by krealloc, but 'state->fc_modified_inodes' already
set NULL. Then will lead to 'state->fc_modified_inodes' memory leak.
Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In 'ext4_fc_write_inode' function first call 'ext4_get_inode_loc' get 'iloc',
after use it miss release 'iloc.bh'.
So just release 'iloc.bh' before 'ext4_fc_write_inode' return.
Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100859.1415196-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Following process may lead to fs corruption:
1. ext4_create(dir/foo)
ext4_add_nondir
ext4_add_entry
ext4_dx_add_entry
a. add_dirent_to_buf
ext4_mark_inode_dirty
ext4_handle_dirty_metadata // dir inode bh is recorded into journal
b. ext4_append // dx_get_count(entries) == dx_get_limit(entries)
ext4_bread(EXT4_GET_BLOCKS_CREATE)
ext4_getblk
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
dquot_alloc_block
dquot_alloc_space_nodirty
inode_add_bytes // update dir's i_blocks
ext4_ext_insert_extent
ext4_ext_dirty // record extent bh into journal
ext4_handle_dirty_metadata(bh)
// record new block into journal
inode->i_size += inode->i_sb->s_blocksize // new size(in mem)
c. ext4_handle_dirty_dx_node(bh2)
// record dir's new block(dx_node) into journal
d. ext4_handle_dirty_dx_node((frame - 1)->bh)
e. ext4_handle_dirty_dx_node(frame->bh)
f. do_split // ret err!
g. add_dirent_to_buf
ext4_mark_inode_dirty(dir) // update raw_inode on disk(skipped)
2. fsck -a /dev/sdb
drop last block(dx_node) which beyonds dir's i_size.
/dev/sdb: recovering journal
/dev/sdb contains a file system with errors, check forced.
/dev/sdb: Inode 12, end of extent exceeds allowed value
(logical block 128, physical block 3938, len 1)
3. fsck -fn /dev/sdb
dx_node->entry[i].blk > dir->i_size
Pass 2: Checking directory structure
Problem in HTREE directory inode 12 (/dir): bad block number 128.
Clear HTree index? no
Problem in HTREE directory inode 12: block #3 has invalid depth (2)
Problem in HTREE directory inode 12: block #3 has bad max hash
Problem in HTREE directory inode 12: block #3 not referenced
Fix it by marking inode dirty directly inside ext4_append().
Fetch a reproducer in [Link].
ext4 currently updates the i_version counter when the atime is updated
during a read. This is less than ideal as it can cause unnecessary cache
invalidations with NFSv4 and unnecessary remeasurements for IMA.
The increment in ext4_mark_iloc_dirty is also problematic since it can
corrupt the i_version counter for ea_inodes. We aren't bumping the file
times in ext4_mark_iloc_dirty, so changing the i_version there seems
wrong, and is the cause of both problems.
Remove that callsite and add increments to the setattr, setxattr and
ioctl codepaths, at the same times that we update the ctime. The
i_version bump that already happens during timestamp updates should take
care of the rest.
In ext4_move_extents, increment the i_version on both inodes, and also
add in missing ctime updates.
[ Some minor updates since we've already enabled the i_version counter
unconditionally already via another patch series. -- TYT ]
Cc: stable@kernel.org Cc: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20220908172448.208585-3-jlayton@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In our product environment, we encounter some jbd hung waiting handles to
stop while several writters were doing memory reclaim for buffer head
allocation in delay alloc write path. Ext4 do buffer head allocation with
holding transaction handle which may be blocked too long if the reclaim
works not so smooth. According to our bcc trace, the reclaim time in
buffer head allocation can reach 258s and the jbd transaction commit also
take almost the same time meanwhile. Except for these extreme cases,
we often see several seconds delays for cgroup memory reclaim on our
servers. This is more likely to happen considering docker environment.
One thing to note, the allocation of buffer heads is as often as page
allocation or more often when blocksize less than page size. Just like
page cache allocation, we should also place the buffer head allocation
before startting the handle.
Recently we notice that ext4 filesystem would occasionally fail to read
metadata from disk and report error message, but the disk and block
layer looks fine. After analyse, we lockon commit 88dbcbb3a484
("blkdev: avoid migration stalls for blkdev pages"). It provide a
migration method for the bdev, we could move page that has buffers
without extra users now, but it lock the buffers on the page, which
breaks the fragile metadata read operation on ext4 filesystem,
ext4_read_bh_lock() was copied from ll_rw_block(), it depends on the
assumption of that locked buffer means it is under IO. So it just
trylock the buffer and skip submit IO if it lock failed, after
wait_on_buffer() we conclude IO error because the buffer is not
uptodate.
This issue could be easily reproduced by add some delay just after
buffer_migrate_lock_buffers() in __buffer_migrate_folio() and do
fsstress on ext4 filesystem.
Fix it by removing the trylock logic in ext4_read_bh_lock(), just lock
the buffer and submit IO if it's not uptodate, and also leave over
readahead helper.
Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220831074629.3755110-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).
Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.
Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally. While we're in here, mark the i_version mount
option Opt_removed.
[ Removed leftover bits of i_version from ext4_apply_options() since it
now can't ever be set in ctx->mask_s_flags -- lczerner ]
Cc: stable@kernel.org Cc: Dave Chinner <david@fromorbit.com> Cc: Benjamin Coddington <bcodding@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220824160349.39664-3-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ea_inodes are using i_version for storing part of the reference count so
we really need to leave it alone.
The problem can be reproduced by xfstest ext4/026 when iversion is
enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
inodes in ext4_mark_iloc_dirty().
Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220824160349.39664-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The check in __ext4_read_dirblock() for block being outside of directory
size was wrong because it compared block number against directory size
in bytes. Fix it.
Fixes: 65f8ea4cd57d ("ext4: check if directory block is within i_size")
CVE: CVE-2022-1184 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220822114832.1482-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ext4_lazyinit_thread is not set freezable. Hence when the thread calls
try_to_freeze it doesn't freeze during suspend and continues to send
requests to the storage during suspend, resulting in suspend failures.
When inode is created and written to using direct IO, there is nothing
to clear the EXT4_STATE_MAY_INLINE_DATA flag. Thus when inode gets
truncated later to say 1 byte and written using normal write, we will
try to store the data as inline data. This confuses the code later
because the inode now has both normal block and inline data allocated
and the confusion manifests for example as:
Add sanity check that filesystem size does not exceed the underlying
device size and that group size is big enough so that metadata can fit
into it. This avoid trying to mount some crafted filesystems with
extremely large group counts.
Reported-by: syzbot+0f2f7e65a3007d39539f@syzkaller.appspotmail.com Reported-by: kernel test robot <oliver.sang@intel.com> # Test fixup CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In fc_do_one_pass() miss release buffer head after use which will lead
to reference count leak.
Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220917093805.1782845-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In 'jbd2_fc_wait_bufs' use 'bh' after put buffer head reference count
which may lead to use-after-free.
So judge buffer if uptodate before put buffer head reference count.
Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100812.1414768-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As in 'jbd2_fc_wait_bufs' if buffer isn't uptodate, will return -EIO without
update 'journal->j_fc_off'. But 'jbd2_fc_release_bufs' will release buffer head
from ‘j_fc_off - 1’ if 'bh' is NULL will terminal release which will lead to
buffer head buffer head reference count leak.
To solve above issue, update 'journal->j_fc_off' before return -EIO.
Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100812.1414768-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
BUG: KASAN: use-after-free in recover_data+0x63ae/0x6ae0 [f2fs]
Read of size 4 at addr ffff8881464dcd80 by task mount/1013
CPU: 3 PID: 1013 Comm: mount Tainted: G W 6.0.0-rc4 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Call Trace:
dump_stack_lvl+0x45/0x5e
print_report.cold+0xf3/0x68d
kasan_report+0xa8/0x130
recover_data+0x63ae/0x6ae0 [f2fs]
f2fs_recover_fsync_data+0x120d/0x1fc0 [f2fs]
f2fs_fill_super+0x4665/0x61e0 [f2fs]
mount_bdev+0x2cf/0x3b0
legacy_get_tree+0xed/0x1d0
vfs_get_tree+0x81/0x2b0
path_mount+0x47e/0x19d0
do_mount+0xce/0xf0
__x64_sys_mount+0x12c/0x1a0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The root cause is: in fuzzed image, SSA table is corrupted: ofs_in_node
is larger than ADDRS_PER_PAGE(), result in out-of-range access on 4k-size
page.