Memory allocations should be done in sw_init. hw_init should
just be hardware programming needed to initialize the IP block.
This is how most other IP blocks work. Move the GPU memory
allocations from psp hw_init to psp sw_init and move the memory
free to sw_fini. This also fixes a potential GPU memory leak
if psp hw_init fails.
When trapping packets for on-CPU processing, Spectrum machines
differentiate between control and non-control traps. Traffic trapped
through non-control traps is treated as data and kept in shared buffer in
pools 0-4. Traffic trapped through control traps is kept in the dedicated
control buffer 9. The advantage of marking traps as control is that
pressure in the data plane does not prevent the control traffic to be
processed.
When the LLDP trap was introduced, it was marked as a control trap. But
then in commit aed4b5721143 ("mlxsw: spectrum: PTP: Hook into packet
receive path"), PTP traps were introduced. Because Ethernet-encapsulated
PTP packets look to the Spectrum-1 ASIC as LLDP traffic and are trapped
under the LLDP trap, this trap was reconfigured as non-control, in sync
with the PTP traps.
There is however no requirement that PTP traffic be handled as data.
Besides, the usual encapsulation for PTP traffic is UDP, not bare Ethernet,
and that is in deployments that even need PTP, which is far less common
than LLDP. This is reflected by the default policer, which was not bumped
up to the 19Kpps / 24Kpps that is the expected load of a PTP-enabled
Spectrum-1 switch.
Marking of LLDP trap as non-control was therefore probably misguided. In
this patch, change it back to control.
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
The idea behind the warnings is that the user would get warned in case when
more than one priority is configured for a given DSCP value on a netdevice.
The warning is currently wrong, because dcb_ieee_getapp_mask() returns
the first matching entry, not all of them, and the warning will then claim
that some priority is "current", when in fact it is not.
But more importantly, the warning is misleading in general. Consider the
following commands:
# dcb app flush dev swp19 dscp-prio
# dcb app add dev swp19 dscp-prio 24:3
# dcb app replace dev swp19 dscp-prio 24:2
The last command will issue the following warning:
mlxsw_spectrum3 0000:07:00.0 swp19: Ignoring new priority 2 for DSCP 24 in favor of current value of 3
The reason is that the "replace" command works by first adding the new
value, and then removing all old values. This is the only way to make the
replacement without causing the traffic to be prioritized to whatever the
chip defaults to. The warning is issued in response to adding the new
priority, and then no warning is shown when the old priority is removed.
The upshot is that the canonical way to change traffic prioritization
always produces a warning about ignoring the new priority, but what gets
configured is in fact what the user intended.
An option to just emit warning every time that the prioritization changes
just to make it clear that it happened is obviously unsatisfactory.
Therefore, in this patch, remove the warnings.
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
It is possible to craft a topology where sof_get_control_data() would do
out of bounds access because it expects that it is only called when the
payload is bytes type.
Confusingly it also handles other types of controls, but the payload
parsing implementation is only valid for bytes.
Fix the code to count the non bytes controls and instead of storing a
pointer to sof_abi_hdr in sof_widget_data (which is only valid for bytes),
store the pointer to the data itself and add a new member to save the size
of the data.
In case of non bytes controls we store the pointer to the chanv itself,
which is just an array of values at the end.
In case of bytes control, drop the wrong cdata->data (wdata[i].pdata) check
against NULL since it is incorrect and invalid in this context.
The data is pointing to the end of cdata struct, so it should never be
null.
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20220427185221.28928-1-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
DAPM tracks and reports the value presented to the user from DAPM controls
separately to the register value, these may diverge during initialisation
or when an autodisable control is in use.
When writing DAPM controls we currently report that a change has occurred
if either the DAPM value or the value stored in the register has changed,
meaning that if the two are out of sync we may appear to report a spurious
event to userspace. Since we use this folded in value for nothing other
than the value reported to userspace simply drop the folding in of the
register change.
When an FTE has no children is means all the rules where removed
and the FTE can be deleted regardless of the dests_size value.
While dests_size should be 0 when there are no children
be extra careful not to leak memory or get firmware syndrome
if the proper bookkeeping of dests_size wasn't done.
Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When tcmu_vma_fault() gets a page successfully, before the current context
completes page fault procedure, find_free_blocks() may run and call
unmap_mapping_range() to unmap the page. Assume that when
find_free_blocks() initially completes and the previous page fault
procedure starts to run again and completes, then one truncated page has
been mapped to userspace. But note that tcmu_vma_fault() has gotten a
refcount for the page so any other subsystem won't be able to use the page
unless the userspace address is unmapped later.
If another command subsequently runs and needs to extend dbi_thresh it may
reuse the corresponding slot for the previous page in data_bitmap. Then
though we'll allocate new page for this slot in data_area, no page fault
will happen because we have a valid map and the real request's data will be
lost.
Filesystem implementations will also run into this issue but they usually
lock the page when vm_operations_struct->fault gets a page and unlock the
page after finish_fault() completes. For truncate filesystems lock pages in
truncate_inode_pages() to protect against racing wrt. page faults.
To fix this possible data corruption scenario we can apply a method similar
to the filesystems. For pages that are to be freed, tcmu_blocks_release()
locks and unlocks. Make tcmu_vma_fault() also lock found page under
cmdr_lock. At the same time, since tcmu_vma_fault() gets an extra page
refcount, tcmu_blocks_release() won't free pages if pages are in page fault
procedure, which means it is safe to call tcmu_blocks_release() before
unmap_mapping_range().
With these changes tcmu_blocks_release() will wait for all page faults to
be completed before calling unmap_mapping_range(). And later, if
unmap_mapping_range() is called, it will ensure stale mappings are removed.
When test connect/disconnect to an AP frequently with WCN6855, sometimes
it show below log.
[ 277.040121] wls1: deauthenticating from 8c:21:0a:b3:5a:64 by local choice (Reason: 3=DEAUTH_LEAVING)
[ 277.050906] ath11k_pci 0000:05:00.0: wmi stats vdev id 0 mac 00:03:7f:29:61:11
[ 277.050944] ath11k_pci 0000:05:00.0: wmi stats bssid 8c:21:0a:b3:5a:64 vif pK-error
[ 277.050954] ath11k_pci 0000:05:00.0: not found station for bssid 8c:21:0a:b3:5a:64
[ 277.050961] ath11k_pci 0000:05:00.0: failed to parse rssi chain -71
[ 277.050967] ath11k_pci 0000:05:00.0: failed to pull fw stats: -71
[ 277.050976] ath11k_pci 0000:05:00.0: wmi stats vdev id 0 mac 00:03:7f:29:61:11
[ 277.050983] ath11k_pci 0000:05:00.0: wmi stats bssid 8c:21:0a:b3:5a:64 vif pK-error
[ 277.050989] ath11k_pci 0000:05:00.0: not found station for bssid 8c:21:0a:b3:5a:64
[ 277.050995] ath11k_pci 0000:05:00.0: failed to parse rssi chain -71
[ 277.051000] ath11k_pci 0000:05:00.0: failed to pull fw stats: -71
[ 278.064050] ath11k_pci 0000:05:00.0: failed to request fw stats: -110
Reason is:
When running disconnect operation, sta_info removed from local->sta_hash
by __sta_info_destroy_part1() from __sta_info_flush(), after this,
ieee80211_find_sta_by_ifaddr() which called by
ath11k_wmi_tlv_fw_stats_data_parse() and ath11k_wmi_tlv_rssi_chain_parse()
cannot find this station, then failed log printed.
steps are like this:
1. when disconnect from AP, __sta_info_destroy() called __sta_info_destroy_part1()
and __sta_info_destroy_part2().
2. in __sta_info_destroy_part1(), it has "sta_info_hash_del(local, sta)"
and "list_del_rcu(&sta->list)", it will remove the ieee80211_sta from the
list of ieee80211_hw.
3. in __sta_info_destroy_part2(), it called drv_sta_state()->ath11k_mac_op_sta_state(),
then peer->sta is clear at this moment.
4. in __sta_info_destroy_part2(), it then called sta_set_sinfo()->drv_sta_statistics()
->ath11k_mac_op_sta_statistics(), then WMI_REQUEST_STATS_CMDID sent to firmware.
5. WMI_UPDATE_STATS_EVENTID reported from firmware, at this moment, the
ieee80211_sta can not be found again because it has remove from list in
step2 and also peer->sta is clear in step3.
6. in __sta_info_destroy_part2(), it then called cleanup_single_sta()->
sta_info_free()->kfree(sta), at this moment, the ieee80211_sta is freed
in memory, then the failed log will not happen because function
ath11k_mac_op_sta_state() will not be called.
Actually this print log is not a real error, it is only to skip parse the
info, so change to skip print by default debug setting.
Context tracking call must be done after hardirq tracking call,
otherwise lockdep_assert_irqs_disabled called from rcu_eqs_exit gives
a warning. To avoid context tracking logic duplication for IRQ/exception
entry paths move trace_hardirqs_off call back to common entry code.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, dpu_hw_lm_collect_misr returns EINVAL if CRC is disabled.
This causes a lot of spam in the DRM debug logs as it's called for every
vblank.
Instead of returning EINVAL when CRC is disabled in
dpu_hw_lm_collect_misr, let's return ENODATA and add an extra ENODATA check
before the debug log in dpu_crtc_get_crc.
Changes since V1:
- Added reported-by and suggested-by tags
If a compat process tries to execute an unknown system call above the
__ARM_NR_COMPAT_END number, the kernel sends a SIGILL signal to the
offending process. Information about the error is printed to dmesg in
compat_arm_syscall() -> arm64_notify_die() -> arm64_force_sig_fault() ->
arm64_show_signal().
arm64_show_signal() interprets a non-zero value for
current->thread.fault_code as an exception syndrome and displays the
message associated with the ESR_ELx.EC field (bits 31:26).
current->thread.fault_code is set in compat_arm_syscall() ->
arm64_notify_die() with the bad syscall number instead of a valid ESR_ELx
value. This means that the ESR_ELx.EC field has the value that the user set
for the syscall number and the kernel can end up printing bogus exception
messages*. For example, for the syscall number 0x68000000, which evaluates
to ESR_ELx.EC value of 0x1A (ESR_ELx_EC_FPAC) the kernel prints this error:
which is misleading, as the bad compat syscall has nothing to do with
pointer authentication.
Stop arm64_show_signal() from printing exception syndrome information by
having compat_arm_syscall() set the ESR_ELx value to 0, as it has no
meaning for an invalid system call number. The example above now becomes:
which although shows less information because the syscall number,
wrongfully advertised as the ESR value, is missing, it is better than
showing plainly wrong information. The syscall number can be easily
obtained with strace.
*A 32-bit value above or equal to 0x8000_0000 is interpreted as a negative
integer in compat_arm_syscal() and the condition scno < __ARM_NR_COMPAT_END
evaluates to true; the syscall will exit to userspace in this case with the
ENOSYS error code instead of arm64_notify_die() being called.
Double free crash is observed when FW recovery(caused by wmi
timeout/crash) is followed by immediate suspend event. The FW recovery
is triggered by ath10k_core_restart() which calls driver clean up via
ath10k_halt(). When the suspend event occurs between the FW recovery,
the restart worker thread is put into frozen state until suspend completes.
The suspend event triggers ath10k_stop() which again triggers ath10k_halt()
The double invocation of ath10k_halt() causes ath10k_htt_rx_free() to be
called twice(Note: ath10k_htt_rx_alloc was not called by restart worker
thread because of its frozen state), causing the crash.
To fix this, during the suspend flow, skip call to ath10k_halt() in
ath10k_stop() when the current driver state is ATH10K_STATE_RESTARTING.
Also, for driver state ATH10K_STATE_RESTARTING, call
ath10k_wait_for_suspend() in ath10k_stop(). This is because call to
ath10k_wait_for_suspend() is skipped later in
[ath10k_halt() > ath10k_core_stop()] for the driver state
ATH10K_STATE_RESTARTING.
The frozen restart worker thread will be cancelled during resume when the
device comes out of suspend.
Reported-by: kernel test robot <lkp@intel.com> CC: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This commit fixes the deadlock warning by ensuring pm.mutex is not
held while holding the topology lock. For this, kfd_local_mem_info
is moved into the KFD dev struct and filled during device init.
This cached value can then be used instead of querying the value
again and again.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
While the check for format_count > 64 in __drm_universal_plane_init()
shouldn't be hit (it's a WARN_ON), in its current position it will then
leak the plane->format_types array and fail to call
drm_mode_object_unregister() leaking the modeset identifier. Move it to
the start of the function to avoid allocating those resources in the
first place.
Add a quirk for the HP Pro Tablet 408, this BYTCR tablet has no CHAN
package in its ACPI tables and uses SSP0-AIF1 rather then SSP0-AIF2 which
is the default for BYTCR devices.
It also uses DMIC1 for the internal mic rather then the default IN3
and it uses JD2 rather then the default JD1 for jack-detect.
If no handler is found in lpfc_complete_unsol_iocb() to match the rctl of a
received frame, the frame is dropped and resources are leaked.
Fix by returning resources when discarding an unhandled frame type. Update
lpfc_fc_frame_check() handling of NOP basic link service.
Link: https://lore.kernel.org/r/20220426181419.9154-1-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Using pm_runtime_resume_and_get() to replace pm_runtime_get_sync() and
pm_runtime_put_noidle(). This change is just to simplify the code, no
actual functional changes.
This patch fixes the issue where the driver miscomputes the 64-bit
values of the wptr of the SDMA doorbell when initializing the
hardware. SDMA engines v4 and later on have full 64-bit registers for
wptr thus they should be set properly.
Older generation hardwares like CIK / SI have only 16 / 20 / 24bits
for the WPTR, where the calls of lower_32_bits() will be removed in a
following patch.
Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Haohui Mai <ricetons@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If major equals 0, register_chrdev() returns an error code when it fails.
This function dynamically allocates a major and returns its number on
success, so we should use "< 0" to check it instead of "!".
The ARASAN MMC controller on Keystone 3 class of devices need the SDCD
line to be connected for proper functioning. Similar to the issue pointed
out in sdhci-of-arasan.c driver, commit 3794c542641f ("mmc:
sdhci-of-arasan: Set controller to test mode when no CD bit").
In cases where this can't be connected, add a quirk to force the
controller into test mode and set the TESTCD bit. Use the flag
"ti,fails-without-test-cd", to implement this above quirk when required.
Do what is done in other DMA-enabled MMC host drivers (cf. host/mmci.c) and
limit the maximum segment size based on the DMA engine's capabilities. This
is needed to avoid warnings like the following with CONFIG_DMA_API_DEBUG=y.
If bitmap area contains invalid data, kernel will crash then mdadm
triggers "Segmentation fault".
This is cluster-md speical bug. In non-clustered env, mdadm will
handle broken metadata case. In clustered array, only kernel space
handles bitmap slot info. But even this bug only happened in clustered
env, current sanity check is wrong, the code should be changed.
In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't
block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T()
trigger "divide error".
When the driver fails to call the dma_set_mask(), the driver will get
the following splat:
[ 55.853884] BUG: KASAN: use-after-free in __process_removed_driver+0x3c/0x240
[ 55.854486] Read of size 8 at addr ffff88810de60408 by task modprobe/590
[ 55.856822] Call Trace:
[ 55.860327] __process_removed_driver+0x3c/0x240
[ 55.861347] bus_for_each_dev+0x102/0x160
[ 55.861681] i2c_del_driver+0x2f/0x50
This is because the driver has initialized the i2c related resources
in cx23885_dev_setup() but not released them in error handling, fix this
bug by modifying the error path that jumps after failing to call the
dma_set_mask().
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
During reconfig (DRC) event from firmware, it is not guaranteed that
all the DPB(internal) buffers would be released by the firmware. Some
buffers might be released gradually while processing frames from the
new sequence. These buffers now stay idle in the dpblist.
In subsequent call to queue the DPBs to firmware, these idle buffers
should not be queued. The fix identifies those buffers and free them.
If venus_probe fails at pm_runtime_put_sync the error handling first
calls hfi_destroy and afterwards hfi_core_deinit. As hfi_destroy sets
core->ops to NULL, hfi_core_deinit cannot call the core_deinit function
anymore.
Avoid this null pointer derefence by skipping the call when necessary.
This patch fixes an invalid TX PA DC bias level on QCA9561, which
results in a very low output power and very low throughput as devices
are further away from the AP (compared to other 2.4GHz APs).
This patch was suggested by Felix Fietkau, who noted[1]:
"The value written to that register is wrong, because while the mask
definition AR_CH0_TOP2_XPABIASLVL uses a different value for 9561, the
shift definition AR_CH0_TOP2_XPABIASLVL_S is hardcoded to 12, which is
wrong for 9561."
In real life testing, without this patch the 2.4GHz throughput on
Yuncore XD3200 is around 10Mbps sitting next to the AP, and closer to
practical maximum with the patch applied.
In the macb binding documentation "phys" is an optional property. Make
implementation in line with it. This change allows the traditional flow
in which first stage bootloader does PS-GT configuration to work along
with newer use cases in which PS-GT configuration is managed by the
phy-zynqmp driver.
It fixes below macb probe failure when macb DT node doesn't have SGMII
phys handle.
"macb ff0b0000.ethernet: error -ENODEV: failed to get PS-GTR PHY"
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> Reviewed-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
commit cfb7b8bf1e2d66 ("ASoC: rsnd: tidyup
rsnd_ssiu_busif_err_status_clear()") merged duplicate code, but it didn't
care about default case, and causes smatch warnings.
smatch warnings:
sound/soc/sh/rcar/ssiu.c:112 rsnd_ssiu_busif_err_status_clear() \
error: uninitialized symbol 'offset'.
sound/soc/sh/rcar/ssiu.c:114 rsnd_ssiu_busif_err_status_clear() \
error: uninitialized symbol 'shift'.
This patch cares it.
Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Link: https://lore.kernel.org/r/87r15rgn6p.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In function si_parse_power_table(), array adev->pm.dpm.ps and its member
is allocated. If the allocation of each member fails, the array itself
is freed and returned with an error code. However, the array is later
freed again in si_dpm_fini() function which is called when the function
returns an error.
This leads to potential double free of the array adev->pm.dpm.ps, as
well as leak of its array members, since the members are not freed in
the allocation function and the array is not nulled when freed.
In addition adev->pm.dpm.num_ps, which keeps track of the allocated
array member, is not updated until the member allocation is
successfully finished, this could also lead to either use after free,
or uninitialized variable access in si_dpm_fini().
Fix this by postponing the free of the array until si_dpm_fini() and
increment adev->pm.dpm.num_ps everytime the array member is allocated.
It needs to check if the pp_funcs is initialized while release the
context, otherwise it will trigger null pointer panic while the software
smu is not enabled.
When PSCI OSI mode is supported the syscore flag is set for the CPU devices
that becomes attached to their PM domains (genpds). In the suspend-to-idle
case, we call dev_pm_genpd_suspend|resume() to allow genpd to properly
manage the power-off/on operations (pick an idlestate and manage the on/off
notifications).
For suspend-to-ram, dev_pm_genpd_suspend|resume() is currently not being
called, which causes a problem that the genpd on/off notifiers do not get
sent as expected. This prevents the platform-specific operations from being
executed, typically needed just before/after the boot CPU is being turned
off/on.
To deal with this problem, let's register a syscore ops for cpuidle-psci
when PSCI OSI mode is being used and call dev_pm_genpd_suspend|resume()
from them. In this way, genpd regains control of the PM domain topology and
then sends the on/off notifications when it's appropriate.
Reported-by: Maulik Shah <quic_mkshah@quicinc.com> Suggested-by: Maulik Shah <quic_mkshah@quicinc.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Maulik Shah <quic_mkshah@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
this_cpu_ptr() calls smp_processor_id() in a preemptible context.
Fix by using per_cpu_ptr() with raw_smp_processor_id() instead.
Link: https://lore.kernel.org/r/20220412222008.126521-16-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
There is a potential memory leak in lpfc_ignore_els_cmpl() and
lpfc_els_rsp_reject() that was allocated from NPIV PLOGI_RJT
(lpfc_rcv_plogi()'s login_mbox).
Check if cmdiocb->context_un.mbox was allocated in lpfc_ignore_els_cmpl(),
and then free it back to phba->mbox_mem_pool along with mbox->ctx_buf for
service parameters.
For lpfc_els_rsp_reject() failure, free both the ctx_buf for service
parameters and the login_mbox.
Link: https://lore.kernel.org/r/20220412222008.126521-10-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If lpfc_issue_els_flogi() fails and returns non-zero status, the node
reference count is decremented to trigger the release of the nodelist
structure. However, if there is a prior registration or dev-loss-evt work
pending, the node may be released prematurely. When dev-loss-evt
completes, the released node is referenced causing a use-after-free null
pointer dereference.
Similarly, when processing non-zero ELS PLOGI completion status in
lpfc_cmpl_els_plogi(), the ndlp flags are checked for a transport
registration before triggering node removal. If dev-loss-evt work is
pending, the node may be released prematurely and a subsequent call to
lpfc_dev_loss_tmo_handler() results in a use after free ndlp dereference.
Add test for pending dev-loss before decrementing the node reference count
for FLOGI, PLOGI, PRLI, and ADISC handling.
Link: https://lore.kernel.org/r/20220412222008.126521-9-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix by reordering the taking of the lpfc_cmd->buf_lock and phba->hbalock in
lpfc_abort_handler routine so that it tries to take the lpfc_cmd->buf_lock
first before phba->hbalock.
Link: https://lore.kernel.org/r/20220412222008.126521-7-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The same CPU tries to claim the phba->port_list_lock twice.
Move the cfg_log_verbose checks as part of the lpfc_printf_vlog() and
lpfc_printf_log() macros before calling lpfc_dmp_dbg(). There is no need
to take the phba->port_list_lock within lpfc_dmp_dbg().
Link: https://lore.kernel.org/r/20220412222008.126521-3-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Ensure that the lo_device which is stored in the gendisk private
data is valid until the gendisk is freed. Currently the loop driver
uses a lot of effort to make sure a device is not freed when it is
still in use, but to to fix a potential deadlock this will be relaxed
a bit soon.
The direction field in the DMA config is deprecated. The rspi driver
sets {src,dst}_{addr,addr_width} based on the DMA direction and
it results in dmaengine_slave_config() failure as RZ DMAC driver
validates {src,dst}_addr_width values independent of DMA direction.
This patch fixes the issue by passing both {src,dst}_{addr,addr_width}
values independent of DMA direction.
Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Suggested-by: Vinod Koul <vkoul@kernel.org> Reviewed-by: Vinod Koul <vkoul@kernel.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20220411173115.6619-1-biju.das.jz@bp.renesas.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
It is possible when using ASoC that input_dev is unregistered while
calling snd_jack_report, which causes NULL pointer dereference.
In order to prevent this serialize access to input_dev using mutex lock.
The TASKS_RUDE_RCU does not select IRQ_WORK, which can result in build
failures for kernels that do not otherwise select IRQ_WORK. This commit
therefore causes the TASKS_RUDE_RCU Kconfig option to select IRQ_WORK.
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
If the cpu_possible_mask is sparse (for example, if bits are set only for
CPUs 0, 4, 8, ...), then rcu_tasks_invoke_cbs() will access per-CPU data
for a CPU not in cpu_possible_mask. It makes these accesses while doing
a workqueue-based binary search for non-empty callback lists. Although
this search must pass through CPUs not represented in cpu_possible_mask,
it has no need to check the callback list for such CPUs.
This commit therefore changes the rcu_tasks_invoke_cbs() function's
binary search so as to only check callback lists for CPUs present in
cpu_possible_mask.
Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This commit therefore fixes this issue by applying a single-CPU
optimization to the RCU Tasks Rude grace-period process. The key point
here is that the purpose of this RCU flavor is to force a schedule on
each online CPU since some past event. But the rcu_tasks_rude_wait_gp()
function runs in the context of the RCU Tasks Rude's grace-period kthread,
so there must already have been a context switch on the current CPU since
the call to either synchronize_rcu_tasks_rude() or call_rcu_tasks_rude().
So if there is only a single CPU online, RCU Tasks Rude's grace-period
kthread does not need to anything at all.
It turns out that the rcu_tasks_rude_wait_gp() function's call to
schedule_on_each_cpu() causes problems during early boot. During that
time, there is only one online CPU, namely the boot CPU. Therefore,
applying this single-CPU optimization fixes early-boot instances of
this problem.
If drm_universal_plane_init() fails early we jump to the common cleanup code
that calls komeda_plane_destroy() which in turn could access the uninitalised
drm_plane and crash. Return early if an error is detected without going through
the common code.
Using an ath9k card the debugfs output of minstrel_ht looks like the following
(note the zero values for the first four rates sum-of success/attempts):
Total packet count:: ideal 16582 lookaround 885
Average # of aggregated frames per A-MPDU: 1.0
Debugging showed that the rate statistics for the first four rates where
stored in the MINSTREL_CCK_GROUP instead of the MINSTREL_OFDM_GROUP because
in minstrel_ht_get_stats() the supported check was not honoured as done in
various other places, e.g net/mac80211/rc80211_minstrel_ht_debugfs.c:
Background:
Libbpf automatically replaces calls to BPF bpf_probe_read_{kernel,user}
[_str]() helpers with bpf_probe_read[_str](), if libbpf detects that
kernel doesn't support new APIs. Specifically, libbpf invokes the
probe_kern_probe_read_kernel function to load a small eBPF program into
the kernel in which bpf_probe_read_kernel API is invoked and lets the
kernel checks whether the new API is valid. If the loading fails, libbpf
considers the new API invalid and replaces it with the old API.
Bug:
On older kernel versions [0], the kernel checks whether the version
number provided in the bpf syscall, matches the LINUX_VERSION_CODE.
If not matched, the bpf syscall fails. eBPF However, the
probe_kern_probe_read_kernel code does not set the kernel version
number provided to the bpf syscall, which causes the loading process
alwasys fails for old versions. It means that libbpf will replace the
new API with the old one even the kernel supports the new one.
Solution:
After a discussion in [1], the solution is using BPF_PROG_TYPE_TRACEPOINT
program type instead of BPF_PROG_TYPE_KPROBE because kernel does not
enfoce version check for tracepoint programs. I test the patch in old
kernels (4.18 and 4.19) and it works well.
While running inside virtual machine, the kernel can bypass cache
flushing. Changing sleep state in a virtual machine doesn't affect the
host system sleep state and cannot lead to data loss.
Before entering sleep states, the ACPI code flushes caches to prevent
data loss using the WBINVD instruction. This mechanism is required on
bare metal.
But, any use WBINVD inside of a guest is worthless. Changing sleep
state in a virtual machine doesn't affect the host system sleep state
and cannot lead to data loss, so most hypervisors simply ignore it.
Despite this, the ACPI code calls WBINVD unconditionally anyway.
It's useless, but also normally harmless.
In TDX guests, though, WBINVD stops being harmless; it triggers a
virtualization exception (#VE). If the ACPI cache-flushing WBINVD
were left in place, TDX guests would need handling to recover from
the exception.
Avoid using WBINVD whenever running under a hypervisor. This both
removes the useless WBINVDs and saves TDX from implementing WBINVD
handling.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-30-kirill.shutemov@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
The UV5 platform synchronizes the TSCs among all chassis, and will not
proceed to OS boot without achieving synchronization. Previous UV
platforms provided a register indicating successful synchronization.
This is no longer available on UV5. On this platform TSC_ADJUST
should not be reset by the kernel.
Signed-off-by: Mike Travis <mike.travis@hpe.com> Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220406195149.228164-3-steve.wahl@hpe.com Signed-off-by: Sasha Levin <sashal@kernel.org>
This shouldn't be a problem in practice since until we've actually
taken over the console there's nothing we've registered with the
console/vt subsystem, so the exit/unbind path that check this can't
do the wrong thing. But it's confusing, so fix it by moving it a tad
later.
Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Du Cheng <ducheng2@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Claudio Suarez <cssk@net-c.es> Cc: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20220405210335.3434130-14-daniel.vetter@ffwll.ch Signed-off-by: Sasha Levin <sashal@kernel.org>
idev->addr_list needs to be protected by idev->lock. However, it is not
always possible to do so while iterating and performing actions on
inet6_ifaddr instances. For example, multiple functions (like
addrconf_{join,leave}_anycast) eventually call down to other functions
that acquire the idev->lock. The current code temporarily unlocked the
idev->lock during the loops, which can cause race conditions. Moving the
locks up is also not an appropriate solution as the ordering of lock
acquisition will be inconsistent with for example mc_lock.
This solution adds an additional field to inet6_ifaddr that is used
to temporarily add the instances to a temporary list while holding
idev->lock. The temporary list can then be traversed without holding
idev->lock. This change was done in two places. In addrconf_ifdown, the
list_for_each_entry_safe variant of the list loop is also no longer
necessary as there is no deletion within that specific loop.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220403231523.45843-1-dossche.niels@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
cfg80211_ch_switch_notify uses ASSERT_WDEV_LOCK to assert that
net_device->ieee80211_ptr->mtx (which is the same as priv->wdev.mtx)
is held during the function's execution.
mwifiex_dfs_chan_sw_work_queue is one of its callers, which does not
hold that lock, therefore violating the assertion.
Add a lock around the call.
Disclaimer:
I am currently working on a static analyser to detect missing locks.
This was a reported case. I manually verified the report by looking
at the code, so that I do not send wrong information or patches.
After concluding that this seems to be a true positive, I created
this patch.
However, as I do not in fact have this particular hardware,
I was unable to test it.
The CAM, meaning address CAM and bssid CAM here, will get leaks during
SER (system error recover) L2 reset process and ieee80211_restart_hw()
which is called by L2 reset process eventually.
The normal flow would be like
-> add interface (acquire 1)
-> enter ips (release 1)
-> leave ips (acquire 1)
-> connection (occupy 1) <(A) 1 leak after L2 reset if non-sec connection>
Originally, CAM is released before HW restart only if connection is under
security. Now, release CAM whatever connection it is to fix leak in (A).
OTOH, check if CAM is already valid to avoid acquiring multiple times to
fix (B).
Besides, if AP mode, release address CAM of all stations before HW restart.
Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20220314071250.40292-2-pkshih@realtek.com Signed-off-by: Sasha Levin <sashal@kernel.org>
The maximum number of SSIDs in a for active probe requests is currently
reported as 16 (WLAN_SCAN_PARAMS_MAX_SSID) when registering the driver.
The scan_req_params structure only has the capacity to hold 10 SSIDs.
This leads to a buffer overflow which can be triggered from
wpa_supplicant in userspace. When copying the SSIDs into the
scan_req_params structure in the ath11k_mac_op_hw_scan route, it can
overwrite the extraie pointer.
Firmware supports 16 ssid * 4 bssid, for each ssid 4 bssid combo probe
request will be sent, so totally 64 probe requests supported. So
set both max ssid and bssid to 16 and 4 respectively. Remove the
redundant macros of ssid and bssid.
The script for checking that various lists of types in bpftool remain in
sync with the UAPI BPF header uses a regex to parse enum bpf_prog_type.
If this enum contains a set of values different from the list of program
types in bpftool, it complains.
This script should have reported the addition, some time ago, of the new
BPF_PROG_TYPE_SYSCALL, which was not reported to bpftool's program types
list. It failed to do so, because it failed to parse that new type from
the enum. This is because the new value, in the BPF header, has an
explicative comment on the same line, and the regex does not support
that.
Let's update the script to support parsing enum values when they have
comments on the same line.
Since commit 6521f8917082 ("namei: prepare for idmapped mounts")
vfs_link's prototype was changed, the kprobe definition in
profiler selftest in turn wasn't updated. The result is that all
argument after the first are now stored in different registers. This
means that self-test has been broken ever since. Fix it by updating the
kprobe definition accordingly.
When test device recovery with below command, it has warning in message
as below.
echo assert > /sys/kernel/debug/ath11k/wcn6855\ hw2.0/simulate_fw_crash
echo assert > /sys/kernel/debug/ath11k/qca6390\ hw2.0/simulate_fw_crash
Reason is mhi_deassert_dev_wake() from mhi_device_put() is called
but mhi_assert_dev_wake() from __mhi_device_get_sync() is not called
in progress of recovery. Commit 8e0559921f9a ("bus: mhi: core:
Skip device wake in error or shutdown state") add check for the
pm_state of mhi in __mhi_device_get_sync(), and the pm_state is not
the normal state untill recovery is completed, so it leads the
dev_wake is not 0 and above warning print in mhi_pm_disable_transition()
while checking mhi_cntrl->dev_wake.
Add check in ath11k_pci_write32()/ath11k_pci_read32() to skip call
mhi_device_put() if mhi_device_get_sync() does not really do wake,
then the warning gone.
The kms code wasn't validating the modifiers and was letting through
unsupported formats. rgb8 was never properly supported and has no
matching svga screen target format so remove it.
This fixes format/modifier failures in kms_addfb_basic from IGT.
The VT-d spec requires (10.4.4 Global Command Register, TE
field) that:
Hardware implementations supporting DMA draining must drain
any in-flight DMA read/write requests queued within the
Root-Complex before completing the translation enable
command and reflecting the status of the command through
the TES field in the Global Status register.
Unfortunately, some integrated graphic devices fail to do
so after some kind of power state transition. As the
result, the system might stuck in iommu_disable_translati
on(), waiting for the completion of TE transition.
This adds RPLS to a quirk list for those devices and skips
TE disabling if the qurik hits.
The block_group->alloc_offset is an offset from the start of the block
group. OTOH, the ->meta_write_pointer is an address in the logical
space. So, we should compare the alloc_offset shifted with the
block_group->start.
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, btrfs_zone_finish_endio() finishes a block group only when the
written region reaches the end of the block group. We can also finish the
block group when no more allocation is possible.
While the active zones within an active block group are reset, and their
active resource is released, the block group itself is kept in the active
block group list and marked as active. As a result, the list will contain
more than max_active_zones block groups. That itself is not fatal for the
device as the zones are properly reset.
However, that inflated list is, of course, strange. Also, a to-appear
patch series, which deactivates an active block group on demand, gets
confused with the wrong list.
So, fix the issue by finishing the unused block group once it gets
read-only, so that we can release the active resource in an early stage.
Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit be1a1d7a5d24 ("btrfs: zoned: finish fully written block group")
introduced zone finishing code both for data and metadata end_io path.
However, the metadata side is not working as it should. First, it
compares logical address (eb->start + eb->len) with offset within a
block group (cache->zone_capacity) in submit_eb_page(). That essentially
disabled zone finishing on metadata end_io path.
Furthermore, fixing the issue above revealed we cannot call
btrfs_zone_finish_endio() in end_extent_buffer_writeback(). We cannot
call btrfs_lookup_block_group() which require spin lock inside end_io
context.
Introduce btrfs_schedule_zone_finish_bg() to wait for the extent buffer
writeback and do the zone finish IO in a workqueue.
Also, drop EXTENT_BUFFER_ZONE_FINISH as it is no longer used.
Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When reserving data space for a direct IO write we can end up deadlocking
if we have multiple tasks attempting a write to the same file range, there
are multiple extents covered by that file range, we are low on available
space for data and the writes don't expand the inode's i_size.
The deadlock can happen like this:
1) We have a file with an i_size of 1M, at offset 0 it has an extent with
a size of 128K and at offset 128K it has another extent also with a
size of 128K;
2) Task A does a direct IO write against file range [0, 256K), and because
the write is within the i_size boundary, it takes the inode's lock (VFS
level) in shared mode;
3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and
then gets the extent map for the extent covering the range [0, 128K).
At btrfs_get_blocks_direct_write(), it creates an ordered extent for
that file range ([0, 128K));
4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
range [0, 256K);
5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
range [128K, 256K), and locks the file range [128K, 256K);
6) Task B starts a direct IO write against file range [0, 256K) as well.
It also locks the inode in shared mode, as it's within the i_size limit,
and then tries to lock file range [0, 256K). It is able to lock the
subrange [0, 128K) but then blocks waiting for the range [128K, 256K),
as it is currently locked by task A;
7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
space. Because we are low on available free space, it triggers the
async data reclaim task, and waits for it to reserve data space;
8) The async reclaim task decides to wait for all existing ordered extents
to complete (through btrfs_wait_ordered_roots()).
It finds the ordered extent previously created by task A for the file
range [0, 128K) and waits for it to complete;
9) The ordered extent for the file range [0, 128K) can not complete
because it blocks at btrfs_finish_ordered_io() when trying to lock the
file range [0, 128K).
This results in a deadlock, because:
- task B is holding the file range [0, 128K) locked, waiting for the
range [128K, 256K) to be unlocked by task A;
- task A is holding the file range [128K, 256K) locked and it's waiting
for the async data reclaim task to satisfy its space reservation
request;
- the async data reclaim task is waiting for ordered extent [0, 128K)
to complete, but the ordered extent can not complete because the
file range [0, 128K) is currently locked by task B, which is waiting
on task A to unlock file range [128K, 256K) and task A waiting
on the async data reclaim task.
This results in a deadlock between 4 task: task A, task B, the async
data reclaim task and the task doing ordered extent completion (a work
queue task).
This type of deadlock can sporadically be triggered by the test case
generic/300 from fstests, and results in a stack trace like the following:
Fix this issue by always reserving data space before locking a file range
at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
error out immediately - instead after locking the file range, check if we
can do a NOCOW write, and if we can we don't error out since we don't need
to allocate a data extent, however if we can't NOCOW then error out with
-ENOSPC. This also implies that we may end up reserving space when it's
not needed because the write will end up being done in NOCOW mode - in that
case we just release the space after we noticed we did a NOCOW write - this
is the same type of logic that is done in the path for buffered IO writes.
Fixes: f0bfa76a11e93d ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range") CC: stable@vger.kernel.org # 5.17+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[BUG]
Test case generic/475 have a very high chance (almost 100%) to hit a fs
hang, where a data page will never be unlocked and hang all later
operations.
[CAUSE]
In btrfs_do_readpage(), if we hit an error from submit_extent_page() we
will try to do the cleanup for our current io range, and exit.
This works fine for PAGE_SIZE == sectorsize cases, but not for subpage.
For subpage btrfs_do_readpage() will lock the full page first, which can
contain several different sectors and extents:
btrfs_do_readpage()
|- begin_page_read()
| |- btrfs_subpage_start_reader();
| Now the page will have PAGE_SIZE / sectorsize reader pending,
| and the page is locked.
|
|- end_page_read() for different branches
| This function will reduce subpage readers, and when readers
| reach 0, it will unlock the page.
But when submit_extent_page() failed, we only cleanup the current
io range, while the remaining io range will never be cleaned up, and the
page remains locked forever.
[FIX]
Update the error handling of submit_extent_page() to cleanup all the
remaining subpage range before exiting the loop.
Please note that, now submit_extent_page() can only fail due to
sanity check in alloc_new_bio().
Thus regular IO errors are impossible to trigger the error path.
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
This has been fixed by commit bbac58698a55 ("btrfs: remove device item
and update super block in the same transaction").
[FIX]
Make the super_num_devices check less strict, converting it from a hard
error to a warning, and reset the value to a correct one for the current
or next transaction commit.
As the number of device items is the critical information where the
super block num_devices is only a cached value (and also useful for
cross checking), it's safe to automatically update it. Other device
related problems like missing device are handled after that and may
require other means to resolve, like degraded mount. With this fix,
potentially affected filesystems won't fail mount and require the manual
repair by btrfs check.
[BUG]
If we hit an error from submit_extent_page() inside
__extent_writepage_io(), we could still return 0 to the caller, and
even trigger the warning in btrfs_page_assert_not_dirty().
[CAUSE]
In __extent_writepage_io(), if we hit an error from
submit_extent_page(), we will just clean up the range and continue.
This is completely fine for regular PAGE_SIZE == sectorsize, as we can
only hit one sector in one page, thus after the error we're ensured to
exit and @ret will be saved.
But for subpage case, we may have other dirty subpage range in the page,
and in the next loop, we may succeeded submitting the next range.
In that case, @ret will be overwritten, and we return 0 to the caller,
while we have hit some error.
[FIX]
Introduce @has_error and @saved_ret to record the first error we hit, so
we will never forget what error we hit.
The following error message lack the "0x" obviously:
cannot mount because of unsupported optional features (4000)
Add the prefix to make it less confusing. This can happen on older
kernels that try to mount a filesystem with newer features so it makes
sense to backport to older trees.
CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current implementation of PTRACE_KILL is buggy and has been for
many years as it assumes it's target has stopped in ptrace_stop. At a
quick skim it looks like this assumption has existed since ptrace
support was added in linux v1.0.
While PTRACE_KILL has been deprecated we can not remove it as
a quick search with google code search reveals many existing
programs calling it.
When the ptracee is not stopped at ptrace_stop some fields would be
set that are ignored except in ptrace_stop. Making the userspace
visible behavior of PTRACE_KILL a noop in those case.
As the usual rules are not obeyed it is not clear what the
consequences are of calling PTRACE_KILL on a running process.
Presumably userspace does not do this as it achieves nothing.
Replace the implementation of PTRACE_KILL with a simple
send_sig_info(SIGKILL) followed by a return 0. This changes the
observable user space behavior only in that PTRACE_KILL on a process
not stopped in ptrace_stop will also kill it. As that has always
been the intent of the code this seems like a reasonable change.
Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xtensa is the last user of the PT_SINGLESTEP flag. Changing tsk->ptrace in
user_enable_single_step and user_disable_single_step without locking could
potentiallly cause problems.
So use a thread info flag instead of a flag in tsk->ptrace. Use TIF_SINGLESTEP
that xtensa already had defined but unused.
Remove the definitions of PT_SINGLESTEP and PT_BLOCKSTEP as they have no more users.
User mode linux is the last user of the PT_DTRACE flag. Using the flag to indicate
single stepping is a little confusing and worse changing tsk->ptrace without locking
could potentionally cause problems.
So use a thread info flag with a better name instead of flag in tsk->ptrace.
Remove the definition PT_DTRACE as uml is the last user.
If kthread_is_per_cpu runs concurrently with free_kthread_struct the
kthread_struct that was just freed may be read from.
This bug was introduced by commit 40966e316f86 ("kthread: Ensure
struct kthread is present for all kthreads"). When kthread_struct
started to be allocated for all tasks that have PF_KTHREAD set. This
in turn required the kthread_struct to be freed in kernel_execve and
violated the assumption that kthread_struct will have the same
lifetime as the task.
Looking a bit deeper this only applies to callers of kernel_execve
which is just the init process and the user mode helper processes.
These processes really don't want to be kernel threads but are for
historical reasons. Mostly that copy_thread does not know how to take
a kernel mode function to the process with for processes without
PF_KTHREAD or PF_IO_WORKER set.
Solve this by not allocating kthread_struct for the init process and
the user mode helper processes.
This is done by adding a kthread member to struct kernel_clone_args.
Setting kthread in fork_idle and kernel_thread. Adding
user_mode_thread that works like kernel_thread except it does not set
kthread. In fork only allocating the kthread_struct if .kthread is set.
I have looked at kernel/kthread.c and since commit 40966e316f86
("kthread: Ensure struct kthread is present for all kthreads") there
have been no assumptions added that to_kthread or __to_kthread will
not return NULL.
There are a few callers of to_kthread or __to_kthread that assume a
non-NULL struct kthread pointer will be returned. These functions are
kthread_data(), kthread_parmme(), kthread_exit(), kthread(),
kthread_park(), kthread_unpark(), kthread_stop(). All of those functions
can reasonably expected to be called when it is know that a task is a
kthread so that assumption seems reasonable.
Cc: stable@vger.kernel.org Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Reported-by: Максим Кутявин <maximkabox13@gmail.com> Link: https://lkml.kernel.org/r/20220506141512.516114-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the system runs out of enclave memory, SGX can reclaim EPC pages
by swapping to normal RAM. These backing pages are allocated via a
per-enclave shared memory area. Since SGX allows unlimited over
commit on EPC memory, the reclaimer thread can allocate a large
number of backing RAM pages in response to EPC memory pressure.
When the shared memory backing RAM allocation occurs during
the reclaimer thread context, the shared memory is charged to
the root memory control group, and the shmem usage of the enclave
is not properly accounted for, making cgroups ineffective at
limiting the amount of RAM an enclave can consume.
For example, when using a cgroup to launch a set of test
enclaves, the kernel does not properly account for 50% - 75% of
shmem page allocations on average. In the worst case, when
nearly all allocations occur during the reclaimer thread, the
kernel accounts less than a percent of the amount of shmem used
by the enclave's cgroup to the correct cgroup.
SGX stores a list of mm_structs that are associated with
an enclave. Pick one of them during reclaim and charge that
mm's memcg with the shmem allocation. The one that gets picked
is arbitrary, but this list almost always only has one mm. The
cases where there is more than one mm with different memcg's
are not worth considering.
Create a new function - sgx_encl_alloc_backing(). This function
is used whenever a new backing storage page needs to be
allocated. Previously the same function was used for page
allocation as well as retrieving a previously allocated page.
Prior to backing page allocation, if there is a mm_struct associated
with the enclave that is requesting the allocation, it is set
as the active memory control group.
[ dhansen: - fix merge conflict with ELDU fixes
- check against actual ksgxd_tsk, not ->mm ]
Cc: stable@vger.kernel.org Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lkml.kernel.org/r/20220520174248.4918-1-kristen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In crash_prepare_elf64_headers(), a buffer is allocated via vmalloc() to
store elf headers. While it's not freed back to system correctly when
kdump kernel is reloaded or unloaded. Then memory leak is caused. Fix it
by introducing x86 specific function arch_kimage_file_post_load_cleanup(),
and freeing the buffer there.
And also remove the incorrect elf header buffer freeing code. Before
calling arch specific kexec_file loading function, the image instance has
been initialized. So 'image->elf_headers' must be NULL. It doesn't make
sense to free the elf header buffer in the place.
Three different people have reported three bugs about the memory leak on
x86_64 inside Redhat.
Link: https://lkml.kernel.org/r/20220223113225.63106-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
According to the latest event list, the event encoding 0x55
INST_DECODED.DECODERS and 0x56 UOPS_DECODED.DEC0 are only available on
the first 4 counters. Add them into the event constraints table.
In mce_threshold_create_device(), if threshold_create_bank() fails, the
previously allocated threshold banks array @bp will be leaked because
the call to mce_threshold_remove_device() will not free it.
This happens because mce_threshold_remove_device() fetches the pointer
through the threshold_banks per-CPU variable but bp is written there
only after the bank creation is successful, and not before, when
threshold_create_bank() fails.
Add a helper which unwinds all the bank creation work previously done
and pass into it the previously allocated threshold banks array for
freeing.
intel_hid_dsm_fn_mask is a bit mask containing one bit for each function
index. Fix the function index check in intel_hid_evaluate_method
accordingly, which was missed in commit 97ab4516205e ("platform/x86:
intel-hid: fix _DSM function index handling").
Fixes: 97ab4516205e ("platform/x86: intel-hid: fix _DSM function index handling") Cc: stable@vger.kernel.org Signed-off-by: Michael Niewöhner <linux@mniewoehner.de> Link: https://lore.kernel.org/r/66f813f5bcc724a0f6dd5adefe6a9728dbe509e3.camel@mniewoehner.de Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Set "HCD_FLAG_DEFER_RH_REGISTER" to hcd->flags in xhci_run() to defer
registering primary roothub in usb_add_hcd() if xhci has two roothubs.
This will make sure both primary roothub and secondary roothub will be
registered along with the second HCD.
This is required for cold plugged USB devices to be detected in certain
PCIe USB cards (like Inateck USB card connected to AM64 EVM or J7200 EVM).
This patch has been added and reverted earier as it triggered a race
in usb device enumeration.
That race is now fixed in 5.16-rc3, and in stable back to 5.4
commit 6cca13de26ee ("usb: hub: Fix locking issues with address0_mutex")
commit 6ae6dc22d2d1 ("usb: hub: Fix usb enumeration issue due to address0
race")
[minor rebase change, and commit message update -Mathias]
CC: stable@vger.kernel.org # 5.4+ Suggested-by: Mathias Nyman <mathias.nyman@linux.intel.com> Tested-by: Chris Chiu <chris.chiu@canonical.com> Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Link: https://lore.kernel.org/r/20220510091630.16564-3-kishon@ti.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
as this is the only way to make sure the region is allocated.
Fix the conditional that was wrong and only tried to make already
non-sparse files non-sparse.
Cc: stable@vger.kernel.org Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>