Based on more testing, commit 8ca5ee624b4c ("ARM: OMAP2+: Restore MPU
power domain if cpu_cluster_pm_enter() fails") is a poor fix for handling
cpu_cluster_pm_enter() returned errors.
We should not override the cpuidle states with a hardcoded PWRDM_POWER_ON
value. Instead, we should use a configured idle state that does not cause
the context to be lost. Otherwise we end up configuring a potentially
improper state for the MPUSS. We also want to update the returned state
index for the selected state.
Let's just select the highest power idle state C1 to ensure no context
loss is allowed on cpu_cluster_pm_enter() errors. With these changes we
can now unconditionally call omap4_enter_lowpower() for WFI like we did
earlier before commit 55be2f50336f ("ARM: OMAP2+: Handle errors for
cpu_pm"). And we can return the selected state index.
Fixes: 8f04aea048d5 ("ARM: OMAP2+: Restore MPU power domain if cpu_cluster_pm_enter() fails") Fixes: 55be2f50336f ("ARM: OMAP2+: Handle errors for cpu_pm") Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Bail out early from sysc_wait_softreset() just like we do in sysc_reset()
if there's no sysstatus srst_shift to fix a bogus resetdone warning on
enable as suggested by Grygorii Strashko <grygorii.strashko@ti.com>.
We do not currently handle resets for modules that need writing to the
sysstatus register. If we at some point add that, we also need to add
SYSS_QUIRK_RESETDONE_INVERTED flag for cpsw as the sysstatus bit is low
when reset is done as described in the am335x TRM "Table 14-202
SOFT_RESET Register Field Descriptions"
Fixes: d46f9fbec719 ("bus: ti-sysc: Use optional clocks on for enable and wait for softreset bit") Suggested-by: Grygorii Strashko <grygorii.strashko@ti.com> Acked-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit d46f9fbec719 ("bus: ti-sysc: Use optional clocks on for enable and
wait for softreset bit") started showing a "OCP softreset timed out"
warning on enable if the interconnect target module is not out of reset.
This caused the warning to be often triggered for i2c and hdq while the
devices are working properly.
Turns out that some interconnect target modules seem to have an unusable
reset status bits unless the module specific reset quirks are activated.
Let's just skip the reset status check for those modules as we only want
to activate the reset quirks when doing a reset, and not on enable. This
way we don't see the bogus "OCP softreset timed out" warnings during boot.
Fixes: d46f9fbec719 ("bus: ti-sysc: Use optional clocks on for enable and wait for softreset bit") Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When the switch is hardware reset, it reads the contents of the
EEPROM. This can contain instructions for programming values into
registers and to perform waits between such programming. Reading the
EEPROM can take longer than the 100ms mv88e6xxx_hardware_reset() waits
after deasserting the reset GPIO. So poll the EEPROM done bit to
ensure it is complete.
sysrq-t ends up invoking show_opcodes() for each task which tries to access
the user space code of other processes, which is obviously bogus.
It either manages to dump where the foreign task's regs->ip points to in a
valid mapping of the current task or triggers a pagefault and prints "Code:
Bad RIP value.". Both is just wrong.
Add a safeguard in copy_code() and check whether the @regs pointer matches
currents pt_regs. If not, do not even try to access it.
While at it, add commentary why using copy_from_user_nmi() is safe in
copy_code() even if the function name suggests otherwise.
When adding __user annotations in commit 2adf5352a34a, the
strncpy_from_user() function declaration for the
CONFIG_GENERIC_STRNCPY_FROM_USER case was missed. Fix it.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Message-Id: <20200831210937.17938-1-laurent.pinchart@ideasonboard.com> Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This change switches rapl to use PMU_FORMAT_ATTR, and fixes two other
macros to use device_attribute instead of kobj_attribute to avoid
callback type mismatches that trip indirect call checking with Clang's
Control-Flow Integrity (CFI).
The cause of the problem is we have call chain lockdep_unregister_key()
-> <irq disabled by raw_local_irq_save()> lockdep_unlock() ->
arch_spin_unlock() -> __pv_queued_spin_unlock_slowpath() -> pv_kick() ->
__send_ipi_one() -> trace_hyperv_send_ipi_one().
Although this particular warning is triggered because Hyper-V has a
trace point in ipi sending, but in general arch_spin_unlock() may call
another function having a trace point in it, so put the arch_spin_lock()
and arch_spin_unlock() after lock_recursion protection to fix this
problem and avoid similiar problems.
Maurizio found a race where the abort and cmd stop paths can race as
follows:
1. thread1 runs iscsit_release_commands_from_conn and sets
CMD_T_FABRIC_STOP.
2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It
then returns from the aborted_task callout and we finish
target_handle_abort and do:
3. thread1 now finishes iscsit_release_commands_from_conn and runs
iscsit_free_cmd while accessing a command we just released.
In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the
CMD_T_ABORTED if the driver is not cleaning up the cmd because of a session
shutdown. However, iscsit_release_commands_from_conn only sets the
CMD_T_FABRIC_STOP and does not check to see if the abort path has claimed
completion ownership of the command.
This adds a check in iscsit_release_commands_from_conn so only the abort or
fabric stop path cleanup the command.
Link: https://lore.kernel.org/r/1605318378-9269-1-git-send-email-michael.christie@oracle.com Reported-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
iSCSI NOPs are sometimes "lost", mistakenly sent to the user-land iscsid
daemon instead of handled in the kernel, as they should be, resulting in a
message from the daemon like:
iscsid: Got nop in, but kernel supports nop handling.
This can occur because of the new forward- and back-locks, and the fact
that an iSCSI NOP response can occur before processing of the NOP send is
complete. This can result in "conn->ping_task" being NULL in
iscsi_nop_out_rsp(), when the pointer is actually in the process of being
set.
To work around this, we add a new state to the "ping_task" pointer. In
addition to NULL (not assigned) and a pointer (assigned), we add the state
"being set", which is signaled with an INVALID pointer (using "-1").
Link: https://lore.kernel.org/r/20201106193317.16993-1-leeman.duncan@gmail.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
dmatest: dma0chan0-copy0: result #1: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #2: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #3: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #4: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #5: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #6: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #7: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #8: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
Annotate tegra_pm_set[clear]_cpu_in_lp2() with RCU_NONIDLE in order to
fix lockdep warning about suspicious RCU usage of a spinlock during late
idling phase.
WARNING: suspicious RCU usage
...
include/trace/events/lock.h:13 suspicious rcu_dereference_check() usage!
...
(dump_stack) from (lock_acquire)
(lock_acquire) from (_raw_spin_lock)
(_raw_spin_lock) from (tegra_pm_set_cpu_in_lp2)
(tegra_pm_set_cpu_in_lp2) from (tegra_cpuidle_enter)
(tegra_cpuidle_enter) from (cpuidle_enter_state)
(cpuidle_enter_state) from (cpuidle_enter_state_coupled)
(cpuidle_enter_state_coupled) from (cpuidle_enter)
(cpuidle_enter) from (do_idle)
...
Tested-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Peter Geis <pgwipeout@gmail.com> Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
We might not do the final se_cmd put from vhost_scsi_complete_cmd_work.
When the last put happens a little later then we could race where
vhost_scsi_complete_cmd_work does vhost_signal, the guest runs and sends
more IO, and vhost_scsi_handle_vq runs but does not find any free cmds.
This patch has us delay completing the cmd until the last lio core ref
is dropped. We then know that once we signal to the guest that the cmd
is completed that if it queues a new command it will find a free cmd.
Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Maurizio Lombardi <mlombard@redhat.com> Link: https://lore.kernel.org/r/1604986403-4931-4-git-send-email-michael.christie@oracle.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
We currently are limited to 256 cmds per session. This leads to problems
where if the user has increased virtqueue_size to more than 2 or
cmd_per_lun to more than 256 vhost_scsi_get_tag can fail and the guest
will get IO errors.
This patch moves the cmd allocation to per vq so we can easily match
whatever the user has specified for num_queues and
virtqueue_size/cmd_per_lun. It also makes it easier to control how much
memory we preallocate. For cases, where perf is not as important and
we can use the current defaults (1 vq and 128 cmds per vq) memory use
from preallocate cmds is cut in half. For cases, where we are willing
to use more memory for higher perf, cmd mem use will now increase as
the num queues and queue depth increases.
Signed-off-by: Mike Christie <michael.christie@oracle.com> Link: https://lore.kernel.org/r/1604986403-4931-3-git-send-email-michael.christie@oracle.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Maurizio Lombardi <mlombard@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This adds a helper check if a vq has been setup. The next patches
will use this when we move the vhost scsi cmd preallocation from per
session to per vq. In the per vq case, we only want to allocate cmds
for vqs that have actually been setup and not for all the possible
vqs.
Any attempt to do path resolution on /proc/self from an async worker will
yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
and without blocking, so retry it from there.
Ideally io_uring would know this upfront and not have to go through the
worker thread to find out, but that doesn't currently seem feasible.
If Doorbell Buffer Config command fails even 'dev->dbbuf_dbs != NULL'
which means OACS indicates that NVME_CTRL_OACS_DBBUF_SUPP is set,
nvme_dbbuf_update_and_check_event() will check event even it's not been
successfully set.
This patch fixes mismatch among dbbuf for sq/cqs in case that dbbuf
command fails.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
currently support that. Once we can get task_pid_ptr() doing the right
thing, then this can go away again.
The battery status is also being reported by the logitech-hidpp driver,
so ignore the standard HID battery status to avoid reporting the same
info twice.
Note the logitech-hidpp battery driver provides more info, such as properly
differentiating between charging and discharging. Also the standard HID
battery info seems to be wrong, reporting a capacity of just 26% after
fully charging the device.
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Like the MX5000 and MX5500 quad/bluetooth keyboards the Dinovo Edge also
needs the HIDPP_CONSUMER_VENDOR_KEYS quirk for some special keys to work.
Specifically without this the "Phone" and the 'A' - 'D' Smart Keys do not
send any events.
In addition to fixing these keys not sending any events, adding the
Bluetooth match, so that hid-logitech-hidpp is used instead of the
generic HID driver, also adds battery monitoring support when the
keyboard is connected over Bluetooth.
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently the following expectation
KUNIT_EXPECT_STREQ(test, "hi", "bye");
will produce:
Expected "hi" == "bye", but
"hi" == 1625079497
"bye" == 1625079500
After this patch:
Expected "hi" == "bye", but
"hi" == hi
"bye" == bye
KUNIT_INIT_BINARY_STR_ASSERT_STRUCT() was written but just mistakenly
not actually used by KUNIT_EXPECT_STREQ() and friends.
The secondary CPUs are not activated with the nosmt mitigations and only
the primary thread on each CPU core is used. In this situation,
xen_hvm_smp_prepare_cpus(), and more importantly xen_init_lock_cpu(), is
not called, so the lock_kicker_irq is not initialized for the secondary
CPUs. Let's fix this by exiting early in xen_uninit_lock_cpu() if the
irq is not set to avoid the warning from above for each secondary CPU.
Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20201107011119.631442-1-bmasney@redhat.com Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The xilinx_dma_poll_timeout macro is sometimes called while holding a
spinlock (see xilinx_dma_issue_pending() for an example) this means we
shouldn't sleep when polling the dma channel registers. To address it
in xilinx poll timeout macro use readl_poll_timeout_atomic instead of
readl_poll_timeout variant.
When DMA_RALINK is enabled and DMADEVICES is disabled, it results in the
following Kbuild warnings:
WARNING: unmet direct dependencies detected for DMA_ENGINE
Depends on [n]: DMADEVICES [=n]
Selected by [y]:
- DMA_RALINK [=y] && STAGING [=y] && RALINK [=y] && !SOC_RT288X [=n]
WARNING: unmet direct dependencies detected for DMA_VIRTUAL_CHANNELS
Depends on [n]: DMADEVICES [=n]
Selected by [y]:
- DMA_RALINK [=y] && STAGING [=y] && RALINK [=y] && !SOC_RT288X [=n]
The reason is that DMA_RALINK selects DMA_ENGINE and DMA_VIRTUAL_CHANNELS
without depending on or selecting DMADEVICES while DMA_ENGINE and
DMA_VIRTUAL_CHANNELS are subordinate to DMADEVICES. This can also fail
building the kernel as demonstrated in a bug report.
Honor the kconfig dependency to remove unmet direct dependency warnings
and avoid any potential build failures.
Some HID devices don't use a report ID because they only have a single
report. In those cases, the report ID in struct hid_report will be zero
and the data for the report will start at the first byte, so don't skip
over the first byte.
The i8042 module exports several symbols which may be used by other
modules.
Before this commit it would refuse to load (when built as a module itself)
on systems without an i8042 controller.
This is a problem specifically for the asus-nb-wmi module. Many Asus
laptops support the Asus WMI interface. Some of them have an i8042
controller and need to use i8042_install_filter() to filter some kbd
events. Other models do not have an i8042 controller (e.g. they use an
USB attached kbd).
Before this commit the asus-nb-wmi driver could not be loaded on Asus
models without an i8042 controller, when the i8042 code was built as
a module (as Arch Linux does) because the module_init function of the
i8042 module would fail with -ENODEV and thus the i8042_install_filter
symbol could not be loaded.
This commit fixes this by exiting from module_init with a return code
of 0 if no controller is found. It also adds a i8042_present bool to
make the module_exit function a no-op in this case and also adds a
check for i8042_present to the exported i8042_command function.
The latter i8042_present check should not really be necessary because
when builtin that function can already be used on systems without
an i8042 controller, but better safe then sorry.
Reported-and-tested-by: Marius Iacob <themariusus@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20201008112628.3979-2-hdegoede@redhat.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The Varmilo VA104M Keyboard (04b4:07b1, reported as Varmilo Z104M)
exposes media control hotkeys as a USB HID consumer control device, but
these keys do not work in the current (5.8-rc1) kernel due to the
incorrect HID report descriptor. Fix the problem by modifying the
internal HID report descriptor.
More specifically, the keyboard report descriptor specifies the
logical boundary as 572~10754 (0x023c ~ 0x2a02) while the usage
boundary is specified as 0~10754 (0x00 ~ 0x2a02). This results in an
incorrect interpretation of input reports, causing inputs to be ignored.
By setting the Logical Minimum to zero, we align the logical boundary
with the Usage ID boundary.
Some notes:
* There seem to be multiple variants of the VA104M keyboard. This
patch specifically targets 04b4:07b1 variant.
* The device works out-of-the-box on Windows platform with the generic
consumer control device driver (hidserv.inf). This suggests that
Windows either ignores the Logical Minimum/Logical Maximum or
interprets the Usage ID assignment differently from the linux
implementation; Maybe there are other devices out there that only
works on Windows due to this problem?
Signed-off-by: Frank Yang <puilp0502@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
The usb-hid keyboard-dock for the Acer Switch 10 SW5-012 model declares
an application and hid-usage page of 0x0088 for the INPUT(4) report which
it sends. This reports contains 2 8-bit fields which are declared as
HID_MAIN_ITEM_VARIABLE.
The keyboard-touchpad combo never actually generates this report, except
when the touchpad is toggled on/off with the Fn + F7 hotkey combo. The
toggle on/off is handled inside the keyboard-dock, when the touchpad is
toggled off it simply stops sending events.
When the touchpad is toggled on/off an INPUT(4) report is generated with
the first content byte set to 120/121, before this commit the kernel
would report this as ABS_MISC 120/121 events.
Patch the descriptor to replace the HID_MAIN_ITEM_VARIABLE with
HID_MAIN_ITEM_RELATIVE (because no key-presss release events are send)
and add mappings for the 0x00880078 and 0x00880079 usages to generate
touchpad on/off key events when the touchpad is toggled on/off.
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
The Trust Flex Design Tablet has an UGTizer USB ID and requires the same
initialization as the UGTizer GP0610 to be detected as a graphics tablet
instead of a mouse.
Signed-off-by: Martijn van de Streek <martijn@zeewinde.xyz> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
The HDCP feature requires at least one connector attached to the device;
however, some GPUs do not have a physical output, making the HDCP
initialization irrelevant. This patch disables HDCP initialization when
the graphic card does not have output.
Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The UVD firmware is copied to cpu addr in uvd_resume, so it
should be used after that. This is to fix a bug introduced by
patch drm/amdgpu: fix SI UVD firmware validate resume fail.
Signed-off-by: Sonny Jiang <sonny.jiang@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> CC: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With hardware dirty bit management, calling pte_wrprotect() on a writable,
dirty PTE will lose the dirty state and return a read-only, clean entry.
Move the logic from ptep_set_wrprotect() into pte_wrprotect() to ensure that
the dirty bit is preserved for writable entries, as this is required for
soft-dirty bit management if we enable it in the future.
Cc: <stable@vger.kernel.org> Fixes: 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits") Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20201120143557.6715-3-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pte_accessible() is used by ptep_clear_flush() to figure out whether TLB
invalidation is necessary when unmapping pages for reclaim. Although our
implementation is correct according to the architecture, returning true
only for valid, young ptes in the absence of racing page-table
modifications, this is in fact flawed due to lazy invalidation of old
ptes in ptep_clear_flush_young() where we elide the expensive DSB
instruction for completing the TLB invalidation.
Rather than penalise the aging path, adjust pte_accessible() to return
true for any valid pte, even if the access flag is cleared.
USB host mode is broken on the OTG port of Jetson TX1 platform because
the USB_VBUS_EN0 regulator (regulator@11) is being overwritten by the
vdd-cam-1v2 regulator. This commit rearranges USB_VBUS_EN0 to be
regulator@14.
Fixes: 257c8047be44 ("arm64: tegra: jetson-tx1: Add camera supplies") Cc: stable@vger.kernel.org Signed-off-by: JC Kuo <jckuo@nvidia.com> Reviewed-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Jetson Xavier NX board routes UARTA to the 40-pin header and UARTC
to a 12-pin debug header. The UARTs can be used by either the Tegra
Combined UART (TCU) driver or the Tegra 8250 driver. By default, the
TCU will use UARTC on Jetson Xavier NX. Currently, device-tree for
Xavier NX enables the TCU and the Tegra 8250 node for UARTC. Fix this
by disabling the Tegra 8250 node for UARTC and enabling the Tegra 8250
node for UARTA.
Fixes: 3f9efbbe57bc ("arm64: tegra: Add support for Jetson Xavier NX") Cc: stable@vger.kernel.org Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The SI UVD firmware validate key is stored at the end of firmware,
which is changed during resume while playing video. So get the key
at sw_init and store it for fw validate using.
Signed-off-by: Sonny Jiang <sonny.jiang@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently array of fix length PM_API_MAX is used to cache
the pm_api version (valid or invalid). However ATF based
PM APIs values are much higher then PM_API_MAX.
So to include ATF based PM APIs also, use hash-table to
store the pm_api version status.
Signed-off-by: Amit Sunil Dhamne <amit.sunil.dhamne@xilinx.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ravi Patel <ravi.patel@xilinx.com> Signed-off-by: Rajan Vaja <rajan.vaja@xilinx.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Michal Simek <michal.simek@xilinx.com> Fixes: f3217d6f2f7a ("firmware: xilinx: fix out-of-bounds access") Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1606197161-25976-1-git-send-email-rajan.vaja@xilinx.com Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
My virtual IOMMU implementation is whining that the guest is reading a
register that doesn't exist. Only read the VCCAP_REG if the corresponding
capability is set in ECAP_REG to indicate that it actually exists.
Fixes: 3375303e8287 ("iommu/vt-d: Add custom allocator for IOASID") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Liu Yi L <yi.l.liu@intel.com> Cc: stable@vger.kernel.org # v5.7+ Acked-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/de32b150ffaa752e0cff8571b17dfb1213fbe71c.camel@infradead.org Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
a hodge-podge of conditions, hacked together to get something that
more or less works. But what is actually needed is much simpler;
in both cases the fundamental question is, do we have a place to stash
an interrupt if userspace does KVM_INTERRUPT?
In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
unnecessarily restrictive.
In split irqchip mode it's a bit more complicated, we need to check
kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
cycle and thus requires ExtINTs not to be masked) as well as
!pending_userspace_extint(vcpu). However, there is no need to
check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
pending ExtINT state separate from event injection state, and checking
kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
priority than APIC interrupts. In fact the latter fixes a bug:
when userspace requests an IRQ window vmexit, an interrupt in the
local APIC can cause kvm_cpu_has_interrupt() to be true and thus
kvm_vcpu_ready_for_interrupt_injection() to return false. When this
happens, vcpu_run does not exit to userspace but the interrupt window
vmexits keep occurring. The VM loops without any hope of making progress.
we realize two things. First, thanks to the previous patch the complex
conditional can reuse !kvm_cpu_has_extint(vcpu). Second, the interrupt
window request in vcpu_enter_guest()
should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
it is unnecessary to ask the processor for an interrupt window
if we would not be able to return to userspace. Therefore,
kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
ANDed with the existing check for masked ExtINT. It all makes sense:
- we can accept an interrupt from userspace if there is a place
to stash it (and, for irqchip split, ExtINTs are not masked).
Interrupts from userspace _can_ be accepted even if right now
EFLAGS.IF=0.
- in order to tell userspace we will inject its interrupt ("IRQ
window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
KVM and the vCPU need to be ready to accept the interrupt.
... and this is what the patch implements.
Reported-by: David Woodhouse <dwmw@amazon.co.uk> Analyzed-by: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Nikos Tsironis <ntsironis@arrikto.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Centralize handling of interrupts from the userspace APIC
in kvm_cpu_has_extint and kvm_cpu_get_extint, since
userspace APIC interrupts are handled more or less the
same as ExtINTs are with split irqchip. This removes
duplicated code from kvm_cpu_has_injectable_intr and
kvm_cpu_has_interrupt, and makes the code more similar
between kvm_cpu_has_{extint,interrupt} on one side
and kvm_cpu_get_{extint,interrupt} on the other.
Cc: stable@vger.kernel.org Reviewed-by: Filippo Sironi <sironi@amazon.de> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's "expected" that users will access registers in the redistributor if
the RD has been properly configured (e.g., the RD base address is set). But
it hasn't yet been covered by the existing documentation.
Per discussion on the list [1], the reporting of the GICR_TYPER.Last bit
for userspace never actually worked. And it's difficult for us to emulate
it correctly given that userspace has the flexibility to access it any
time. Let's just drop the reporting of the Last bit for userspace for now
(userspace should have full knowledge about it anyway) and it at least
prevents kernel from panic ;-)
When accessing the ESB page of a source interrupt, the fault handler
will retrieve the page address from the XIVE interrupt 'xive_irq_data'
structure. If the associated KVM XIVE interrupt is not valid, that is
not allocated at the HW level for some reason, the fault handler will
dereference a NULL pointer leading to the oops below :
Commit 2284ffea8f0c ("powerpc/64s/exception: Only test KVM in SRR
interrupts when PR KVM is supported") removed KVM guest tests from
interrupts that do not set HV=1, when PR-KVM is not configured.
This is wrong for HV-KVM HPT guest MMIO emulation case which attempts
to load the faulting instruction word with MSR[DR]=1 and MSR[HV]=1 with
the guest MMU context loaded. This can cause host DSI, DSLB interrupts
which must test for KVM guest. Restore this and add a comment.
Fixes: 2284ffea8f0c ("powerpc/64s/exception: Only test KVM in SRR interrupts when PR KVM is supported") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201117135617.3521127-1-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pseries guest kernels have a FWNMI handler for SRESET and MCE NMIs,
which is basically the same as the regular handlers for those
interrupts.
The system reset FWNMI handler did not have a KVM guest test in it,
although it probably should have because the guest can itself run
guests.
Commit 4f50541f6703b ("powerpc/64s/exception: Move all interrupt
handlers to new style code gen macros") convert the handler faithfully
to avoid a KVM test with a "clever" trick to modify the IKVM_REAL
setting to 0 when the fwnmi handler is to be generated (PPC_PSERIES=y).
This worked when the KVM test was generated in the interrupt entry
handlers, but a later patch moved the KVM test to the common handler,
and the common handler macro is expanded below the fwnmi entry. This
prevents the KVM test from being generated even for the 0x100 entry
point as well.
The result is NMI IPIs in the host kernel when a guest is running will
use gest registers. This goes particularly badly when an HPT guest is
running and the MMU is set to guest mode.
Remove this trickery and just generate the test always.
Fixes: 9600f261acaa ("powerpc/64s/exception: Move KVM test to common code") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201114114743.3306283-1-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mid callback needs to be called only when valid data is
read into pages.
These patches address a problem found during decryption offload:
CIFS: VFS: trying to dequeue a deleted mid
that could cause a refcount use after free:
Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> #5.4+ Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When reconnect happens Mid queue can be corrupted when both
demultiplex and offload thread try to dequeue the MID from the
pending list.
These patches address a problem found during decryption offload:
CIFS: VFS: trying to dequeue a deleted mid
that could cause a refcount use after free:
Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> #5.4+ Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cifs_reconnect needs to be called only from demultiplex thread.
skip cifs_reconnect in offload thread. So, cifs_reconnect will be
called by demultiplex thread in subsequent request.
These patches address a problem found during decryption offload:
CIFS: VFS: trying to dequeue a deleted mid
that can cause a refcount use after free:
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.
The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.
https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?
It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).
Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,
--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.
Devious"
And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.
I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().
Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).
We need to disable interrupts in load_fpu_regs(). Otherwise an
interrupt might come in after the registers are loaded, but before
CIF_FPU is cleared in load_fpu_regs(). When the interrupt returns,
CIF_FPU will be cleared and the registers will never be restored.
The entry.S code usually saves the interrupt state in __SF_EMPTY on the
stack when disabling/restoring interrupts. sie64a however saves the pointer
to the sie control block in __SF_SIE_CONTROL, which references the same
location. This is non-obvious to the reader. To avoid thrashing the sie
control block pointer in load_fpu_regs(), move the __SIE_* offsets eight
bytes after __SF_EMPTY on the stack.
Cc: <stable@vger.kernel.org> # 5.8 Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S") Reported-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix a bug when not specify interrupts property in dts
as follows,
rtc-pcf2127-i2c 1-0051: failed to request alarm irq
rtc-pcf2127-i2c: probe of 1-0051 failed with error -22
This happens because at btrfs_read_qgroup_config() we can call
qgroup_rescan_init() while holding a read lock on a quota btree leaf,
acquired by the previous call to btrfs_search_slot_for_read(), and
qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.
A qgroup rescan worker does the opposite: it acquires the mutex
qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
update the qgroup status item in the quota btree through the call to
update_qgroup_status_item(). This inversion of locking order
between the qgroup_rescan_lock mutex and quota btree locks causes the
splat.
Fix this simply by releasing and freeing the path before calling
qgroup_rescan_init() at btrfs_read_qgroup_config().
CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().
At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.
==================================================================
BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.
Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.
There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are sectorsize alignment checks that are reported but then
check_extent_data_ref continues. This was not intended, wrong alignment
is not a minor problem and we should return with error.
When doing a buffered write, through one of the write family syscalls, we
look for ranges which currently don't have allocated extents and set the
'delalloc new' bit on them, so that we can report a correct number of used
blocks to the stat(2) syscall until delalloc is flushed and ordered extents
complete.
However there are a few other places where we can do a buffered write
against a range that is mapped to a hole (no extent allocated) and where
we do not set the 'new delalloc' bit. Those places are:
- Doing a memory mapped write against a hole;
- Cloning an inline extent into a hole starting at file offset 0;
- Calling btrfs_cont_expand() when the i_size of the file is not aligned
to the sector size and is located in a hole. For example when cloning
to a destination offset beyond EOF.
So after such cases, until the corresponding delalloc range is flushed and
the respective ordered extents complete, we can report an incorrect number
of blocks used through the stat(2) syscall.
In some cases we can end up reporting 0 used blocks to stat(2), which is a
particular bad value to report as it may mislead tools to think a file is
completely sparse when its i_size is not zero, making them skip reading
any data, an undesired consequence for tools such as archivers and other
backup tools, as reported a long time ago in the following thread (and
other past threads):
if [ $blocks_used -eq 0 ]; then
echo "ERROR: blocks used is 0"
fi
umount $DEV
$ ./reproducer.sh
blocks used: 0
ERROR: blocks used is 0
So move the logic that decides to set the 'delalloc bit' bit into the
function btrfs_set_extent_delalloc(), since that is what we use for all
those missing cases as well as for the cases that currently work well.
This change is also preparatory work for an upcoming patch that fixes
other problems related to tracking and reporting the number of bytes used
by an inode.
CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
i40iw_mmap manipulates the vma->vm_pgoff to differentiate a push page mmap
vs a doorbell mmap, and uses it to compute the pfn in remap_pfn_range
without any validation. This is vulnerable to an mmap exploit as described
in: https://lore.kernel.org/r/20201119093523.7588-1-zhudi21@huawei.com
The push feature is disabled in the driver currently and therefore no push
mmaps are issued from user-space. The feature does not work as expected in
the x722 product.
Remove the push module parameter and all VMA attribute manipulations for
this feature in i40iw_mmap. Update i40iw_mmap to only allow DB user
mmapings at offset = 0. Check vm_pgoff for zero and if the mmaps are bound
to a single page.
Cc: <stable@kernel.org> Fixes: d37498417947 ("i40iw: add files for iwarp interface") Link: https://lore.kernel.org/r/20201125005616.1800-2-shiraz.saleem@intel.com Reported-by: Di Zhu <zhudi21@huawei.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Two earlier bug fixes have created a security problem in the hfi1
driver. One fix aimed to solve an issue where current->mm was not valid
when closing the hfi1 cdev. It attempted to do this by saving a cached
value of the current->mm pointer at file open time. This is a problem if
another process with access to the FD calls in via write() or ioctl() to
pin pages via the hfi driver. The other fix tried to solve a use after
free by taking a reference on the mm.
To fix this correctly we use the existing cached value of the mm in the
mmu notifier. Now we can check in the insert, evict, etc. routines that
current->mm matched what the notifier was registered for. If not, then
don't allow access. The register of the mmu notifier will save the mm
pointer.
Since in do_exit() the exit_mm() is called before exit_files(), which
would call our close routine a reference is needed on the mm. We rely on
the mmgrab done by the registration of the notifier, whereas before it was
explicit. The mmu notifier deregistration happens when the user context is
torn down, the creation of which triggered the registration.
Also of note is we do not do any explicit work to protect the interval
tree notifier. It doesn't seem that this is going to be needed since we
aren't actually doing anything with current->mm. The interval tree
notifier stuff still has a FIXME noted from a previous commit that will be
addressed in a follow on patch.
Cc: <stable@vger.kernel.org> Fixes: e0cf75deab81 ("IB/hfi1: Fix mm_struct use after free") Fixes: 3faa3d9a308e ("IB/hfi1: Make use of mm consistent") Link: https://lore.kernel.org/r/20201125210112.104301.51331.stgit@awfm-01.aw.intel.com Suggested-by: Jann Horn <jannh@google.com> Reported-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use IS_ENABLED instead, which is true for both built-ins and modules.
Otherwise, a
> ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1
fails with the message "Error: IPv6 support not enabled in kernel." if
CONFIG_IPV6 is `m`.
bcm2835_spi_remove() accesses the driver's private data after calling
spi_unregister_controller() even though that function releases the last
reference on the spi_controller and thereby frees the private data.
Fix by switching over to the new devm_spi_alloc_master() helper which
keeps the private data accessible until the driver has unbound.
bcm_qspi_remove() calls spi_unregister_master() even though
bcm_qspi_probe() calls devm_spi_register_master(). The spi_master is
therefore unregistered and freed twice on unbind.
Moreover, since commit 0392727c261b ("spi: bcm-qspi: Handle clock probe
deferral"), bcm_qspi_probe() leaks the spi_master allocation if the call
to devm_clk_get_optional() fails.
Fix by switching over to the new devm_spi_alloc_master() helper which
keeps the private data accessible until the driver has unbound and also
avoids the spi_master leak on probe.
While at it, fix an ordering issue in bcm_qspi_remove() wherein
spi_unregister_master() is called after uninitializing the hardware,
disabling the clock and freeing an IRQ data structure. The correct
order is to call spi_unregister_master() *before* those teardown steps
because bus accesses may still be ongoing until that function returns.
Alexander reported a syzkaller / KASAN finding on s390, see below for
complete output.
In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
freed in some cases. In the case of userfaultfd_missing(), this will
happen after calling handle_userfault(), which might have released the
mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will
access an unstable vma->vm_mm, which could have been freed or re-used
already.
For all architectures other than s390 this will go w/o any negative
impact, because pte_free() simply frees the page and ignores the
passed-in mm. The implementation for SPARC32 would also access
mm->page_table_lock for pte_free(), but there is no THP support in
SPARC32, so the buggy code path will not be used there.
For s390, the mm->context.pgtable_list is being used to maintain the 2K
pagetable fragments, and operating on an already freed or even re-used
mm could result in various more or less subtle bugs due to list /
pagetable corruption.
Fix this by calling pte_free() before handle_userfault(), similar to how
it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
non-huge_zero_page case.
Commit 6b251fc96cf2c ("userfaultfd: call handle_userfault() for
userfaultfd_missing() faults") actually introduced both, the
do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
changes wrt to calling handle_userfault(), but only in the latter case
it put the pte_free() before calling handle_userfault().
BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334
Memory state around the buggy address: 00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
>00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ 00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
If we reparent the slab objects to the root memcg, when we free the slab
object, we need to update the per-memcg vmstats to keep it correct for
the root memcg. Now this at least affects the vmstat of
NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
smaller than the PAGE_SIZE.
David said:
"I assume that without this fix that the root memcg's vmstat would
always be inflated if we reparented"
Fixes: ec9f02384f60 ("mm: workingset: fix vmstat counters for shadow nodes") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christopher Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Cc: <stable@vger.kernel.org> [5.3+] Link: https://lkml.kernel.org/r/20201110031015.15715-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Both btrfs and fuse have reported faults caused by seeing a retry entry
instead of the page they were looking for. This was caused by a missing
check in the iterator.
As can be seen in the below panic log, the accessing 0x402 causes a
panic. In the xarray.h, 0x402 means RETRY_ENTRY.
We catch the case where we enter generic_file_buffered_read() with data
already transferred, but we also need to be careful not to allow an async
page lock if we're looping transferring data. If not, we could be
returning -EIOCBQUEUED instead of the transferred amount, and it could
result in double waitqueue additions as well.
Currently, scan_microcode() leverages microcode_matches() to check
if the microcode matches the CPU by comparing the family and model.
However, the processor stepping and flags of the microcode signature
should also be considered when saving a microcode patch for early
update.
Use find_matching_signature() in scan_microcode() and get rid of the
now-unused microcode_matches() which is a good cleanup in itself.
Complete the verification of the patch being saved for early loading in
save_microcode_patch() directly. This needs to be done there too because
save_mc_for_early() will call save_microcode_patch() too.
The second reason why this needs to be done is because the loader still
tries to support, at least hypothetically, mixed-steppings systems and
thus adds all patches to the cache that belong to the same CPU model
albeit with different steppings.
For example:
microcode: CPU: sig=0x906ec, pf=0x2, rev=0xd6
microcode: mc_saved[0]: sig=0x906e9, pf=0x2a, rev=0xd6, total size=0x19400, date = 2020-04-23
microcode: mc_saved[1]: sig=0x906ea, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
microcode: mc_saved[2]: sig=0x906eb, pf=0x2, rev=0xd6, total size=0x19400, date = 2020-04-23
microcode: mc_saved[3]: sig=0x906ec, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27
microcode: mc_saved[4]: sig=0x906ed, pf=0x22, rev=0xd6, total size=0x19400, date = 2020-04-23
The patch which is being saved for early loading, however, can only be
the one which fits the CPU this runs on so do the signature verification
before saving.
[ bp: Do signature verification in save_microcode_patch()
and rewrite commit message. ]
The victim inode's parent and name info is required when an event
needs to be delivered to a group interested in filename info OR
when the inode's parent is interested in an event on its children.
Let us call the first condition 'parent_needed' and the second
condition 'parent_interested'.
In fsnotify_parent(), the condition where the inode's parent is
interested in some events on its children, but not necessarily
interested the specific event is called 'parent_watched'.
fsnotify_parent() tests the condition (!parent_watched && !parent_needed)
for sending the event without parent and name info, which is correct.
It then wrongly assumes that parent_watched implies !parent_needed
and tests the condition (parent_watched && !parent_interested)
for sending the event without parent and name info, which is wrong,
because parent may still be needed by some group.
For example, after initializing a group with FAN_REPORT_DFID_NAME and
adding a FAN_MARK_MOUNT with FAN_OPEN mask, open events on non-directory
children of "testdir" are delivered with file name info.
After adding another mark to the same group on the parent "testdir"
with FAN_CLOSE|FAN_EVENT_ON_CHILD mask, open events on non-directory
children of "testdir" are no longer delivered with file name info.
Fix the logic and use auxiliary variables to clarify the conditions.
Fixes: 9b93f33105f5 ("fsnotify: send event with parent/name info to sb/mount/non-dir marks") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201108105906.8493-1-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 69f594a38967 ("ptrace: do not audit capability check when outputing
/proc/pid/stat") replaced the use of ns_capable() with
has_ns_capability{,_noaudit}() which doesn't set PF_SUPERPRIV.
Commit 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials in
ptrace_has_cap()") replaced has_ns_capability{,_noaudit}() with
security_capable(), which doesn't set PF_SUPERPRIV neither.
Since commit 98f368e9e263 ("kernel: Add noaudit variant of ns_capable()"), a
new ns_capable_noaudit() helper is available. Let's use it!
As a result, the signature of ptrace_has_cap() is restored to its original one.
Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Eric Paris <eparis@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: stable@vger.kernel.org Fixes: 6b3ad6649a4c ("ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()") Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201030123849.770769-2-mic@digikod.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
disk_get_part needs to be paired with a disk_put_part.
Cc: stable@vger.kernel.org Fixes: ef45fe470e1 ("blk-cgroup: show global disk stats in root cgroup io.stat") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the current implementation DLL reset will be issued for
each ITAP and OTAP setting inside ATF, this is creating issues
in some scenarios and this sequence is not inline with the TRM.
To fix the issue, DLL reset should be removed from the ATF and
host driver will request it explicitly.
This patch update host driver to explicitly request for DLL reset
before ITAP (assert DLL) and after OTAP (release DLL) settings.
Fixes: a5c8b2ae2e51 ("mmc: sdhci-of-arasan: Add support for ZynqMP Platform Tap Delays Setup") Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com> Signed-off-by: Manish Narani <manish.narani@xilinx.com> Acked-by: Michal Simek <michal.simek@xilinx.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1605515565-117562-4-git-send-email-manish.narani@xilinx.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Allow configuring the Output and Input tap values with zero to avoid
failures in some cases (one of them is SD boot mode) where the output
and input tap values may be already set to non-zero.
Fixes: a5c8b2ae2e51 ("mmc: sdhci-of-arasan: Add support for ZynqMP Platform Tap Delays Setup") Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com> Signed-off-by: Manish Narani <manish.narani@xilinx.com> Acked-by: Michal Simek <michal.simek@xilinx.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1605515565-117562-2-git-send-email-manish.narani@xilinx.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A UHS setting of SDR25 can give better results for High Speed mode.
This is because there is no setting corresponding to high speed. Currently
SDHCI sets no value, which means zero which is also the setting for SDR12.
There was an attempt to change this in sdhci.c but it caused problems for
some drivers, so it was reverted and the change was made to sdhci-brcmstb
in commit 2fefc7c5f7d16e ("mmc: sdhci-brcmstb: Fix incorrect switch to HS
mode"). Several other drivers also do this.
Zorro reports that an xfstest test case is failing, and it turns out that
for the reissue path we can potentially issue a double completion on the
request for the failure path. There's an issue around the retry as well,
but for now, at least just make sure that we handle the error path
correctly.
Cc: stable@vger.kernel.org Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some media power gates are disabled by default. commit 5d86923060fc
("drm/i915/tgl: Enable VD HCP/MFX sub-pipe power gating")
tried to enable it, but it duplicated an existent register.
So, the main PG setup sequences ended up overwriting it.
So, let's now merge this to the main PG setup sequence.
v2: (Chris): s/BIT/REG_BIT, remove useless comment,
remove useless =0, use the right gt,
remove rc6 sequence doubt from commit message.
Fixes: 5d86923060fc ("drm/i915/tgl: Enable VD HCP/MFX sub-pipe power gating") Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: stable@vger.kernel.org#v5.5+ Cc: Dale B Stimson <dale.b.stimson@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20201111072859.1186070-1-rodrigo.vivi@intel.com
(cherry picked from commit 695dc55b573985569259e18f8e6261a77924342b) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
EDID can declare the maximum supported bpc up to 16,
and apparently there are displays that do so. Currently
we assume 12 bpc is tha max. Fix the assumption and
toss in a MISSING_CASE() for any other value we don't
expect to see.
This fixes modesets with a display with EDID max bpc > 12.
Previously any modeset would just silently fail on platforms
that didn't otherwise limit this via the max_bpc property.
In particular we don't add the max_bpc property to HDMI
ports on gmch platforms, and thus we would see the raw
max_bpc coming from the EDID.
I suppose we could already adjust this to also allow 16bpc,
but seeing as no current platform supports that there is
little point.
This was due to hv_synic_cleanup() callback returning -EBUSY to
cpuhp_issue_call() when tearing down the VMBUS_CONNECT_CPU, even
if the vmbus_connection.conn_state = DISCONNECTED. hv_synic_cleanup()
should succeed in the case where vmbus_connection.conn_state
is DISCONNECTED.
Fix is to add an extra condition to test for
vmbus_connection.conn_state == CONNECTED on the VMBUS_CONNECT_CPU and
only return early if true. This way the kexec() path can still shut
everything down while preserving the initial behavior of preventing
CPU offlining on the VMBUS_CONNECT_CPU while the VM is running.
Fixes: 8a857c55420f29 ("Drivers: hv: vmbus: Always handle the VMBus messages on CPU0") Signed-off-by: Chris Co <chrco@microsoft.com> Reviewed-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201110190118.15596-1-chrco@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When requeueing all requests on the device request queue to the blocklayer
we might get to an ERP (error recovery) request that is a copy of an
original CQR.
Those requests do not have blocklayer request information or a pointer to
the dasd_queue set. When trying to access those data it will lead to a
null pointer dereference in dasd_requeue_all_requests().
Fix by checking if the request is an ERP request that can simply be
ignored. The blocklayer request will be requeued by the original CQR that
is on the device queue right behind the ERP request.
Fixes: 9487cfd3430d ("s390/dasd: fix handling of internal requests") Cc: <stable@vger.kernel.org> #4.16 Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This file is installed by the s390 CPU Measurement sampling
facility device driver to export supported minimum and
maximum sample buffer sizes.
This file is read by lscpumf tool to display the details
of the device driver capabilities. The lscpumf tool might
be invoked by a non-root user. In this case it does not
print anything because the file contents can not be read.
Fix this by allowing read access for all users. Reading
the file contents is ok, changing the file contents is
left to the root user only.
For further reference and details see:
[1] https://github.com/ibm-s390-tools/s390-tools/issues/97
The system call exit path is running with interrupts enabled while
checking for TIF/PIF/CIF bits which require special handling. If all
bits have been checked interrupts are disabled and the kernel exits to
user space.
The problem is that after checking all bits and before interrupts are
disabled bits can be set already again, due to interrupt handling.
This means that the kernel can exit to user space with some
TIF/PIF/CIF bits set, which should never happen. E.g. TIF_NEED_RESCHED
might be set, which might lead to additional latencies, since that bit
will only be recognized with next exit to user space.
Fix this by checking the corresponding bits only when interrupts are
disabled.
If sta_info_insert_finish() fails, we currently keep the station
around and free it only in the caller, but there's only one such
caller and it always frees it immediately.
As syzbot found, another consequence of this split is that we can
put things that sleep only into __cleanup_single_sta() and not in
sta_info_free(), but this is the only place that requires such of
sta_info_free() now.
Change this to free the station in sta_info_insert_finish(), in
which case we can still sleep. This will also let us unify the
cleanup code later.
Some drivers fill the status rate list without setting the rate index after
the final rate to -1. minstrel_ht already deals with this, but minstrel
doesn't, which causes it to get stuck at the lowest rate on these drivers.
Fix this by checking the count as well.
Cc: stable@vger.kernel.org Fixes: cccf129f820e ("mac80211: add the 'minstrel' rate control algorithm") Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201111183359.43528-3-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Deferring sampling attempts to the second stage has some bad interactions
with drivers that process the rate table in hardware and use the probe flag
to indicate probing packets (e.g. most mt76 drivers). On affected drivers
it can lead to probing not working at all.
If the link conditions turn worse, it might not be such a good idea to
do a lot of sampling for lower rates in this case.
Fix this by simply skipping the sample attempt instead of deferring it,
but keep the checks that would allow it to be sampled if it was skipped
too often, but only if it has less than 95% success probability.
Also ensure that IEEE80211_TX_CTL_RATE_CTRL_PROBE is set for all probing
packets.
Cc: stable@vger.kernel.org Fixes: cccf129f820e ("mac80211: add the 'minstrel' rate control algorithm") Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20201111183359.43528-2-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Although cache alias management calls set up and tear down TLB entries
and fast_second_level_miss is able to restore TLB entry should it be
evicted they absolutely cannot preempt each other because they use the
same TLBTEMP area for different purposes.
Disable preemption around all cache alias management calls to enforce
that.
Cc: stable@vger.kernel.org Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fast_second_level_miss handler for the TLBTEMP area has an assumption
that page table directory entry for the TLBTEMP address range is 0. For
it to be true the TLBTEMP area must be aligned to 4MB boundary and not
share its 4MB region with anything that may use a page table. This is
not true currently: TLBTEMP shares space with vmalloc space which
results in the following kinds of runtime errors when
fast_second_level_miss loads page table directory entry for the vmalloc
space instead of fixing up the TLBTEMP area:
Patch 541656d3a513 ("gfs2: freeze should work on read-only mounts") changed
the check for glock state in function freeze_go_sync() from "gl->gl_state
== LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE". That's wrong and it
regressed gfs2's freeze/thaw mechanism because it caused only the freezing
node (which requests the glock in EX) to queue freeze work.
All nodes go through this go_sync code path during the freeze to drop their
SHared hold on the freeze glock, allowing the freezing node to acquire it
in EXclusive mode. But all the nodes must freeze access to the file system
locally, so they ALL must queue freeze work. The freeze_work calls
freeze_func, which makes a request to reacquire the freeze glock in SH,
effectively blocking until the thaw from the EX holder. Once thawed, the
freezing node drops its EX hold on the freeze glock, then the (blocked)
freeze_func reacquires the freeze glock in SH again (on all nodes, including
the freezer) so all nodes go back to a thawed state.
This patch changes the check back to gl_state == LM_ST_SHARED like it was
prior to 541656d3a513.
Fixes: 541656d3a513 ("gfs2: freeze should work on read-only mounts") Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>