The first message to firmware may fail if the device is undergoing FLR.
The driver has some recovery logic for this failure scenario but we must
wait 100 msec for FLR to complete before proceeding. Otherwise the
recovery will always fail.
Fixes: ba02629ff6cb ("bnxt_en: log firmware status on firmware init failure") Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240117234515.226944-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The issue triggering process is analyzed as follows:
Thread A Thread B
tcp_v4_rcv //receive ack TCP packet inet_shutdown
tcp_check_req tcp_disconnect //disconnect sock
... tcp_set_state(sk, TCP_CLOSE)
inet_csk_complete_hashdance ...
inet_csk_reqsk_queue_add inet_listen //start listen
spin_lock(&queue->rskq_lock) inet_csk_listen_start
... reqsk_queue_alloc
... spin_lock_init
spin_unlock(&queue->rskq_lock) //warning
When the socket receives the ACK packet during the three-way handshake,
it will hold spinlock. And then the user actively shutdowns the socket
and listens to the socket immediately, the spinlock will be initialized.
When the socket is going to release the spinlock, a warning is generated.
Also the same issue to fastopenq.lock.
Move init spinlock to inet_create and inet_accept to make sure init the
accept_queue's spinlocks once.
Fixes: fff1f3001cc5 ("tcp: add a spinlock to protect struct request_sock_queue") Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Reported-by: Ming Shu <sming56@aliyun.com> Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240118012019.1751966-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When tests are run by runner.sh, bond_options.sh gets killed before
it can complete:
make -C tools/testing/selftests run_tests TARGETS="drivers/net/bonding"
[...]
# timeout set to 120
# selftests: drivers/net/bonding: bond_options.sh
# TEST: prio (active-backup miimon primary_reselect 0) [ OK ]
# TEST: prio (active-backup miimon primary_reselect 1) [ OK ]
# TEST: prio (active-backup miimon primary_reselect 2) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 0) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 1) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 2) [ OK ]
#
not ok 7 selftests: drivers/net/bonding: bond_options.sh # TIMEOUT 120 seconds
This test includes many sleep statements, at least some of which are
related to timers in the operation of the bonding driver itself. Increase
the test timeout to allow the test to complete.
I ran the test in slightly different VMs (including one without HW
virtualization support) and got runtimes of 13m39.760s, 13m31.238s, and
13m2.956s. Use a ~1.5x "safety factor" and set the timeout to 1200s.
Fixes: 42a8d4aaea84 ("selftests: bonding: add bonding prio option test") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20240116104402.1203850a@kernel.org/#t Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240118001233.304759-1-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
It is possible that the connection is in process of being established
when we dump it. Assumed that the connection has been registered in a
link group by smc_conn_create() but the rmb_desc has not yet been
initialized by smc_buf_create(), thus causing the illegal access to
conn->rmb_desc. So fix it by checking before dump.
Fixes: 4b1b7d3b30a6 ("net/smc: add SMC-D diag support") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
This means btrfs_submit_bio() would split the bio, and trigger endio
function for both of the two halves.
However scrub_submit_initial_read() would only expect the endio function
to be called once, not any more.
This means the first endio function would already free the bbio::bio,
leaving the bvec freed, thus the 2nd endio call would lead to
use-after-free.
[FIX]
- Make sure scrub_read_endio() only updates bits in its range
Since we may read less than 64K at the end of the chunk, we should not
touch the bits beyond chunk boundary.
- Make sure scrub_submit_initial_read() only to read the chunk range
This is done by calculating the real number of sectors we need to
read, and add sector-by-sector to the bio.
Thankfully the scrub read repair path won't need extra fixes:
- scrub_stripe_submit_repair_read()
With above fixes, we won't update error bit for range beyond chunk,
thus scrub_stripe_submit_repair_read() should never submit any read
beyond the chunk.
When a station is allocated, links are added but not
set to valid yet (e.g. during connection to an AP MLD),
we might remove the station without ever marking links
valid, and leak them. Fix that.
Fixes: cb71f1d136a6 ("wifi: mac80211: add sta link addition/removal") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240111181514.6573998beaf8.I09ac2e1d41c80f82a5a616b8bd1d9d8dd709a6a6@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Use the proper size when setting up the bio_vec, as otherwise only
zero-length UDP packets will be sent.
Fixes: baabf59c2414 ("SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array") Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The kernel thread function ksmbd_conn_handler_loop() invokes
the try_to_freeze() in its loop. But all the kernel threads are
non-freezable by default. So if we want to make a kernel thread to be
freezable, we have to invoke set_freezable() explicitly.
Signed-off-by: Kevin Hao <haokexin@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mac80211 started to delete debugfs entries in certain cases, causing a
ath11k to crash when it tried to delete the entries later. Fix this by
relying on mac80211 to delete the entries when appropriate and adding
them from the vif_add_debugfs handler.
Fixes: 0a3d898ee9a8 ("wifi: mac80211: add/remove driver debugfs entries as appropriate") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218364 Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240115101805.1277949-1-benjamin@sipsolutions.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If file opened with v2 lease is upgraded with v1 lease, smb server
should response v2 lease create context to client.
This patch fix smb2.lease.v2_epoch2 test failure.
This test case assumes the following scenario:
1. smb2 create with v2 lease(R, LEASE1 key)
2. smb server return smb2 create response with v2 lease context(R,
LEASE1 key, epoch + 1)
3. smb2 create with v1 lease(RH, LEASE1 key)
4. smb server return smb2 create response with v2 lease context(RH,
LEASE1 key, epoch + 2)
i.e. If same client(same lease key) try to open a file that is being
opened with v2 lease with v1 lease, smb server should return v2 lease.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to be
retried before OOM kill path is taken.
should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:
a) true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.
b) false, thus slowpath allocation takes oom kill path.
should_reclaim_retry() can also unreserves the high atomic reserves **but
only after all the reclaim retries are exhausted.**
In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't use
these high atomic reserves, makes the available memory below min wmark
levels hence false is returned from should_reclaim_retry() leading the
allocation request to take OOM kill path. This can turn into a early oom
kill if high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.
(early)OOM is encountered on a VM with the below state:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB
Per above log, the free memory of ~7MB exist in the high atomic reserves
is not freed up before falling back to oom kill path.
Fix it by trying to unreserve the high atomic reserves in
should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback
to oom kill path.
Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Chris Goldsworthy <quic_cgoldswo@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joakim Tjernlund <Joakim.Tjernlund@infinera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Simplify and improve readability by replacing while(1) loop with
do {} while, and by using the keep_polling variable as the exit
condition, making it more explicit.
Fixes: 834449872105 ("sc16is7xx: Fix for multi-channel stall") Cc: <stable@vger.kernel.org> Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Hugo Villeneuve <hvilleneuve@dimonoff.com> Link: https://lore.kernel.org/r/20231221231823.2327894-6-hugo@hugovil.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 834449872105 ("sc16is7xx: Fix for multi-channel stall") changed
sc16is7xx_port_irq() from looping multiple times when there was still
interrupts to serve. It simply changed the do {} while(1) loop to a
do {} while(0) loop, which makes the loop itself now obsolete.
Clean the code by removing this obsolete do {} while(0) loop.
Fixes: 834449872105 ("sc16is7xx: Fix for multi-channel stall") Cc: <stable@vger.kernel.org> Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Hugo Villeneuve <hvilleneuve@dimonoff.com> Link: https://lore.kernel.org/r/20231221231823.2327894-5-hugo@hugovil.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If an error occurs during probing, the sc16is7xx_lines bitfield may be left
in a state that doesn't represent the correct state of lines allocation.
For example, in a system with two SC16 devices, if an error occurs only
during probing of channel (port) B of the second device, sc16is7xx_lines
final state will be 00001011b instead of the expected 00000011b.
This is caused in part because of the "i--" in the for/loop located in
the out_ports: error path.
Fix this by checking the return value of uart_add_one_port() and set line
allocation bit only if this was successful. This allows the refactor of
the obfuscated for(i--...) loop in the error path, and properly call
uart_remove_one_port() only when needed, and properly unset line allocation
bits.
Also use same mechanism in remove() when calling uart_remove_one_port().
Commit cc4c1d05eb10 ("sc16is7xx: Properly resume TX after stop") changed
behavior to unconditionnaly set the THRI interrupt in sc16is7xx_tx_proc().
For example when sending a 65 bytes message, and assuming the Tx FIFO is
initially empty, sc16is7xx_handle_tx() will write the first 64 bytes of the
message to the FIFO and sc16is7xx_tx_proc() will then activate THRI. When
the THRI IRQ is fired, the driver will write the remaining byte of the
message to the FIFO, and disable THRI by calling sc16is7xx_stop_tx().
When sending a 2 bytes message, sc16is7xx_handle_tx() will write the 2
bytes of the message to the FIFO and call sc16is7xx_stop_tx(), disabling
THRI. After sc16is7xx_handle_tx() exits, control returns to
sc16is7xx_tx_proc() which will unconditionally set THRI. When the THRI IRQ
is fired, the driver simply acknowledges the interrupt and does nothing
more, since all the data has already been written to the FIFO. This results
in 2 register writes and 4 register reads all for nothing and taking
precious cycles from the I2C/SPI bus.
Fix this by enabling the THRI interrupt only when we fill the Tx FIFO to
its maximum capacity and there are remaining bytes to send in the message.
The SC16IS7XX IC supports a burst mode to access the FIFOs where the
initial register address is sent ($00), followed by all the FIFO data
without having to resend the register address each time. In this mode, the
IC doesn't increment the register address for each R/W byte.
The regmap_raw_read() and regmap_raw_write() are functions which can
perform IO over multiple registers. They are currently used to read/write
from/to the FIFO, and although they operate correctly in this burst mode on
the SPI bus, they would corrupt the regmap cache if it was not disabled
manually. The reason is that when the R/W size is more than 1 byte, these
functions assume that the register address is incremented and handle the
cache accordingly.
Convert FIFO R/W functions to use the regmap _noinc_ versions in order to
remove the manual cache control which was a workaround when using the
_raw_ versions. FIFO registers are properly declared as volatile so
cache will not be used/updated for FIFO accesses.
Now that the driver has been converted to use one regmap per port, change
efr locking to operate on a channel basis instead of on the whole IC.
Fixes: 3837a0379533 ("serial: sc16is7xx: improve regmap debugfs by using one regmap per port") Cc: <stable@vger.kernel.org> # 6.1.x: 3837a03 serial: sc16is7xx: improve regmap debugfs by using one regmap per port Signed-off-by: Hugo Villeneuve <hvilleneuve@dimonoff.com> Link: https://lore.kernel.org/r/20231211171353.2901416-5-hugo@hugovil.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that the driver has been converted to use one regmap per port, the line
structure member is no longer used, so remove it.
Fixes: 3837a0379533 ("serial: sc16is7xx: improve regmap debugfs by using one regmap per port") Cc: <stable@vger.kernel.org> Signed-off-by: Hugo Villeneuve <hvilleneuve@dimonoff.com> Link: https://lore.kernel.org/r/20231211171353.2901416-4-hugo@hugovil.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove global struct regmap so that it is more obvious that this
regmap is to be used only in the probe function.
Also add a comment to that effect in probe function.
Fixes: 3837a0379533 ("serial: sc16is7xx: improve regmap debugfs by using one regmap per port") Cc: <stable@vger.kernel.org> Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Hugo Villeneuve <hvilleneuve@dimonoff.com> Link: https://lore.kernel.org/r/20231211171353.2901416-3-hugo@hugovil.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Using a static buffer inside sc16is7xx_regmap_name() was a convenient and
simple way to set the regmap name without having to allocate and free a
buffer each time it is called. The drawback is that the static buffer
wastes memory for nothing once regmap is fully initialized.
Remove static buffer and use constant strings instead.
This also avoids a truncation warning when using "%d" or "%u" in snprintf
which was flagged by kernel test robot.
Fixes: 3837a0379533 ("serial: sc16is7xx: improve regmap debugfs by using one regmap per port") Cc: <stable@vger.kernel.org> # 6.1.x: 3837a03 serial: sc16is7xx: improve regmap debugfs by using one regmap per port Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Hugo Villeneuve <hvilleneuve@dimonoff.com> Link: https://lore.kernel.org/r/20231211171353.2901416-2-hugo@hugovil.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With this current driver regmap implementation, it is hard to make sense
of the register addresses displayed using the regmap debugfs interface,
because they do not correspond to the actual register addresses documented
in the datasheet. For example, register 1 is displayed as registers 04 thru
07:
$ cat /sys/kernel/debug/regmap/spi0.0/registers
04: 10 -> Port 0, register offset 1
05: 10 -> Port 1, register offset 1
06: 00 -> Port 2, register offset 1 -> invalid
07: 00 -> port 3, register offset 1 -> invalid
...
The reason is that bits 0 and 1 of the register address correspond to the
channel (port) bits, so the register address itself starts at bit 2, and we
must 'mentally' shift each register address by 2 bits to get its real
address/offset.
Also, only channels 0 and 1 are supported by the chip, so channel mask
combinations of 10b and 11b are invalid, and the display of these
registers is useless.
This patch adds a separate regmap configuration for each port, similar to
what is done in the max310x driver, so that register addresses displayed
match the register addresses in the chip datasheet. Also, each port now has
its own debugfs entry.
We should never lock two subdirectories without having taken
->s_vfs_rename_mutex; inode pointer order or not, the "order" proposed
in 28eceeda130f "fs: Lock moved directories" is not transitive, with
the usual consequences.
The rationale for locking renamed subdirectory in all cases was
the possibility of race between rename modifying .. in a subdirectory to
reflect the new parent and another thread modifying the same subdirectory.
For a lot of filesystems that's not a problem, but for some it can lead
to trouble (e.g. the case when short directory contents is kept in the
inode, but creating a file in it might push it across the size limit
and copy its contents into separate data block(s)).
However, we need that only in case when the parent does change -
otherwise ->rename() doesn't need to do anything with .. entry in the
first place. Some instances are lazy and do a tautological update anyway,
but it's really not hard to avoid.
Amended locking rules for rename():
find the parent(s) of source and target
if source and target have the same parent
lock the common parent
else
lock ->s_vfs_rename_mutex
lock both parents, in ancestor-first order; if neither
is an ancestor of another, lock the parent of source
first.
find the source and target.
if source and target have the same parent
if operation is an overwriting rename of a subdirectory
lock the target subdirectory
else
if source is a subdirectory
lock the source
if target is a subdirectory
lock the target
lock non-directories involved, in inode pointer order if both
source and target are such.
That way we are guaranteed that parents are locked (for obvious reasons),
that any renamed non-directory is locked (nfsd relies upon that),
that any victim is locked (emptiness check needs that, among other things)
and subdirectory that changes parent is locked (needed to protect the update
of .. entries). We are also guaranteed that any operation locking more
than one directory either takes ->s_vfs_rename_mutex or locks a parent
followed by its child.
Cc: stable@vger.kernel.org Fixes: 28eceeda130f "fs: Lock moved directories" Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The below race is observed on a PFN which falls into the device memory
region with the system memory configuration where PFN's are such that
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and end
pfn contains the device memory PFN's as well, the compaction triggered
will try on the device memory PFN's too though they end up in NOP(because
pfn_to_online_page() returns NULL for ZONE_DEVICE memory sections). When
from other core, the section mappings are being removed for the
ZONE_DEVICE region, that the PFN in question belongs to, on which
compaction is currently being operated is resulting into the kernel crash
with CONFIG_SPASEMEM_VMEMAP enabled. The crash logs can be seen at [1].
compact_zone() memunmap_pages
------------- ---------------
__pageblock_pfn_to_page
......
(a)pfn_valid():
valid_section()//return true
(b)__remove_pages()->
sparse_remove_section()->
section_deactivate():
[Free the array ms->usage and set
ms->usage = NULL]
pfn_section_valid()
[Access ms->usage which
is NULL]
NOTE: From the above it can be said that the race is reduced to between
the pfn_valid()/pfn_section_valid() and the section deactivate with
SPASEMEM_VMEMAP enabled.
The commit b943f045a9af("mm/sparse: fix kernel crash with
pfn_section_valid check") tried to address the same problem by clearing
the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns
false thus ms->usage is not accessed.
Fix this issue by the below steps:
a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
b) RCU protected read side critical section will either return NULL
when SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
c) Free the ->usage with kfree_rcu() and set ms->usage = NULL. No
attempt will be made to access ->usage after this as the
SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.
Thanks to David/Pavan for their inputs on this patch.
On Snapdragon SoC, with the mentioned memory configuration of PFN's as
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of
issues daily while testing on a device farm.
For this particular issue below is the log. Though the below log is
not directly pointing to the pfn_section_valid(){ ms->usage;}, when we
loaded this dump on T32 lauterbach tool, it is pointing.
[quic_charante@quicinc.com: use kfree_rcu() in place of synchronize_rcu(), per David] Link: https://lkml.kernel.org/r/1698403778-20938-1-git-send-email-quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1697202267-23600-1-git-send-email-quic_charante@quicinc.com Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot") Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After analyzing the vmcore, I found this issue is caused by page migration.
The scenario is that, one thread is doing page migration, and we will use the
target page's ->mapping field to save 'anon_vma' pointer between page unmap and
page move, and now the target page is locked and refcount is 1.
Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers that
the refcount of this target page is 1, preventing the offline operation, thus
proceeding to dump the page. However, page_mapping() of the target page may
return an incorrect file mapping to crash the system in dump_mapping(), since
the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.
There are seveval ways to fix this issue:
(1) Setting the PAGE_MAPPING_ANON flag for target page's ->mapping when saving
'anon_vma', but this can confuse PageAnon() for PFN walkers, since the target
page has not built mappings yet.
(2) Getting the page lock to call page_mapping() in __dump_page() to avoid crashing
the system, however, there are still some PFN walkers that call page_mapping()
without holding the page lock, such as compaction.
(3) Using target page->private field to save the 'anon_vma' pointer and 2 bits
page state, just as page->mapping records an anonymous page, which can remove
the page_mapping() impact for PFN walkers and also seems a simple way.
So I choose option 3 to fix this issue, and this can also fix other potential
issues for PFN walkers, such as compaction.
Link: https://lkml.kernel.org/r/e60b17a88afc38cb32f84c3e30837ec70b343d2b.1702641709.git.baolin.wang@linux.alibaba.com Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Xu Yu <xuyu@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I thought it was interesting that line 264 of rmap.h had a 100% incorrect
annotation, but the line directly below it was 100% correct. Looking at the
code:
if (likely(!is_device_private_page(page) &&
unlikely(page_needs_cow_for_dma(vma, page))))
It didn't make sense. The "likely()" was around the entire if statement
(not just the "!is_device_private_page(page)"), which also included the
"unlikely()" portion of that if condition.
If the unlikely portion is unlikely to be true, that would make the entire
if condition unlikely to be true, so it made no sense at all to say the
entire if condition is true.
What is more likely to be likely is just the first part of the if statement
before the && operation. It's likely to be a misplaced parenthesis. And
after making the if condition broken into a likely() && unlikely(), both
now appear to be correct!
Link: https://lkml.kernel.org/r/20231201145936.5ddfdb50@gandalf.local.home
Fixes:fb3d824d1a46c ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kernel sefltest mm/hugepage-vmemmap fails on architectures which has
different page size other than 4K. In hugepage-vmemmap page size used is
4k so the pfn calculation will go wrong on systems which has different
page size .The length of MAP_HUGETLB memory must be hugepage aligned but
in hugepage-vmemmap map length is 2M so this will not get aligned if the
system has differnet hugepage size.
Added psize() to get the page size and default_huge_page_size() to
get the default hugepage size at run time, hugepage-vmemmap test pass
on powerpc with 64K page size and x86 with 4K page size.
Result on powerpc without patch (page size 64K)
*# ./hugepage-vmemmap
Returned address is 0x7effff000000 whose pfn is 0
Head page flags (100000000) is invalid
check_page_flags: Invalid argument
*#
Result on powerpc with patch (page size 64K)
*# ./hugepage-vmemmap
Returned address is 0x7effff000000 whose pfn is 600
*#
Result on x86 with patch (page size 4K)
*# ./hugepage-vmemmap
Returned address is 0x7fc7c2c00000 whose pfn is 1dac00
*#
Link: https://lkml.kernel.org/r/3b3a3ae37ba21218481c482a872bbf7526031600.1704865754.git.donettom@linux.vnet.ibm.com Fixes: b147c89cd429 ("selftests: vm: add a hugetlb test case") Signed-off-by: Donet Tom <donettom@linux.vnet.ibm.com> Reported-by: Geetika Moolchandani <geetika@linux.ibm.com> Tested-by: Geetika Moolchandani <geetika@linux.ibm.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syscore_shutdown() runs driver and module callbacks to get the system into
a state where it can be correctly shut down. In commit 6f389a8f1dd2 ("PM
/ reboot: call syscore_shutdown() after disable_nonboot_cpus()")
syscore_shutdown() was removed from kernel_restart_prepare() and hence got
(incorrectly?) removed from the kexec flow. This was innocuous until
commit 6735150b6997 ("KVM: Use syscore_ops instead of reboot_notifier to
hook restart/shutdown") changed the way that KVM registered its shutdown
callbacks, switching from reboot notifiers to syscore_ops.shutdown. As
syscore_shutdown() is missing from kexec, KVM's shutdown hook is not run
and virtualisation is left enabled on the boot CPU which results in triple
faults when switching to the new kernel on Intel x86 VT-x with VMXE
enabled.
Fix this by adding syscore_shutdown() to the kexec sequence. In terms of
where to add it, it is being added after migrating the kexec task to the
boot CPU, but before APs are shut down. It is not totally clear if this
is the best place: in commit 6f389a8f1dd2 ("PM / reboot: call
syscore_shutdown() after disable_nonboot_cpus()") it is stated that
"syscore_ops operations should be carried with one CPU on-line and
interrupts disabled." APs are only offlined later in machine_shutdown(),
so this syscore_shutdown() is being run while APs are still online. This
seems to be the correct place as it matches where syscore_shutdown() is
run in the reboot and halt flows - they also run it before APs are shut
down. The assumption is that the commit message in commit 6f389a8f1dd2
("PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()") is
no longer valid.
KVM has been discussed here as it is what broke loudly by not having
syscore_shutdown() in kexec, but this change impacts more than just KVM;
all drivers/modules which register a syscore_ops.shutdown callback will
now be invoked in the kexec flow. Looking at some of them like x86 MCE it
is probably more correct to also shut these down during kexec.
Maintainers of all drivers which use syscore_ops.shutdown are added on CC
for visibility. They are:
Move mmu notification mechanism inside mm lock to prevent race condition
in other components which depend on it. The notifier will invalidate
memory range. Depending upon the number of iterations, different memory
ranges would be invalidated.
The following warning would be removed by this patch:
WARNING: CPU: 0 PID: 5067 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:734 kvm_mmu_notifier_change_pte+0x860/0x960 arch/x86/kvm/../../../virt/kvm/kvm_main.c:734
There is no behavioural and performance change with this patch when
there is no component registered with the mmu notifier.
[akpm@linux-foundation.org: narrow the scope of `range', per Sean] Link: https://lkml.kernel.org/r/20240109112445.590736-1-usama.anjum@collabora.com Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reported-by: syzbot+81227d2bd69e9dedb802@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com/ Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: Andrei Vagin <avagin@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 0952177f2a1f ("thermal/core/power_allocator: Update once
cooling devices when temp is low") adds an update flag to avoid
triggering a thermal event when there is no need, and the thermal
cdev is updated once when the temperature is low.
But when the trips are writable, and switch_on_temp is set to be a
higher value, the cooling device state may not be reset to 0,
because last_temperature is smaller than switch_on_temp.
For example:
First:
switch_on_temp=70 control_temp=85;
Then userspace change the trip_temp:
switch_on_temp=45 control_temp=55 cur_temp=54
Then userspace reset the trip_temp:
switch_on_temp=70 control_temp=85 cur_temp=57 last_temp=54
At this time, the cooling device state should be reset to 0.
However, because cur_temp(57) < switch_on_temp(70)
last_temp(54) < switch_on_temp(70) ----> update = false,
update is false, the cooling device state can not be reset.
Using the observation that tz->passive can also be regarded as the
temperature status, set the update flag to the tz->passive value.
When the temperature drops below switch_on for the first time, the
states of cooling devices can be reset once, and tz->passive is updated
to 0. In the next round, because tz->passive is 0, cdev->state will not
be updated.
By using the tz->passive value as the "update" flag, the issue above
can be solved, and the cooling devices can be updated only once when the
temperature is low.
Fixes: 0952177f2a1f ("thermal/core/power_allocator: Update once cooling devices when temp is low") Cc: 5.13+ <stable@vger.kernel.org> # 5.13+ Suggested-by: Wei Wang <wvw@google.com> Signed-off-by: Di Shen <di.shen@unisoc.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
[ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For error handling path in ubifs_symlink(), inode will be marked as
bad first, then iput() is invoked. If inode->i_link is initialized by
fscrypt_encrypt_symlink() in encryption scenario, inode->i_link won't
be freed by callchain ubifs_free_inode -> fscrypt_free_inode in error
handling path, because make_bad_inode() has changed 'inode->i_mode' as
'S_IFREG'.
Following kmemleak is easy to be reproduced by injecting error in
ubifs_jnl_update() when doing symlink in encryption scenario:
unreferenced object 0xffff888103da3d98 (size 8):
comm "ln", pid 1692, jiffies 4294914701 (age 12.045s)
backtrace:
kmemdup+0x32/0x70
__fscrypt_encrypt_symlink+0xed/0x1c0
ubifs_symlink+0x210/0x300 [ubifs]
vfs_symlink+0x216/0x360
do_symlinkat+0x11a/0x190
do_syscall_64+0x3b/0xe0
There are two ways fixing it:
1. Remove make_bad_inode() in error handling path. We can do that
because ubifs_evict_inode() will do same processes for good
symlink inode and bad symlink inode, for inode->i_nlink checking
is before is_bad_inode().
2. Free inode->i_link before marking inode bad.
Method 2 is picked, it has less influence, personally, I think.
Cc: stable@vger.kernel.org Fixes: 2c58d548f570 ("fscrypt: cache decrypted symlink target in ->i_link") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In /proc/iomem, sub-regions should be inserted after their parent,
otherwise the insertion of parent resource fails. But after generic
crashkernel reservation applied, in both RISC-V and ARM64 (LoongArch will
also use generic reservation later on), crashkernel resources are inserted
before their parent, which causes the parent disappear in /proc/iomem. So
we defer the insertion of crashkernel resources to an early_initcall().
If the system has no mirrored memory or uses crashkernel.high while
kernelcore=mirror is enabled on the command line then during crashkernel,
there will be limited mirrored memory and this usually leads to OOM.
To solve this problem, disable the mirror feature during crashkernel.
Link: https://lkml.kernel.org/r/20240109041536.3903042-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It appears on TU106 GPUs (2070), that some of the nvdec engines
are in the runlist but have no valid nonstall interrupt, nouveau
didn't handle that too well.
Specs don't say anything about UIP being cleared within 10ms. They
only say that UIP won't occur for another 244uS. If a long NMI occurs
while UIP is still updating it might not be possible to get valid
data in 10ms.
This has been observed in the wild that around s2idle some calls can
take up to 480ms before UIP is clear.
Adjust callers from outside an interrupt context to wait for up to a
1s instead of 10ms.
The UIP timeout is hardcoded to 10ms for all RTC reads, but in some
contexts this might not be enough time. Add a timeout parameter to
mc146818_get_time() and mc146818_get_time_callback().
If UIP timeout is configured by caller to be >=100 ms and a call
takes this long, log a warning.
Make all callers use 10ms to ensure no functional changes.
mc146818_get_time() calls mc146818_avoid_UIP() to avoid fetching the
time while RTC update is in progress (UIP). When this fails, the return
code is -EIO, but actually there was no IO failure.
The reason for the return from mc146818_avoid_UIP() is that the UIP
wasn't cleared in the time period. Adjust the return code to -ETIMEDOUT
to match the behavior.
When mc146818_avoid_UIP() fails to return a valid value, this is because
UIP didn't clear in the timeout period. Adjust the return code in this
case to -ETIMEDOUT.
Intel systems > 2015 have been configured to use ACPI alarm instead
of HPET to avoid s2idle issues.
Having HPET programmed for wakeup causes problems on AMD systems with
s2idle as well.
One particular case is that the systemd "SuspendThenHibernate" feature
doesn't work properly on the Framework 13" AMD model. Switching to
using ACPI alarm fixes the issue.
Adjust the quirk to apply to AMD/Hygon systems from 2021 onwards.
This matches what has been tested and is specifically to avoid potential
risk to older systems.
Currently the ARM64_WORKAROUND_SPECULATIVE_UNPRIV_LOAD workaround isn't
quite right, as it is supposed to be applied after the last explicit
memory access, but is immediately followed by an LDR.
The ARM64_WORKAROUND_SPECULATIVE_UNPRIV_LOAD workaround is used to
handle Cortex-A520 erratum 2966298 and Cortex-A510 erratum 3117295,
which are described in:
| If pagetable isolation is disabled, the context switch logic in the
| kernel can be updated to execute the following sequence on affected
| cores before exiting to EL0, and after all explicit memory accesses:
|
| 1. A non-shareable TLBI to any context and/or address, including
| unused contexts or addresses, such as a `TLBI VALE1 Xzr`.
|
| 2. A DSB NSH to guarantee completion of the TLBI.
The important part being that the TLBI+DSB must be placed "after all
explicit memory accesses".
Unfortunately, as-implemented, the TLBI+DSB is immediately followed by
an LDR, as we have:
This patch fixes this by reworking the logic to place the TLBI+DSB
immediately before the ERET, after all explicit memory accesses.
The ERET is currently in a separate alternative block, and alternatives
cannot be nested. To account for this, the alternative block for
ARM64_UNMAP_KERNEL_AT_EL0 is replaced with a single alternative branch
to skip the KPTI logic, with the new shape of the logic being:
The new structure means that the workaround is only applied when KPTI is
not in use; this is fine as noted in the documented implications of the
erratum:
| Pagetable isolation between EL0 and higher level ELs prevents the
| issue from occurring.
... and as per the workaround description quoted above, the workaround
is only necessary "If pagetable isolation is disabled".
Fixes: 471470bc7052 ("arm64: errata: Add Cortex-A520 speculative unprivileged load workaround") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Rob Herring <robh@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240116110221.420467-2-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When sme_alloc() is called with existing storage and we are not flushing we
will always allocate new storage, both leaking the existing storage and
corrupting the state. Fix this by separating the checks for flushing and
for existing storage as we do for SVE.
Callers that reallocate (eg, due to changing the vector length) should
call sme_free() themselves.
Fixes: 5d0a8d2fba50 ("arm64/ptrace: Ensure that SME is set up for target when writing SSVE state") Signed-off-by: Mark Brown <broonie@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20240115-arm64-sme-flush-v1-1-7472bd3459b7@kernel.org Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Implement the workaround for ARM Cortex-A510 erratum 3117295. On an
affected Cortex-A510 core, a speculatively executed unprivileged load
might leak data from a privileged load via a cache side channel. The
issue only exists for loads within a translation regime with the same
translation (e.g. same ASID and VMID). Therefore, the issue only affects
the return to EL0.
The erratum and workaround are the same as ARM Cortex-A520 erratum 2966298, so reuse the existing workaround.
The 'i' constraint expects a constant operand, which fn and its
constant derivative MK_CBO(fn) are, but passing fn through a function
as a parameter and using a local variable for MK_CBO(fn) allow the
compiler to lose sight of that when no optimization is done. Use
a macro instead of a function and skip the local variable to ensure
the compiler uses constants, matching the asm constraints.
Reported-by: Yunhui Cui <cuiyunhui@bytedance.com> Closes: https://lore.kernel.org/all/20240117082514.42967-1-cuiyunhui@bytedance.com Fixes: a29e2a48afe3 ("RISC-V: selftests: Add CBO tests") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20240117130933.57514-2-ajones@ventanamicro.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In COMPAT mode, the STACK_TOP is DEFAULT_MAP_WINDOW (0x80000000), but
the TASK_SIZE is 0x7fff000. When the user stack is upon 0x7fff000, it
will cause a user segment fault. Sometimes, it would cause boot
failure when the whole rootfs is rv32.
Freeing unused kernel image (initmem) memory: 2236K
Run /sbin/init as init process
Starting init: /sbin/init exists but couldn't execute it (error -14)
Run /etc/init as init process
...
Increase the TASK_SIZE to cover STACK_TOP.
Cc: stable@vger.kernel.org Fixes: add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Reviewed-by: Leonardo Bras <leobras@redhat.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20231222115703.2404036-2-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the task is in COMPAT mode, the arch_get_mmap_end should be 2GB,
not TASK_SIZE_64. The TASK_SIZE has contained is_compat_mode()
detection, so change the definition of STACK_TOP_MAX to TASK_SIZE
directly.
Cc: stable@vger.kernel.org Fixes: add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Reviewed-by: Leonardo Bras <leobras@redhat.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20231222115703.2404036-3-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In mtk_jpeg_probe, &jpeg->job_timeout_work is bound with
mtk_jpeg_job_timeout_work.
In mtk_jpeg_dec_device_run, if error happens in
mtk_jpeg_set_dec_dst, it will finally start the worker while
mark the job as finished by invoking v4l2_m2m_job_finish.
There are two methods to trigger the bug. If we remove the
module, it which will call mtk_jpeg_remove to make cleanup.
The possible sequence is as follows, which will cause a
use-after-free bug.
If we close the file descriptor, which will call mtk_jpeg_release,
it will have a similar sequence.
Fix this bug by starting timeout worker only if started jpegdec worker
successfully. Then v4l2_m2m_job_finish will only be called in
either mtk_jpeg_job_timeout_work or mtk_jpeg_dec_device_run.
In mtk_jpegdec_worker, if error occurs in mtk_jpeg_set_dec_dst, it
will start the timeout worker and invoke v4l2_m2m_job_finish at
the same time. This will break the logic of design for there should
be only one function to call v4l2_m2m_job_finish. But now the timeout
handler and mtk_jpegdec_worker will both invoke it.
Fix it by start the worker only if mtk_jpeg_set_dec_dst successfully
finished.
Use a copy of the struct v4l2_subdev_format when propagating
format from the sink to source pad in order to avoid impacting the
sink format returned to the application.
Thanks to Jacopo Mondi for pointing the issue.
Fixes: 6c01e6f3f27b ("media: st-mipid02: Propagate format from sink to source pad") Signed-off-by: Alain Volmat <alain.volmat@foss.st.com> Cc: stable@vger.kernel.org Reviewed-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Daniel Scally <dan.scally@ideasonboard.com> Reviewed-by: Benjamin Mugnier <benjamin.mugnier@foss.st.com> Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is no need to duplicate what SPI core or individual controller
drivers already do, i.e. mapping the buffers for DMA capable transfers.
Note, that the code, besides its redundancy, was buggy: strictly speaking
there is no guarantee, while it's true for those which can use this code
(see below), that the SPI host controller _is_ the device which does DMA.
Also see the Link tags below.
Additional notes. Currently only two SPI host controller drivers may use
premapped (by the user) DMA buffers:
- drivers/spi/spi-au1550.c
- drivers/spi/spi-fsl-spi.c
Both of them have DMA mapping support code. I don't expect that SPI host
controller code is worse than what has been done in mmc_spi. Hence I do
not expect any regressions here. Otherwise, I'm pretty much sure these
regressions have to be fixed in the respective drivers, and not here.
That said, remove all related pieces of DMA mapping code from mmc_spi.
Field Firmware Update (ffu) may use close-ended or open ended sequence.
Each such sequence is comprised of a write commands enclosed between 2
switch commands - to and from ffu mode. So for the close-ended case, it
will be: cmd6->cmd23-cmd25-cmd6.
Some host controllers however, get confused when multi-block rw is sent
without sbc, and may generate auto-cmd12 which breaks the ffu sequence.
I encountered this issue while testing fwupd (github.com/fwupd/fwupd)
on HP Chromebook x2, a qualcomm based QC-7c, code name - strongbad.
Instead of a quirk, or hooking the request function of the msm ops,
it would be better to fix the ioctl handling and make it use mrq.sbc
instead of issuing SET_BLOCK_COUNT separately.
For dmabuf import users to be able to use the vaddr from another
videobuf2-dma-sg source, the exporter needs to set a proper vaddr on
vb2_dma_sg_dmabuf_ops_vmap callback. This patch adds vmap on map if
buf->vaddr was not set.
Cc: stable@kernel.org Fixes: 7938f4218168 ("dma-buf-map: Rename to iosys-map") Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Acked-by: Tomasz Figa <tfiga@chromium.org> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recent changes to kernel_connect() and kernel_bind() ensure that
callers are insulated from changes to the address parameter made by BPF
SOCK_ADDR hooks. This patch wraps direct calls to ops->connect() and
ops->bind() with kernel_connect() and kernel_bind() to protect callers
in such cases.
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/ Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect") Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind") Cc: stable@vger.kernel.org Signed-off-by: Jordan Rife <jrife@google.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some ioctl commands do not require ioctl permission, but are routed to
other permissions such as FILE_GETATTR or FILE_SETATTR. This routing is
done by comparing the ioctl cmd to a set of 64-bit flags (FS_IOC_*).
However, if a 32-bit process is running on a 64-bit kernel, it emits
32-bit flags (FS_IOC32_*) for certain ioctl operations. These flags are
being checked erroneously, which leads to these ioctl operations being
routed to the ioctl permission, rather than the correct file
permissions.
This was also noted in a RED-PEN finding from a while back -
"/* RED-PEN how should LSM module know it's handling 32bit? */".
This patch introduces a new hook, security_file_ioctl_compat(), that is
called from the compat ioctl syscall. All current LSMs have been changed
to support this hook.
Reviewing the three places where we are currently using
security_file_ioctl(), it appears that only SELinux needs a dedicated
compat change; TOMOYO and SMACK appear to be functional without any
change.
Cc: stable@vger.kernel.org Fixes: 0b24dcb7f2f7 ("Revert "selinux: simplify ioctl checking"") Signed-off-by: Alfred Piccioni <alpic@google.com> Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
[PM: subject tweak, line length fixes, and alignment corrections] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The USB DP/DM HS PHY interrupts need to be provided by the PDC interrupt
controller in order to be able to wake the system up from low-power
states and to be able to detect disconnect events, which requires
triggering on falling edges.
A recent commit updated the trigger type but failed to change the
interrupt provider as required. This leads to the current Linux driver
failing to probe instead of printing an error during suspend and USB
wakeup not working as intended.
Fixes: de3b3de30999 ("arm64: dts: qcom: sdm670: fix USB wakeup interrupt types") Fixes: 07c8ded6e373 ("arm64: dts: qcom: add sdm670 and pixel 3a device trees") Cc: stable@vger.kernel.org # 6.2 Cc: Richard Acayan <mailingradian@gmail.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Tested-by: Richard Acayan <mailingradian@gmail.com> Link: https://lore.kernel.org/r/20231214074319.11023-2-johan+linaro@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The USB DP/DM HS PHY interrupts need to be provided by the PDC interrupt
controller in order to be able to wake the system up from low-power
states and to be able to detect disconnect events, which requires
triggering on falling edges.
A recent commit updated the trigger type but failed to change the
interrupt provider as required. This leads to the current Linux driver
failing to probe instead of printing an error during suspend and USB
wakeup not working as intended.
Fixes: 0dc0f6da3d43 ("arm64: dts: qcom: sc8180x: fix USB wakeup interrupt types") Fixes: b080f53a8f44 ("arm64: dts: qcom: sc8180x: Add remoteprocs, wifi and usb nodes") Cc: stable@vger.kernel.org # 6.5 Cc: Vinod Koul <vkoul@kernel.org> Reported-by: Konrad Dybcio <konrad.dybcio@linaro.org> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Tested-by: Konrad Dybcio <konrad.dybcio@linaro.org> Link: https://lore.kernel.org/r/20231213173403.29544-2-johan+linaro@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The USB DP/DM HS PHY interrupts need to be provided by the PDC interrupt
controller in order to be able to wake the system up from low-power
states and to be able to detect disconnect events, which requires
triggering on falling edges.
A recent commit updated the trigger type but failed to change the
interrupt provider as required. This leads to the current Linux driver
failing to probe instead of printing an error during suspend and USB
wakeup not working as intended.
Fixes: 54524b6987d1 ("arm64: dts: qcom: sm8150: fix USB wakeup interrupt types") Fixes: 0c9dde0d2015 ("arm64: dts: qcom: sm8150: Add secondary USB and PHY nodes") Fixes: b33d2868e8d3 ("arm64: dts: qcom: sm8150: Add USB and PHY device nodes") Cc: stable@vger.kernel.org # 5.10 Cc: Jack Pham <quic_jackp@quicinc.com> Cc: Jonathan Marek <jonathan@marek.ca> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org> Link: https://lore.kernel.org/r/20231213173403.29544-5-johan+linaro@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The USB DP/DM HS PHY interrupts need to be provided by the PDC interrupt
controller in order to be able to wake the system up from low-power
states and to be able to detect disconnect events, which requires
triggering on falling edges.
A recent commit updated the trigger type but failed to change the
interrupt provider as required. This leads to the current Linux driver
failing to probe instead of printing an error during suspend and USB
wakeup not working as intended.
The USB DP/DM HS PHY interrupts need to be provided by the PDC interrupt
controller in order to be able to wake the system up from low-power
states and to be able to detect disconnect events, which requires
triggering on falling edges.
A recent commit updated the trigger type but failed to change the
interrupt provider as required. This leads to the current Linux driver
failing to probe instead of printing an error during suspend and USB
wakeup not working as intended.
Add the missing vio-supply to all usages of the AW2013 LED controller
to ensure that the regulator needed for pull-up of the interrupt and
I2C lines is really turned on. While this seems to have worked fine so
far some of these regulators are not guaranteed to be always-on. For
example, pm8916_l6 is typically turned off together with the display
if there aren't any other devices (e.g. sensors) keeping it always-on.
The DP/DM wakeup interrupts are edge triggered and which edge to trigger
on depends on use-case and whether a Low speed or Full/High speed device
is connected.
The DP/DM wakeup interrupts are edge triggered and which edge to trigger
on depends on use-case and whether a Low speed or Full/High speed device
is connected.
Fixes: 0c9dde0d2015 ("arm64: dts: qcom: sm8150: Add secondary USB and PHY nodes") Fixes: b33d2868e8d3 ("arm64: dts: qcom: sm8150: Add USB and PHY device nodes") Cc: stable@vger.kernel.org # 5.10 Cc: Jonathan Marek <jonathan@marek.ca> Cc: Jack Pham <quic_jackp@quicinc.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Reviewed-by: Jack Pham <quic_jackp@quicinc.com> Link: https://lore.kernel.org/r/20231120164331.8116-11-johan+linaro@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The DP/DM wakeup interrupts are edge triggered and which edge to trigger
on depends on use-case and whether a Low speed or Full/High speed device
is connected.
Fixes: 07c8ded6e373 ("arm64: dts: qcom: add sdm670 and pixel 3a device trees") Cc: stable@vger.kernel.org # 6.2 Cc: Richard Acayan <mailingradian@gmail.com> Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Acked-by: Richard Acayan <mailingradian@gmail.com> Link: https://lore.kernel.org/r/20231120164331.8116-8-johan+linaro@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The DP/DM wakeup interrupts are edge triggered and which edge to trigger
on depends on use-case and whether a Low speed or Full/High speed device
is connected.
The DP/DM wakeup interrupts are edge triggered and which edge to trigger
on depends on use-case and whether a Low speed or Full/High speed device
is connected.
Fixes: 0b766e7fe5a2 ("arm64: dts: qcom: sc7180: Add USB related nodes") Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Link: https://lore.kernel.org/r/20231120164331.8116-4-johan+linaro@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The blsp_dma controller is shared between the different subsystems,
which is why it is already initialized by the firmware. We should not
reinitialize it from Linux to avoid potential other users of the DMA
engine to misbehave.
In mainline this can be described using the "qcom,controlled-remotely"
property. In the downstream/vendor kernel from Qualcomm there is an
opposite "qcom,managed-locally" property. This property is *not* set
for the qcom,sps-dma@7884000 [1] so adding "qcom,controlled-remotely"
upstream matches the behavior of the downstream/vendor kernel.
Adding this seems to fix some weird issues with UART where both
input/output becomes garbled with certain obscure firmware versions on
some devices.
The blsp_dma controller is shared between the different subsystems,
which is why it is already initialized by the firmware. We should not
reinitialize it from Linux to avoid potential other users of the DMA
engine to misbehave.
In mainline this can be described using the "qcom,controlled-remotely"
property. In the downstream/vendor kernel from Qualcomm there is an
opposite "qcom,managed-locally" property. This property is *not* set
for the qcom,sps-dma@7884000 [1] so adding "qcom,controlled-remotely"
upstream matches the behavior of the downstream/vendor kernel.
Adding this seems to fix some weird issues with UART where both
input/output becomes garbled with certain obscure firmware versions on
some devices.
The QoS blocks saved/restored when toggling the PD_USB power domain are
clocked by ACLK_USB. Attempting to access these memory regions without
that clock running will result in an indefinite CPU stall.
The PD_USB node wasn't specifying this clock dependency, resulting in
hangs when trying to toggle the power domain (either on or off), unless
we get "lucky" and have ACLK_USB running for another reason at the time.
This "luck" can result from the bootloader leaving USB powered/clocked,
and if no built-in driver wants USB, Linux will disable the unused
PD+CLK on boot when {pd,clk}_ignore_unused aren't given. This can also
be unlucky because the two cleanup tasks run in parallel and race: if
the CLK is disabled first, the PD deactivation stalls the boot. In any
case, the PD cannot then be reenabled (if e.g. the driver loads later)
once the clock has been stopped.
Fix this by specifying a dependency on ACLK_USB, instead of only
ACLK_USB_ROOT. The child-parent relationship means the former implies
the latter anyway.
Fixes: c9211fa2602b8 ("arm64: dts: rockchip: Add base DT for rk3588 SoC") Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards <CFSworks@gmail.com> Link: https://lore.kernel.org/r/20231216021019.1543811-1-CFSworks@gmail.com
[changed to only include the missing clock, not dropping the root-clocks] Signed-off-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The DP/DM wakeup interrupts are edge triggered and which edge to trigger
on depends on use-case and whether a Low speed or Full/High speed device
is connected.
The sc8280xp Display Port PHYs can be used in either DP or eDP mode and
this is configured using the devicetree compatible string which defaults
to DP mode in the SoC dtsi.
Override the default compatible string for the CRD eDP PHY node so that
the eDP settings are used.
iface_last_update was an unused field when it was introduced.
Later, when we had periodic update of server interface list,
this field was used regularly to decide when to update next.
However, with the new logic of updating the interfaces, it
becomes crucial that this field be updated whenever
parse_server_interfaces runs successfully.
This change updates this field when either the server does
not support query of interfaces; so that we do not query
the interfaces repeatedly. It also updates the field when
the function reaches the end.
Fixes: aa45dadd34e4 ("cifs: change iface_list from array to sorted linked list") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Some servers like Azure SMB servers always advertise multichannel
capability in server capabilities list. Such servers return error
STATUS_NOT_IMPLEMENTED for ioctl calls to query server interfaces,
and expect clients to consider that as a sign that they do not support
multichannel.
We already handled this at mount time. Soon after the tree connect,
we query server interfaces. And when server returned STATUS_NOT_IMPLEMENTED,
we kept interface list as empty. When cifs_try_adding_channels gets
called, it would not find any interfaces, so will not add channels.
For the case where an active multichannel mount exists, and multichannel
is disabled by such a server, this change will now allow the client
to disable secondary channels on the mount. It will check the return
status of query server interfaces call soon after a tree reconnect.
If the return status is EOPNOTSUPP, then instead of the check to add
more channels, we'll disable the secondary channels instead.
For better code reuse, this change also moves the common code for
disabling multichannel to a helper function.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Stable-dep-of: 78e727e58e54 ("cifs: update iface_last_update on each query-and-update") Signed-off-by: Sasha Levin <sashal@kernel.org>
The data offset for the SMB3.1.1 POSIX create context will always be
8-byte aligned so having the check 'noff + nlen >= doff' in
smb2_parse_contexts() is wrong as it will lead to -EINVAL because noff
+ nlen == doff.
Fix the sanity check to correctly handle aligned create context data.
Fixes: af1689a9b770 ("smb: client: fix potential OOBs in smb2_parse_contexts()") Signed-off-by: Paulo Alcantara <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
One instance of gpio_backlight_platform_data.fbdev was renamed, but the
second instance was forgotten, causing a build failure:
arch/sh/boards/mach-ecovec24/setup.c: In function ‘arch_setup’:
arch/sh/boards/mach-ecovec24/setup.c:1223:37: error: ‘struct gpio_backlight_platform_data’ has no member named ‘fbdev’; did you mean ‘dev’?
1223 | gpio_backlight_data.fbdev = NULL;
| ^~~~~
| dev
Fix this by updating the second instance.
Fixes: ed369def91c1579a ("backlight/gpio_backlight: Rename field 'fbdev' to 'dev'") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309231601.Uu6qcRnU-lkp@intel.com/ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Link: https://lore.kernel.org/r/20230925111022.3626362-1-geert+renesas@glider.be Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
When libata calls ata_link_abort() to abort all ata queued commands, it
calls blk_abort_request() on the SCSI command representing each QC.
This causes scsi_timeout() to be called, which calls scsi_eh_scmd_add() for
each SCSI command.
scsi_eh_scmd_add() sets the SCSI host to state recovery, and then adds the
command to shost->eh_cmd_q.
This will wake up the SCSI EH, and eventually the libata EH strategy
handler will be called, which calls scsi_eh_flush_done_q() to either flush
retry or flush finish each failed command.
The commands that are flush retried by scsi_eh_flush_done_q() are done so
using scsi_queue_insert().
Before commit 8b566edbdbfb ("scsi: core: Only kick the requeue list if
necessary"), __scsi_queue_insert() called blk_mq_requeue_request() with the
second argument set to true, indicating that it should always kick/run the
requeue list after inserting.
After commit 8b566edbdbfb ("scsi: core: Only kick the requeue list if
necessary"), __scsi_queue_insert() does not kick/run the requeue list after
inserting, if the current SCSI host state is recovery (which is the case in
the libata example above).
This optimization is probably fine in most cases, as I can only assume that
most often someone will eventually kick/run the queues.
However, that is not the case for scsi_eh_flush_done_q(), where we can see
that the request gets inserted to the requeue list, but the queue is never
started after the request has been inserted, leading to the block layer
waiting for the completion of command that never gets to run.
Since scsi_eh_flush_done_q() is called by SCSI EH context, the SCSI host
state is most likely always in recovery when this function is called.
Thus, let scsi_eh_flush_done_q() explicitly kick the requeue list after
inserting a flush retry command, so that scsi_eh_flush_done_q() keeps the
same behavior as before commit 8b566edbdbfb ("scsi: core: Only kick the
requeue list if necessary").
Simple reproducer for the libata example above:
$ hdparm -Y /dev/sda
$ echo 1 > /sys/class/scsi_device/0\:0\:0\:0/device/delete
Fixes: 8b566edbdbfb ("scsi: core: Only kick the requeue list if necessary") Reported-by: Kevin Locke <kevin@kevinlocke.name> Closes: https://lore.kernel.org/linux-scsi/ZZw3Th70wUUvCiCY@kevinlocke.name/ Signed-off-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240111120533.3612509-1-cassel@kernel.org Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The ending NULL is not taken into account by strncat(), so switch to
strlcat() to correctly compute the size of the available memory when
appending CONFIG_CMDLINE to 'early_cmdline'.
When there is not enough allocatable memory for the relocation
hashtable, module loading should exit gracefully. Previously, this was
attempted to be accomplished by checking if an unsigned number is less
than zero which does not work. Instead have the caller check if the
hashtable was correctly allocated and add a comment explaining that
hashtable_bits that is 0 is valid.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Fixes: d8792a5734b0 ("riscv: Safely remove entries from relocation list") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202312132019.iYGTwW0L-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/r/202312120044.wTI1Uyaa-lkp@intel.com/ Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/20240104-module_loading_fix-v3-2-a71f8de6ce0f@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Calling ufshcd_hba_exit() from a function that is called asynchronously
from ufshcd_init() is wrong because this triggers multiple race
conditions. Instead of calling ufshcd_hba_exit(), log an error message.
Reported-by: Daniel Mentz <danielmentz@google.com> Fixes: 1d337ec2f35e ("ufs: improve init sequence") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20231218225229.2542156-3-bvanassche@acm.org Reviewed-by: Can Guo <quic_cang@quicinc.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Xilinx DMA engine is capable of keeping track of the number of elapsed
periods and this is an increasing 32-bit counter which is only reset
when turning off the engine. No need to add this value to our local
counter.
Task may be rescheduled within dma_free_coherent(). So dma_free_coherent()
can't be called between spin_lock() and spin_unlock() to avoid Call Trace:
Call Trace:
<TASK>
dump_stack_lvl+0x37/0x50
__might_resched+0x16a/0x1c0
vunmap+0x2c/0x70
__iommu_dma_free+0x96/0x100
idxd_device_evl_free+0xd5/0x100 [idxd]
device_release_driver_internal+0x197/0x200
unbind_store+0xa1/0xb0
kernfs_fop_write_iter+0x120/0x1c0
vfs_write+0x2d3/0x400
ksys_write+0x63/0xe0
do_syscall_64+0x44/0xa0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Move it out of the context.
__dma_async_device_channel_register() can fail. In case of failure,
chan->local is freed (with free_percpu()), and chan->local is nullified.
When dma_async_device_unregister() is called (because of managed API or
intentionally by DMA controller driver), channels are unconditionally
unregistered, leading to this NULL pointer:
[ 1.318693] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000d0
[...]
[ 1.484499] Call trace:
[ 1.486930] device_del+0x40/0x394
[ 1.490314] device_unregister+0x20/0x7c
[ 1.494220] __dma_async_device_channel_unregister+0x68/0xc0
Look at dma_async_device_register() function error path, channel device
unregistration is done only if chan->local is not NULL.
Then add the same condition at the beginning of
__dma_async_device_channel_unregister() function, to avoid NULL pointer
issue whatever the API used to reach this function.