When accessing a file with more entries than ES_MAX_ENTRY_NUM, the bh-array
is allocated in __exfat_get_entry_set. The problem is that the bh-array is
allocated with GFP_KERNEL. It does not make sense. In the following cases,
a deadlock for sbi->s_lock between the two processes may occur.
The header_length field is byte unit, thus it can not express the number of
elements in header field. It seems that the argument for counted_by
attribute can have no arithmetic expression, therefore this commit just
reverts the issued commit.
8117961d98f ("x86/efi: Disregard setup header of loaded image")
which triggers boot issues on older Dell laptops. As it turns out,
switching back to a heap allocation for the struct boot_params
constructed by the EFI stub works around this, even though it is unclear
why.
The fail label is only used in a situation where the previous EFI API
call succeeded, and so status will be set to EFI_SUCCESS. Fix this, by
dropping the goto entirely, and call efi_exit() with the correct error
code.
mem_cgroup_calculate_protection() is not stateless and should only be used
as part of a top-down tree traversal. shrink_one() traverses the per-node
memcg LRU instead of the root_mem_cgroup tree, and therefore it should not
call mem_cgroup_calculate_protection().
The existing misuse in shrink_one() can cause ineffective protection of
sub-trees that are grandchildren of root_mem_cgroup. Fix it by reusing
lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to
calculate the protection.
Previously lru_gen_age_node() opportunistically skips the first pass,
i.e., when scan_control->priority is DEF_PRIORITY. On the second pass,
lruvec_is_sizable() uses appropriate scan_control->priority, set by
set_initial_priority() from lru_gen_shrink_node(), to decide whether a
memcg is too small to reclaim from.
Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree.
So it should call set_initial_priority() upfront, to make sure
lruvec_is_sizable() uses appropriate scan_control->priority on the first
pass. Otherwise, lruvec_is_reclaimable() can return false negatives and
result in premature OOM kills when min_ttl_ms is used.
set_initial_priority() tries to jump-start global reclaim by estimating
the priority based on cold/hot LRU pages. The estimation does not account
for shrinker objects, and it cannot do so because their sizes can be in
different units other than page.
If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where
ZFS ARC can use almost all system memory, set_initial_priority() can
vastly underestimate how much memory ARC shrinker can evict and assign
extreme low values to scan_control->priority, resulting in overshoots of
shrinker objects.
To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a
test ZFS pool and the following commands:
for ((i = 0; i < 20; i++))
do
sleep 120
fio --name=mglru.anon --numjobs=16 --ioengine=mmap \
--filename=/dev/zero --size=1024m --fadvise_hint=0 \
--rw=randrw --random_distribution=random \
--time_based --runtime=1m
done
To fix the problem:
1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent
the jump-start from being overly aggressive.
2. Account for the progress from mm_account_reclaimed_pages(), to
prevent kswapd_shrink_node() from raising the priority
unnecessarily.
Commit 2b5067a8143e ("mm: mmap_lock: add tracepoints around lock
acquisition") introduced TRACE_MMAP_LOCK_EVENT() macro using
preempt_disable() in order to let get_mm_memcg_path() return a percpu
buffer exclusively used by normal, softirq, irq and NMI contexts
respectively.
Commit 832b50725373 ("mm: mmap_lock: use local locks instead of disabling
preemption") replaced preempt_disable() with local_lock(&memcg_paths.lock)
based on an argument that preempt_disable() has to be avoided because
get_mm_memcg_path() might sleep if PREEMPT_RT=y.
We could replace local_lock() with local_lock_irqsave() in order to
suppress these messages. But this patch instead replaces percpu buffers
with on-stack buffer, for the size of each buffer returned by
get_memcg_path_buf() is only 256 bytes which is tolerable for allocating
from current thread's kernel stack memory.
Link: https://lkml.kernel.org/r/ef22d289-eadb-4ed9-863b-fbc922b33d8d@I-love.SAKURA.ne.jp Reported-by: syzbot <syzbot+40905bca570ae6784745@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=40905bca570ae6784745 Fixes: 832b50725373 ("mm: mmap_lock: use local locks instead of disabling preemption") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
evict_folios() uses a second pass to reclaim folios that have gone through
page writeback and become clean before it finishes the first pass, since
folio_rotate_reclaimable() cannot handle those folios due to the
isolation.
The second pass tries to avoid potential double counting by deducting
scan_control->nr_scanned. However, this can result in underflow of
nr_scanned, under a condition where shrink_folio_list() does not increment
nr_scanned, i.e., when folio_trylock() fails.
The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
vmpressure_calc_level(), to become zero, resulting in the following crash:
[exception RIP: vmpressure_work_fn+101]
process_one_work at ffffffffa3313f2b
Since scan_control->nr_scanned has no established semantics, the potential
double counting has minimal risks. Therefore, fix the problem by not
deducting scan_control->nr_scanned in evict_folios().
Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated") Reported-by: Wei Xu <weixugc@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alexander Motin <mav@ixsystems.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Lockdep considers this an AA deadlock because the different resize_lock
mutexes reside in the same lockdep class, but this is a false positive.
Place them in distinct classes to avoid these warnings.
Link: https://lkml.kernel.org/r/20240712031314.2570452-1-linmiaohe@huawei.com Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When trying to allocate a hugepage with no reserved ones free, it may be
allowed in case a number of overcommit hugepages was configured (using
/proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached.
This allows for a behavior of having extra hugepages allocated
dynamically, if there're resources for it. Some sysadmins even prefer not
reserving any hugepages and setting a big number of overcommit hugepages.
But while attempting to allocate overcommit hugepages in a multi node
system (either NUMA or mempolicy/cpuset) said allocations might randomly
fail even when there're resources available for the allocation.
This happens due to allowed_mems_nr() only accounting for the number of
free hugepages in the nodes the current process belongs to and the surplus
hugepage allocation is done so it can be allocated in any node. In case
one or more of the requested surplus hugepages are allocated in a
different node, the whole allocation will fail due allowed_mems_nr()
returning a lower value.
So allocate surplus hugepages in one of the nodes the current process
belongs to.
Easy way to reproduce this issue is to use a 2+ NUMA nodes system:
xarray can't support arbitrary page cache size. the largest and supported
page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71
("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However,
it's possible to have 512MB page cache in the huge memory's collapsing
path on ARM64 system whose base page size is 64KB. 512MB page cache is
breaking the limitation and a warning is raised when the xarray entry is
split as shown in the following example.
[root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize
KernelPageSize: 64 kB
[root@dhcp-10-26-1-207 ~]# cat /tmp/test.c
:
int main(int argc, char **argv)
{
const char *filename = TEST_XFS_FILENAME;
int fd = 0;
void *buf = (void *)-1, *p;
int pgsize = getpagesize();
int ret = 0;
if (pgsize != 0x10000) {
fprintf(stdout, "System with 64KB base page size is required!\n");
return -EPERM;
}
Fix it by correcting the supported page cache orders, different sets for
DAX and other files. With it corrected, 512MB page cache becomes
disallowed on all non-DAX files on ARM64 system where the base page size
is 64KB. After this patch is applied, the test program fails with error
-EINVAL returned from __thp_vma_allowable_orders() and the madvise()
system call to collapse the page caches.
Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't
force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is
because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT.
When a process' cred struct is replaced, this _almost_ always invokes
the cred_prepare LSM hook; but in one special case (when
KEYCTL_SESSION_TO_PARENT updates the parent's credentials), the
cred_transfer LSM hook is used instead. Landlock only implements the
cred_prepare hook, not cred_transfer, so KEYCTL_SESSION_TO_PARENT causes
all information on Landlock restrictions to be lost.
This basically means that a process with the ability to use the fork()
and keyctl() syscalls can get rid of all Landlock restrictions on
itself.
Fix it by adding a cred_transfer hook that does the same thing as the
existing cred_prepare hook. (Implemented by having hook_cred_prepare()
call hook_cred_transfer() so that the two functions are less likely to
accidentally diverge in the future.)
Cc: stable@kernel.org Fixes: 385975dca53e ("landlock: Set up the security framework and manage credentials") Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20240724-landlock-houdini-fix-v1-1-df89a4560ca3@google.com Signed-off-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When mtk-cmdq unbinds, a WARN_ON message with condition
pm_runtime_get_sync() < 0 occurs.
According to the call tracei below:
cmdq_mbox_shutdown
mbox_free_channel
mbox_controller_unregister
__devm_mbox_controller_unregister
...
The root cause can be deduced to be calling pm_runtime_get_sync() after
calling pm_runtime_disable() as observed below:
1. CMDQ driver uses devm_mbox_controller_register() in cmdq_probe()
to bind the cmdq device to the mbox_controller, so
devm_mbox_controller_unregister() will automatically unregister
the device bound to the mailbox controller when the device-managed
resource is removed. That means devm_mbox_controller_unregister()
and cmdq_mbox_shoutdown() will be called after cmdq_remove().
2. CMDQ driver also uses devm_pm_runtime_enable() in cmdq_probe() after
devm_mbox_controller_register(), so that devm_pm_runtime_disable()
will be called after cmdq_remove(), but before
devm_mbox_controller_unregister().
To fix this problem, cmdq_probe() needs to move
devm_mbox_controller_register() after devm_pm_runtime_enable() to make
devm_pm_runtime_disable() be called after
devm_mbox_controller_unregister().
Two TXDB_V2 channels are used between Linux and System Manager(SM).
Channel0 for normal TX, Channel 1 for notification completion.
The TXDB_V2 trigger logic is using imx_mu_xcr_rmw which uses
read/modify/update logic.
Note: clear MUB GSR BITs, the MUA side GCR BITs will also got cleared per
hardware design.
Channel0 Linux
read GCR->modify GCR->write GCR->M33 SM->read GSR----->clear GSR
|-(1)-|
Channel1 Linux start in time slot(1)
read GCR->modify GCR->write GCR->M33 SM->read GSR->clear GSR
So Channel1 read GCR will read back the GCR that Channel0 wrote, because
M33 has not finish clear GSR, this means Channel1 GCR writing will
trigger Channel1 and Channel0 interrupt both which is wrong.
Channel0 will be freed(SCMI channel status set to FREE) in M33 SM when
processing the 1st Channel0 interrupt. So when 2nd interrupt trigger
(channel 0/1 trigger together), SM will see a freed Channel0, and report
protocol error.
To address the issue, not using read/modify/update logic, just use
write, because write 0 to GCR will be ignored. And after write MUA GCR,
wait the SM to clear MUB GSR by looping MUA GCR value.
Fixes: 5bfe4067d350 ("mailbox: imx: support channel type tx doorbell v2") Reviewed-by: Ranjani Vaidyanathan <ranjani.vaidyanathan@nxp.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
ret variable was used to test reset status, get from
reset_control_status() call. But this variable was overwritten by
ti_sci_proc_get_status() a few lines bellow.
And as ti_sci_proc_get_status() returns 0 or a negative value (in this
latter case, followed by a return), the expression !ret was always true,
Clearly, this was not what was intended:
In the comment above it's said that "requires both local and module
resets to be deasserted"; if reset_control_status() returns 0 it means
that the reset line is deasserted.
So, it's pretty clear that the return value of reset_control_status()
was intended to be used instead of ti_sci_proc_get_status() return
value.
This could lead in an incorrect IPC-only mode detection if reset line is
asserted (so reset_control_status() return > 0) and c_state != 0 and
halted == 0.
In this case, the old code would have detected an IPC-only mode instead
of a mismatched mode.
Fixes: 1168af40b1ad ("remoteproc: k3-r5: Add support for IPC-only mode for all R5Fs") Signed-off-by: Richard Genoud <richard.genoud@bootlin.com> Reviewed-by: Hari Nagalla <hnagalla@ti.com> Link: https://lore.kernel.org/r/20240621150058.319524-2-richard.genoud@bootlin.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The current code doesn't check whether platform_get_resource_byname()
succeeded to get the l1tcm memory, which is optional, before attempting
to map it. This results in the following error message when it is
missing:
Add a check so that the remapping is only attempted if the memory region
exists. This also allows to simplify the logic handling failure to
remap, since a failure then is always a failure.
Fixes: ca23ecfdbd44 ("remoteproc/mediatek: support L1TCM") Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20240627-scp-invalid-resource-l1tcm-v1-1-7d221e6c495a@collabora.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
If iio_read_channel_processed() fails, 'val->intval' is not updated, but it
is still *1000 just after. So, in case of error, the *1000 accumulate and
'val->intval' becomes erroneous.
So instead of rescaling the value after the fact, use the dedicated scaling
API. This way the result is updated only when needed. In case of error, the
previous value is kept, unmodified.
This should also reduce any inaccuracies resulting from the scaling.
Finally, this is also slightly more efficient as it saves a function call
and a multiplication.
Currently, there are some places to set CSR.PRMD.PWE, the first one is
in hw_breakpoint_thread_switch() to enable user space singlestep via
checking TIF_SINGLESTEP, the second one is in hw_breakpoint_control() to
enable user space watchpoint. For the latter case, it should also check
TIF_LOAD_WATCH to make the logic correct and clear.
Configuration for sbq:
depth=64, wake_batch=6, shift=6, map_nr=1
1. There are 64 requests in progress:
map->word = 0xFFFFFFFFFFFFFFFF
2. After all the 64 requests complete, and no more requests come:
map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF
3. Now two tasks try to allocate requests:
T1: T2:
__blk_mq_get_tag .
__sbitmap_queue_get .
sbitmap_get .
sbitmap_find_bit .
sbitmap_find_bit_in_word .
__sbitmap_get_word -> nr=-1 __blk_mq_get_tag
sbitmap_deferred_clear __sbitmap_queue_get
/* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit
if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word
return false; __sbitmap_get_word -> nr=-1
mask = xchg(&map->cleared, 0) sbitmap_deferred_clear
atomic_long_andnot() /* map->cleared=0 */
if (!(map->cleared))
return false;
/*
* map->cleared is cleared by T1
* T2 fail to acquire the tag
*/
4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken
up due to the wake_batch being set at 6. If no more requests come, T1
will wait here indefinitely.
This patch achieves two purposes:
1. Check on ->cleared and update on both ->cleared and ->word need to
be done atomically, and using spinlock could be the simplest solution.
2. Add extra check in sbitmap_deferred_clear(), to identify whether
->word has free bits.
Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits") Signed-off-by: Yang Yang <yang.yang@vivo.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240716082644.659566-1-yang.yang@vivo.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
pgalloc_tag_sub() might call page_ext_put() using a page different from
the one used in page_ext_get() call. This does not pose an issue since
page_ext_put() ignores this parameter as long as it's non-NULL but
technically this is wrong. Fix it by storing the original page used in
page_ext_get() and passing it to page_ext_put().
Link: https://lkml.kernel.org/r/20240711220457.1751071-3-surenb@google.com Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
dasd_add_busid() can return an error via ERR_PTR() if an allocation
fails. However, two callsites in dasd_copy_pair_store() do not check
the result, potentially resulting in a NULL pointer dereference. Fix
this by checking the result with IS_ERR() and returning the error up
the stack.
The commit 1bbe254e4336 ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.
When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.
How to fix:
1. In dlm_lock_sync(), change the wait behaviour from forever to a
timeout, This could avoid the hanging issue when another node
fails to handle cluster msg. Another result of this change is
that if another node receives an unknown msg (e.g. a new msg_type),
the old code will hang, whereas the new code will timeout and fail.
This could help cluster_md handle new msg_type from different
nodes with different kernel/module versions (e.g. The user only
updates one leg's kernel and monitors the stability of the new
kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
design (must successfully unlock ->message_lockres). This commit
makes this function return an error number when an error occurs.
Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com Signed-off-by: Sasha Levin <sashal@kernel.org>
We skip the run_truncate_head call also for $MFT::$ATTR_BITMAP.
Otherwise wnd_map()/run_lookup_entry will not find the disk position for the bitmap parts.
Fixes: 0e5b044cbf3a ("fs/ntfs3: Refactoring attr_set_size to restore after errors") Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The 'nocase' option was mistakenly added as fsparam_flag_no
with the 'no' prefix, causing the case-insensitive mode to require
the 'nonocase' option to be enabled.
`rtc_add_offset()` is called by `__rtc_read_time()`
and `__rtc_read_alarm()` to add the RTC's offset to
the raw read-outs from the device drivers. However,
in the latter case, a fix-up algorithm is run if
the RTC device does not report a full `struct rtc_time`
alarm value. In that case, the offset was forgot to be
added.
According to the C standard 3.4.3p3, the result of signed integer overflow
is undefined. The macro nilfs_cnt32_ge(), which compares two sequence
numbers, uses signed integer subtraction that can overflow, and therefore
the result of the calculation may differ from what is expected due to
undefined behavior in different environments.
Similar to an earlier change to the jiffies-related comparison macros in
commit 5a581b367b5d ("jiffies: Avoid undefined behavior from signed
overflow"), avoid this potential issue by changing the definition of the
macro to perform the subtraction as unsigned integers, then cast the
result to a signed integer for comparison.
Patch series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions".
This patch series fix a minor issue in a program for DAMON selftest, and
implement new functionality selftests for DAMOS tried regions and
{min,max}_nr_regions. The test for max_nr_regions also test the recovery
from online tuning-caused limit violation, which was fixed by a previous
patch [1] titled "mm/damon/core: merge regions aggressively when
max_nr_regions is unmet".
The first patch fixes a minor problem in the articial memory access
pattern generator for tests. Following 3 patches (2-4) implement schemes
tried regions test. Then a couple of patches (5-6) implementing static
setup based {min,max}_nr_regions functionality test follows. Final two
patches (7-8) implement dynamic max_nr_regions update test.
'access_memory' is an artificial memory access pattern generator for DAMON
tests. It creates and accesses memory regions that the user specified the
number and size via the command line. However, real access part of the
program ignores the user-specified size of each region. Instead, it uses
a hard-coded value, 10 MiB. Fix it to use user-defined size.
Note that all existing 'access_memory' users are setting the region size
as 10 MiB. Hence no real problem has happened so far.
We added PM_MMAP_EXCLUSIVE in 2015 via commit 77bb499bb60f ("pagemap: add
mmap-exclusive bit for marking pages mapped only here"), when THPs could
not be partially mapped and page_mapcount() returned something that was
true for all pages of the THP.
In 2016, we added support for partially mapping THPs via commit 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of
THPs") but missed to determine PM_MMAP_EXCLUSIVE as well per page.
Checking page_mapcount() on the head page does not tell the whole story.
We should check each individual page. In a future without per-page
mapcounts it will be different, but we'll change that to be consistent
with PTE-mapped THPs once we deal with that.
Link: https://lkml.kernel.org/r/20240607122357.115423-4-david@redhat.com Fixes: 53f9263baba6 ("mm: rework mapcount accounting to enable 4k mapping of THPs") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Relying on the mapcount for non-present PTEs that reference pages doesn't
make any sense: they are not accounted in the mapcount, so page_mapcount()
== 1 won't return the result we actually want to know.
While we don't check the mapcount for migration entries already, we could
end up checking it for swap, hwpoison, device exclusive, ... entries,
which we really shouldn't.
There is one exception: device private entries, which we consider
fake-present (e.g., incremented the mapcount). But we won't care about
that for now for PM_MMAP_EXCLUSIVE, because indicating PM_SWAP for them
although they are fake-present already sounds suspiciously wrong.
Let's never indicate PM_MMAP_EXCLUSIVE without PM_PRESENT.
Link: https://lkml.kernel.org/r/20240607122357.115423-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs") Signed-off-by: Sasha Levin <sashal@kernel.org>
Patch series "fs/proc: move page_mapcount() to fs/proc/internal.h".
With all other page_mapcount() users in the tree gone, move
page_mapcount() to fs/proc/internal.h, rename it and extend the
documentation to prevent future (ab)use.
... of course, I find some issues while working on that code that I sort
first ;)
We'll now only end up calling page_mapcount() [now
folio_precise_page_mapcount()] on pages mapped via present page table
entries. Except for /proc/kpagecount, that still does questionable
things, but we'll leave that legacy interface as is for now.
Did a quick sanity check. Likely we would want some better selfestest for
/proc/$/pagemap + smaps. I'll see if I can find some time to write some
more.
This patch (of 6):
Looks like we never taught pagemap_pmd_range() about the existence of
PMD-mapped file THPs. Seems to date back to the times when we first added
support for non-anon THPs in the form of shmem THP.
Link: https://lkml.kernel.org/r/20240607122357.115423-1-david@redhat.com Link: https://lkml.kernel.org/r/20240607122357.115423-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The Pin Multiplex attachment in Rev.1.10 of the R-Car V4H Series
Hardware User's Manual still has two alternate pins named both TCLK3
and TCLK4. To differentiate, the pin control driver uses "TCLK[34]" and
"TCLK[34]_X". In addition, there are alternate pins without suffix, and
with an "_A" or "_B" suffix.
Increase uniformity by adopting R-Car V4M naming:
- Rename "TCLK2_B" to "TCLK2_C",
- Rename "TCLK[12]_A" to "TCLK[12]_B",
- Rename "TCLK[12]" to "TCLK[12]_A",
- Rename "TCLK[34]_A" to "TCLK[34]_C",
- Rename "TCLK[34]_X" to "TCLK[34]_A",
- Rename "TCLK[34]" to "TCLK[34]_B".
The suffixes of the IRQ identifiers for external interrupts 0-3
are inconsistent:
- "IRQ0" and "IRQ0_A",
- "IRQ1" and "IRQ1_A",
- "IRQ2" and "IRQ2_A",
- "IRQ3" and "IRQ3_B".
The suffixes for external interrupts 4 and 5 do follow conventional
naming:
- "IRQ4A" and IRQ4_B",
- "IRQ5".
Fix this by adopting R-Car V4M naming:
- Rename "IRQ[0-2]_A" to "IRQ[0-2]_B",
- Rename "IRQ[0-3]" to "IRQ[0-3]_A".
(H)SCIF instance 3 has two alternate pin groups: "hscif3" and
"hscif3_a", resp. "scif3" and "scif3_a", but the actual meanings of the
pins within the groups do not match.
Increase uniformity by adopting R-Car V4M naming:
- Rename "hscif3_a" to "hscif3_b",
- Rename "hscif3" to "hscif3_a",
- Rename "scif3" to "scif3_b".
The Pin Multiplex attachment in Rev.1.10 of the R-Car V4H Series
Hardware User's Manual still has two alternate pin groups (GP0_14-18
and GP1_6-10) each named both HSCIF1 and SCIF1. To differentiate, the
pin control driver uses "(h)scif1" and "(h)scif1_x", which were
considered temporary names until the conflict was sorted out.
Fix this by adopting R-Car V4M naming:
- Rename "(h)scif1" to "(h)scif1_a",
- Rename "(h)scif1_x" to "(h)scif1_b".
Adopt the R-Car V4M naming "(h)scif1_a" and "(h)scif1_b" to increase
uniformity.
The Pin Multiplex attachment in Rev.1.10 of the R-Car V4H Series
Hardware User's Manual still has two alternate pins named both
"FXR_TXEN[AB]". To differentiate, the pin control driver uses
"FXR_TXEN[AB]" and "FXR_TXEN[AB]_X", which were considered temporary
names until the conflict was sorted out.
Fix this by adopting R-Car V4M naming:
- Rename "FXR_TXEN[AB]" to "FXR_TXEN[AB]_A",
- Rename "FXR_TXEN[AB]_X" to "FXR_TXEN[AB]_B".
Fixes: d27e202b9ac4 ("fs/ntfs3: Add more info into /proc/fs/ntfs3/<dev>/volinfo") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
This driver calls pinctrl_register_and_init() which is not
devm_ managed, it will leads memory leak if pinctrl_enable()
fails. Replace it with devm_pinctrl_register_and_init().
And add missing of_node_put() in the error path.
Fixes: 5038a66dad01 ("pinctrl: core: delete incorrect free in pinctrl_enable()") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/20240606023704.3931561-4-yangyingliang@huawei.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This driver calls pinctrl_register_and_init() which is not
devm_ managed, it will leads memory leak if pinctrl_enable()
fails. Replace it with devm_pinctrl_register_and_init().
And call pcs_free_resources() if pinctrl_enable() fails.
Fixes: 5038a66dad01 ("pinctrl: core: delete incorrect free in pinctrl_enable()") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/20240606023704.3931561-3-yangyingliang@huawei.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In devm_pinctrl_register(), if pinctrl_enable() fails in pinctrl_register(),
the "pctldev" has not been added to dev resources, so devm_pinctrl_dev_release()
can not be called, it leads memory leak.
Introduce pinctrl_uninit_controller(), call it in the error path to free memory.
Fixes: 5038a66dad01 ("pinctrl: core: delete incorrect free in pinctrl_enable()") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/20240606023704.3931561-2-yangyingliang@huawei.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
After switching from pages to folio [1], it became evident that
the initialization of .dirty_folio for page cache operations was missed for
compressed files.
Broadcom switches supported by the b53 driver use a chip-wide jumbo frame
configuration. In the commit referenced with the Fixes tag, the setting
is applied just for the last port changing its MTU.
While configuring CPU ports accounts for tagger overhead, user ports do
not. When setting the MTU for a user port, the chip-wide setting is
reduced to not include the tagger overhead, resulting in an potentially
insufficient chip-wide maximum frame size for the CPU port.
As, by design, the CPU port MTU is adjusted for any user port change,
apply the chip-wide setting only for CPU ports. This aligns the driver
to the behavior of other switch drivers.
Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Suggested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Martin Willi <martin@strongswan.org> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Marvell chips not supporting per-port jumbo frame size configurations use
a chip-wide frame size configuration. In the commit referenced with the
Fixes tag, the setting is applied just for the last port changing its MTU.
While configuring CPU ports accounts for tagger overhead, user ports do
not. When setting the MTU for a user port, the chip-wide setting is
reduced to not include the tagger overhead, resulting in an potentially
insufficient maximum frame size for the CPU port. Specifically, sending
full-size frames from the CPU port on a MV88E6097 having a user port MTU
of 1500 bytes results in dropped frames.
As, by design, the CPU port MTU is adjusted for any user port change,
apply the chip-wide setting only for CPU ports.
Fixes: 1baf0fac10fb ("net: dsa: mv88e6xxx: Use chip-wide max frame size for MTU") Suggested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Martin Willi <martin@strongswan.org> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The TOS value that is returned to user space in the route get reply is
the one with which the lookup was performed ('fl4->flowi4_tos'). This is
fine when the matched route is configured with a TOS as it would not
match if its TOS value did not match the one with which the lookup was
performed.
However, matching on TOS is only performed when the route's TOS is not
zero. It is therefore possible to have the kernel incorrectly return a
non-zero TOS:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get fibmatch 192.0.2.2 tos 0xfc
192.0.2.0/24 tos 0x1c dev dummy1 proto kernel scope link src 192.0.2.1
Fix by instead returning the DSCP field from the FIB result structure
which was populated during the route lookup.
Output after the patch:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get fibmatch 192.0.2.2 tos 0xfc
192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1
Extend the existing selftests to not only verify that the correct route
is returned, but that it is also returned with correct "tos" value (or
without it).
Fixes: b61798130f1b ("net: ipv4: RTM_GETROUTE: return matched fib result when requested") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The TOS value that is returned to user space in the route get reply is
the one with which the lookup was performed ('fl4->flowi4_tos'). This is
fine when the matched route is configured with a TOS as it would not
match if its TOS value did not match the one with which the lookup was
performed.
However, matching on TOS is only performed when the route's TOS is not
zero. It is therefore possible to have the kernel incorrectly return a
non-zero TOS:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get 192.0.2.2 tos 0xfc
192.0.2.2 tos 0x1c dev dummy1 src 192.0.2.1 uid 0
cache
Fix by adding a DSCP field to the FIB result structure (inside an
existing 4 bytes hole), populating it in the route lookup and using it
when filling the route get reply.
Output after the patch:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip route get 192.0.2.2 tos 0xfc
192.0.2.2 dev dummy1 src 192.0.2.1 uid 0
cache
Fixes: 1a00fee4ffb2 ("ipv4: Remove rt_key_{src,dst,tos} from struct rtable.") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The following splat is easy to reproduce upstream as well as in -stable
kernels. Florian Westphal provided the following commit:
d1dab4f71d37 ("net: add and use __skb_get_hash_symmetric_net")
but this complementary fix has been also suggested by Willem de Bruijn
and it can be easily backported to -stable kernel which consists in
using DEBUG_NET_WARN_ON_ONCE instead to silence the following splat
given __skb_get_hash() is used by the nftables tracing infrastructure to
to identify packets in traces.
Fixes: 9b52e3f267a6 ("flow_dissector: handle no-skb use case") Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240715141442.43775-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
In gve_clean_xdp_done, the driver processes the TX completions based on
a 32-bit NIC counter and a 32-bit completion counter stored in the tx
queue.
Fix the for loop so that the counter wraparound is handled correctly.
Fixes: 75eaae158b1b ("gve: Add XDP DROP and TX support for GQI-QPL format") Signed-off-by: Joshua Washington <joshwash@google.com> Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240716171041.1561142-1-pkaligineedi@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The initial buffer has to be inited to all-ones, but it must restrict
it to the size of the first field, not the total field size.
After each round in the map search step, the result and the fill map
are swapped, so if we have a set where f->bsize of the first element
is smaller than m->bsize_max, those one-bits are leaked into future
rounds result map.
This makes pipapo find an incorrect matching results for sets where
first field size is not the largest.
Followup patch adds a test case to nft_concat_range.sh selftest script.
Thanks to Stefano Brivio for pointing out that we need to zero out
the remainder explicitly, only correcting memset() argument isn't enough.
Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Yi Chen <yiche@redhat.com> Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Delete expectation path is missing a call to the nf_expect_get_id()
helper function to calculate the expectation ID, otherwise LSB of the
expectation object address is leaked to userspace.
Fixes: 3c79107631db ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id") Reported-by: zdi-disclosures@trendmicro.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fallback march for SB1 should be mips64 instead of mips64r1.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407111851.LwDasTcp-lkp@intel.com/ Fixes: bfc0a330c1b4 ("MIPS: Fallback CPU -march flag to ISA level if unsupported") Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
Add mana_get_primary_netdev_rcu helper to get a primary
netdevice for a given port. When mana is used with
netvsc, the VF netdev is controlled by an upper netvsc
device. In a baremetal case, the VF netdev is the
primary device.
Use the mana_get_primary_netdev_rcu() helper in the mana_ib
to get the correct device for querying network states.
Fixes: 8b184e4f1c32 ("RDMA/mana_ib: Enable RoCE on port 1") Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://lore.kernel.org/r/1720705077-322-1-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
If a netdev has already been assigned, ib_device_set_netdev needs to
release the reference on the older netdev but it is mistakenly being
called for the new netdev. Fix it and in the process use netdev_put
to be symmetrical with the netdev_hold.
Fixes: 09f530f0c6d6 ("RDMA: Add netdevice_tracker to ib_device_set_netdev()") Signed-off-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240710203310.19317-1-dsahern@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
We could leak stack memory through the payload field when running
AES with a key from one of the hardware's key slots. Fix this by
ensuring the payload field is set to 0 in such cases.
This does not affect the common use case when the key is supplied
from main memory via the descriptor payload.
Signed-off-by: David Gstir <david@sigma-star.at> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202405270146.Y9tPoil8-lkp@intel.com/ Fixes: 3d16af0b4cfa ("crypto: mxs-dcp: Add support for hardware-bound keys") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
Intel IOMMU operates on inclusive bounds (both generally aas well as
iommu_domain_identity_map()). Meanwhile, for_each_mem_pfn_range() uses
exclusive bounds for end_pfn. This creates an off-by-one error when
switching between the two.
Fixes: c5395d5c4a82 ("intel-iommu: Clean up iommu_domain_identity_map()") Signed-off-by: Jon Pan-Doh <pandoh@google.com> Tested-by: Sudheer Dantuluri <dantuluris@google.com> Suggested-by: Gary Zibrat <gzibrat@google.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240709234913.2749386-1-pandoh@google.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When a large number of tasks are issued, the speed of HW processing
mbx will slow down. The standard for judging mbx timeout in the current
firmware is 30ms, and the current timeout standard for the driver is also
30ms.
Considering that firmware scheduling in multi-function scenarios takes a
certain amount of time, this will cause the driver to time out too early
and report a failure before mbx execution times out.
This patch introduces a new mechanism that can set different timeouts for
different cmds and extends the timeout of mbx to 35ms.
Fixes: a04ff739f2a9 ("RDMA/hns: Add command queue support for hip08 RoCE driver") Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://lore.kernel.org/r/20240710133705.896445-9-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
VFs and its PF will share the memory of the extend DB. Currently,
the number of extend DB allocated by driver is only enough for PF.
This leads to a probability of DB loss and some other problems in
scenarios where both PF and VFs use a large number of QPs.
CEQEs are handled in interrupt handler currently. This may cause the
CPU core staying in interrupt context too long and lead to soft lockup
under heavy load.
Handle CEQEs in BH workqueue and set an upper limit for the number of
CEQE handled by a single call of work handler.
8 bytes is the only supported length of atomic. Add this check in
set_rc_wqe(). Besides, stop processing WQEs and return from
set_rc_wqe() if there is any error.
The of_device_unregister call in therm_windtunnel's module_exit procedure
does not fully reverse the effects of of_platform_device_create in the
module_init prodedure. Once you unload this module, it is impossible
to load it ever again since only the first of_platform_device_create
call on the fan node succeeds.
This driver predates first git commit, and it turns out back then
of_platform_device_create worked differently than it does today.
So this is actually an old regression.
The appropriate function to undo of_platform_device_create now appears
to be of_platform_device_destroy, and switching to use this makes it
possible to unload and load the module as expected.
Signed-off-by: Nick Bowler <nbowler@draconx.ca> Fixes: c6e126de43e7 ("of: Keep track of populated platform devices") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240711035428.16696-1-nbowler@draconx.ca Signed-off-by: Sasha Levin <sashal@kernel.org>
In the xmon disassembly code there are several CPU feature checks to
determine what dialects should be passed to the disassembler. The
dialect controls which instructions the disassembler will recognise.
Unfortunately the checks are incorrect, because instead of passing a
single CPU feature they are passing a mask of feature bits.
For example the code:
if (cpu_has_feature(CPU_FTRS_POWER5))
dialect |= PPC_OPCODE_POWER5;
Is trying to check if the system is running on a Power5 CPU. But
CPU_FTRS_POWER5 is a mask of *all* the feature bits that are enabled on
a Power5.
In practice the test will always return true for any 64-bit CPU, because
at least one bit in the mask will be present in the CPU_FTRS_ALWAYS
mask.
Similarly for all the other checks against CPU_FTRS_xx masks.
Rather than trying to match the disassembly behaviour exactly to the
current CPU, just differentiate between 32-bit and 64-bit, and Altivec,
VSX and HTM.
That will cause some instructions to be shown in disassembly even
on a CPU that doesn't support them, but that's OK, objdump -d output
has the same behaviour, and if anything it's less confusing than some
instructions not being disassembled.
Fixes: 897f112bb42e ("[POWERPC] Import updated version of ppc disassembly code for xmon") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240509121248.270878-2-mpe@ellerman.id.au Signed-off-by: Sasha Levin <sashal@kernel.org>
The helper calculate_psi_aligned_address() is used to convert an arbitrary
range into a size-aligned one.
The aligned_pages variable is calculated from input start and end, but is
not adjusted when the start pfn is not aligned and the mask is adjusted,
which results in an incorrect number of pages returned.
The number of pages is used by qi_flush_piotlb() to flush caches for the
first-stage translation. With the wrong number of pages, the cache is not
synchronized, leading to inconsistencies in some cases.
Fixes: c4d27ffaa8eb ("iommu/vt-d: Add cache tag invalidation helpers") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240709152643.28109-3-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Address mask specifies the number of low order bits of the address field
that must be masked for the invalidation operation.
Since address bits masked start from bit 12, the max address mask should
be MAX_AGAW_PFN_WIDTH, as defined in Table 19 ("Invalidate Descriptor
Address Mask Encodings") of the spec.
Limit the max address mask returned from calculate_psi_aligned_address()
to MAX_AGAW_PFN_WIDTH to prevent potential integer overflow in the
following code:
When PERST# assert and deassert happens on the PERST# supported platforms,
both iATU0 and iATU6 will map inbound window to BAR0. DMA will access the
area that was previously allocated (iATU0) for BAR0, instead of the new
area (iATU6) for BAR0.
Right now, this isn't an issue because both iATU0 and iATU6 should
translate inbound accesses to BAR0 to the same allocated memory area.
However, having two separate inbound mappings for the same BAR is a
disaster waiting to happen.
The mappings between PCI BAR and iATU inbound window are maintained in the
dw_pcie_ep::bar_to_atu[] array. While allocating a new inbound iATU map for
a BAR, dw_pcie_ep_inbound_atu() API checks for the availability of the
existing mapping in the array and if it is not found (i.e., value in the
array indexed by the BAR is found to be 0), it allocates a new map value
using find_first_zero_bit().
The issue is the existing logic failed to consider the fact that the map
value '0' is a valid value for BAR0, so find_first_zero_bit() will return
'0' as the map value for BAR0 (note that it returns the first zero bit
position).
Due to this, when PERST# assert + deassert happens on the PERST# supported
platforms, the inbound window allocation restarts from BAR0 and the
existing logic to find the BAR mapping will return '6' for BAR0 instead of
'0' due to the fact that it considers '0' as an invalid map value.
Fix this issue by always incrementing the map value before assigning to
bar_to_atu[] array and then decrementing it while fetching. This will make
sure that the map value '0' always represents the invalid mapping."
All EP specific resources are enabled during PERST# deassert. As a counter
operation, all resources should be disabled during PERST# assert. There is
no point in skipping that if the link was not enabled.
This will also result in enablement of the resources twice if PERST# got
deasserted again. So remove the check from qcom_pcie_perst_assert() and
disable all the resources unconditionally.
Introduce div_offset field in en_clk_desc struct in order to fix rate
divider estimation in en7523_get_div routine for slic and spi fixed
rate clocks.
Moreover, fix base_shift for crypto clock.
Fixes: 1e6273179190 ("clk: en7523: Add clock driver for Airoha EN7523 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c491bdea05d847f1f1294b94f14725d292eb95d0.1718615934.git.lorenzo@kernel.org Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The first problem is that they incorrectly report the parent after
commit 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for
parked RCGs"). That's because the cached CFG register value needs to be
populated when the clk is registered. clk_rcg2_shared_enable() writes
the cached CFG register value 'parked_cfg'. This value is initially zero
due to static initializers. If a driver calls clk_enable() before
setting a rate or parent, it will set the parent to '0' which is
(almost?) always XO, and may not reflect the parent at registration. In
the worst case, this switches the RCG from sourcing a fast PLL to the
slow crystal speed.
The second problem is that the force enable bit isn't cleared. The force
enable bit is only used during parking and unparking of shared RCGs.
Otherwise it shouldn't be set because it keeps the RCG enabled even when
all the branches on the output of the RCG are disabled (the hardware has
a feedback mechanism so that any child branches keep the RCG enabled
when the branch enable bit is set). This problem wastes power if the clk
is unused, and is harmful in the case that the clk framework disables
the parent of the force enabled RCG. In the latter case, the GDSC the
shared RCG is associated with will get wedged if the RCG's source clk is
disabled and the GDSC tries to enable the RCG to do "housekeeping" while
powering on.
Both of these problems combined with incorrect runtime PM usage in the
display driver lead to a black screen on Qualcomm sc7180 Trogdor
chromebooks. What happens is that the bootloader leaves the
'disp_cc_mdss_rot_clk' enabled and the 'disp_cc_mdss_rot_clk_src' force
enabled and parented to 'disp_cc_pll0'. The mdss driver probes and
runtime suspends, disabling the mdss_gdsc which uses the
'disp_cc_mdss_rot_clk_src' for "housekeeping". The
'disp_cc_mdss_rot_clk' is disabled during late init because the clk is
unused, but the parent 'disp_cc_mdss_rot_clk_src' is still force enabled
because the force enable bit was never cleared. Then 'disp_cc_pll0' is
disabled because it is also unused. That's because the clk framework
believes the parent of the RCG is XO when it isn't. A child device of
the mdss device (e.g. DSI) runtime resumes mdss which powers on the
mdss_gdsc. This wedges the GDSC because 'disp_cc_mdss_rot_clk_src' is
parented to 'disp_cc_pll0' and that PLL is off. With the GDSC wedged,
mdss_runtime_resume() tries to enable 'disp_cc_mdss_mdp_clk' but it
can't because the GDSC has wedged all the clks associated with the GDSC
causing clks to stay stuck off.
This leads to the following warning seen at boot and a black screen
because the display driver fails to probe.
Fix these problems by parking shared RCGs at boot. This will properly
initialize the parked_cfg struct member so that the parent is reported
properly and ensure that the clk won't get stuck on or off because the
RCG is parented to the safe source (XO).
Fixes: 703db1f5da1e ("clk: qcom: rcg2: Cache CFG register updates for parked RCGs") Reported-by: Stephen Boyd <sboyd@kernel.org> Closes: https://lore.kernel.org/r/1290a5a0f7f584fcce722eeb2a1fd898.sboyd@kernel.org Closes: https://issuetracker.google.com/319956935 Reported-by: Laura Nao <laura.nao@collabora.com> Closes: https://lore.kernel.org/r/20231218091806.7155-1-laura.nao@collabora.com Cc: Bjorn Andersson <andersson@kernel.org> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Douglas Anderson <dianders@chromium.org> Cc: Taniya Das <quic_tdas@quicinc.com> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20240502224703.103150-1-swboyd@chromium.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Allow the USB3 second and third GCC PHY pipe clocks to propagate the
rate to the pipe clocks provided by the QMP combo PHYs. The first
instance is already doing that.
On big endian architectures, it is possible to run into a memory out of
bounds pointer dereference when FCP targets are zoned.
In lpfc_prep_embed_io, the memcpy(ptr, fcp_cmnd, sgl->sge_len) is
referencing a little endian formatted sgl->sge_len value. So, the memcpy
can cause big endian systems to crash.
Redefine the *sgl ptr as a struct sli4_sge_le to make it clear that we are
referring to a little endian formatted data structure. And, update the
routine with proper le32_to_cpu macro usages.
Fixes: af20bb73ac25 ("scsi: lpfc: Add support for 32 byte CDBs") Signed-off-by: Justin Tee <justin.tee@broadcom.com> Link: https://lore.kernel.org/r/20240628172011.25921-8-justintee8345@gmail.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>