mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
Let's remove yet another follow_page() user. Note that we have to do the
split without holding the PTL, after folio_walk_end(). We don't care
about losing the secretmem check in follow_page().
[david@redhat.com: teach can_split_folio() that we are not holding an additional reference] Link: https://lkml.kernel.org/r/c75d1c6c-8ea6-424f-853c-1ccda6c77ba2@redhat.com Link: https://lkml.kernel.org/r/20240802155524.517137-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/ksm: convert scan_get_next_rmap_item() from follow_page() to folio_walk
Let's use folio_walk instead, for example avoiding taking temporary folio
references if the folio does obviously not even apply and getting rid of
one more follow_page() user. We cannot move all handling under the PTL,
so leave the rmap handling (which implies an allocation) out.
Note that zeropages obviously don't apply: old code could just have
specified FOLL_DUMP. Further, we don't care about losing the secretmem
check in follow_page(): these are never anon pages and
vma_ksm_compatible() would never consider secretmem vmas (VM_SHARED |
VM_MAYSHARE must be set for secretmem, see secretmem_mmap()).
Link: https://lkml.kernel.org/r/20240802155524.517137-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk
Let's use folio_walk instead, for example avoiding taking temporary folio
references if the folio does not even apply and getting rid of one more
follow_page() user.
Note that zeropages obviously don't apply: old code could just have
specified FOLL_DUMP. Anon folios are never secretmem, so we don't care
about losing the check in follow_page().
Link: https://lkml.kernel.org/r/20240802155524.517137-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/migrate: convert add_page_for_migration() from follow_page() to folio_walk
Let's use folio_walk instead, so we can avoid taking a folio reference
when we won't even be trying to migrate the folio and to get rid of
another follow_page()/FOLL_DUMP user. Use FW_ZEROPAGE so we can return
"-EFAULT" for it as documented.
We now perform the folio_likely_mapped_shared() check under PTL, which is
what we want: relying on the mapcount and friends after dropping the PTL
does not make too much sense, as the page can get unmapped concurrently
from this process.
Further, we perform the folio isolation under PTL, similar to how we
handle it for MADV_PAGEOUT.
The possible return values for follow_page() were confusing, especially
with FOLL_DUMP set. We'll handle it like documented in the man page:
* -EFAULT: This is a zero page or the memory area is not mapped by the
process.
* -ENOENT: The page is not present.
We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to
do, but it likely doesn't really matter (just like for weird devmap,
whereby we fake "not present").
The other errros are left as is, and match the documentation in the man
page.
While at it, rename add_page_for_migration() to add_folio_for_migration().
We'll lose the "secretmem" check, but that shouldn't really matter because
these folios cannot ever be migrated. Should vma_migratable() refuse
these VMAs? Maybe.
Link: https://lkml.kernel.org/r/20240802155524.517137-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk
Let's use folio_walk instead, so we can avoid taking a folio reference
just to read the nid and get rid of another follow_page()/FOLL_DUMP user.
Use FW_ZEROPAGE so we can return "-EFAULT" for it as documented.
The possible return values for follow_page() were confusing, especially
with FOLL_DUMP set. We'll handle it like documented in the man page:
* -EFAULT: This is a zero page or the memory area is not mapped by the
process.
* -ENOENT: The page is not present.
We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to
do, but it likely doesn't really matter (just like for weird devmap,
whereby we fake "not present").
Note that the other errors (-EACCESS, -EBUSY, -EIO, -EINVAL, -ENOMEM) so
far only applied when actually moving pages, not when only querying stats.
We'll effectively drop the "secretmem" check we had in follow_page(), but
that shouldn't really matter here, we're not accessing folio/page content
after all.
Link: https://lkml.kernel.org/r/20240802155524.517137-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We want to get rid of follow_page(), and have a more reasonable way to
just lookup a folio mapped at a certain address, perform some checks while
still under PTL, and then only conditionally grab a folio reference if
really required.
Further, we might want to get rid of some walk_page_range*() users that
really only want to temporarily lookup a single folio at a single address.
So let's add a new page table walker that does exactly that, similarly to
GUP also being able to walk hugetlb VMAs.
Add folio_walk_end() as a macro for now: the compiler is not easy to
please with the pte_unmap()->kunmap_local().
Note that one difference between follow_page() and get_user_pages(1) is
that follow_page() will not trigger faults to get something mapped. So
folio_walk is at least currently not a replacement for get_user_pages(1),
but could likely be extended/reused to achieve something similar in the
future.
Link: https://lkml.kernel.org/r/20240802155524.517137-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES
Patch series "mm: replace follow_page() by folio_walk".
Looking into a way of moving the last folio_likely_mapped_shared() call in
add_folio_for_migration() under the PTL, I found myself removing
follow_page(). This paves the way for cleaning up all the FOLL_, follow_*
terminology to just be called "GUP" nowadays.
The new page table walker will lookup a mapped folio and return to the
caller with the PTL held, such that the folio cannot get unmapped
concurrently. Callers can then conditionally decide whether they really
want to take a short-term folio reference or whether the can simply unlock
the PTL and be done with it.
folio_walk is similar to page_vma_mapped_walk(), except that we don't know
the folio we want to walk to and that we are only walking to exactly one
PTE/PMD/PUD.
folio_walk provides access to the pte/pmd/pud (and the referenced folio
page because things like KSM need that), however, as part of this series
no page table modifications are performed by users.
We might be able to convert some other walk_page_range() users that really
only walk to one address, such as DAMON with
damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk
in the future to optionally fault in a folio (if applicable), such that we
can replace some get_user_pages() users that really only want to lookup a
single page/folio under PTL without unconditionally grabbing a folio
reference.
I have plans to extend the approach to a range walker that will try
batching various page table entries (not just folio pages) to be a better
replace for walk_page_range() -- and users will be able to opt in which
type of page table entries they want to process -- but that will require
more work and more thoughts.
KSM seems to work just fine (ksm_functional_tests selftests) and
move_pages seems to work (migration selftest). I tested the leaf
implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G)
on arm64 using move_pages and did some more testing on x86-64. Cross
compiled on a bunch of architectures.
This patch (of 11):
We want to make use of vm_normal_page_pmd() in generic page table walking
code where we might walk hugetlb folios that are mapped by PMDs even
without CONFIG_TRANSPARENT_HUGEPAGE.
So let's expose vm_normal_page_pmd() + vm_normal_folio_pmd() with
CONFIG_PGTABLE_HAS_HUGE_LEAVES.
Link: https://lkml.kernel.org/r/20240802155524.517137-1-david@redhat.com Link: https://lkml.kernel.org/r/20240802155524.517137-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Thu, 1 Aug 2024 23:50:05 +0000 (16:50 -0700)]
include/linux/mmzone.h: clean up watermark accessors
- we have a helper wmark_pages(). Teach min_wmark_pages(),
low_wmark_pages(), high_wmark_pages() and promo_wmark_pages() to use
it instead of open-coding its implementation.
- there's no reason to implement all these things as macros. Redo them
in C.
Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kaiyang Zhao [Thu, 1 Aug 2024 18:04:56 +0000 (18:04 +0000)]
mm: consider CMA pages in watermark check for NUMA balancing target node
Currently in migrate_balanced_pgdat(), ALLOC_CMA flag is not passed when
checking watermark on the migration target node. This does not match the
gfp in alloc_misplaced_dst_folio() which allows allocation from CMA.
This causes promotion failures when there are a lot of available CMA
memory in the system.
Therefore, we change the alloc_flags passed to zone_watermark_ok() in
migrate_balanced_pgdat().
Link: https://lkml.kernel.org/r/20240801180456.25927-1-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: zswap: fix global shrinker error handling logic
This patch fixes the zswap global shrinker, which did not shrink the zpool
as expected.
The issue addressed is that shrink_worker() did not distinguish between
unexpected errors and expected errors, such as failed writeback from an
empty memcg. The shrinker would stop shrinking after iterating through
the memcg tree 16 times, even if there was only one empty memcg.
With this patch, the shrinker no longer considers encountering an empty
memcg, encountering a memcg with writeback disabled, or reaching the end
of a memcg tree walk as a failure, as long as there are memcgs that are
candidates for writeback. Systems with one or more empty memcgs will now
observe significantly higher zswap writeback activity after the zswap pool
limit is hit.
To avoid an infinite loop when there are no writeback candidates, this
patch tracks writeback attempts during memcg tree walks and limits reties
if no writeback candidates are found.
To handle the empty memcg case, the helper function shrink_memcg() is
modified to check if the memcg is empty and then return -ENOENT.
Link: https://lkml.kernel.org/r/20240731004918.33182-3-flintglass@gmail.com Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") Signed-off-by: Takero Funaki <flintglass@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: zswap: fixes for global shrinker", v5.
This series addresses issues in the zswap global shrinker that could not
shrink stored pages. With this series, the shrinker continues to shrink
pages until it reaches the accept threshold more reliably, gives much
higher writeback when the zswap pool limit is hit.
This patch (of 2):
This patch fixes an issue where the zswap global shrinker stopped
iterating through the memcg tree.
The problem was that shrink_worker() would restart iterating memcg tree
from the tree root, considering an offline memcg as a failure, and abort
shrinking after encountering the same offline memcg 16 times even if there
is only one offline memcg. After this change, an offline memcg in the
tree is no longer considered a failure. This allows the shrinker to
continue shrinking the other online memcgs regardless of whether an
offline memcg exists, gives higher zswap writeback activity.
To avoid holding refcount of offline memcg encountered during the memcg
tree walking, shrink_worker() must continue iterating to release the
offline memcg to ensure the next memcg stored in the cursor is online.
The offline memcg cleaner has also been changed to avoid the same issue.
When the next memcg of the offlined memcg is also offline, the refcount
stored in the iteration cursor was held until the next shrink_worker()
run. The cleaner must release the offline memcg recursively.
Zhaoyu Liu [Wed, 31 Jul 2024 13:31:01 +0000 (21:31 +0800)]
mm: swap: allocate folio only first time in __read_swap_cache_async()
It should be checked by filemap_get_folio() if SWAP_HAS_CACHE was
marked while reading a share swap page. It would re-allocate a folio
if the swap cache was not ready now. We save the new folio to avoid
page allocating again.
Link: https://lkml.kernel.org/r/20240731133101.GA2096752@bytedance Signed-off-by: Zhaoyu Liu <liuzhaoyu.zackary@bytedance.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: clarify folio_likely_mapped_shared() documentation for KSM folios
For KSM folios, the function actually does what it is supposed to do: even
having multiple mappings inside the same MM is considered "sharing", as
there is no real relationship between these KSM page mappings -- in
contrast to mapping the same file range twice and having the same
pagecache page mapped twice.
mm/rmap: cleanup partially-mapped handling in __folio_remove_rmap()
Let's simplify and reduce code indentation. In the RMAP_LEVEL_PTE case,
we already check for nr when computing partially_mapped.
For RMAP_LEVEL_PMD, it's a bit more confusing. Likely, we don't need the
"nr" check, but we could have "nr < nr_pmdmapped" also if we stumbled into
the "/* Raced ahead of another remove and an add? */" case. So let's
simply move the nr check in there.
Note that partially_mapped is always false for small folios.
No functional change intended.
Link: https://lkml.kernel.org/r/20240710214350.147864-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We removed hugetlb_follow_page_mask() in commit 9cb28da54643 ("mm/gup:
handle hugetlb in the generic follow_page_mask code") but forgot to
cleanup some leftovers.
While at it, simplify the hugetlb comment, it's overly detailed and rather
confusing. Stating that we may end up in there during coredumping is
sufficient to explain the PF_DUMPCORE usage.
Link: https://lkml.kernel.org/r/20240731142000.625044-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Fri, 26 Jul 2024 01:01:57 +0000 (01:01 +0000)]
mm/memory_hotplug: get rid of __ref
After commit 73db3abdca58 ("init/modpost: conditionally check section
mismatch to __meminit*"), we can get rid of __ref annotations.
Link: https://lkml.kernel.org/r/20240726010157.6177-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Tue, 30 Jul 2024 07:13:39 +0000 (19:13 +1200)]
mm: swap: add nr argument in swapcache_prepare and swapcache_clear to support large folios
Right now, swapcache_prepare() and swapcache_clear() supports one entry
only, to support large folios, we need to handle multiple swap entries.
To optimize stack usage, we iterate twice in __swap_duplicate(): the first
time to verify that all entries are valid, and the second time to apply
the modifications to the entries.
Currently, we're using nr=1 for the existing users.
[v-songbaohua@oppo.com: clarify swap_count_continued and improve readability for __swap_duplicate] Link: https://lkml.kernel.org/r/20240802071817.47081-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240730071339.107447-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Mon, 29 Jul 2024 09:17:17 +0000 (14:47 +0530)]
mm: improve code consistency with zonelist_* helper functions
Replace direct access to zoneref->zone, zoneref->zone_idx, or
zone_to_nid(zoneref->zone) with the corresponding zonelist_* helper
functions for consistency.
No functional change.
Link: https://lkml.kernel.org/r/20240729091717.464-1-shivankg@amd.com Co-developed-by: Shivank Garg <shivankg@amd.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 29 Jul 2024 11:50:41 +0000 (12:50 +0100)]
tools: add skeleton code for userland testing of VMA logic
Establish a new userland VMA unit testing implementation under
tools/testing which utilises existing logic providing maple tree support
in userland utilising the now-shared code previously exclusive to radix
tree testing.
This provides fundamental VMA operations whose API is defined in mm/vma.h,
while stubbing out superfluous functionality.
This exists as a proof-of-concept, with the test implementation functional
and sufficient to allow userland compilation of vma.c, but containing only
cursory tests to demonstrate basic functionality.
Link: https://lkml.kernel.org/r/533ffa2eec771cbe6b387dd049a7f128a53eb616.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: SeongJae Park <sj@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 29 Jul 2024 11:50:40 +0000 (12:50 +0100)]
tools: separate out shared radix-tree components
The core components contained within the radix-tree tests which provide
shims for kernel headers and access to the maple tree are useful for
testing other things, so separate them out and make the radix tree tests
dependent on the shared components.
This lays the groundwork for us to add VMA tests of the newly introduced
vma.c file.
Link: https://lkml.kernel.org/r/1ee720c265808168e0d75608e687607d77c36719.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 29 Jul 2024 11:50:38 +0000 (12:50 +0100)]
mm: move internal core VMA manipulation functions to own file
This patch introduces vma.c and moves internal core VMA manipulation
functions to this file from mmap.c.
This allows us to isolate VMA functionality in a single place such that we
can create userspace testing code that invokes this functionality in an
environment where we can implement simple unit tests of core
functionality.
This patch ensures that core VMA functionality is explicitly marked as
such by its presence in mm/vma.h.
It also places the header includes required by vma.c in vma_internal.h,
which is simply imported by vma.c. This makes the VMA functionality
testable, as userland testing code can simply stub out functionality as
required.
Link: https://lkml.kernel.org/r/c77a6aafb4c42aaadb8e7271a853658cbdca2e22.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 29 Jul 2024 11:50:37 +0000 (12:50 +0100)]
mm: move vma_shrink(), vma_expand() to internal header
The vma_shrink() and vma_expand() functions are internal VMA manipulation
functions which we ought to abstract for use outside of memory management
code.
To achieve this, we replace shift_arg_pages() in fs/exec.c with an
invocation of a new relocate_vma_down() function implemented in mm/mmap.c,
which enables us to also move move_page_tables() and vma_iter_prev_range()
to internal.h.
The purpose of doing this is to isolate key VMA manipulation functions in
order that we can both abstract them and later render them easily
testable.
Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 29 Jul 2024 11:50:35 +0000 (12:50 +0100)]
userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c
Patch series "Make core VMA operations internal and testable", v4.
There are a number of "core" VMA manipulation functions implemented in
mm/mmap.c, notably those concerning VMA merging, splitting, modifying,
expanding and shrinking, which logically don't belong there.
More importantly this functionality represents an internal implementation
detail of memory management and should not be exposed outside of mm/
itself.
This patch series isolates core VMA manipulation functionality into its
own file, mm/vma.c, and provides an API to the rest of the mm code in
mm/vma.h.
Importantly, it also carefully implements mm/vma_internal.h, which
specifies which headers need to be imported by vma.c, leading to the very
useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h.
This means we can then re-implement vma_internal.h in userland, adding
shims for kernel mechanisms as required, allowing us to unit test internal
VMA functionality.
This testing is useful as opposed to an e.g. kunit implementation as this
way we can avoid all external kernel side-effects while testing, run tests
VERY quickly, and iterate on and debug problems quickly.
Excitingly this opens the door to, in the future, recreating precise
problems observed in production in userland and very quickly debugging
problems that might otherwise be very difficult to reproduce.
This patch series takes advantage of existing shim logic and full userland
maple tree support contained in tools/testing/radix-tree/ and
tools/include/linux/, separating out shared components of the radix tree
implementation to provide this testing.
Kernel functionality is stubbed and shimmed as needed in
tools/testing/vma/ which contains a fully functional userland
vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly
tested from userland.
A simple, skeleton testing implementation is provided in
tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA
merge, modify (testing split), expand and shrink functionality work
correctly.
This patch (of 4):
This patch forms part of a patch series intending to separate out VMA
logic and render it testable from userspace, which requires that core
manipulation functions be exposed in an mm/-internal header file.
In order to do this, we must abstract APIs we wish to test, in this
instance functions which ultimately invoke vma_modify().
This patch therefore moves all logic which ultimately invokes vma_modify()
to mm/userfaultfd.c, trying to transfer code at a functional granularity
where possible.
[lorenzo.stoakes@oracle.com: fix user-after-free in userfaultfd_clear_vma()] Link: https://lkml.kernel.org/r/3c947ddc-b804-49b7-8fe9-3ea3ca13def5@lucifer.local Link: https://lkml.kernel.org/r/cover.1722251717.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Gow <davidgow@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rae Moar <rmoar@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Finkel [Mon, 29 Jul 2024 14:37:43 +0000 (10:37 -0400)]
mm, memcg: cg2 memory{.swap,}.peak write tests
Extend two existing tests to cover extracting memory usage through the
newly mutable memory.peak and memory.swap.peak handlers.
In particular, make sure to exercise adding and removing watchers with
overlapping lifetimes so the less-trivial logic gets tested.
The new/updated tests attempt to detect a lack of the write handler by
fstat'ing the memory.peak and memory.swap.peak files and skip the tests if
that's the case. Additionally, skip if the file doesn't exist at all.
[davidf@vimeo.com: update tests] Link: https://lkml.kernel.org/r/20240730231304.761942-3-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-3-davidf@vimeo.com Signed-off-by: David Finkel <davidf@vimeo.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Finkel [Mon, 29 Jul 2024 14:37:42 +0000 (10:37 -0400)]
mm, memcg: cg2 memory{.swap,}.peak write handlers
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
This patch (of 2):
Other mechanisms for querying the peak memory usage of either a process or
v1 memory cgroup allow for resetting the high watermark. Restore parity
with those mechanisms, but with a less racy API.
For example:
- Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
the high watermark.
- writing "5" to the clear_refs pseudo-file in a processes's proc
directory resets the peak RSS.
This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.
Writing any non-empty string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for subsequent
reads through that same FD.
Notably, following Johannes's suggestion, this implementation moves the
O(FDs that have written) behavior onto the FD write(2) path. Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.
Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.
This behavior is particularly useful for work scheduling systems that need
to track memory usage of worker processes/cgroups per-work-item. Since
memory can't be squeezed like CPU can (the OOM-killer has opinions), these
systems need to track the peak memory usage to compute system/container
fullness when binpacking workitems.
Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place. To facilitate this, we track
the peak memory usage. However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while a
work-item is executing. Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.
As this data is used to ensure that binpacked work ends up with sufficient
headroom, this use-case mostly avoids the inaccuracies surrounding
reclaimable memory.
Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com Signed-off-by: David Finkel <davidf@vimeo.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: remove arch_make_page_accessible()".
Now that s390x implements arch_make_folio_accessible(), let's convert
remaining users to use arch_make_folio_accessible() instead so we can
remove arch_make_page_accessible().
This patch (of 3):
Now that s390x implements HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE, let's turn
generic arch_make_folio_accessible() into a NOP: there are no other
targets that implement HAVE_ARCH_MAKE_PAGE_ACCESSIBLE but not
HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE.
Link: https://lkml.kernel.org/r/20240729183844.388481-1-david@redhat.com Link: https://lkml.kernel.org/r/20240729183844.388481-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
powerpc/8xx: document and enforce that split PT locks are not used
Right now, we cannot have split PT locks because 8xx does not support SMP.
But for the sake of documentation *why* 8xx is fine regarding what we
documented in huge_pte_lockptr(), let's just add code to enforce it at the
same time as documenting it.
This should also make everybody who wants to copy from the 8xx approach of
supporting such unusual ways of mapping hugetlb folios aware that it gets
tricky once multiple page tables are involved.
Link: https://lkml.kernel.org/r/20240726150728.3159964-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hugetlb: enforce that PMD PT sharing has split PMD PT locks
Sharing page tables between processes but falling back to per-MM page
table locks cannot possibly work.
So, let's make sure that we do have split PMD locks by adding a new
Kconfig option and letting that depend on CONFIG_SPLIT_PMD_PTLOCKS.
Link: https://lkml.kernel.org/r/20240726150728.3159964-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options
Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications".
This series is a follow up to the fixes:
"[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking"
When working on the fixes, I wondered why 8xx is fine (-> never uses split
PT locks) and how PT locking even works properly with PMD page table
sharing (-> always requires split PMD PT locks).
Let's improve the split PT lock detection, make hugetlb properly depend on
it and make 8xx bail out if it would ever get enabled by accident.
As an alternative to patch #3 we could extend the Kconfig
SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the
code that actually implements it feels a bit nicer for documentation
purposes, and there is no need to actually disable it because it should
always be disabled (!SMP).
Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks
are still getting used where we would expect them.
Roman Gushchin [Fri, 26 Jul 2024 20:31:10 +0000 (20:31 +0000)]
mm: page_counters: initialize usage using ATOMIC_LONG_INIT() macro
When a page_counter structure is initialized, there is no need to use an
atomic set operation to initialize the usage counter because at this point
the structure is not visible to anybody else. ATOMIC_LONG_INIT() is what
should be used in such cases.
Link: https://lkml.kernel.org/r/20240726203110.1577216-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Roman Gushchin [Fri, 26 Jul 2024 20:31:09 +0000 (20:31 +0000)]
mm: page_counters: put page_counter_calculate_protection() under CONFIG_MEMCG
Put page_counter_calculate_protection() under CONFIG_MEMCG.
The protection functionality (min/low limits) is not supported by any
other cgroup subsystem, so page_counter_calculate_protection() and related
static effective_protection() can be compiled out if CONFIG_MEMCG is not
enabled.
Link: https://lkml.kernel.org/r/20240726203110.1577216-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: memcg: page counters optimizations", v3.
This patchset contains 3 independent small optimizations of page counters.
This patch (of 3):
Memory protection (min/low) requires a constant tracking of protected
memory usage. propagate_protected_usage() is called on each page counters
update and does a number of operations even in cases when the actual
memory protection functionality is not supported (e.g. hugetlb cgroups or
memcg swap counters).
It's obviously inefficient and leads to a waste of CPU cycles. It can be
addressed by calling propagate_protected_usage() only for the counters
which do support memory guarantees. As of now it's only memcg->memory -
the unified memory memcg counter.
Link: https://lkml.kernel.org/r/20240726203110.1577216-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pavel Tikhomirov [Thu, 25 Jul 2024 04:12:15 +0000 (12:12 +0800)]
kmemleak: enable tracking for percpu pointers
Patch series "kmemleak: support for percpu memory leak detect'.
This is a rework of this series:
https://lore.kernel.org/lkml/20200921020007.35803-1-chenjun102@huawei.com/
Originally I was investigating a percpu leak on our customer nodes and
having this functionality was a huge help, which lead to this fix [1].
So probably it's a good idea to have it in mainstream too, especially as
after [2] it became much easier to implement (we already have a separate
tree for percpu pointers).
[1] commit 0af8c09c89681 ("netfilter: x_tables: fix percpu counter block leak on error path when creating new netns")
[2] commit 39042079a0c24 ("kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers")
This patch (of 2):
This basically does:
- Add min_percpu_addr and max_percpu_addr to filter out unrelated data
similar to min_addr and max_addr;
- Set min_count for percpu pointers to 1 to start tracking them;
- Calculate checksum of percpu area as xor of crc32 for each cpu;
- Split pointer lookup and update refs code into separate helper and use
it twice: once as if the pointer is a virtual pointer and once as if
it's percpu.
As part of the dynamic kernel stack project, we need to know the amount of
data that can be saved by reducing the default kernel stack size [1].
Provide a kernel stack usage histogram to aid in optimizing kernel stack
sizes and minimizing memory waste in large-scale environments. The
histogram divides stack usage into power-of-two buckets and reports the
results in /proc/vmstat. This information is especially valuable in
environments with millions of machines, where even small optimizations can
have a significant impact.
The histogram data is presented in /proc/vmstat with entries like
"kstack_1k", "kstack_2k", and so on, indicating the number of threads that
exited with stack usage falling within each respective bucket.
Note: once the dynamic kernel stack is implemented it will depend on the
implementation the usability of this feature: On hardware that supports
faults on kernel stacks, we will have other metrics that show the total
number of pages allocated for stacks. On hardware where faults are not
supported, we will most likely have some optimization where only some
threads are extended, and for those, these metrics will still be very
useful.
[1] https://lwn.net/Articles/974367
Link: https://lkml.kernel.org/r/20240730150158.832783-3-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240724203322.2765486-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
At the moment the valid index for the indirection tables for memcg stats
and events is < S8_MAX. These indirection tables are used in performance
critical codepaths. With the latest addition to the vm_events, the
NR_VM_EVENT_ITEMS has gone over S8_MAX. One way to resolve is to increase
the entry size of the indirection table from int8_t to int16_t but this
will increase the potential number of cachelines needed to access the
indirection table.
This patch took a different approach and make the valid index < U8_MAX.
In this way the size of the indirection tables will remain same and we
only need to invalid index check from less than 0 to equal to U8_MAX. In
this approach we have also removed a subtraction from the performance
critical codepaths.
mm: shrink skip folio mapped by an exiting process
The releasing process of the non-shared anonymous folio mapped solely by
an exiting process may go through two flows: 1) the anonymous folio is
firstly is swaped-out into swapspace and transformed into a swp_entry in
shrink_folio_list; 2) then the swp_entry is released in the process
exiting flow. This will result in the high cpu load of releasing a
non-shared anonymous folio mapped solely by an exiting process.
When the low system memory and the exiting process exist at the same time,
it will be likely to happen, because the non-shared anonymous folio mapped
solely by an exiting process may be reclaimed by shrink_folio_list.
This patch is that shrink skips the non-shared anonymous folio solely
mapped by an exting process and this folio is only released directly in
the process exiting flow, which will save swap-out time and alleviate the
load of the process exiting.
Barry provided some effectiveness testing in [1]. "I observed that
this patch effectively skipped 6114 folios (either 4KB or 64KB mTHP),
potentially reducing the swap-out by up to 92MB (97,300,480 bytes)
during the process exit. The working set size is 256MB."
Fold lru_rotate into cpu_fbatches, and rename the folio_batch and the lock
protecting it to lru_move_tail and lock_irq respectively so that all the
boilerplate can be removed at the end of this series.
Also remove data_race() around folio_batch_count(), which is out of place:
all folio_batch_count() calls on remote cpu_fbatches are subject to
data_race(), and therefore data_race() should be inside
folio_batch_count().
Rename cpu_fbatches->activate to cpu_fbatches->lru_activate, and its
handler folio_activate_fn() to lru_activate() so that all the boilerplate
can be removed at the end of this series.
Zi Yan [Wed, 24 Jul 2024 13:01:15 +0000 (09:01 -0400)]
memory tiering: count PGPROMOTE_SUCCESS when mem tiering is enabled.
memory tiering can be enabled/disabled at runtime and
sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING is used to
check it. In migrate_misplaced_folio(), the check is missing when
PGPROMOTE_SUCCESS is incremented. Add the missing check.
Link: https://lkml.kernel.org/r/20240724130115.793641-4-ziy@nvidia.com Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency") Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Closes: https://lore.kernel.org/linux-mm/f4ae2c9c-fe40-4807-bdb2-64cf2d716c1a@huawei.com/ Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If memory tiering mode is on and a folio is not in the top tier memory,
folio's cpupid field is repurposed to store page access time. Instead of
an open coded check, use a function to encapsulate the check.
Link: https://lkml.kernel.org/r/20240724130115.793641-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 24 Jul 2024 13:01:13 +0000 (09:01 -0400)]
memory tiering: read last_cpupid correctly in do_huge_pmd_numa_page()
Patch series "Various memory tiering fixes", v3.
This patch (of 3):
last_cpupid is only available when memory tiering is off or the folio is
in toptier node. Complete the check to read last_cpupid when it is
available.
Before the fix, the default last_cpupid will be used even if memory
tiering mode is turned off at runtime instead of the actual value. This
can prevent task_numa_fault() from getting right numa fault stats, but
should not cause any crash. User might see performance changes after the
fix.
Link: https://lkml.kernel.org/r/20240724130115.793641-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20240724130115.793641-2-ziy@nvidia.com Fixes: 33024536bafd ("memory tiering: hot page selection with hint page fault latency") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/9af34a6b-ca56-4a64-8aa6-ade65f109288@redhat.com/ Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Wed, 24 Jul 2024 02:00:56 +0000 (14:00 +1200)]
mm: extend 'usage' parameter so that cluster_swap_free_nr() can be reused
Extend a usage parameter so that cluster_swap_free_nr() can be reused by
both swapcache_clear() and swap_free(). __swap_entry_free() is quite
similar but more tricky as it requires the return value of
__swap_entry_free_locked() which cluster_swap_free_nr() doesn't support.
Link: https://lkml.kernel.org/r/20240724020056.65838-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chuanhua Han <hanchuanhua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josef Bacik [Thu, 18 Jul 2024 21:26:07 +0000 (17:26 -0400)]
mm: remove foll_flags in __get_user_pages
Now that we're not passing around a pointer to the flags, there's no
reason to have an extra variable for the gup_flags, simply pass the
gup_flags directly everywhere.
Josef Bacik [Thu, 18 Jul 2024 21:26:06 +0000 (17:26 -0400)]
mm: cleanup flags usage in faultin_page
Patch series "mm: some small page fault cleanups".
I was recently wreaking havoc in the page fault code and I noticed some
things that could be cleaned up. We no longer modify the gup flags in
faultin_page, so we can clean up how we pass the flags in and remove the
extra variable in __get_user_pages.
This patch (of 2):
We're passing a pointer to the foll_flags for faultin_page, however we
never modify the flags in this call. Change this to just take the flags
value instead.
mm/damon/lru_sort: adjust local variable to dynamic allocation
When KASAN is enabled and built with clang:
mm/damon/lru_sort.c:199:12: error: stack frame size (2328) exceeds
limit (2048) in 'damon_lru_sort_apply_parameters' [-Werror,-Wframe-larger-than]
static int damon_lru_sort_apply_parameters(void)
^
1 error generated.
This is because damon_lru_sort_quota contains a large array, and
assigning this variable to a local variable causes a large amount of
stack space to be occupied.
mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO
hugetlb_vmemmap_optimize_folio() and hugetlb_vmemmap_restore_folio() are
wrappers meant to be called regardless of whether HVO is enabled.
Therefore, they should not call synchronize_rcu(). Otherwise, it
regresses use cases not enabling HVO.
So move synchronize_rcu() to __hugetlb_vmemmap_optimize_folio() and
__hugetlb_vmemmap_restore_folio(), and call it once for each batch of
folios when HVO is enabled.
Link: https://lkml.kernel.org/r/20240719042503.2752316-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407091001.1250ad4a-oliver.sang@intel.com Reported-by: Janosch Frank <frankja@linux.ibm.com> Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com> Acked-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/78656.1720853990@turing-police Fixes: e93d4166b40a ("mm: memcg: put cgroup v1-specific code under a config option") Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 22 Jul 2024 05:43:19 +0000 (13:43 +0800)]
mm: shmem: move shmem_huge_global_enabled() into shmem_allowable_huge_orders()
Move shmem_huge_global_enabled() into shmem_allowable_huge_orders(), so
that shmem_allowable_huge_orders() can also help to find the allowable
huge orders for tmpfs. Moreover the shmem_huge_global_enabled() can
become static. While we are at it, passing the vma instead of mm for
shmem_huge_global_enabled() makes code cleaner.
No functional changes.
Link: https://lkml.kernel.org/r/8e825146bb29ee1a1c7bd64d2968ff3e19be7815.1721626645.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 22 Jul 2024 05:43:18 +0000 (13:43 +0800)]
mm: shmem: rename shmem_is_huge() to shmem_huge_global_enabled()
shmem_is_huge() is now used to check if the top-level huge page is
enabled, thus rename it to reflect its usage.
Link: https://lkml.kernel.org/r/da53296e0ab6359aa083561d9dc01e4223d60fbe.1721626645.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 22 Jul 2024 05:43:17 +0000 (13:43 +0800)]
mm: shmem: simplify the suitable huge orders validation for tmpfs
Patch series "Some cleanups for shmem", v3.
This series does some cleanups to reuse code, rename functions and
simplify logic to make code more clear. No functional changes are
expected.
This patch (of 3):
Move the suitable huge orders validation into shmem_suitable_orders() for
tmpfs, which can reuse some code to simplify the logic.
In addition, we don't have special handling for the error code -E2BIG when
checking for conflicts with PMD sized THP in the pagecache for tmpfs,
instead, it will just fallback to order-0 allocations like this patch
does, so this simplification will not add functional changes.
Besides the obvious (and desired) difference between krealloc() and
kvrealloc(), there is some inconsistency in their function signatures and
behavior:
- krealloc() frees the memory when the requested size is zero, whereas
kvrealloc() simply returns a pointer to the existing allocation.
- krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas
kvrealloc() does not accept a NULL pointer at all and, if passed,
would fault instead.
- krealloc() is self-contained, whereas kvrealloc() relies on the caller
to provide the size of the previous allocation.
Inconsistent behavior throughout allocation APIs is error prone, hence
make kvrealloc() behave like krealloc(), which seems superior in all
mentioned aspects.
Besides that, implementing kvrealloc() by making use of krealloc() and
vrealloc() provides oppertunities to grow (and shrink) allocations more
efficiently. For instance, vrealloc() can be optimized to allocate and
map additional pages to grow the allocation or unmap and free unused pages
to shrink the allocation.
[dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org
[dakr@kernel.org: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org
[dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Christian König <christian.koenig@amd.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Align kvrealloc() with krealloc()", v2.
Besides the obvious (and desired) difference between krealloc() and
kvrealloc(), there is some inconsistency in their function signatures and
behavior:
- krealloc() frees the memory when the requested size is zero, whereas
kvrealloc() simply returns a pointer to the existing allocation.
- krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas
kvrealloc() does not accept a NULL pointer at all and, if passed, would fault
instead.
- krealloc() is self-contained, whereas kvrealloc() relies on the caller to
provide the size of the previous allocation.
Inconsistent behavior throughout allocation APIs is error prone, hence
make kvrealloc() behave like krealloc(), which seems superior in all
mentioned aspects.
In order to be able to get rid of kvrealloc()'s oldsize parameter,
introduce vrealloc() and make use of it in kvrealloc().
Making use of vrealloc() in kvrealloc() also provides oppertunities to
grow (and shrink) allocations more efficiently. For instance, vrealloc()
can be optimized to allocate and map additional pages to grow the
allocation or unmap and free unused pages to shrink the allocation.
Besides the above, those functions are required by Rust's allocator abstractons
[1] (rework based on this series in [2]). With `Vec` or `KVec` respectively,
potentially growing (and shrinking) data structures are rather common.
Currently, krealloc() requires the caller to pass the size of the previous
memory allocation, which, instead, should be self-contained.
We attempt to fix this in a subsequent patch which, in order to do so,
requires vrealloc().
Besides that, we need realloc() functions for kernel allocators in Rust
too. With `Vec` or `KVec` respectively, potentially growing (and
shrinking) data structures are rather common.
Matthew Cassell [Mon, 22 Jul 2024 17:13:16 +0000 (17:13 +0000)]
mm: add node_reclaim successes to VM event counters
/proc/vmstat currently shows the number of node_reclaim() failures when
vm.zone_reclaim_mode is set appropriately. It would be convenient to have
the number of successes right next to zone_reclaim_failed (similar to
compaction and migration).
While just a trivially addition to the vmstat file. It was helpful during
benchmarking to not have to probe node_reclaim() to observe the
success/failure ratio.
Link: https://lkml.kernel.org/r/20240722171316.7517-1-mcassell411@gmail.com Signed-off-by: Matthew Cassell <mcassell411@gmail.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Merge tag 'v6.11-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- copy_file_range fix
- two read fixes including read past end of file rc fix and read retry
crediting fix
- falloc zero range fix
* tag 'v6.11-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix FALLOC_FL_ZERO_RANGE to preflush buffered part of target region
cifs: Fix copy offload to flush destination region
netfs, cifs: Fix handling of short DIO read
cifs: Fix lack of credit renegotiation on read retry
Merge tag 'bcachefs-2024-08-21' of https://github.com/koverstreet/bcachefs
Push bcachefs fixes from Kent Overstreet:
"The data corruption in the buffered write path is troubling; inode
lock should not have been able to cause that...
- Fix a rare data corruption in the rebalance path, caught as a nonce
inconsistency on encrypted filesystems
- Revert lockless buffered write path
- Mark more errors as autofix"
* tag 'bcachefs-2024-08-21' of https://github.com/koverstreet/bcachefs:
bcachefs: Mark more errors as autofix
bcachefs: Revert lockless buffered IO path
bcachefs: Fix bch2_extents_match() false positive
bcachefs: Fix failure to return error in data_update_index_update()
Kent Overstreet [Thu, 22 Aug 2024 15:47:32 +0000 (11:47 -0400)]
bcachefs: Mark more errors as autofix
errors that are known to always be safe to fix should be autofix: this
should be most errors even at this point, but that will need some
thorough review.
note that errors are still logged in the superblock, so we'll still know
that they happened.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It seems that writes are being dropped, but only when issued by QEMU,
and possibly only in snapshot mode. It's undetermined if it's write
calls are being dropped or dirty folios.
Further testing, via minimizing the original patch to just the change
that skips the inode lock on non appends/truncates, reveals that it
really is just not taking the inode lock that causes the corruption: it
has nothing to do with the other logic changes for preserving write
atomicity in corner cases.
It's also kernel config dependent: it doesn't reproduce with the minimal
kernel config that ktest uses, but it does reproduce with nixos's distro
config. Bisection the kernel config initially pointer the finger at page
migration or compaction, but it appears that was erroneous; we haven't
yet determined what kernel config option actually triggers it.
Sadly it appears this will have to be reverted since we're getting too
close to release and my plate is full, but we'd _really_ like to fully
debug it.
My suspicion is that this patch is exposing a preexisting bug - the
inode lock actually covers very little in IO paths, and we have a
different lock (the pagecache add lock) that guards against races with
truncate here.
Fixes: 7e64c86cdc6c ("bcachefs: Buffered write path now can avoid the inode lock") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Linus Torvalds [Sat, 31 Aug 2024 21:18:48 +0000 (09:18 +1200)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull misc fixes from Guenter Roeck.
These are fixes for regressions that Guenther has been reporting, and
the maintainers haven't picked up and sent in. With rc6 fairly imminent,
I'm taking them directly from Guenter.
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
apparmor: fix policy_unpack_test on big endian systems
Revert "MIPS: csrc-r4k: Apply verification clocksource flags"
microblaze: don't treat zero reserved memory regions as error
Linus Torvalds [Sat, 31 Aug 2024 21:07:44 +0000 (09:07 +1200)]
Merge tag 'pwrseq-fixes-for-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull power sequencing fix from Bartosz Golaszewski:
"A follow-up fix for the power sequencing subsystem. It turned out the
previous fix for this driver was incomplete and broke the WLAN support
on some platforms. This addresses the issue.
- set the direction of the wlan-enable GPIO to output after
requesting it as-is"
* tag 'pwrseq-fixes-for-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
power: sequencing: qcom-wcn: set the wlan-enable GPIO to output
power: sequencing: qcom-wcn: set the wlan-enable GPIO to output
Commit a9aaf1ff88a8 ("power: sequencing: request the WLAN enable GPIO
as-is") broke WLAN on boards on which the wlan-enable GPIO enabling the
wifi module isn't in output mode by default. We need to set direction to
output while retaining the value that was already set to keep the ath
module on if it's already started.
Linus Torvalds [Sat, 31 Aug 2024 19:00:38 +0000 (07:00 +1200)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Minor fixes only.
The sd.c one ignores a sync cache request if format is in progress
which can happen if formatting a drive across suspend/resume"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: Ignore command SYNCHRONIZE CACHE error if format in progress
scsi: aacraid: Fix double-free on probe failure
scsi: lpfc: Fix overflow build issue
Linus Torvalds [Sat, 31 Aug 2024 18:55:47 +0000 (06:55 +1200)]
Merge tag 'nfsd-6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- One more write delegation fix
* tag 'nfsd-6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix nfsd4_deleg_getattr_conflict in presence of third party lease
Linus Torvalds [Sat, 31 Aug 2024 18:48:37 +0000 (06:48 +1200)]
Merge tag 'xfs-6.11-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Do not call out v1 inodes with non-zero di_nlink field as being
corrupt
- Change xfs_finobt_count_blocks() to count "free inode btree" blocks
rather than "inode btree" blocks
- Don't report the number of trimmed bytes via FITRIM because the
underlying storage isn't required to do anything and failed discard
IOs aren't reported to the caller anyway
- Fix incorrect setting of rm_owner field in an rmap query
- Report missing disk offset range in an fsmap query
- Obtain m_growlock when extending realtime section of the filesystem
- Reset rootdir extent size hint after extending realtime section of
the filesystem
* tag 'xfs-6.11-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: reset rootdir extent size hint after growfsrt
xfs: take m_growlock when running growfsrt
xfs: Fix missing interval for missing_owner in xfs fsmap
xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code
xfs: Fix the owner setting issue for rmap query in xfs fsmap
xfs: don't bother reporting blocks trimmed via FITRIM
xfs: xfs_finobt_count_blocks() walks the wrong btree
xfs: fix folio dirtying for XFILE_ALLOC callers
xfs: fix di_onlink checking for V1/V2 inodes
Linus Torvalds [Sat, 31 Aug 2024 18:42:13 +0000 (06:42 +1200)]
Merge tag 'arm-fixes-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"There is a fairly large number of bug fixes for Qualcomm platforms,
most of them addressing issues with the devicetree files for the newly
added Snapdragon X1 based laptops to make them more reliable.
The Qualcomm driver changes address a few build-time issues as well as
runtime problems in the tzmem and scm firmware, the USB Type-C driver,
and the cmd-db and pmic_glink soc drivers.
The NXP i.MX usually gets a bunch of devicetree fixes that is
proportional to the number of supported machines. This includes both
warning fixes and correctness for the 64-bit i.MX9, i.MX8 and
layerscape platforms, as well as a single fix for a 32-bit i.MX6 based
board.
The other changes are the usual minor changes, including an update to
the MAINTAINERS file, an omap3 dts file and a SoC driver for mpfs
(risc-v)"
* tag 'arm-fixes-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (50 commits)
firmware: microchip: fix incorrect error report of programming:timeout on success
soc: qcom: pd-mapper: Fix singleton refcount
firmware: qcom: tzmem: disable sdm670 platform
soc: qcom: pmic_glink: Actually communicate when remote goes down
usb: typec: ucsi: Move unregister out of atomic section
soc: qcom: pmic_glink: Fix race during initialization
firmware: qcom: qseecom: remove unused functions
firmware: qcom: tzmem: fix virtual-to-physical address conversion
firmware: qcom: scm: Mark get_wq_ctx() as atomic call
arm64: dts: qcom: x1e80100: Fix Adreno SMMU global interrupt
arm64: dts: qcom: disable GPU on x1e80100 by default
arm64: dts: imx8mm-phygate: fix typo pinctrcl-0
arm64: dts: imx95: correct L3Cache cache-sets
arm64: dts: imx95: correct a55 power-domains
arm64: dts: freescale: imx93-tqma9352-mba93xxla: fix typo
arm64: dts: freescale: imx93-tqma9352: fix CMA alloc-ranges
ARM: dts: imx6dl-yapp43: Increase LED current to match the yapp4 HW design
arm64: dts: imx93: update default value for snps,clk-csr
arm64: dts: freescale: tqma9352: Fix watchdog reset
arm64: dts: imx8mp-beacon-kit: Fix Stereo Audio on WM8962
...
Linus Torvalds [Sat, 31 Aug 2024 03:32:38 +0000 (15:32 +1200)]
Merge tag 'input-for-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fix from Dmitry Torokhov:
- a fix for Cypress PS/2 touchpad for regression introduced in 6.11
merge window where a timeout condition is incorrectly reported for
all extended Cypress commands
* tag 'input-for-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: cypress_ps2 - fix waiting for command response
Linus Torvalds [Sat, 31 Aug 2024 02:54:11 +0000 (14:54 +1200)]
Merge tag 'pci-v6.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Add Manivannan Sadhasivam as PCI native host bridge and endpoint
driver reviewer (Manivannan Sadhasivam)
- Disable MHI RAM data parity error interrupt for qcom SA8775P SoC to
work around hardware erratum that causes a constant stream of
interrupts (Manivannan Sadhasivam)
- Don't try to fall back to qcom Operating Performance Points (OPP)
support unless the platform actually supports OPP (Manivannan
Sadhasivam)
- Add imx@lists.linux.dev mailing list to MAINTAINERS for NXP
layerscape and imx6 PCI controller drivers (Frank Li)
* tag 'pci-v6.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
MAINTAINERS: PCI: Add NXP PCI controller mailing list imx@lists.linux.dev
PCI: qcom: Use OPP only if the platform supports it
PCI: qcom-ep: Disable MHI RAM data parity error interrupt for SA8775P SoC
MAINTAINERS: Add Manivannan Sadhasivam as Reviewer for PCI native host bridge and endpoint drivers
Linus Torvalds [Sat, 31 Aug 2024 01:51:27 +0000 (13:51 +1200)]
Merge tag 'io_uring-6.11-20240830' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- A fix for a regression that happened in 6.11 merge window, where the
copying of iovecs for compat mode applications got broken for certain
cases.
- Fix for a bug introduced in 6.10, where if using recv/send bundles
with classic provided buffers, the recv/send would fail to set the
right iovec count. This caused 0 byte send/recv results. Found via
code coverage testing and writing a test case to exercise it.
* tag 'io_uring-6.11-20240830' of git://git.kernel.dk/linux:
io_uring/kbuf: return correct iovec count from classic buffer peek
io_uring/rsrc: ensure compat iovecs are copied correctly
Linus Torvalds [Fri, 30 Aug 2024 18:33:59 +0000 (06:33 +1200)]
Merge tag 'lsm-pr-20240830' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm fix from Paul Moore:
"One small patch to correct a NFS permissions problem with SELinux and
Smack"
* tag 'lsm-pr-20240830' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
selinux,smack: don't bypass permissions check in inode_setsecctx hook
Linus Torvalds [Fri, 30 Aug 2024 18:25:34 +0000 (06:25 +1200)]
Merge tag 'pm-6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix three issues in the amd-pstate cpufreq driver.
Specifics:
- Remove checks for highest performance match on preferred cores when
updating preferred core ranking in amd-pstate (Mario Limonciello)
- Make amd-pstate call topology_logical_package_id() instead of
logical_die_id() to get a socked ID for a CPU (Gautham Shenoy)
- Fix uninitialized variable in amd_pstate_cpu_boost_update() (Dan
Carpenter)"
* tag 'pm-6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate-ut: Don't check for highest perf matching on prefcore
cpufreq/amd-pstate: Use topology_logical_package_id() instead of logical_die_id()
cpufreq: amd-pstate: Fix uninitialized variable in amd_pstate_cpu_boost_update()
Linus Torvalds [Fri, 30 Aug 2024 18:15:02 +0000 (06:15 +1200)]
Merge tag 'soundwire-6.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire
Pull soundwire fix from Vinod Koul:
- Single fix for non-continous port map programming
* tag 'soundwire-6.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
soundwire: stream: fix programming slave ports for non-continous port maps
Linus Torvalds [Fri, 30 Aug 2024 18:11:34 +0000 (06:11 +1200)]
Merge tag 'iommu-fixes-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Fix a device-stall problem in bad io-page-fault setups (faults
received from devices with no supporting domain attached).
- Context flush fix for Intel VT-d.
- Do not allow non-read+non-write mapping through iommufd as most
implementations can not handle that.
- Fix a possible infinite-loop issue in map_pages() path.
- Add Jean-Philippe as reviewer for SMMUv3 SVA support
* tag 'iommu-fixes-v6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
MAINTAINERS: Add Jean-Philippe as SMMUv3 SVA reviewer
iommu: Do not return 0 from map_pages if it doesn't do anything
iommufd: Do not allow creating areas without READ or WRITE
iommu/vt-d: Fix incorrect domain ID in context flush helper
iommu: Handle iommu faults for a bad iopf setup
Jens Axboe [Fri, 30 Aug 2024 16:45:54 +0000 (10:45 -0600)]
io_uring/kbuf: return correct iovec count from classic buffer peek
io_provided_buffers_select() returns 0 to indicate success, but it should
be returning 1 to indicate that 1 vec was mapped. This causes peeking
to fail with classic provided buffers, and while that's not a use case
that anyone should use, it should still work correctly.
The end result is that no buffer will be selected, and hence a completion
with '0' as the result will be posted, without a buffer attached.
NeilBrown [Wed, 28 Aug 2024 23:06:28 +0000 (09:06 +1000)]
nfsd: fix nfsd4_deleg_getattr_conflict in presence of third party lease
It is not safe to dereference fl->c.flc_owner without first confirming
fl->fl_lmops is the expected manager. nfsd4_deleg_getattr_conflict()
tests fl_lmops but largely ignores the result and assumes that flc_owner
is an nfs4_delegation anyway. This is wrong.
With this patch we restore the "!= &nfsd_lease_mng_ops" case to behave
as it did before the change mentioned below. This is the same as the
current code, but without any reference to a possible delegation.
Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Jens Axboe [Wed, 28 Aug 2024 15:42:33 +0000 (09:42 -0600)]
io_uring/rsrc: ensure compat iovecs are copied correctly
For buffer registration (or updates), a userspace iovec is copied in
and updated. If the application is within a compat syscall, then the
iovec type is compat_iovec rather than iovec. However, the type used
in __io_sqe_buffers_update() and io_sqe_buffers_register() is always
struct iovec, and hence the source is incremented by the size of a
non-compat iovec in the loop. This misses every other iovec in the
source, and will run into garbage half way through the copies and
return -EFAULT to the application.
Maintain the source address separately and assign to our user vec
pointer, so that copies always happen from the right source address.
While in there, correct a bad placement of __user which triggered
the following sparse warning prior to this fix:
io_uring/rsrc.c:981:33: warning: cast removes address space '__user' of expression
io_uring/rsrc.c:981:30: warning: incorrect type in assignment (different address spaces)
io_uring/rsrc.c:981:30: expected struct iovec const [noderef] __user *uvec
io_uring/rsrc.c:981:30: got struct iovec *[noderef] __user
Fixes: f4eaf8eda89e ("io_uring/rsrc: Drop io_copy_iov in favor of iovec API") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 30 Aug 2024 02:17:30 +0000 (14:17 +1200)]
Merge tag 'drm-fixes-2024-08-30' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Another week, another set of GPU fixes. amdgpu and vmwgfx leading the
charge, then i915 and xe changes along with v3d and some other bits.
The TTM revert is due to some stuttering graphical apps probably due
to longer stalls while prefaulting.
i915:
- Fix #11195: The external display connect via USB type-C dock stays
blank after re-connect the dock
- Make DSI backlight work for 2G version of Lenovo Yoga Tab 3 X90F
- Move ARL GuC firmware to correct version
vmwgfx:
- prevent unmapping active read buffers
- fix prime with external buffers
- disable coherent dumb buffers without 3d
v3d:
- disable preemption while updating GPU stats"
* tag 'drm-fixes-2024-08-30' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe/hwmon: Fix WRITE_I1 param from u32 to u16
drm/v3d: Disable preemption while updating GPU stats
drm/amd/pm: Drop unsupported features on smu v14_0_2
drm/amd/pm: Add support for new P2S table revision
drm/amdgpu: support for gc_info table v1.3
drm/amd/display: avoid using null object of framebuffer
drm/amdgpu/gfx12: set UNORD_DISPATCH in compute MQDs
drm/amd/pm: update message interface for smu v14.0.2/3
drm/amdgpu/swsmu: always force a state reprogram on init
drm/amdgpu/smu13.0.7: print index for profiles
drm/amdgpu: align pp_power_profile_mode with kernel docs
drm/i915/dp_mst: Fix MST state after a sink reset
drm/xe: Invalidate media_gt TLBs
drm/i915: ARL requires a newer GSC firmware
drm/i915/dsi: Make Lenovo Yoga Tab 3 X90F DMI match less strict
video/aperture: optionally match the device in sysfb_disable()
drm/vmwgfx: Disable coherent dumb buffers without 3d
drm/vmwgfx: Fix prime with external buffers
drm/vmwgfx: Prevent unmapping active read buffers
Revert "drm/ttm: increase ttm pre-fault value to PMD size"