rv/include: Add deterministic automata monitor definition via C macros
In Linux terms, the runtime verification monitors are encapsulated
inside the "RV monitor" abstraction. The "RV monitor" includes a set
of instances of the monitor (per-cpu monitor, per-task monitor, and
so on), the helper functions that glue the monitor to the system
reference model, and the trace output as a reaction for event parsing
and exceptions, as depicted below:
Add the rv/da_monitor.h, enabling automatic code generation for the
*Monitor Instance(s)* using C macros, and code to support it.
The benefits of the usage of macro for monitor synthesis are 3-fold as it:
- Reduces the code duplication;
- Facilitates the bug fix/improvement;
- Avoids the case of developers changing the core of the monitor code
to manipulate the model in a (let's say) non-standard way.
This initial implementation presents three different types of monitor
instances:
The first declares the functions for a global deterministic automata monitor,
the second for monitors with per-cpu instances, and the third with per-task
instances.
Link: https://lkml.kernel.org/r/51b0bf425a281e226dfeba7401d2115d6091f84e.1659052063.git.bristot@kernel.org Cc: Wim Van Sebroeck <wim@linux-watchdog.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Tao Zhou <tao.zhou@linux.dev> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
rv/include: Add helper functions for deterministic automata
Formally, a deterministic automaton, denoted by G, is defined as a
quintuple:
G = { X, E, f, x_0, X_m }
where:
- X is the set of states;
- E is the finite set of events;
- x_0 is the initial state;
- X_m (subset of X) is the set of marked states.
- f : X x E -> X $ is the transition function. It defines the
state transition in the occurrence of a event from E in
the state X. In the special case of deterministic automata,
the occurrence of the event in E in a state in X has a
deterministic next state from X.
An automaton can also be represented using a graphical format of
vertices (nodes) and edges. The open-source tool Graphviz can produce
this graphic format using the (textual) DOT language as the source code.
The dot2c tool presented in this paper:
De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
Silva. Efficient formal verification for the Linux kernel. In:
International Conference on Software Engineering and Formal Methods.
Springer, Cham, 2019. p. 315-332.
Translates a deterministic automaton in the DOT format into a C
source code representation that to be used for monitoring.
This header file implements helper functions to facilitate the usage
of the C output from dot2c/k for monitoring.
Link: https://lkml.kernel.org/r/563234f2bfa84b540f60cf9e39c2d9f0eea95a55.1659052063.git.bristot@kernel.org Cc: Wim Van Sebroeck <wim@linux-watchdog.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Tao Zhou <tao.zhou@linux.dev> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A runtime monitor can cause a reaction to the detection of an
exception on the model's execution. By default, the monitors have
tracing reactions, printing the monitor output via tracepoints.
But other reactions can be added (on-demand) via this interface.
The user interface resembles the kernel tracing interface and
presents these files:
"available_reactors"
- Reading shows the available reactors, one per line.
For example:
# cat available_reactors
nop
panic
printk
"reacting_on"
- It is an on/off general switch for reactors, disabling
all reactions.
"monitors/MONITOR/reactors"
- List available reactors, with the select reaction for the given
MONITOR inside []. The default one is the nop (no operation)
reactor.
- Writing the name of a reactor enables it to the given
MONITOR.
RV is a lightweight (yet rigorous) method that complements classical
exhaustive verification techniques (such as model checking and
theorem proving) with a more practical approach to complex systems.
RV works by analyzing the trace of the system's actual execution,
comparing it against a formal specification of the system behavior.
RV can give precise information on the runtime behavior of the
monitored system while enabling the reaction for unexpected
events, avoiding, for example, the propagation of a failure on
safety-critical systems.
The development of this interface roots in the development of the
paper:
De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
Silva. Efficient formal verification for the Linux kernel. In:
International Conference on Software Engineering and Formal Methods.
Springer, Cham, 2019. p. 315-332.
And:
De Oliveira, Daniel Bristot. Automata-based formal analysis
and verification of the real-time Linux kernel. PhD Thesis, 2020.
The RV interface resembles the tracing/ interface on purpose. The current
path for the RV interface is /sys/kernel/tracing/rv/.
It presents these files:
"available_monitors"
- List the available monitors, one per line.
For example:
# cat available_monitors
wip
wwnr
"enabled_monitors"
- Lists the enabled monitors, one per line;
- Writing to it enables a given monitor;
- Writing a monitor name with a '!' prefix disables it;
- Truncating the file disables all enabled monitors.
Note that more than one monitor can be enabled concurrently.
"monitoring_on"
- It is an on/off general switcher for monitoring. Note
that it does not disable enabled monitors or detach events,
but stop the per-entity monitors of monitoring the events
received from the system. It resembles the "tracing_on" switcher.
"monitors/"
Each monitor will have its one directory inside "monitors/". There
the monitor specific files will be presented.
The "monitors/" directory resembles the "events" directory on
tracefs.
For example:
# cd monitors/wip/
# ls
desc enable
# cat desc
wakeup in preemptive per-cpu testing monitor.
# cat enable
0
For further information, see the comments in the header of
kernel/trace/rv/rv.c from this patch.
Link: https://lkml.kernel.org/r/a4bfe038f50cb047bfb343ad0e12b0e646ab308b.1659052063.git.bristot@kernel.org Cc: Wim Van Sebroeck <wim@linux-watchdog.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Tao Zhou <tao.zhou@linux.dev> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When a ftrace_bug happens (where ftrace fails to modify a location) it is
helpful to have what was at that location as well as what was expected to
be there.
But with the conversion to text_poke() the variable that assigns the
expected for debugging was dropped. Unfortunately, I noticed this when I
needed it. Add it back.
Link: https://lkml.kernel.org/r/20220726101851.069d2e70@gandalf.local.home Cc: "x86@kernel.org" <x86@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: 768ae4406a5c ("x86/ftrace: Use text_poke()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
tracing: Use a copy of the va_list for __assign_vstr()
If an instance of tracing enables the same trace event as another
instance, or the top level instance, or even perf, then the va_list passed
into some tracepoints can be used more than once.
As va_list can only be traversed once, this can cause issues:
batman-adv: tracing: Use the new __vstring() helper
Instead of open coding a __dynamic_array() with a fixed length (which
defeats the purpose of the dynamic array in the first place). Use the new
__vstring() helper that will use a va_list and only write enough of the
string into the ring buffer that is needed.
Link: https://lkml.kernel.org/r/20220724191650.236b1355@rorschach.local.home Cc: Marek Lindner <mareklindner@neomailbox.ch> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Simon Wunderlich <sw@simonwunderlich.de> Cc: Antonio Quartulli <a@unstable.cc> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: b.a.t.m.a.n@lists.open-mesh.org Cc: netdev@vger.kernel.org Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Factor out the common prerequisites for DT compilation into the new
target, dtbs_prepare.
Add comments to explain why include/config/kernel.release is the
prerequisite. Our policy is that installation targets must not rebuild
anything in the tree. If 'make modules_install' is executed as root,
include/config/kernel.release may be owned by root.
This options is used to reserve a shared memory region for user processes
to use for hardware memory buffers. The actual code to support the option
comes in the following patch.
Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Waiman Long [Wed, 22 Jun 2022 20:04:19 +0000 (16:04 -0400)]
locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter
With commit d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more
consistent"), the writer that sets the handoff bit can be interrupted
out without clearing the bit if the wait queue isn't empty. This disables
reader and writer optimistic lock spinning and stealing.
Now if a non-first writer in the queue is somehow woken up or a new
waiter enters the slowpath, it can't acquire the lock. This is not the
case before commit d257cc8cb8d5 as the writer that set the handoff bit
will clear it when exiting out via the out_nolock path. This is less
efficient as the busy rwsem stays in an unlock state for a longer time.
In some cases, this new behavior may cause lockups as shown in [1] and
[2].
This patch allows a non-first writer to ignore the handoff bit if it
is not originally set or initiated by the first waiter. This patch is
shown to be effective in fixing the lockup problem reported in [1].
Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Donnelly <john.p.donnelly@oracle.com> Tested-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220622200419.778799-1-longman@redhat.com
Randy Dunlap [Sun, 24 Jul 2022 05:57:23 +0000 (22:57 -0700)]
MIPS: msi-octeon: eliminate kernel-doc warnings
Rearrange kernel-doc notation for 2 functions to eliminate
kernel-doc warnings. Use Return: notation for the function
return value description. Add function short descriptions
for both functions.
Correct 2 typos.
Fixes these kernel-doc warnings:
msi-octeon.c:49: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Called when a driver request MSI interrupts instead of the
msi-octeon.c:49: warning: missing initial short description on line:
* Called when a driver request MSI interrupts instead of the
msi-octeon.c:62: warning: No description found for return value of 'arch_setup_msi_irq'
msi-octeon.c:189: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Called when a device no longer needs its MSI interrupts. All
msi-octeon.c:189: warning: missing initial short description on line:
* Called when a device no longer needs its MSI interrupts. All
Fixes: e8635b484f64 ("MIPS: Add Cavium OCTEON PCI support.") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Aditya Srivastava <yashsri421@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: linux-mips@vger.kernel.org Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
memblock test: Modify the obsolete description in README
The VERBOSE option in Makefile has been moved, but there still have the
description left in README. For now, we use `-v` options when running
memblock test to print information, so using the new to replace the
obsolete items.
Philipp Jungkamp [Fri, 29 Jul 2022 16:21:03 +0000 (18:21 +0200)]
ALSA: hda/realtek: Add quirk for Lenovo Yoga9 14IAP7
The Lenovo Yoga 9 14IAP7 is set up similarly to the Thinkpad X1 7th and
8th Gen. It also has the speakers attached to NID 0x14 and the bass
speakers to NID 0x17, but here the codec misreports the NID 0x17 as
unconnected.
The pincfg and hda verbs connect and activate the bass speaker
amplifiers, but the generic driver will connect them to NID 0x06 which
has no volume control. Set connection list/preferred connections is
required to gain volume control.
Jakub Kicinski [Sat, 30 Jul 2022 04:39:06 +0000 (21:39 -0700)]
Merge tag 'mlx5-updates-2022-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-07-28
Misc updates to mlx5 driver:
1) Gal corrects to use skb_tcp_all_headers on encapsulated skbs.
2) Roi Adds the support for offloading standalone police actions.
3) lama, did some refactoring to minimize code coupling with
mlx5e_priv "god object" in some of the follows, and converts some of the
objects to pointers to preserve on memory when these objects aren't needed.
This is part one of two parts series.
* tag 'mlx5-updates-2022-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Move mlx5e_init_l2_addr to en_main
net/mlx5e: Split en_fs ndo's and move to en_main
net/mlx5e: Separate mlx5e_set_rx_mode_work and move caller to en_main
net/mlx5e: Add mdev to flow_steering struct
net/mlx5e: Report flow steering errors with mdev err report API
net/mlx5e: Convert mlx5e_flow_steering member of mlx5e_priv to pointer
net/mlx5e: Allocate VLAN and TC for featured profiles only
net/mlx5e: Make mlx5e_tc_table private
net/mlx5e: Convert mlx5e_tc_table member of mlx5e_flow_steering to pointer
net/mlx5e: TC, Support tc action api for police
net/mlx5e: TC, Separate get/update/replace meter functions
net/mlx5e: Add red and green counters for metering
net/mlx5e: TC, Allocate post meter ft per rule
net/mlx5: DR, Add support for flow metering ASO
net/mlx5e: Fix wrong use of skb_tcp_all_headers() with encapsulation
====================
fs/dcache: Move wakeup out of i_seq_dir write held region.
__d_add() and __d_move() wake up waiters on dentry::d_wait from within
the i_seq_dir write held region. This violates the PREEMPT_RT
constraints as the wake up acquires wait_queue_head::lock which is a
"sleeping" spinlock on RT.
There is no requirement to do so. __d_lookup_unhash() has cleared
DCACHE_PAR_LOOKUP and dentry::d_wait and returned the now unreachable wait
queue head pointer to the caller, so the actual wake up can be postponed
until the i_dir_seq write side critical section is left. The only
requirement is that dentry::lock is held across the whole sequence
including the wake up. The previous commit includes an analysis why this
is considered safe.
Move the wake up past end_dir_add() which leaves the i_dir_seq write side
critical section and enables preemption.
For non RT kernels there is no difference because preemption is still
disabled due to dentry::lock being held, but it shortens the time between
wake up and unlocking dentry::lock, which reduces the contention for the
woken up waiter.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs/dcache: Move the wakeup from __d_lookup_done() to the caller.
__d_lookup_done() wakes waiters on dentry->d_wait. On PREEMPT_RT we are
not allowed to do that with preemption disabled, since the wakeup
acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.
Calling it under dentry->d_lock is not a problem, since that is also a
"sleeping" spinlock on the same configs. Unfortunately, two of its
callers (__d_add() and __d_move()) are holding more than just ->d_lock
and that needs to be dealt with.
The key observation is that wakeup can be moved to any point before
dropping ->d_lock.
As a first step to solve this, move the wake up outside of the
hlist_bl_lock() held section.
This is safe because:
Waiters get inserted into ->d_wait only after they'd taken ->d_lock
and observed DCACHE_PAR_LOOKUP in flags. As long as they are
woken up (and evicted from the queue) between the moment __d_lookup_done()
has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
since the waitqueue ->d_wait points to won't get destroyed without
having __d_lookup_done(dentry) called (under ->d_lock).
->d_wait is set only by d_alloc_parallel() and only in case when
it returns a freshly allocated in-lookup dentry. Whenever that happens,
we are guaranteed that __d_lookup_done() will be called for resulting
dentry (under ->d_lock) before the wq in question gets destroyed.
With two exceptions wq lives in call frame of the caller of
d_alloc_parallel() and we have an explicit d_lookup_done() on the
resulting in-lookup dentry before we leave that frame.
One of those exceptions is nfs_call_unlink(), where wq is embedded into
(dynamically allocated) struct nfs_unlinkdata. It is destroyed in
nfs_async_unlink_release() after an explicit d_lookup_done() on the
dentry wq went into.
Remaining exception is d_add_ci(). There wq is what we'd found in
->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
instances of ->d_lookup() and they must have been given an in-lookup
dentry. Which means that they'd been called by __lookup_slow() or
lookup_open(), with wq in the call frame of one of those.
Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which either returns non-NULL (and d_add_ci() does
d_lookup_done()) or feeds dentry to __d_add() that will do
__d_lookup_done() under ->d_lock. That concludes the analysis.
Let __d_lookup_unhash():
1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
2) Unhash the dentry
3) Retrieve and clear dentry::d_wait
4) Unlock the hash and return the retrieved waitqueue head pointer
5) Let the caller handle the wake up.
6) Rename __d_lookup_done() to __d_lookup_unhash_wake() to enforce
build failures for OOT code that used __d_lookup_done() and is not
aware of the new return value.
This does not yet solve the PREEMPT_RT problem completely because
preemption is still disabled due to i_dir_seq being held for write. This
will be addressed in subsequent steps.
An alternative solution would be to switch the waitqueue to a simple
waitqueue, but aside of Linus not being a fan of them, moving the wake up
closer to the place where dentry::lock is unlocked reduces lock contention
time for the woken up waiter.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20220613140712.77932-3-bigeasy@linutronix.de Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
i_dir_seq is a sequence counter with a lock which is represented by the
lowest bit. The writer atomically updates the counter which ensures that it
can be modified by only one writer at a time. This requires preemption to
be disabled across the write side critical section.
On !PREEMPT_RT kernels this is implicit by the caller acquiring
dentry::lock. On PREEMPT_RT kernels spin_lock() does not disable preemption
which means that a preempting writer or reader would live lock. It's
therefore required to disable preemption explicitly.
An alternative solution would be to replace i_dir_seq with a seqlock_t for
PREEMPT_RT, but that comes with its own set of problems due to arbitrary
lock nesting. A pure sequence count with an associated spinlock is not
possible because the locks held by the caller are not necessarily related.
As the critical section is small, disabling preemption is a sensible
solution.
Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20220613140712.77932-2-bigeasy@linutronix.de Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Sat, 30 Jul 2022 04:29:05 +0000 (00:29 -0400)]
d_add_ci(): make sure we don't miss d_lookup_done()
All callers of d_alloc_parallel() must make sure that resulting
in-lookup dentry (if any) will encounter __d_lookup_done() before
the final dput(). d_add_ci() might end up creating in-lookup
dentries; they are fed to d_splice_alias(), which will normally
make sure they meet __d_lookup_done(). However, it is possible
to end up with d_splice_alias() failing with ERR_PTR(-ELOOP)
without having done so. It takes a corrupted ntfs or case-insensitive
xfs image, but neither should end up with memory corruption...
Jakub Kicinski [Sat, 30 Jul 2022 04:28:56 +0000 (21:28 -0700)]
Merge tag 'mlx5-fixes-2022-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-07-28
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2022-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Fix driver use of uninitialized timeout
net/mlx5: DR, Fix SMFS steering info dump format
net/mlx5: Adjust log_max_qp to be 18 at most
net/mlx5e: Modify slow path rules to go to slow fdb
net/mlx5e: Fix calculations related to max MPWQE size
net/mlx5e: xsk: Account for XSK RQ UMRs when calculating ICOSQ size
net/mlx5e: Fix the value of MLX5E_MAX_RQ_NUM_MTTS
net/mlx5e: TC, Fix post_act to not match on in_port metadata
net/mlx5e: Remove WARN_ON when trying to offload an unsupported TLS cipher/version
====================
Jakub Kicinski [Sat, 30 Jul 2022 04:26:09 +0000 (21:26 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2022-07-28
This series contains updates to ice driver only.
Michal allows for VF true promiscuous mode to be set for multiple VFs
and adds clearing of promiscuous filters when VF trust is removed.
Maciej refactors ice_set_features() to track/check changed features
instead of constantly checking against netdev features and adds support for
NETIF_F_LOOPBACK.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: allow toggling loopback mode via ndo_set_features callback
ice: compress branches in ice_set_features()
ice: Fix promiscuous mode not turning off
ice: Introduce enabling promiscuous mode on multiple VF's
====================
Edward Cree [Thu, 28 Jul 2022 18:57:51 +0000 (19:57 +0100)]
sfc: use a dynamic m-port for representor RX and set it promisc
Representors do not want to be subject to the PF's Ethernet address
filters, since traffic from VFs will typically have a destination
either elsewhere on the link segment or on an overlay network.
So, create a dynamic m-port with promiscuous and all-multicast
filters, and set it as the egress port of representor default rules.
Since the m-port is an alias of the calling PF's own m-port, traffic
will still be delivered to the PF's RXQs, but it will be subject to
the VNRX filter rules installed on the dynamic m-port (specified by
the v-port ID field of the filter spec).
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Edward Cree [Thu, 28 Jul 2022 18:57:50 +0000 (19:57 +0100)]
sfc: move table locking into filter_table_{probe,remove} methods
We need to be able to drop the efx->filter_sem in ef100_filter_table_up()
so that we can call functions that insert filters (and thus take that
rwsem for read), which means the efx->type->filter_table_probe method
needs to be responsible for taking the lock in the first place.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Edward Cree [Thu, 28 Jul 2022 18:57:49 +0000 (19:57 +0100)]
sfc: insert default MAE rules to connect VFs to representors
Default rules are low-priority switching rules which the hardware uses
in the absence of higher-priority rules. Each representor requires a
corresponding rule matching traffic from its representee VF and
delivering to the PF (where a check on INGRESS_MPORT in
__ef100_rx_packet() will direct it to the representor). No rule is
required in the reverse direction, because representor TX uses a TX
override descriptor to bypass the MAE and deliver directly to the VF.
Since inserting any rule into the MAE disables the firmware's own
default rules, also insert a pair of rules to connect the PF to the
physical network port and vice-versa.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Edward Cree [Thu, 28 Jul 2022 18:57:46 +0000 (19:57 +0100)]
sfc: determine wire m-port at EF100 PF probe time
Traffic delivered to the (MAE admin) PF could be from either the wire
or a VF. The INGRESS_MPORT field of the RX prefix distinguishes these;
base_mport is the value this field will have for traffic from the wire
(which should be delivered to the PF's netdevice, not a representor).
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Edward Cree [Thu, 28 Jul 2022 18:57:45 +0000 (19:57 +0100)]
sfc: ef100 representor RX top half
Representor RX uses a NAPI context driven by a 'fake interrupt': when
the parent PF receives a packet destined for the representor, it adds
it to an SKB list (efv->rx_list), and schedules NAPI if the 'fake
interrupt' is primed. The NAPI poll then pulls packets off this list
and feeds them to the stack with netif_receive_skb_list().
This scheme allows us to decouple representor RX from the parent PF's
RX fast-path.
This patch implements the 'top half', which builds an SKB, copies data
into it from the RX buffer (which can then be released), adds it to
the queue and fires the 'fake interrupt' if necessary.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Edward Cree [Thu, 28 Jul 2022 18:57:44 +0000 (19:57 +0100)]
sfc: ef100 representor RX NAPI poll
This patch adds the 'bottom half' napi->poll routine for representor RX.
See the next patch (with the top half) for an explanation of the 'fake
interrupt' scheme used to drive this NAPI context.
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'mm-hotfixes-stable-2022-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Two hotfixes, both cc:stable"
* tag 'mm-hotfixes-stable-2022-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/hmm: fault non-owner device private entries
page_alloc: fix invalid watermark check on a negative value
dn_route: replace "jiffies-now>0" with "jiffies!=now"
Use "jiffies != now" to replace "jiffies - now > 0" to make
code more readable. We want to put a limit on how long the
loop can run for before rescheduling.
Jakub Kicinski [Sat, 30 Jul 2022 02:34:45 +0000 (19:34 -0700)]
Merge tag 'wireless-next-2022-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v5.20
Fourth set of patches for v5.20, last few patches before the merge
window. Only driver changes this time, mostly just fixes and cleanup.
Major changes:
brcmfmac
- support brcm,ccode-map-trivial DT property
wcn36xx
- add debugfs file to show firmware feature strings
* tag 'wireless-next-2022-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (36 commits)
wifi: rtw88: check the return value of alloc_workqueue()
wifi: rtw89: 8852a: adjust IMR for SER L1
wifi: rtw89: 8852a: update RF radio A/B R56
wifi: wcn36xx: Add debugfs entry to read firmware feature strings
wifi: wcn36xx: Move capability bitmap to string translation function to firmware.c
wifi: wcn36xx: Move firmware feature bit storage to dedicated firmware.c file
wifi: wcn36xx: Rename clunky firmware feature bit enum
wifi: brcmfmac: prevent double-free on hardware-reset
wifi: brcmfmac: support brcm,ccode-map-trivial DT property
dt-bindings: bcm4329-fmac: add optional brcm,ccode-map-trivial
wifi: brcmfmac: Replace default (not configured) MAC with a random MAC
wifi: brcmfmac: Add brcmf_c_set_cur_etheraddr() helper
wifi: brcmfmac: Remove #ifdef guards for PM related functions
wifi: brcmfmac: use strreplace() in brcmf_of_probe()
wifi: plfxlc: Use eth_zero_addr() to assign zero address
wifi: wilc1000: use existing iftype variable to store the interface type
wifi: wilc1000: add 'isinit' flag for SDIO bus similar to SPI
wifi: wilc1000: cancel the connect operation during interface down
wifi: wilc1000: get correct length of string WID from received config packet
wifi: wilc1000: set station_info flag only when signal value is valid
...
====================
We've added 22 non-merge commits during the last 4 day(s) which contain
a total of 27 files changed, 763 insertions(+), 120 deletions(-).
The main changes are:
1) Fixes to allow setting any source IP with bpf_skb_set_tunnel_key() helper,
from Paul Chaignon.
2) Fix for bpf_xdp_pointer() helper when doing sanity checking, from Joanne Koong.
3) Fix for XDP frame length calculation, from Lorenzo Bianconi.
4) Libbpf BPF_KSYSCALL docs improvements and fixes to selftests to accommodate
s390x quirks with socketcall(), from Ilya Leoshkevich.
5) Allow/denylist and CI configs additions to selftests/bpf to improve BPF CI,
from Daniel Müller.
6) BPF trampoline + ftrace follow up fixes, from Song Liu and Xu Kuohai.
7) Fix allocation warnings in netdevsim, from Jakub Kicinski.
8) bpf_obj_get_opts() libbpf API allowing to provide file flags, from Joe Burton.
9) vsnprintf usage fix in bpf_snprintf_btf(), from Fedor Tokarev.
10) Various small fixes and clean ups, from Daniel Müller, Rongguang Wei,
Jörn-Thorben Hinz, Yang Li.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (22 commits)
bpf: Remove unneeded semicolon
libbpf: Add bpf_obj_get_opts()
netdevsim: Avoid allocation warnings triggered from user space
bpf: Fix NULL pointer dereference when registering bpf trampoline
bpf: Fix test_progs -j error with fentry/fexit tests
selftests/bpf: Bump internal send_signal/send_signal_tracepoint timeout
bpftool: Don't try to return value from void function in skeleton
bpftool: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE macro
bpf: btf: Fix vsnprintf return value check
libbpf: Support PPC in arch_specific_syscall_pfx
selftests/bpf: Adjust vmtest.sh to use local kernel configuration
selftests/bpf: Copy over libbpf configs
selftests/bpf: Sort configuration
selftests/bpf: Attach to socketcall() in test_probe_user
libbpf: Extend BPF_KSYSCALL documentation
bpf, devmap: Compute proper xdp_frame len redirecting frames
bpf: Fix bpf_xdp_pointer return pointer
selftests/bpf: Don't assign outer source IP to host
bpf: Set flow flag to allow any source IP in bpf_tunnel_key
geneve: Use ip_tunnel_key flow flags in route lookups
...
====================
scripts/gdb: ensure the absolute path is generated on initial source
Post 'make scripts_gdb' a symbolic link to scripts/gdb/vmlinux-gdb.py is
created. Currently 'os.path.dirname(__file__)' does not generate the
absolute path to scripts/gdb resulting in the following:
(gdb) source vmlinux-gdb.py
Traceback (most recent call last):
File "scripts/gdb/vmlinux-gdb.py", line 25, in <module>
import linux.utils
ModuleNotFoundError: No module named 'linux'
This patch ensures that the absolute path to scripts/gdb in relation to
the given file is generated so each module can be located accordingly.
Link: https://lkml.kernel.org/r/20220712110248.1404125-1-atomlin@redhat.com Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Because of my new work remote setup at Google, I can no longer use command
line tools with my google.com email address, for this reason I got a
linux.dev account. So update the mailmap to show the new alias I will be
using.
Link: https://lkml.kernel.org/r/20220725215833.789133-1-brendan.higgins@linux.dev Signed-off-by: Brendan Higgins <brendan.higgins@linux.dev> Reviewed-by: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Daniel Latypov <dlatypov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I disconnected from both Virtuozzo and OpenVZ, so this updates my email to
point to my own. I haven't used @openvz address for patches, so let's
rewrite the line instead of to add a new one. CC all previous addresses.
Ben Dooks [Thu, 21 Jul 2022 19:55:09 +0000 (20:55 +0100)]
profile: setup_profiling_timer() is moslty not implemented
The setup_profiling_timer() is mostly un-implemented by many
architectures. In many places it isn't guarded by CONFIG_PROFILE which is
needed for it to be used. Make it a weak symbol in kernel/profile.c and
remove the 'return -EINVAL' implementations from the kenrel.
There are a couple of architectures which do return 0 from the
setup_profiling_timer() function but they don't seem to do anything else
with it. To keep the /proc compatibility for now, leave these for a
future update or removal.
On ARM, this fixes the following sparse warning:
arch/arm/kernel/smp.c:793:5: warning: symbol 'setup_profiling_timer' was not declared. Should it be static?
Link: https://lkml.kernel.org/r/4d4a6786e8ad522bfad6d2401b7f6634f8af0e5d.1658436259.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use bitmap_zero() instead of hand-writing it. It is less verbose.
While at it, add an explicit #include <linux/bitmap.h>.
Link: https://lkml.kernel.org/r/86d2a027c319db12055c98f00c65f7d01e703722.1658436259.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
enum wb_congested_state and the member 'congested' in bdi_writeback are
useless since commit a88f2096d5a2 ("remove congestion tracking
framework"), so remove it.
Ben Dooks [Thu, 14 Jul 2022 07:47:44 +0000 (08:47 +0100)]
kernel/hung_task: fix address space of proc_dohung_task_timeout_secs
The proc_dohung_task_timeout_secs() function is incorrectly marked
as having a __user buffer as argument 3. However this is not the
case and it is casing multiple sparse warnings. Fix the following
warnings by removing __user from the argument:
kernel/hung_task.c:237:52: warning: incorrect type in argument 3 (different address spaces)
kernel/hung_task.c:237:52: expected void *
kernel/hung_task.c:237:52: got void [noderef] __user *buffer
kernel/hung_task.c:287:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/hung_task.c:287:35: expected int ( [usertype] *proc_handler )( ... )
kernel/hung_task.c:287:35: got int ( * )( ... )
kernel/hung_task.c:295:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/hung_task.c:295:35: expected int ( [usertype] *proc_handler )( ... )
kernel/hung_task.c:295:35: got int ( * )( ... )
Jiangshan Yi [Thu, 14 Jul 2022 01:54:41 +0000 (09:54 +0800)]
lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t()
Fix the following coccicheck warning:
lib/lzo/lzo1x_compress.c:54: WARNING opportunity for min().
lib/lzo/lzo1x_compress.c:329: WARNING opportunity for min().
min() and min_t() macro is defined in include/linux/minmax.h. It avoids
multiple evaluations of the arguments when non-constant and performs
strict type-checking.
Phillip Lougher [Fri, 17 Jun 2022 08:38:15 +0000 (16:38 +0800)]
squashfs: support reading fragments in readahead call
Add a function which can be used to read fragments in the readahead call.
This function is necessary because filesystems built with the -tailends
(or -always-use-fragments) option may have fragments present which cannot
be currently handled.
Link: https://lkml.kernel.org/r/20220617083810.337573-5-hsinyi@chromium.org Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Hou Tao <houtao1@huawei.com> Cc: kernel test robot <lkp@intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miao Xie <miaoxie@huawei.com> Cc: Xiongwei Song <Xiongwei.Song@windriver.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Zheng Liang <zhengliang6@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hsin-Yi Wang [Fri, 17 Jun 2022 08:38:13 +0000 (16:38 +0800)]
squashfs: implement readahead
Implement readahead callback for squashfs. It will read datablocks which
cover pages in readahead request. For a few cases it will not mark page
as uptodate, including:
- file end is 0.
- zero filled blocks.
- current batch of pages isn't in the same datablock.
- decompressor error.
Otherwise pages will be marked as uptodate. The unhandled pages will be
updated by readpage later.
Link: https://lkml.kernel.org/r/20220617083810.337573-4-hsinyi@chromium.org Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reported-by: Matthew Wilcox <willy@infradead.org> Reported-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: Xiongwei Song <Xiongwei.Song@windriver.com> Reported-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hou Tao <houtao1@huawei.com> Cc: kernel test robot <lkp@intel.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Zheng Liang <zhengliang6@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hsin-Yi Wang [Fri, 17 Jun 2022 08:38:09 +0000 (16:38 +0800)]
Revert "squashfs: provide backing_dev_info in order to disable read-ahead"
Patch series "Implement readahead for squashfs", v7.
Commit 9eec1d897139("squashfs: provide backing_dev_info in order to
disable read-ahead") mitigates the performance drop issue for squashfs by
closing readahead for it.
This series implements readahead callback for squashfs.
This patch (of 4):
This reverts 9eec1d897139e5 ("squashfs: provide backing_dev_info in order
to disable read-ahead").
Revert closing the readahead to squashfs since the readahead callback for
squashfs is implemented.
Link: https://lkml.kernel.org/r/20220617083810.337573-1-hsinyi@chromium.org Link: https://lkml.kernel.org/r/20220617083810.337573-2-hsinyi@chromium.org Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org> Suggested-by: Xiongwei Song <Xiongwei.Song@windriver.com> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Zheng Liang <zhengliang6@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Hou Tao <houtao1@huawei.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 26 Jul 2022 08:10:46 +0000 (16:10 +0800)]
mm: memory-failure: convert to pr_fmt()
Use pr_fmt to prefix all pr_<level> output, but unpoison_memory() and
soft_offline_page() are used by error injection, which have own prefixes
like "Unpoison:" and "soft offline:", meanwhile, soft_offline_page() could
be used by memory hotremove, so reset pr_fmt before unpoison_pr_info
definition to keep the original output for them.
Kefeng Wang [Tue, 26 Jul 2022 13:11:35 +0000 (21:11 +0800)]
mm: use is_zone_movable_page() helper
Use is_zone_movable_page() helper to simplify code.
Link: https://lkml.kernel.org/r/20220726131135.146912-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Tue, 26 Jul 2022 14:29:18 +0000 (22:29 +0800)]
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
In some cases, e.g. when size option is not specified, f_blocks, f_bavail
and f_bfree will be set to -1 instead of 0. Likewise, when nr_inodes
isn't specified, f_files and f_ffree will be set to -1 too. Update the
comment to make this clear.
Link: https://lkml.kernel.org/r/20220726142918.51693-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Tue, 26 Jul 2022 14:29:17 +0000 (22:29 +0800)]
hugetlbfs: cleanup some comments in inode.c
The function generic_file_buffered_read has been renamed to filemap_read
since commit 87fa0f3eb267 ("mm/filemap: rename generic_file_buffered_read
to filemap_read"). Update the corresponding comment. And duplicated
taken in hugetlbfs_fill_super is removed.
Link: https://lkml.kernel.org/r/20220726142918.51693-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Tue, 26 Jul 2022 14:29:16 +0000 (22:29 +0800)]
hugetlbfs: remove unneeded header file
The header file signal.h is unneeded now. Remove it.
Link: https://lkml.kernel.org/r/20220726142918.51693-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The forward declaration for hugetlbfs_ops is unnecessary. Remove it.
Link: https://lkml.kernel.org/r/20220726142918.51693-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Tue, 26 Jul 2022 14:29:14 +0000 (22:29 +0800)]
hugetlbfs: use helper macro SZ_1{K,M}
Patch series "A few cleanup and fixup patches for hugetlbfs", v2.
This series contains a few cleaup patches to remove unneeded forward
declaration, use helper macro and so on. More details can be found in the
respective changelogs.
This patch (of 5):
Use helper macro SZ_1K and SZ_1M to do the size conversion. Minor
readability improvement.
Kefeng Wang [Tue, 26 Jul 2022 13:18:16 +0000 (21:18 +0800)]
mm: cleanup is_highmem()
It is unnecessary to add CONFIG_HIGHMEM check in is_highmem(), which has
been done in is_highmem_idx(), and move is_highmem() close to
is_highmem_idx(). This has no functional impact.
Ralph Campbell [Mon, 25 Jul 2022 18:36:15 +0000 (11:36 -0700)]
mm/hmm: add a test for cross device private faults
Add a simple test case for when hmm_range_fault() is called with the
HMM_PFN_REQ_FAULT flag and a device private PTE is found for a device
other than the hmm_range::dev_private_owner. This should cause the page
to be faulted back to system memory from the other device and the PFN
returned in the output array.
Also, remove a piece of code that unnecessarily unmaps part of the buffer.
Peter Xu [Mon, 25 Jul 2022 14:20:48 +0000 (10:20 -0400)]
selftests: add soft-dirty into run_vmtests.sh
Link: https://lkml.kernel.org/r/20220725142048.30450-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Mon, 25 Jul 2022 14:20:47 +0000 (10:20 -0400)]
selftests: soft-dirty: add test for mprotect
Add two soft-dirty test cases for mprotect() on both anon or file.
Link: https://lkml.kernel.org/r/20220725142048.30450-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Mon, 25 Jul 2022 14:20:46 +0000 (10:20 -0400)]
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
Patch series "mm/mprotect: Fix soft-dirty checks", v4.
This patch (of 3):
The check wanted to make sure when soft-dirty tracking is enabled we won't
grant write bit by accident, as a page fault is needed for dirty tracking.
The intention is correct but we didn't check it right because
VM_SOFTDIRTY set actually means soft-dirty tracking disabled. Fix it.
There's another thing tricky about soft-dirty is that, we can't check the
vma flag !(vma_flags & VM_SOFTDIRTY) directly but only check it after we
checked CONFIG_MEM_SOFT_DIRTY because otherwise VM_SOFTDIRTY will be
defined as zero, and !(vma_flags & VM_SOFTDIRTY) will constantly return
true. To avoid misuse, introduce a helper for checking whether vma has
soft-dirty tracking enabled.
We can easily verify this with any exclusive anonymous page, like program
below:
Here we attach a Fixes to commit 64fe24a3e05e only for easy tracking, as
this patch won't apply to a tree before that point. However the commit
wasn't the source of problem, but instead 64e455079e1b. It's just that
after 64fe24a3e05e anonymous memory will also suffer from this problem
with mprotect().
Link: https://lkml.kernel.org/r/20220725142048.30450-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220725142048.30450-2-peterx@redhat.com Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared") Fixes: 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
syzbot is reporting GFP_KERNEL allocation with oom_lock held when
reporting memcg OOM [1]. If this allocation triggers the global OOM
situation then the system can livelock because the GFP_KERNEL
allocation with oom_lock held cannot trigger the global OOM killer
because __alloc_pages_may_oom() fails to hold oom_lock.
Fix this problem by removing the allocation from memory_stat_format()
completely, and pass static buffer when calling from memcg OOM path.
Note that the caller holding filesystem lock was the trigger for syzbot
to report this locking dependency. Doing GFP_KERNEL allocation with
filesystem lock held can deadlock the system even without involving OOM
situation.
Link: https://syzkaller.appspot.com/bug?extid=2d2aeadc6ce1e1f11d45 Link: https://lkml.kernel.org/r/86afb39f-8c65-bec2-6cfc-c5e3cd600c0b@I-love.SAKURA.ne.jp Fixes: c8713d0b23123759 ("mm: memcontrol: dump memory.stat during cgroup OOM") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+2d2aeadc6ce1e1f11d45@syzkaller.appspotmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Thu, 9 Jun 2022 14:34:35 +0000 (22:34 +0800)]
xfs: fail dax mount if reflink is enabled on a partition
Failure notification is not supported on partitions. So, when we mount a
reflink enabled xfs on a partition with dax option, let it fail with
-EINVAL code.
Link: https://lkml.kernel.org/r/20220609143435.393724-1-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiebin Sun [Fri, 22 Jul 2022 16:49:49 +0000 (00:49 +0800)]
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
Remove the redundant updating of stats_flush_threshold. If the global var
stats_flush_threshold has exceeded the trigger value for
__mem_cgroup_flush_stats, further increment is unnecessary.
Apply the patch and test the pts/hackbench-1.0.0 Count:4 (160 threads).
Score gain: 1.95x
Reduce CPU cycles in __mod_memcg_lruvec_state (44.88% -> 0.12%)
CPU: ICX 8380 x 2 sockets
Core number: 40 x 2 physical cores
Benchmark: pts/hackbench-1.0.0 Count:4 (160 threads)
Link: https://lkml.kernel.org/r/20220722164949.47760-1-jiebin.sun@intel.com Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Amadeusz Sawiski <amadeuszx.slawinski@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Fri, 22 Jul 2022 20:15:13 +0000 (13:15 -0700)]
userfaultfd: don't fail on unrecognized features
The basic interaction for setting up a userfaultfd is, userspace issues
a UFFDIO_API ioctl, and passes in a set of zero or more feature flags,
indicating the features they would prefer to use.
Of course, different kernels may support different sets of features
(depending on kernel version, kconfig options, architecture, etc).
Userspace's expectations may also not match: perhaps it was built
against newer kernel headers, which defined some features the kernel
it's running on doesn't support.
Currently, if userspace passes in a flag we don't recognize, the
initialization fails and we return -EINVAL. This isn't great, though.
Userspace doesn't have an obvious way to react to this; sure, one of the
features I asked for was unavailable, but which one? The only option it
has is to turn off things "at random" and hope something works.
Instead, modify UFFDIO_API to just ignore any unrecognized feature
flags. The interaction is now that the initialization will succeed, and
as always we return the *subset* of feature flags that can actually be
used back to userspace.
Now userspace has an obvious way to react: it checks if any flags it
asked for are missing. If so, it can conclude this kernel doesn't
support those, and it can either resign itself to not using them, or
fail with an error on its own, or whatever else.
Miaohe Lin [Sat, 23 Jul 2022 07:38:04 +0000 (15:38 +0800)]
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
We forget to set cft->private for numa stat file. As a result, numa stat
of hstates[0] is always showed for all hstates. Encode the hstates index
into cft->private to fix this issue.
Link: https://lkml.kernel.org/r/20220723073804.53035-1-linmiaohe@huawei.com Fixes: f47761999052 ("hugetlb: add hugetlb.*.numa_stat file") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kassey Li [Tue, 19 Jul 2022 09:15:54 +0000 (17:15 +0800)]
mm/cma_debug.c: align the name buffer length as struct cma
Avoids truncating the debugfs output to 16 chars. Potentially alters
the userspace output, but this is a debugfs interface and there are no
stability guarantees.
Link: https://lkml.kernel.org/r/20220719091554.27864-1-quic_yingangl@quicinc.com Signed-off-by: Kassey Li <quic_yingangl@quicinc.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This code just reads from memory without caring about the data itself.
However static checkers complain that "tmp" is never properly initialized.
Initialize it to zero and change the name to "dummy" to show that we
don't care about the value stored in it.
Miaohe Lin [Tue, 19 Jul 2022 11:52:33 +0000 (19:52 +0800)]
mm/mempolicy: remove unneeded out label
We can use unlock label to unlock ptl and return ret directly to remove
the unneeded out label and reduce the size of mempolicy.o. No functional
change intended.
[Before]
text data bss dec hex filename
26702 3972 6168 36842 8fea mm/mempolicy.o
[After]
text data bss dec hex filename
26662 3972 6168 36802 8fc2 mm/mempolicy.o
mm/page_alloc: correct the wrong cpuset file path in comment
cpuset.c was moved to kernel/cgroup/ in below commit 201af4c0fab0 ("cgroup: move cgroup files under kernel/cgroup/")
Correct the wrong path in comment.
Jianglei Nie [Thu, 14 Jul 2022 06:37:46 +0000 (14:37 +0800)]
mm/damon/reclaim: fix potential memory leak in damon_reclaim_init()
damon_reclaim_init() allocates a memory chunk for ctx with
damon_new_ctx(). When damon_select_ops() fails, ctx is not released,
which will lead to a memory leak.
We should release the ctx with damon_destroy_ctx() when damon_select_ops()
fails to fix the memory leak.
Link: https://lkml.kernel.org/r/20220714063746.2343549-1-niejianglei2021@163.com Fixes: 4d69c3457821 ("mm/damon/reclaim: use damon_select_ops() instead of damon_{v,p}a_set_operations()") Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yosry Ahmed [Thu, 14 Jul 2022 06:49:18 +0000 (06:49 +0000)]
mm: vmpressure: don't count proactive reclaim in vmpressure
memory.reclaim is a cgroup v2 interface that allows users to proactively
reclaim memory from a memcg, without real memory pressure. Reclaim
operations invoke vmpressure, which is used: (a) To notify userspace of
reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being
under memory pressure for networking (see
mem_cgroup_under_socket_pressure()).
For (a), vmpressure notifications in v1 are not affected by this change
since memory.reclaim is a v2 feature.
For (b), the effects of the vmpressure signal (according to Shakeel [1])
are as follows:
1. Reducing send and receive buffers of the current socket.
2. May drop packets on the rx path.
3. May throttle current thread on the tx path.
Since proactive reclaim is invoked directly by userspace, not by memory
pressure, it makes sense not to throttle networking. Hence, this change
makes sure that proactive reclaim caused by memory.reclaim does not
trigger vmpressure.
zs_malloc returns 0 if it fails. zs_zpool_malloc will return -1 when
zs_malloc return 0. But -1 makes the return value unclear.
For example, when zswap_frontswap_store calls zs_malloc through
zs_zpool_malloc, it will return -1 to its caller. The other return value
is -EINVAL, -ENODEV or something else.
This commit changes zs_malloc to return ERR_PTR on failure. It didn't
just let zs_zpool_malloc return -ENOMEM becaue zs_malloc has two types of
failure:
- size is not OK return -EINVAL
- memory alloc fail return -ENOMEM.
Zhou Guanghui [Wed, 15 Jun 2022 10:27:42 +0000 (10:27 +0000)]
memblock,arm64: expand the static memblock memory table
In a system(Huawei Ascend ARM64 SoC) using HBM, a multi-bit ECC error
occurs, and the BIOS will mark the corresponding area (for example, 2 MB)
as unusable. When the system restarts next time, these areas are not
reported or reported as EFI_UNUSABLE_MEMORY. Both cases lead to an
increase in the number of memblocks, whereas EFI_UNUSABLE_MEMORY leads to
a larger number of memblocks.
For example, if the EFI_UNUSABLE_MEMORY type is reported:
...
memory[0x92] [0x0000200834a00000-0x0000200835bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x93] [0x0000200835c00000-0x0000200835dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x94] [0x0000200835e00000-0x00002008367fffff], 0x0000000000a00000 bytes on node 7 flags: 0x0
memory[0x95] [0x0000200836800000-0x00002008369fffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x96] [0x0000200836a00000-0x0000200837bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x97] [0x0000200837c00000-0x0000200837dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x98] [0x0000200837e00000-0x000020087fffffff], 0x0000000048200000 bytes on node 7 flags: 0x0
memory[0x99] [0x0000200880000000-0x0000200bcfffffff], 0x0000000350000000 bytes on node 6 flags: 0x0
memory[0x9a] [0x0000200bd0000000-0x0000200bd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9b] [0x0000200bd0200000-0x0000200bd07fffff], 0x0000000000600000 bytes on node 6 flags: 0x0
memory[0x9c] [0x0000200bd0800000-0x0000200bd09fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9d] [0x0000200bd0a00000-0x0000200fcfffffff], 0x00000003ff600000 bytes on node 6 flags: 0x0
memory[0x9e] [0x0000200fd0000000-0x0000200fd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9f] [0x0000200fd0200000-0x0000200fffffffff], 0x000000002fe00000 bytes on node 6 flags: 0x0
...
The EFI memory map is parsed to construct the memblock arrays before the
memblock arrays can be resized. As the result, memory regions beyond
INIT_MEMBLOCK_REGIONS are lost.
Add a new macro INIT_MEMBLOCK_MEMORY_REGIONS to replace
INIT_MEMBLOCK_REGTIONS to define the size of the static memblock.memory
array.
Allow overriding memblock.memory array size with architecture defined
INIT_MEMBLOCK_MEMORY_REGIONS and make arm64 to set
INIT_MEMBLOCK_MEMORY_REGIONS to 1024 when CONFIG_EFI is enabled.
Link: https://lkml.kernel.org/r/20220615102742.96450-1-zhouguanghui1@huawei.com Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Darren Hart <darren@os.amperecomputing.com> Acked-by: Will Deacon <will@kernel.org> [arm64] Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Xu Qiang <xuqiang36@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Sat, 16 Jul 2022 08:03:59 +0000 (16:03 +0800)]
mm: remove obsolete comment in do_fault_around()
Since commit 7267ec008b5c ("mm: postpone page table allocation until we
have page to map"), do_fault_around is not called with page table lock
held. Cleanup the corresponding comments.
William Lam [Mon, 11 Jul 2022 20:28:06 +0000 (21:28 +0100)]
mm: compaction: include compound page count for scanning in pageblock isolation
The number of scanned pages can be lower than the number of isolated pages
when isolating mirgratable or free pageblock. The metric is being
reported in trace event and also used in vmstat.
some example output from trace where it shows nr_taken can be greater
than nr_scanned:
mm_compaction_isolate_migratepages does not seem to have this
behaviour, but for the reason of consistency, nr_scanned should also be
taken care of in that side.
This behaviour is confusing since currently the count for isolated pages
takes account of compound page but not for the case of scanned pages. And
given that the number of isolated pages(nr_taken) reported in
mm_compaction_isolate_template trace event is on a single-page basis, the
ambiguity when reporting the number of scanned pages can be removed by
also including compound page count.
Adam Sindelar [Mon, 4 Jul 2022 12:38:13 +0000 (14:38 +0200)]
selftests/vm: skip 128TBswitch on unsupported arch
The test va_128TBswitch.c exercises a feature only supported on PPC and
x86_64, but it's run on other 64-bit archs as well. Before this patch,
the test did nothing and returned 0 for KSFT_PASS. This patch makes it
return the KSFT codes from kselftest.h, including KSFT_SKIP when
appropriate.
Roman Gushchin [Sat, 2 Jul 2022 03:35:21 +0000 (20:35 -0700)]
mm: memcontrol: do not miss MEMCG_MAX events for enforced allocations
Yafang Shao reported an issue related to the accounting of bpf memory:
if a bpf map is charged indirectly for memory consumed from an
interrupt context and allocations are enforced, MEMCG_MAX events are
not raised.
It's not/less of an issue in a generic case because consequent
allocations from a process context will trigger the direct reclaim and
MEMCG_MAX events will be raised. However a bpf map can belong to a
dying/abandoned memory cgroup, so there will be no allocations from a
process context and no MEMCG_MAX events will be triggered.
Link: https://lkml.kernel.org/r/20220702033521.64630-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Mon, 27 Jun 2022 13:23:51 +0000 (21:23 +0800)]
filemap: minor cleanup for filemap_write_and_wait_range
Restructure the logic in filemap_write_and_wait_range to simplify the code
and make it more consistent with file_write_and_wait_range. No functional
change intended.
Miaohe Lin [Sat, 18 Jun 2022 08:20:27 +0000 (16:20 +0800)]
mm/mmap.c: fix missing call to vm_unacct_memory in mmap_region
Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory
twice because vm_unacct_memory will be called by above unmap_region. But
since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from
the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory
anymore. So charged shouldn't be set to 0 now otherwise the calling to
paired vm_unacct_memory will be missed and leads to imbalanced account.
Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Mon, 27 Jun 2022 15:18:59 +0000 (15:18 +0000)]
android: binder: fix lockdep check on clearing vma
When munmapping a vma, the mmap_lock can be degraded to a write before
calling close() on the file handle. The binder close() function calls
binder_alloc_set_vma() to clear the vma address, which now has a lock dep
check for writing on the mmap_lock. Change the lockdep check to ensure
the reading lock is held while clearing and keep the write check while
writing.
Link: https://lkml.kernel.org/r/20220627151857.2316964-1-Liam.Howlett@oracle.com Fixes: 472a68df605b ("android: binder: stop saving a pointer to the VMA") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: syzbot+da54fa8d793ca89c741f@syzkaller.appspotmail.com Acked-by: Todd Kjos <tkjos@google.com> Cc: "Arve Hjønnevåg" <arve@android.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hridya Valsaraju <hridya@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Martijn Coenen <maco@android.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Tue, 21 Jun 2022 01:09:09 +0000 (21:09 -0400)]
android: binder: stop saving a pointer to the VMA
Do not record a pointer to a VMA outside of the mmap_lock for later use.
This is unsafe and there are a number of failure paths *after* the
recorded VMA pointer may be freed during setup. There is no callback to
the driver to clear the saved pointer from generic mm code. Furthermore,
the VMA pointer may become stale if any number of VMA operations end up
freeing the VMA so saving it was fragile to being with.
Instead, change the binder_alloc struct to record the start address of the
VMA and use vma_lookup() to get the vma when needed. Add lockdep
mmap_lock checks on updates to the vma pointer to ensure the lock is held
and depend on that lock for synchronization of readers and writers - which
was already the case anyways, so the smp_wmb()/smp_rmb() was not
necessary.
[akpm@linux-foundation.org: fix drivers/android/binder_alloc_selftest.c] Link: https://lkml.kernel.org/r/20220621140212.vpkio64idahetbyf@revolver Fixes: da1b9564e85b ("android: binder: fix the race mmap and alloc_new_buf_locked") Reported-by: syzbot+58b51ac2b04e388ab7b0@syzkaller.appspotmail.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hridya Valsaraju <hridya@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Martijn Coenen <maco@android.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>