Pull procfs updates from Christian Brauner:
- Revamp fs/filesystems.c
The file was a mess with a hand-rolled linked list in desperate need
of a cleanup. The filesystems list is now RCU-ified, /proc files can
be marked permanent from outside fs/proc/, and the string emitted
when reading /proc/filesystems is pre-generated and cached instead of
pointer-chasing and printfing entry by entry on every read.
The file is read frequently because libselinux reads it and is linked
into numerous frequently used programs (even ones you would not
suspect, like sed!). Scalability also improves since reference
maintenance on open/close is bypassed.
open+read+close cycle single-threaded (ops/s):
before: 442732
after:
1063462 (+140%)
open+read+close cycle with 20 processes (ops/s):
before: 606177
after:
3300576 (+444%)
A follow-up patch adds missing unlocks in some corner cases and
tidies things up.
- Relax the mount visibility check for subset=pid mounts
When procfs is mounted with subset=pid, all static files become
unavailable and only the dynamic pid information is accessible. In
that case there is no point in imposing the full mount visibility
restrictions on the mounter - everything that can be hidden in procfs
is already inaccessible. These restrictions prevented procfs from
being mounted inside rootless containers since almost all container
implementations overmount parts of procfs to hide certain
directories.
As part of this /proc/self/net is only shown in subset=pid mounts for
CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the
SB_I_USERNS_VISIBLE superblock flag is replaced with an
FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are
recorded in a list, and the mount restrictions are finally
documented.
- Protect ptrace_may_access() with exec_update_lock in procfs
Most uses of ptrace_may_access() in procfs should hold
exec_update_lock to avoid TOCTOU issues with concurrent privileged
execve() (like setuid binary execution).
This fixes the easy cases - the owner and visibility checks and the
FD link permission checks - with the gnarlier ones to follow later.
* tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: fix ups and tidy ups to /proc/filesystems caching
proc: protect ptrace_may_access() with exec_update_lock (FD links)
proc: protect ptrace_may_access() with exec_update_lock (part 1)
docs: proc: add documentation about mount restrictions
proc: handle subset=pid separately in userns visibility checks
proc: prevent reconfiguring subset=pid
proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
fs: cache the string generated by reading /proc/filesystems
sysfs: remove trivial sysfs_get_tree() wrapper
fs: RCU-ify filesystems list
fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
proc: allow to mark /proc files permanent outside of fs/proc/
namespace: record fully visible mounts in list