git.ipfire.org Git - thirdparty/linux.git/log

sched_ext: Use offsetofend on both sides of the ops_cid layout assert

sizeof() includes trailing struct pad, offsetofend() doesn't. On
32-bit PPC, sched_ext_ops_cid tail-pads 4 bytes past @priv and the
assert trips. Use offsetofend() on both sides.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605081637.DbH4SZ1E-lkp@intel.com/
Fixes: 7e655ed7b953 ("sched_ext: Add bpf_sched_ext_ops_cid struct_ops type")
Signed-off-by: Tejun Heo <tj@kernel.org>

selftests/sched_ext: Fix select_cpu_dfl link leak on early return

If run() exits early via SCX_EQ/SCX_ASSERT (which calls return
directly), bpf_link__destroy() is never reached and the BPF
scheduler stays loaded. All subsequent tests then fail to attach
because SCX is not in the DISABLED state.

Move bpf_link into a context struct so cleanup() always destroys
it, regardless of how run() exits. Also skip waitpid() for children
where fork() returned -1, avoiding waitpid(-1,...) accidentally
reaping an unrelated child and triggering the early return path.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Add __printf format attributes to scx_vexit() and bstr formatters

scx_vexit() forwards (fmt, args) to vscnprintf(); bstr_format() and
__bstr_format() forward fmt to bstr_printf(); the BPF kfunc wrappers
scx_bpf_exit_bstr(), scx_bpf_error_bstr() and scx_bpf_dump_bstr() in
turn forward fmt to those formatters. None of them have __printf(),
so clang -Wmissing-format-attribute fires on the forwarded calls and
C-side callers don't get format-string checking.

Annotate the six functions with __printf(N, 0) matching the fmt
parameter position in each.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/all/202605041112.Y6OG7v9r-lkp@intel.com/
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Normalize exit dump header to "on CPU N"

Unify to uppercase to match the UEI output.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Remove redundant rcu_read_lock/unlock() in sysrq_handle_sched_ext_reset()

sysrq_handle_sched_ext_reset() is called from __handle_sysrq(), which
already holds rcu_read_lock() while invoking the sysrq handler. Remove
the redundant rcu_read_lock/unlock() pair.

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Require cid-form struct_ops for sub-sched support

Sub-scheduler support is tied to the cid-form struct_ops: sub_attach /
sub_detach will communicate allocation via cmask, and the hierarchy assumes
all participants share a single topological cid space. A cpu-form root that
accepts sub-scheds would need cpu <-> cid translation on every cross-sched
interaction, defeating the purpose.

Enforce this at validate_ops():
- A sub-scheduler (scx_parent(sch) non-NULL) must be cid-form.
- A root that exposes sub_attach / sub_detach must be cid-form.

scx_qmap, which is currently the only scheduler demoing sub-sched support,
was converted to cid-form in the preceding patch, so this doesn't cause
breakage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

tools/sched_ext: scx_qmap: Port to cid-form struct_ops

Flip qmap's struct_ops to bpf_sched_ext_ops_cid. The kernel now passes
cids and cmasks to callbacks directly, so the per-callback cpu<->cid
translations that the prior patch added drop out and cpu_ctxs[] is
reindexed by cid. Cpu-form kfunc calls switch to their cid-form
counterparts.

The cpu-only kfuncs (idle/any pick, cpumask iteration) have no cid
substitute. Their callers already moved to cmask scans against
qa_idle_cids and taskc->cpus_allowed in the prior patch, so the kfunc
calls drop here without behavior changes.

set_cmask is wired up via cmask_copy_from_kernel() to copy the
kernel-supplied cmask into the arena-resident taskc cmask. The
cpuperf monitor iterates the cid-form perf kfuncs.

v4: Match scx_bpf_cid_override()'s 2-arg form, drop the shard test
plumbing, bound nr_cpu_ids for the verifier, and switch mode 3
from bad-mono to bad-range (Changwoo, Andrea).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick

Switch qmap's idle-cpu picker from scx_bpf_pick_idle_cpu() to a
BPF-side bitmap scan, still under cpu-form struct_ops. qa_idle_cids
tracks idle cids (updated in update_idle / cpu_offline) and each
task's taskc->cpus_allowed tracks its allowed cids (built in
set_cpumask / init_task); select_cpu / enqueue scan the intersection
for an idle cid. Callbacks translate cpu <-> cid on entry;
cid-qmap-port drops those translations.

The scan is barebone - no core preference or other topology-aware
picks like the in-kernel picker - but qmap is a demo and this is
enough to exercise the plumbing.

v3: qmap_init() refuses to load when nr_cids exceeds SCX_QMAP_MAX_CPUS;
task_ctx's flex array would otherwise overflow into the next slab
entry. (Sashiko)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline

The cid mapping is built from the online cpu set at scheduler enable
and stays valid for that set; routine hotplug invalidates it. The
default cid behavior is to restart the scheduler so the mapping gets
rebuilt against the new online set, and that requires not implementing
cpu_online / cpu_offline (which suppress the kernel's ACT_RESTART).

Drop the two ops along with their print_cpus() helper - the cluster
view was only useful as a hotplug demo and is meaningless over the
dense cid space the scheduler will move to. Wire main() to handle the
ACT_RESTART exit by reopening the skel and reattaching, matching the
pattern in scx_simple / scx_central / scx_flatcg etc. Reset optind so
getopt re-parses argv into the fresh skel rodata each iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Forbid cpu-form kfuncs from cid-form schedulers

cid and cpu are both small s32s, trivially confused when a cid-form
scheduler calls a cpu-keyed kfunc. Reject cid-form programs that
reference any kfunc in the new scx_kfunc_ids_cpu_only at verifier load
time.

The reverse direction is intentionally permissive: cpu-form schedulers
can freely call cid-form kfuncs to ease a gradual cpumask -> cid
migration.

The check sits in scx_kfunc_context_filter() right after the SCX
struct_ops gate and before the any/idle allow and per-op allow-list
checks, so it catches cpu-only kfuncs regardless of which set they
belong to (any, idle, or select_cpu).

v2: Sync per-entry kfunc flags with their primary declarations (Zhao).
pahole intersects flags across BTF_ID_FLAGS() occurrences, so
omitting them drops the flags globally.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Add bpf_sched_ext_ops_cid struct_ops type

cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without a full cid interface,
schedulers end up mixing forms - a subtle-bug factory.

Add sched_ext_ops_cid, which mirrors sched_ext_ops with cid/cmask
replacing cpu/cpumask in the topology-carrying callbacks.
cpu_acquire/cpu_release are deprecated and absent; a prior patch
moved them past @priv so the cid-form can omit them without
disturbing shared-field offsets.

The two structs share byte-identical layout up to @priv, so the
existing bpf_scx init/check hooks, has_op bitmap, and
scx_kf_allow_flags[] are offset-indexed and apply to both.
BUILD_BUG_ON in scx_init() pins the shared-field and renamed-callback
offsets so any future drift trips at boot.

The kernel<->BPF boundary translates between cpu and cid:

- A static key, enabled on cid-form sched load, gates the translation
  so cpu-form schedulers pay nothing.
- dispatch, update_idle, cpu_online/offline and dump_cpu translate
  the cpu arg at the callsite.
- select_cpu also translates the returned cid back to a cpu.
- set_cpumask is wrapped to synthesize a cmask in a per-cpu scratch
  before calling the cid-form callback.

All scheds in a hierarchy share one form. The static key drives the
hot-path branch.

v2: Use struct_size() for the set_cmask_scratch percpu alloc. Move
    cid-shard fields and assertions into the later cid-shard patch.

v3: Drop `static` on scx_set_cmask_scratch; add extern in ext_internal.h.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Add cid-form kfunc wrappers alongside cpu-form

cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without full cid coverage a
scheduler has to mix cid and cpu forms, which is a subtle-bug factory.
Close the gap with a cid-native interface.

Pair every cpu-form kfunc that takes a cpu id with a cid-form
equivalent (kick, task placement, cpuperf query/set, per-cpu current
task, nr-cpu-ids). Add two cid-natives with no cpu-form sibling:
scx_bpf_this_cid() (cid of the running cpu, scx equivalent of
bpf_get_smp_processor_id) and scx_bpf_nr_online_cids().

scx_bpf_cpu_rq is deprecated; no cid-form counterpart. NUMA node info
is reachable via scx_bpf_cid_topo() on the BPF side.

Each cid-form wrapper is a thin cid -> cpu translation that delegates
to the cpu path, registered in the same context sets so usage
constraints match.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Add cmask, a base-windowed bitmap over cid space

Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid
space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS wastes
most of its bits for a small window and is awkward in BPF.

scx_cmask covers [base, base + nr_bits). bits[] is aligned to the global
64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64). Any two
cmasks therefore address bits[] against the same global windows, so
cross-cmask word ops reduce to

dest->bits[i] OP= operand->bits[i - delta]

with no bit-shifting, at the cost of up to one extra storage word for
head misalignment. This alignment guarantee is the reason binary ops
can stay word-level; every mutating helper preserves it.

Kernel side in ext_cid.[hc]; BPF side in tools/sched_ext/include/scx/
cid.bpf.h. BPF side drops the scx_ prefix (redundant in BPF code) and
adds the extra helpers that basic idle-cpu selection needs.

No callers yet.

v2: Narrow to helpers that will be used in the planned changes;
    set/bit/find/zero ops will be added as usage develops.

v3: cmask_copy_from_kernel: validate src->base == 0 via probe-read;
    bit-level nr_bits check instead of round-up word count. (Sashiko)

v4: Bump CMASK_CAS_TRIES to 1<<23 so abort fires only after seconds
    of real spinning, not on plausible contention. Switch
    __builtin_ctzll() to the ctzll() wrapper for clang compat
    (Changwoo).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

tools/sched_ext: Add struct_size() helpers to common.bpf.h

Add flex_array_size(), struct_size() and struct_size_t() to
scx/common.bpf.h so BPF schedulers can size flex-array-containing
structs the same way kernel code does. These are abbreviated forms of
the <linux/overflow.h> macros.

v3: Use offsetof() instead of sizeof() in struct_size() to match kernel
semantics (no inflation from trailing struct padding). (Sashiko)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Add scx_bpf_cid_override() kfunc

The auto-probed cid mapping reflects the kernel's view of topology
(node -> LLC -> core), but a BPF scheduler may want a different layout -
to align cid slices with its own partitioning, or to work around how the
kernel reports a particular machine.

Add scx_bpf_cid_override(), callable from ops.init() of the root
scheduler. It validates the caller-supplied cpu->cid array and replaces
the in-place mapping; topo info is invalidated. A compat.bpf.h wrapper
silently no-ops on kernels that lack the kfunc.

A new SCX_KF_ALLOW_INIT bit in the kfunc context filter restricts the
kfunc to ops.init() at verifier load time.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>

sched_ext: Add topological CPU IDs (cids)

Raw cpu numbers are clumsy for sharding and cross-sched communication,
especially from BPF. The space is sparse, numerical closeness doesn't
track topological closeness (x86 hyperthreading often scatters SMT
siblings), and a range of cpu ids doesn't describe anything meaningful.
Sub-sched support makes this acute: cpu allocation, revocation, and
state constantly flow across sub-scheds. Passing whole cpumasks scales
poorly (every op scans 4K bits) and cpumasks are awkward in BPF.

cids assign every cpu a dense, topology-ordered id. CPUs sharing a core,
LLC, or NUMA node occupy contiguous cid ranges, so a topology unit
becomes a (start, length) slice. Communication passes slices; BPF can
process a u64 word of cids at a time.

Build the mapping once at root enable by walking online cpus node -> LLC
-> core. Possible-but-not-online cpus tail the space with no-topo cids.
Expose kfuncs to map cpu <-> cid in either direction and to query each
cid's topology metadata.

v2: Use kzalloc_objs()/kmalloc_objs() for the three allocs in
    scx_cid_arrays_alloc() (Cheng-Yang Chou).

v3: scx_cid_init() failure path now drops cpus_read_lock();
    BUILD_BUG_ON tightened to match BPF cmask helpers' NR_CPUS<=8192.
    (Sashiko)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Make scx_enable() take scx_enable_cmd

Pass struct scx_enable_cmd to scx_enable() rather than unpacking @ops
at every call site and re-packing into a fresh cmd inside. bpf_scx_reg()
now builds the cmd on its stack and hands it in; scx_enable() just
wires up the kthread work and waits.

Relocate struct scx_enable_cmd above scx_alloc_and_add_sched() so
upcoming patches that also want the cmd can see it.

No behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops

cpu_acquire and cpu_release are deprecated and slated for removal. Move
their declarations to the end of struct sched_ext_ops so an upcoming
cid-form struct (sched_ext_ops_cid) can omit them entirely without
disturbing the offsets of the shared fields.

Switch the two SCX_HAS_OP() callers for these ops to direct field checks
since the relocated ops sit outside the SCX_OPI_END range covered by the
has_op bitmap.

scx_kf_allow_flags[] auto-sizes to the highest used SCX_OP_IDX, so
SCX_OP_IDX(cpu_release) moving to a higher index just enlarges the
sparse array; the lookup logic is unchanged.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu()

Callers that already know the cpu is valid shouldn't have to pay for a
redundant check. scx_kick_cpu() is called from the in-kernel balance loop
break-out path with the current cpu (trivially valid) and from
scx_bpf_kick_cpu() with a BPF-supplied cpu that does need validation. Move
the check out of scx_kick_cpu() into scx_bpf_kick_cpu() so the backend is
reusable by callers that have already validated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h

Things shared across multiple .c files belong in a header. scx_exit() /
scx_error() (and their scx_vexit() / scx_verror() siblings) are already
called from ext_idle.c and the upcoming ext_cid.c, and it was only
build_policy.c's textual inclusion of ext.c that made the references
resolve. Move the whole family to ext_internal.h.

Pure visibility change.

v4: Rebased over the exit_cpu plumbing. scx_exit() and scx_verror()
    are now macros wrapping raw_smp_processor_id(); move both macros
    plus the underlying __scx_exit() / scx_vexit() declarations to
    the header.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it

Rename the static ext.c helper and declare it in ext_internal.h so
ext_idle.c and the upcoming cid code can call it directly instead of
relying on build_policy.c textual inclusion.

Pure rename and visibility change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Add ext_types.h for early subsystem-wide defs

Introduce kernel/sched/ext_types.h as the early-def header for the
sched_ext compilation unit. Included from kernel/sched/build_policy.c
before ext_internal.h so every later header and source in the unit
sees its content without re-inclusion. Later patches add their types
here (struct scx_cid_topo, scx_cmask, scx_cid_shard, etc.) so the
subsystem has one place to stash types shared across the TU.

Move enum scx_consts (SCX_DSP_DFL_MAX_BATCH, SCX_WATCHDOG_MAX_TIMEOUT,
SCX_SUB_MAX_DEPTH, etc.) here as the initial content. Ops-facing
content stays in ext_internal.h.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Expose exit_cpu to BPF and userspace

Extend struct user_exit_info with an exit_cpu field so BPF schedulers
and the userspace report path can see the CPU that triggered the exit,
matching the kernel-side dump.

UEI_RECORD() defaults the field to -1 before the CO-RE-gated copy so
that running against an older kernel without exit_cpu stays
distinguishable from "exit happened on CPU 0".

UEI_REPORT() appends "on CPU N" to the EXIT line when the value is
valid, surfacing the most diagnostically useful piece of exit info to
any sched_ext userspace tool without needing to crack open the debug
dump.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Dump the exit CPU first

When sched_ext is disabled by an error, the CPU that triggered the exit
is the most relevant piece of information for diagnosing the problem.
However, if there are many CPUs, the dump can get truncated and that
CPU's information may not appear in the output.

Add an exit_cpu field to scx_exit_info and thread it through scx_vexit()
/ __scx_exit(). For the watchdog stall path, populate it from cpu_of(rq)
in check_rq_for_timeouts(). For all other exit paths, define a scx_exit()
macro that wraps __scx_exit() with raw_smp_processor_id(), so the CPU
that initiated the exit is captured automatically, with no call-site
changes needed.

In scx_dump_state(), report the exit CPU in the dump header ("on cpu N")
and dump that CPU first, skipping it in the per-CPU loop, so the most
relevant CPU is never truncated out of the dump. The SysRq-D path
initializes exit_cpu to -1 so debug dumps not tied to an exit don't
arbitrarily promote CPU 0.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Extract scx_dump_cpu() from scx_dump_state()

Factor out the per-CPU state dump logic from the for_each_possible_cpu
loop in scx_dump_state() into a new scx_dump_cpu() helper to improve
readability. No functional change.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Collect ext_*.c include headers in build_policy.c

Move <linux/btf_ids.h> from ext.c and "ext_idle.h" from ext.c (plus its
self-include in ext_idle.c) into build_policy.c. Subsequent patches add
their headers the same way for consistency.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

Merge branch 'for-7.1-fixes' into for-7.2

Pull to receive:

c0e8ddc76d54 ("sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED")

which conflicts with:

41e3312861ea ("sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup")

It's a simple context conflict. Take changes from both.

Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable

scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.

scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.

Drop cpus_read_lock() before goto err_disable.

v2: Correct Fixes: tag (Andrea Righi).

Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime

scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.

Reject !sch up front so the gate only admits callers with a resolved
scheduler.

Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Refuse cross-task select_cpu_from_kfunc calls

select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.

Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().

v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
    comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
    ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
    scx_bpf_select_cpu_*() on the same task. (Andrea Righi)

Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>

sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED

Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:

- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
  in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
  matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
  only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
  reference on every root-sched enable.

- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
  SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
  compile.

GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:

- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
  put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
  with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
  lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.

Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Make bypass LB cpumasks per-scheduler

scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.

Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().

Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before

scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass task_rq(a).

Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task

scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes
ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.

Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too.
The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps
rq=NULL.

Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP

SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on
exit. Correct at the top level, but ops can recurse via
scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which
invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner
call returns, the NULL clobbers the outer's state. The parent's BPF then
calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and
re-acquire the already-held rq.

Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq
parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local
can be typed as 'struct rq *' without colliding with the parameter token in
the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel
through the two base macros and inherit the fix.

Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail

dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide
whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF
iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a
reliable "no real task" check. If the last real task is unlinked while a
cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then
sees list_empty() == false and skips the first_task update, leaving
scx_bpf_dsq_peek() returning NULL for a non-empty DSQ.

Test dsq->first_task directly, which already tracks only real tasks and is
maintained under dsq->lock.

Fixes: 44f5c8ec5b9a ("sched_ext: Add lockless peek operation for DSQs")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Ryan Newton <newton@meta.com>

sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()

scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux)
and inserts the new DSQ into that scheduler's dsq_hash. Its inverse
scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were
hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only
destroy or query DSQs in the root scheduler's hash - never its own. If the
root had a DSQ with the same id, the sub-sched silently destroyed it and the
root aborted on the next dispatch ("invalid DSQ ID 0x0..").

Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the
scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq().

Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters

scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring
scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs.
If the loaded scheduler is disabled and freed (via RCU work) and another is
enabled between the naked load and the rwsem acquire, the reader sees
scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one
- UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...).

scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write
(scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section
correlates @sch with the enabled snapshot.

Fixes: a5bd6ba30b33 ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations")
Cc: stable@vger.kernel.org # v6.18+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path

scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false)
without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails
partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the
marked tasks. Task state is still the parent's ENABLED, so that dispatches
to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e.
child->ops.disable() - for tasks on which child->ops.enable() never ran. A
BPF sub-scheduler allocating per-task state in enable/freeing in disable
would operate on uninitialized state.

The dying-task branch in scx_disable_and_exit_task() has the same problem,
and scx_enabling_sub_sched was cleared before the abort cleanup loop - a
task exiting during cleanup tripped the WARN and skipped both ops.exit_task
and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the
task stuck.

Introduce scx_sub_init_cancel_task() that calls ops.exit_task with
cancelled=true - matching what the top-level init path does when init_task
itself returns -errno. Use it in the abort loop and in the dying-task
branch. scx_enabling_sub_sched now stays set until the abort loop finishes
clearing SUB_INIT, so concurrent exits hitting the dying-task branch can
still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally
when seen, leaving the task unmarked even if the WARN fires.

Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()

bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without
migrating them - task_cpu() only updates when the donee later consumes the
task via move_remote_task_to_local_dsq(). If the LB timer fires again before
consumption and the new DSQ becomes a donor, @p is still on the previous CPU
and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked.

Skip such tasks.

Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new

bpf_iter_scx_dsq_new() clears kit->dsq on failure and
bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't -
it dereferences kit->dsq immediately, so a BPF program that calls
scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel.

Return false if kit->dsq is NULL.

Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Unregister sub_kset on scheduler disable

When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a
child of &sch->kobj, which pins the parent with its own reference. The
disable paths never call kset_unregister(), so the final kobject_put() in
bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs,
leaking the whole struct scx_sched on every load/unload cycle.

Unregister sub_kset in scx_root_disable() and scx_sub_disable() before
kobject_del(&sch->kobj).

Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Defer scx_hardlockup() out of NMI

scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(),
which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing
it from NMI context can lead to deadlocks.

The hardlockup handler is best-effort recovery and the disable path it
triggers runs off of irq_work anyway. Move the handle_lockup() call into
an irq_work so it runs in IRQ context.

Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: sync disable_irq_work in bpf_scx_unreg()

When unregistered my self-written scx scheduler, the following panic
occurs.

[  229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[  229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1]  SMP
[  230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[  230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[  230.093972] Workqueue: events_unbound bpf_map_free_deferred
[  230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[  230.116843] pc : 0xffff80009bc2c1f8
[  230.120406] lr : dequeue_task_scx+0x270/0x2d0
[  230.217749] Call trace:
[  230.228515]  0xffff80009bc2c1f8 (P)
[  230.232077]  dequeue_task+0x84/0x188
[  230.235728]  sched_change_begin+0x1dc/0x250
[  230.240000]  __set_cpus_allowed_ptr_locked+0x17c/0x240
[  230.245250]  __set_cpus_allowed_ptr+0x74/0xf0
[  230.249701]  ___migrate_enable+0x4c/0xa0
[  230.253707]  bpf_map_free_deferred+0x1a4/0x1b0
[  230.258246]  process_one_work+0x184/0x540
[  230.262342]  worker_thread+0x19c/0x348
[  230.266170]  kthread+0x13c/0x150
[  230.269465]  ret_from_fork+0x10/0x20
[  230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[  230.287621] ---[ end trace 0000000000000000 ]---
[  231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt

The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.

The expected ordering during teardown is:
    bitmap_zero(sch->has_op) + synchronize_rcu()
        -> guarantees no CPU will ever call sch->ops.* again
    -> only THEN free the BPF struct_ops JIT page

bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.

So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.

Add a helper scx_flush_disable_work() so the future use cases that want
to flush disable_work can use it.
Also amend the call for scx_root_enable_workfn() and
scx_sub_enable_workfn() which have similar pattern in the error path.

Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge branch 'for-7.1-fixes' into for-7.2

Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched

local_dsq_post_enq() calls call_task_dequeue() with scx_root instead of
the scheduler instance actually managing the task. When
CONFIG_EXT_SUB_SCHED is enabled, tasks may be managed by a sub-scheduler
whose ops.dequeue() callback differs from root's. Using scx_root causes
the wrong scheduler's ops.dequeue() to be consulted: sub-sched tasks
dispatched to a local DSQ via scx_bpf_dsq_move_to_local() will have
SCX_TASK_IN_CUSTODY cleared but the sub-scheduler's ops.dequeue() is
never invoked, violating the custody exit semantics.

Fix by adding a 'struct scx_sched *sch' parameter to local_dsq_post_enq()
and move_local_task_to_local_dsq(), and propagating the correct scheduler
from their callers dispatch_enqueue(), move_task_between_dsqs(), and
consume_dispatch_q().

This is consistent with dispatch_enqueue()'s non-local path which already
passes 'sch' directly to call_task_dequeue() for global/bypass DSQs.

Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

selftests/sched_ext: Include common.bpf.h to avoid build failure

In scx-cid patchsets, sched_ext selftest failed to build with following
error:

non_scx_kfunc_deny.bpf.c:17:6: error: conflicting types for 'scx_bpf_kick_cpu'
17 | void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
|      ^
tools/testing/selftests/sched_ext/build/include/vmlinux.h:136300:13: note: previous declaration is here
136300 | extern void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux *aux) __weak __ksym;
|             ^
non_scx_kfunc_deny.bpf.c:26:23: error: too few arguments to function call, expected 3, have 2
26 |         scx_bpf_kick_cpu(0, 0);
|         ~~~~~~~~~~~~~~~~     ^
tools/testing/selftests/sched_ext/build/include/vmlinux.h:136300:13: note: 'scx_bpf_kick_cpu' declared here
136300 | extern void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux *aux) __weak __ksym;

The root cause is on scx core part, but we can avoid this by including
common.bpf.h and remove scx_bpf_kick_cpu() to make it more robust, just
like the usage in other xx.bpf.c.

Link: https://lore.kernel.org/sched-ext/20260421071945.3110084-1-tj@kernel.org/
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Tested-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge branch 'for-7.1-fixes' into for-7.2

Pull to receive:

05909810a946 ("tools/sched_ext: scx_qmap: Silence task_ctx lookup miss")

which conflicts with the cid-form qmap rework on for-7.2. Resolved
by applying the same silence-on-NULL semantics to the arena-backed
lookup_task_ctx() and qmap_select_cpu() on for-7.2.

Signed-off-by: Tejun Heo <tj@kernel.org>

tools/sched_ext: scx_qmap: Silence task_ctx lookup miss

scx_fork() dispatches ops.init_task to exactly one scheduler - the one
owning the forking task's cgroup. A task forked inside a sub-scheduler's
cgroup is init'd into the sub only; the root scheduler has no task_ctx
entry for it. When that task later appears as @prev in the root's
qmap_dispatch() (or flows through core-sched comparison via task_qdist),
the bpf_task_storage_get() legitimately misses.

qmap treated those misses as fatal via scx_bpf_error("task_ctx lookup
failed") and aborted the scheduler as soon as the first cross-sched
task hit the root. Drop the error in the sites where the miss is
legitimate: lookup_task_ctx() (helper; callers already check for NULL),
qmap_dispatch()'s @prev branch (bookkeeping-only), task_qdist()
(returns 0 which makes the comparison a no-op), and qmap_select_cpu()
(returns prev_cpu as a no-op fallback instead of -ESRCH). The existing
scx_error was a paranoid guard from the pre-sub-sched world where every
task was owned by the one and only scheduler.

v2: qmap_select_cpu() returns prev_cpu on NULL instead of -ESRCH, so
the root scheduler doesn't error on cross-sched tasks that pass
through it (Andrea Righi).

Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>

rhashtable: Bounce deferred worker kick through irq_work

Inserts past 75% load call schedule_work(&ht->run_work) to kick an
async resize. If a caller holds a raw spinlock (e.g. an
insecure_elasticity user), schedule_work() under that lock records

  caller_lock -> pool->lock -> pi_lock -> rq->__lock

A cycle forms if any of these locks is acquired in the reverse
direction elsewhere. sched_ext, the only current insecure_elasticity
user, hits this: it holds scx_sched_lock across rhashtable inserts of
sub-schedulers, while scx_bypass() takes rq->__lock -> scx_sched_lock.
Exercising the resize path produces:

  Chain exists of:
    &pool->lock --> &rq->__lock --> scx_sched_lock

Bounce the kick from the insert paths through irq_work so
schedule_work() runs from hard IRQ context with the caller's lock no
longer held. rht_deferred_worker()'s self-rearm on error stays on
schedule_work(&ht->run_work) - the worker runs in process context with
no caller lock held, and keeping the self-requeue on @run_work lets
cancel_work_sync() in rhashtable_free_and_destroy() drain it.

v3: Keep rht_deferred_worker()'s self-rearm on schedule_work(&run_work).
    Routing it through irq_work in v2 broke cancel_work_sync()'s
    self-requeue handling - an irq_work queued after irq_work_sync()
    returned but while cancel_work_sync() was still waiting could fire
    post-teardown.

v2: Bounce unconditionally instead of gating on insecure_elasticity,
    as suggested by Herbert.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

tools/sched_ext: Remove unused nr_cpus in scx_cpu0

The nr_cpus variable is defined in scx_cpu0.bpf.c but never used in
the BPF logic. Remove both in BPF and userspace side.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge branch 'for-7.1-fixes' into for-7.2

Pull to receive:

2d2b026c3ea7 ("sched_ext: Deny SCX kfuncs to non-SCX struct_ops programs")

which modifies scx_kfunc_context_filter() to avoid conflicts with planned
changes in for-7.2.

Signed-off-by: Tejun Heo <tj@kernel.org>

selftests/sched_ext: Add non_scx_kfunc_deny test

Verify that the BPF verifier rejects a non-SCX struct_ops program
(tcp_congestion_ops) that attempts to call an SCX kfunc (scx_bpf_kick_cpu).
The test expects the load to fail with -EACCES from scx_kfunc_context_filter.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Deny SCX kfuncs to non-SCX struct_ops programs

scx_kfunc_context_filter() currently allows non-SCX struct_ops programs
(e.g. tcp_congestion_ops) to call SCX unlocked kfuncs. This is wrong
for two reasons:

- It is semantically incorrect: a TCP congestion control program has no
  business calling SCX kfuncs such as scx_bpf_kick_cpu().

- With CONFIG_EXT_SUB_SCHED=y, kfuncs like scx_bpf_kick_cpu() call
  scx_prog_sched(aux), which invokes bpf_prog_get_assoc_struct_ops(aux)
  and casts the result to struct sched_ext_ops * before reading ops->priv.
  For a non-SCX struct_ops program the returned pointer is the kdata of
  that struct_ops type, which is far smaller than sched_ext_ops, making
  the read an out-of-bounds access (confirmed with KASAN).

Extend the filter to cover scx_kfunc_set_any and scx_kfunc_set_idle as
well, and deny all SCX kfuncs for any struct_ops program that is not the
SCX struct_ops. This addresses both issues: the semantic contract is
enforced at the verifier level, and the runtime out-of-bounds access
becomes unreachable.

Fixes: d1d3c1c6ae36 ("sched_ext: Add verifier-time kfunc context filter")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Documentation: clarify arena-backed doubly-linked lists in scx_qmap

Update scx_qmap description to reflect arena-backed doubly-linked
lists with per-queue bpf_res_spin_lock.

Also update scx_qmap.bpf.c to reflect switch from PIDs to TIDs.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Documentation: add note about multiple ops.enqueue() calls in a row

Commit 84b1a0ea0b7c
("sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs")
introduced the possibility of ops.enqueue() being called multiple times
in a row for the same task without intervening calls to ops.dequeue().
Document this behavior as it may be surprising to some.

Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup

BPF schedulers that can't hold task_struct pointers (arena-backed ones in
particular) key tasks by pid. During exit, pid is released before the
task finishes passing through scheduler callbacks, so a dying task
becomes invisible to the BPF side mid-schedule. scx_qmap hits this: an
exiting task's dispatch callback can't recover its queue entry, stalling
dispatch until SCX_EXIT_ERROR_STALL.

Add a unique non-zero u64 p->scx.tid assigned at fork that survives the
full task lifetime including exit. scx_bpf_tid_to_task() looks up the
task; unlike bpf_task_from_pid(), it handles exiting tasks.

The lookup costs an rhashtable insert/remove under scx_tasks_lock, so
root schedulers opt in via SCX_OPS_TID_TO_TASK. Sub-schedulers that set
the flag to declare a dependency are rejected at attach if root didn't
opt in.

scx_qmap converted: keys tasks by tid and enables SCX_OPS_ENQ_EXITING.
Pre-patch it stalls within seconds under a non-leader-exec workload;
with the patch it runs cleanly.

v3: Warn on rhashtable_lookup_insert_fast() failure via new
    scx_tid_hash_insert() helper (Cheng-Yang Chou).

v2: Guard scx_root deref in scx_bpf_tid_to_task() error path. The kfunc
    is registered via scx_kfunc_set_any and reachable from tracing and
    syscall programs when no scheduler is attached (Cheng-Yang Chou).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

Merge branch 'for-7.1-fixes' into for-7.2

Pull to receive 73bd1227787b ("rhashtable: Restore insecure_elasticity
toggle") as a dependency for upcoming patches that use
.insecure_elasticity = true on their rhashtables.

Signed-off-by: Tejun Heo <tj@kernel.org>

tools/sched_ext: Remove dead -d option in scx_flatcg

The -d option was non-functional, only toggling a variable that was
echoed in the status line but never used to dump the cgroup hierarchy.
Remove the option to avoid documenting dead code as a feature.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Document the ops compat strategy in compat.h/compat.bpf.h

The comments around SCX_OPS_DEFINE() and SCX_OPS_OPEN() were vague about
how backward compatibility actually works. Expand them to describe the two
mechanisms: load-time BTF fix-up for additive changes, and multi-variant
struct_ops for incompatible ones.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

sched_ext: Mark scx_sched_hash insecure_elasticity

scx_sched_hash is inserted into under scx_sched_lock (raw_spinlock_irq)
in scx_link_sched(). rhashtable's sync grow path calls get_random_u32()
and does a GFP_ATOMIC allocation; both acquire regular spinlocks, which
is unsafe under raw_spinlock_t. Set insecure_elasticity to skip the
sync grow.

v2:
- Dropped dsq_hash changes. Insertion is not under raw_spin_lock.

- Switched from no_sync_grow flag to insecure_elasticity.

Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Signed-off-by: Tejun Heo <tj@kernel.org>

rhashtable: Restore insecure_elasticity toggle

Some users of rhashtable cannot handle insertion failures, and
are happy to accept the consequences of a hash table that having
very long chains.

Restore the insecure_elasticity toggle for these users. In
addition to disabling the chain length checks, this also removes
the emergency resize that would otherwise occur when the hash
table occupancy hits 100% (an async resize is still scheduled
at 75%).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Print sub-scheduler disabled log and reason

Take scx_qmap for example, when sub scheduler is attached, there is
'BPF sub-scheduler "qmap" enabled' message, but when detached, the log
is missing. Add a new function to do the log thing, it can be used by
both root scheduler and sub scheduler.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

tools/sched_ext: Add missing -c option in scx_qmap help

The sub-scheduler api has been added to scx_qmap, but the new -c option is
missing in help, which is hard to understand and use. Add it in help.

V2: add [-c CG_PATH] to the usage synopsis.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

tools/sched_ext: Handle migration-disabled tasks in scx_central

When a task calls migrate_disable(), p->cpus_ptr is not updated until
migrate_disable_switch() runs during context switch, so dispatch_to_cpu()
may dequeue such a task and dispatch it to a CPU it cannot run on.

Extend the mismatch check in dispatch_to_cpu() to also test
is_migration_disabled() alongside the cpumask check, so tasks in this
window are bounced to the fallback DSQ.

Suggested-by: Andrea Righi <arighi@nvidia.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: scx_qmap: replace FIFO queue maps with arena-backed lists

Arena simplifies verification and allows more natural programming.
Convert scx_qmap to arena as preparation for further sub-sched work.

Replace the five BPF_MAP_TYPE_QUEUE maps with doubly-linked lists in
arena, threaded through task_ctx. Each queue is a struct qmap_fifo with
head/tail pointers and its own per-queue bpf_res_spin_lock.

qmap_dequeue() now properly removes tasks from the queue instead of
leaving stale entries for dispatch to skip.

v2:
- Remove duplicate QMAP_TOUCH_ARENA() in qmap_dump_task (Andrea).
- Update file-level description for arena-backed lists (Andrea).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

sched_ext: scx_qmap: move task_ctx into a BPF arena slab

Arena simplifies verification and allows more natural programming.
Convert scx_qmap to arena as preparation for further sub-sched work.

Allocate per-task context from an arena slab instead of storing it
directly in task_storage. task_ctx_stor now holds an arena pointer to
the task's slab entry. Free entries form a singly-linked list protected
by bpf_res_spin_lock; slab exhaustion triggers scx_bpf_error().

The slab size is configurable via the new -N option (default 16384).

Also add bpf_res_spin_lock/unlock declarations to common.bpf.h.

Scheduling logic unchanged.

v2: Add task_ctx_t typedef for struct task_ctx __arena (Emil).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

sched_ext: scx_qmap: move globals and cpu_ctx into a BPF arena map

Arena simplifies verification and allows more natural programming.
Convert scx_qmap to arena as preparation for further sub-sched work.

Move scheduler state from BSS globals and a percpu array map
into a single BPF arena map. A shared struct qmap_arena is declared as
an __arena global so BPF accesses it directly and userspace reaches it
through skel->arena->qa.

Scheduling logic unchanged; only memory backing changes.

v2: Drop "mutable" from comments.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

sched_ext: scx_qmap: rename tctx to taskc

Rename the per-task context local variable from tctx to taskc for
consistency.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

Merge tag 'v7.1-rc-part1-smbdirect-fixes' of git://git.samba.org/ksmbd

Pull smbdirect updates from Steve French:
"Move smbdirect server and client code to common directory:

   - temporary use of smbdirect_all_c_files.c to allow micro steps

   - factor out common functions into a smbdirect.ko.

   - convert cifs.ko to use smbdirect.ko

   - convert ksmbd.ko to use smbdirect.ko

   - let smbdirect.ko use global workqueues

   - move ib_client logic from ksmbd.ko into smbdirect.ko

   - remove smbdirect_all_c_files.c hack again

   - some locking and teardown related fixes on top"

* tag 'v7.1-rc-part1-smbdirect-fixes' of git://git.samba.org/ksmbd: (145 commits)
  smb: smbdirect: let smbdirect_connection_deregister_mr_io unlock while waiting
  smb: smbdirect: fix the logic in smbdirect_socket_destroy_sync() without an error
  smb: smbdirect: fix copyright header of smbdirect.h
  smb: smbdirect: change smbdirect_socket_parameters.{initiator_depth,responder_resources} to __u16
  smb: smbdirect: remove unused SMBDIRECT_USE_INLINE_C_FILES logic
  smb: server: no longer use smbdirect_socket_set_custom_workqueue()
  smb: client: no longer use smbdirect_socket_set_custom_workqueue()
  smb: smbdirect: introduce global workqueues
  smb: smbdirect: prepare use of dedicated workqueues for different steps
  smb: smbdirect: remove unused smbdirect_connection_mr_io_recovery_work()
  smb: smbdirect: wrap rdma_disconnect() in rdma_[un]lock_handler()
  smb: server: make use of smbdirect_netdev_rdma_capable_mode_type()
  smb: smbdirect: introduce smbdirect_netdev_rdma_capable_mode_type()
  smb: server: make use of smbdirect.ko
  smb: server: remove unused ksmbd_transport_ops.prepare()
  smb: server: make use of smbdirect_socket_{listen,accept}()
  smb: server: only use public smbdirect functions
  smb: server: make use of smbdirect_socket_create_accepting()/smbdirect_socket_release()
  smb: server: make use of smbdirect_{socket_init_accepting,connection_wait_for_connected}()
  smb: server: make use of smbdirect_connection_send_iter() and related functions
  ...

Merge tag 'livepatching-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

Pull livepatching updates from Petr Mladek:

- Add two new selftests

* tag 'livepatching-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
selftests/livepatch: add test for module function patching
selftests: livepatch: test-ftrace: livepatch a traced function

Merge tag 'm68k-for-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k

Pull m68k updates from Geert Uytterhoeven:

- Add support for QEMU virt-ctrl, and use it for system reset
   and power off on the virt platform

- defconfig updates

- Miscellaneous fixes and improvements

* tag 'm68k-for-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: virt: Switch to qemu-virt-ctrl driver
  power: reset: Add QEMU virt-ctrl driver
  m68k: defconfig: Update defconfigs for v7.0-rc1
  m68k: emu: Replace unbounded sprintf() in nfhd_init_one()
  m68k: uapi: Add ucontext.h
  m68k: defconfig: hp300: Enable monochrome and 16-color linux logos
  m68k: q40: Remove commented out code

Merge tag 'efi-next-for-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:
"Again not a busy cycle for EFI, just some minor tweaks and bug fixes:

   - Enable boot graphics resource table (BGRT) on Xen/x86

   - Correct a misguided assumption in the memory attributes table
     sanity check

   - Start tagging efi_mem_reserve()'d regions as MEMBLOCK_RSRV_KERN

   - Some other minor fixes and cleanups"

* tag 'efi-next-for-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi/capsule-loader: fix incorrect sizeof in phys array reallocation
  efi: Tag memblock reservations of boot services regions as RSRV_KERN
  memblock: Permit existing reserved regions to be marked RSRV_KERN
  efi/memattr: Fix thinko in table size sanity check
  efi: libstub: fix type of fdt 32 and 64bit variables
  efi: Drop unused efi_range_is_wc() function
  efi: Enable BGRT loading under Xen
  efi: make efi_mem_type() and efi_mem_attributes() work on Xen PV

Merge tag 'vfio-v7.1-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

- Update QAT vfio-pci variant driver for Gen 5, 420xx devices (Vijay
   Sundar Selvamani, Suman Kumar Chakraborty, Giovanni Cabiddu)

- Fix vfio selftest MMIO DMA mapping selftest (Alex Mastro)

- Conversions to const struct class in support of class_create()
   deprecation (Jori Koolstra)

- Improve selftest compiler compatibility by avoiding initializer on
   variable-length array (Manish Honap)

- Define new uAPI for drivers supporting migration to advise user-
   space of new initial data for reducing target startup latency.
   Implemented for mlx5 vfio-pci variant driver (Yishai Hadas)

- Enable vfio selftests on aarch64, not just cross-compiles reporting
   arm64 (Ted Logan)

- Update vfio selftest driver support to include additional DSA devices
   (Yi Lai)

- Unconditionally include debugfs root pointer in vfio device struct,
   avoiding a build failure seen in hisi_acc variant driver without
   debugfs otherwise (Arnd Bergmann)

- Add support for the s390 ISM (Internal Shared Memory) device via a
   new variant driver. The device is unique in the size of its BAR space
   (256TiB) and lack of mmap support (Julian Ruess)

- Enforce that vfio-pci drivers implement a name in their ops structure
   for use in sequestering SR-IOV VFs (Alex Williamson)

- Prune leftover group notifier code (Paolo Bonzini)

- Fix Xe vfio-pci variant driver to avoid migration support as a
   dependency in the reset path and missing release call (Michał
   Winiarski)

* tag 'vfio-v7.1-rc1' of https://github.com/awilliam/linux-vfio: (23 commits)
  vfio/xe: Add a missing vfio_pci_core_release_dev()
  vfio/xe: Reorganize the init to decouple migration from reset
  vfio: remove dead notifier code
  vfio/pci: Require vfio_device_ops.name
  MAINTAINERS: add VFIO ISM PCI DRIVER section
  vfio/ism: Implement vfio_pci driver for ISM devices
  vfio/pci: Rename vfio_config_do_rw() to vfio_pci_config_rw_single() and export it
  vfio: unhide vdev->debug_root
  vfio/qat: add support for Intel QAT 420xx VFs
  vfio: selftests: Support DMR and GNR-D DSA devices
  vfio: selftests: Build tests on aarch64
  vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  vfio/mlx5: consider inflight SAVE during PRE_COPY
  net/mlx5: Add IFC bits for migration state
  vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl
  vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2
  vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase
  vfio: selftests: Fix VLA initialisation in vfio_pci_irq_set()
  vfio: uapi: fix comment typo
  vfio: mdev: replace mtty_dev->vd_class with a const struct class
  ...

Merge branch 'for-7.1/module-function-test' into for-linus

smb: smbdirect: let smbdirect_connection_deregister_mr_io unlock while waiting

We should not hold a mutex locked during wait_for_completion()
holding a reference is enough.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Henrique Carvalho <henrique.carvalho@suse.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: fix the logic in smbdirect_socket_destroy_sync() without an error

If smbdirect_socket_destroy_sync() and sc->first_error was not set
we should set -ESHUTDOWN, that's a better condition
doing it only implicitly with the
sc->status < SMBDIRECT_SOCKET_DISCONNECTING check.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Henrique Carvalho <henrique.carvalho@suse.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: fix copyright header of smbdirect.h

Everything in smbdirect.h was taken from my out of
tree prototype.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Henrique Carvalho <henrique.carvalho@suse.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: change smbdirect_socket_parameters.{initiator_depth,responder_resources} to __u16

We still limit this to U8_MAX as the rdma api only uses __u8
and that's also the limit for Infiniband and RoCE*,
while iWarp would be able to support larger values at
the protocol level.

As struct smbdirect_socket_parameters will be part
of the uapi for IPPROTO_SMBDIRECT in future, change it
now even if userspace sockets won't be supported yet.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Acked-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: remove unused SMBDIRECT_USE_INLINE_C_FILES logic

We always build as standalone module (or as part of the core kernel).

This also removes unused elements from struct smbdirect_socket
and unused exports.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: no longer use smbdirect_socket_set_custom_workqueue()

smbdirect.ko has global workqueues now, so we should use these
default once.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: no longer use smbdirect_socket_set_custom_workqueue()

smbdirect.ko has global workqueues now, so we should use these
default once.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce global workqueues

These will be used in future and callers should no
longer use smbdirect_socket_set_custom_workqueue().

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: prepare use of dedicated workqueues for different steps

This is a preparation in order to have global workqueues in
the smbdirect module instead of having the caller to
provide one.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: remove unused smbdirect_connection_mr_io_recovery_work()

This would actually never be used as we only move to
SMBDIRECT_MR_ERROR when we directly call
smbdirect_socket_schedule_cleanup().

Doing an ib_dereg_mr/ib_alloc_mr dance on
working connection is not needed and
it's also pointless on a broken connection
as we don't reuse any ib_pd.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: wrap rdma_disconnect() in rdma_[un]lock_handler()

This might not be needed, but it controls the order
of ib_drain_qp() and rdma_disconnect().

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: make use of smbdirect_netdev_rdma_capable_mode_type()

This removes is basically the same logic.

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: smbdirect: introduce smbdirect_netdev_rdma_capable_mode_type()

This is basically a copy of ksmbd_rdma_capable_netdev() in the
server, but this also prints a message when a device is renamed.

The differences are:
- It uses rdma_for_each_port() instead of implementing the
same logic again.
- It returns RDMA_NODE_{UNSPECIFIED,IB_CA,RNIC} values instead of bool

Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: make use of smbdirect.ko

This means we no longer inline the common smbdirect
.c files and use the exported functions from the
module instead.

Note the connection specific logging is still
redirect to ksmbd.ko functions via
smbdirect_socket_set_logging().

We still don't use real socket layer,
but we're very close...

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: remove unused ksmbd_transport_ops.prepare()

This is no longer needed for smbdirect.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: make use of smbdirect_socket_{listen,accept}()

We no longer need the custom rdma listener.

The code logic is very similar to transport_tcp.c now
using a kernel thread that loops over smbdirect_socket_accept().

This is the first step in the direction of using IPPROTO_SMBDIRECT
sockets in future.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: only use public smbdirect functions

Also remove a lot of unused includes...

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: make use of smbdirect_socket_create_accepting()/smbdirect_socket_release()

With this we no longer embed struct smbdirect_socket, which will allow
us to make it private in the following commits.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: make use of smbdirect_{socket_init_accepting,connection_wait_for_connected}()

This means we finally only use common functions in the server.

We still use the embedded struct smbdirect_socket and are
able to access internals, but the will be removed in the
next commits as well.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: make use of smbdirect_connection_send_iter() and related functions

This makes use of common code for sending messages, this will
allow to make more use of common code in the next commits.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: let smb_direct_post_send_data() return data_length

This make it easier moving to common code shared with the client.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: split out smb_direct_send_iter() out of smb_direct_writev()

This will help to move to common code in future.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: let smbdirect_map_sges_from_iter() truncate the message boundary

smbdirect_map_sges_from_iter() already handles the case that only
a limited number of sges are available. Its return value
is data_length and the remaining bytes in the iter are
remaining_data_length.

This is now much easier and will allow us to share
more code with the client soon.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: inline smb_direct_create_header() into smb_direct_post_send_data()

The point is that ib_dma_map_single() is done first, but
the 'Fill in the packet header' will be done after
smbdirect_map_sges_from_iter().

This will simplify further changes in order to
share common code with the client.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: server: move iov_iter_kvec() out of smb_direct_post_send_data()

This will allow us to make the code more generic in order
to move it to common with the client.

Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>