]> git.ipfire.org Git - thirdparty/kernel/stable.git/log
thirdparty/kernel/stable.git
5 months agoarc/perf: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:39 +0000 (11:16 -0700)] 
arc/perf: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-12-kan.liang@linux.intel.com
5 months agoalpha/perf: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:38 +0000 (11:16 -0700)] 
alpha/perf: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-11-kan.liang@linux.intel.com
5 months agoperf/apple_m1: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:37 +0000 (11:16 -0700)] 
perf/apple_m1: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-10-kan.liang@linux.intel.com
5 months agoperf/arm: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:36 +0000 (11:16 -0700)] 
perf/arm: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20250520181644.2673067-9-kan.liang@linux.intel.com
5 months agos390/perf: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:35 +0000 (11:16 -0700)] 
s390/perf: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Link: https://lore.kernel.org/r/20250520181644.2673067-8-kan.liang@linux.intel.com
5 months agopowerpc/perf: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:34 +0000 (11:16 -0700)] 
powerpc/perf: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-7-kan.liang@linux.intel.com
5 months agoperf/x86/zhaoxin: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:33 +0000 (11:16 -0700)] 
perf/x86/zhaoxin: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-6-kan.liang@linux.intel.com
5 months agoperf/x86/amd: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:32 +0000 (11:16 -0700)] 
perf/x86/amd: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250520181644.2673067-5-kan.liang@linux.intel.com
5 months agoperf/x86/intel: Remove driver-specific throttle support
Kan Liang [Tue, 20 May 2025 18:16:31 +0000 (11:16 -0700)] 
perf/x86/intel: Remove driver-specific throttle support

The throttle support has been added in the generic code. Remove
the driver-specific throttle support.

Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-4-kan.liang@linux.intel.com
5 months agoperf: Only dump the throttle log for the leader
Kan Liang [Tue, 20 May 2025 18:16:30 +0000 (11:16 -0700)] 
perf: Only dump the throttle log for the leader

The PERF_RECORD_THROTTLE records are dumped for all throttled events.
It's not necessary for group events, which are throttled altogether.

Optimize it by only dump the throttle log for the leader.

The sample right after the THROTTLE record must be generated by the
actual target event. It is good enough for the perf tool to locate the
actual target event.

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
5 months agoperf: Fix the throttle logic for a group
Kan Liang [Tue, 20 May 2025 18:16:29 +0000 (11:16 -0700)] 
perf: Fix the throttle logic for a group

The current throttle logic doesn't work well with a group, e.g., the
following sampling-read case.

$ perf record -e "{cycles,cycles}:S" ...

$ perf report -D | grep THROTTLE | tail -2
            THROTTLE events:        426  ( 9.0%)
          UNTHROTTLE events:        425  ( 9.0%)

$ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
... sample_read:
.... group nr 2
..... id 0000000000000327, value 000000000cbb993a, lost 0
..... id 0000000000000328, value 00000002211c26df, lost 0

The second cycles event has a much larger value than the first cycles
event in the same group.

The current throttle logic in the generic code only logs the THROTTLE
event. It relies on the specific driver implementation to disable
events. For all ARCHs, the implementation is similar. Only the event is
disabled, rather than the group.

The logic to disable the group should be generic for all ARCHs. Add the
logic in the generic code. The following patch will remove the buggy
driver-specific implementation.

The throttle only happens when an event is overflowed. Stop the entire
group when any event in the group triggers the throttle.
The MAX_INTERRUPTS is set to all throttle events.

The unthrottled could happen in 3 places.
- event/group sched. All events in the group are scheduled one by one.
  All of them will be unthrottled eventually. Nothing needs to be
  changed.
- The perf_adjust_freq_unthr_events for each tick. Needs to restart the
  group altogether.
- The __perf_event_period(). The whole group needs to be restarted
  altogether as well.

With the fix,
$ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
... sample_read:
.... group nr 2
..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0

Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com
5 months agoperf/core: Add the is_event_in_freq_mode() helper to simplify the code
Kan Liang [Fri, 16 May 2025 18:28:38 +0000 (11:28 -0700)] 
perf/core: Add the is_event_in_freq_mode() helper to simplify the code

Add a helper to check if an event is in freq mode to improve readability.

No functional changes.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250516182853.2610284-2-kan.liang@linux.intel.com
5 months agoperf/x86/intel/ds: Remove redundant assignments to sample.period
Changbin Du [Tue, 6 May 2025 09:49:07 +0000 (17:49 +0800)] 
perf/x86/intel/ds: Remove redundant assignments to sample.period

The perf_sample_data_init() has already set the period of sample, so no
need to do it again.

Signed-off-by: Changbin Du <changbin.du@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250506094907.2724-1-changbin.du@huawei.com
5 months agoperf/aux: Allocate non-contiguous AUX pages by default
Yabin Cui [Thu, 8 May 2025 23:26:42 +0000 (16:26 -0700)] 
perf/aux: Allocate non-contiguous AUX pages by default

perf always allocates contiguous AUX pages based on aux_watermark.
However, this contiguous allocation doesn't benefit all PMUs. For
instance, ARM SPE and TRBE operate with virtual pages, and Coresight
ETR allocates a separate buffer. For these PMUs, allocating contiguous
AUX pages unnecessarily exacerbates memory fragmentation. This
fragmentation can prevent their use on long-running devices.

This patch modifies the perf driver to be memory-friendly by default,
by allocating non-contiguous AUX pages. For PMUs requiring contiguous
pages (Intel BTS and some Intel PT), the existing
PERF_PMU_CAP_AUX_NO_SG capability can be used. For PMUs that don't
require but can benefit from contiguous pages (some Intel PT), a new
capability, PERF_PMU_CAP_AUX_PREFER_LARGE, is added to maintain their
existing behavior.

Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250508232642.148767-1-yabinc@google.com
5 months agoperf: Fix confusing aux iteration
Frederic Weisbecker [Thu, 24 Apr 2025 16:11:28 +0000 (18:11 +0200)] 
perf: Fix confusing aux iteration

While an event tears down all links to it as an aux, the iteration
happens on the event's group leader instead of the group itself.

If the event is a group leader, it has no effect because the event is
also its own group leader. But otherwise there would be a risk to detach
all the siblings events from the wrong group leader.

It just happens to work because each sibling's aux link is tested
against the right event before proceeding. Also the ctx lock is the same
for the events and their group leader so the iteration is safe.

Yet the iteration is confusing. Clarify the actual intent.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-5-frederic@kernel.org
5 months agoperf: Remove too early and redundant CPU hotplug handling
Frederic Weisbecker [Thu, 24 Apr 2025 16:11:27 +0000 (18:11 +0200)] 
perf: Remove too early and redundant CPU hotplug handling

The CPU hotplug handlers are called twice: at prepare and online stage.

Their role is to:

1) Enable/disable a CPU context. This is irrelevant and even buggy at
   the prepare stage because the CPU is still offline. On early
   secondary CPU up, creating an event attached to that CPU might
   silently fail because the CPU context is observed as online but the
   context installation's IPI failure is ignored.

2) Update the scope cpumasks and re-migrate the events accordingly in
   the CPU down case. This is irrelevant at the prepare stage.

3) Remove the events attached to the context of the offlining CPU. It
   even uses an (unnecessary) IPI for it. This is also irrelevant at the
   prepare stage.

Also none of the *_PREPARE and *_STARTING architecture perf related CPU
hotplug callbacks rely on CPUHP_PERF_PREPARE.

CPUHP_AP_PERF_ONLINE is enough and the right place to perform the work.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-4-frederic@kernel.org
5 months agoperf: Fix irq work dereferencing garbage
Frederic Weisbecker [Mon, 28 Apr 2025 11:11:47 +0000 (13:11 +0200)] 
perf: Fix irq work dereferencing garbage

The following commit:

da916e96e2de ("perf: Make perf_pmu_unregister() useable")

has introduced two significant event's parent lifecycle changes:

1) An event that has exited now has EVENT_TOMBSTONE as a parent.
   This can result in a situation where the delayed wakeup irq_work can
   accidentally dereference EVENT_TOMBSTONE on:

CPU 0                                          CPU 1
-----                                          -----

__schedule()
    local_irq_disable()
    rq_lock()
    <NMI>
    perf_event_overflow()
        irq_work_queue(&child->pending_irq)
    </NMI>
    perf_event_task_sched_out()
        raw_spin_lock(&ctx->lock)
        ctx_sched_out()
        ctx->is_active = 0
        event_sched_out(child)
        raw_spin_unlock(&ctx->lock)
                                              perf_event_release_kernel(parent)
                                                  perf_remove_from_context(child)
                                                  raw_spin_lock_irq(&ctx->lock)
                                                  // Sees !ctx->is_active
                                                  // Removes from context inline
                                                  __perf_remove_from_context(child)
                                                      perf_child_detach(child)
                                                          event->parent = EVENT_TOMBSTONE
    raw_spin_rq_unlock_irq(rq);
    <IRQ>
    perf_pending_irq()
        perf_event_wakeup(child)
            ring_buffer_wakeup(child)
                rcu_dereference(child->parent->rb) <--- CRASH

This also concerns the call to kill_fasync() on parent->fasync.

2) The final parent reference count decrement can now happen before the
   the final child reference count decrement. ie: the parent can now
   be freed before its child. On PREEMPT_RT, this can result in a
   situation where the delayed wakeup irq_work can accidentally
   dereference a freed parent:

CPU 0                                          CPU 1                              CPU 2
-----                                          -----                              ------

perf_pmu_unregister()
    pmu_detach_events()
       pmu_get_event()
           atomic_long_inc_not_zero(&child->refcount)

                                               <NMI>
                                               perf_event_overflow()
                                                   irq_work_queue(&child->pending_irq);
                                               </NMI>
                                               <IRQ>
                                               irq_work_run()
                                                   wake_irq_workd()
                                               </IRQ>
                                               preempt_schedule_irq()
                                               =========> SWITCH to workd
                                               irq_work_run_list()
                                                   perf_pending_irq()
                                                       perf_event_wakeup(child)
                                                           ring_buffer_wakeup(child)
                                                               event = child->parent

                                                                                  perf_event_release_kernel(parent)
                                                                                      // Not last ref, PMU holds it
                                                                                      put_event(child)
                                                                                      // Last ref
                                                                                      put_event(parent)
                                                                                          free_event()
                                                                                              call_rcu(...)
                                                                                  rcu_core()
                                                                                      free_event_rcu()

                                                               rcu_dereference(event->rb) <--- CRASH

This also concerns the call to kill_fasync() on parent->fasync.

The "easy" solution to 1) is to check that event->parent is not
EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
and fasync uses).

The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
run under rcu_read_lock().

However because of 2), sanity would prescribe to make event::parent
an __rcu pointer and annotate each and every users to prove they are
reliable.

Propose an alternate solution and restore the stable pointer to the
parent until all its children have called _free_event() themselves to
avoid any further accident. Also revert the EVENT_TOMBSTONE design
that is mostly here to determine which caller of perf_event_exit_event()
must perform the refcount decrement on a child event matching the
increment in inherit_event().

Arrange instead for checking the attach state of an event prior to its
removal and decrement the refcount of the child accordingly.

Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
5 months agoperf: Fix failing inherit_event() doing extra refcount decrement on parent
Frederic Weisbecker [Thu, 24 Apr 2025 16:11:25 +0000 (18:11 +0200)] 
perf: Fix failing inherit_event() doing extra refcount decrement on parent

When inherit_event() fails after the child allocation but before the
parent refcount has been incremented, calling put_event() wrongly
decrements the reference to the parent, risking to free it too early.

Also pmu_get_event() can't be holding a reference to the child
concurrently at this point since it is under pmus_srcu critical section.

Fix it with restoring the deleted free_event() function and call it on
the failing child in order to free it directly under the verified
assumption that its refcount is only 1. The refcount to the parent is
then voluntarily omitted.

Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-2-frederic@kernel.org
6 months agoperf/x86: Optimize the is_x86_event
Kan Liang [Thu, 24 Apr 2025 13:47:17 +0000 (06:47 -0700)] 
perf/x86: Optimize the is_x86_event

The current is_x86_event has to go through the hybrid_pmus list to find
the matched pmu, then check if it's a X86 PMU and a X86 event. It's not
necessary.

The X86 PMU has a unique type ID on a non-hybrid machine, and a unique
capability type. They are good enough to do the check.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424134718.311934-5-kan.liang@linux.intel.com
6 months agoperf/x86/intel: Check the X86 leader for ACR group
Kan Liang [Thu, 24 Apr 2025 13:47:16 +0000 (06:47 -0700)] 
perf/x86/intel: Check the X86 leader for ACR group

The auto counter reload group also requires a group flag in the leader.
The leader must be a X86 event.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424134718.311934-4-kan.liang@linux.intel.com
6 months agoperf/core: Fix broken throttling when max_samples_per_tick=1
Qing Wang [Sat, 5 Apr 2025 14:16:35 +0000 (22:16 +0800)] 
perf/core: Fix broken throttling when max_samples_per_tick=1

According to the throttling mechanism, the pmu interrupts number can not
exceed the max_samples_per_tick in one tick. But this mechanism is
ineffective when max_samples_per_tick=1, because the throttling check is
skipped during the first interrupt and only performed when the second
interrupt arrives.

Perhaps this bug may cause little influence in one tick, but if in a
larger time scale, the problem can not be underestimated.

When max_samples_per_tick = 1:
Allowed-interrupts-per-second max-samples-per-second  default-HZ  ARCH
200                           100                     100         X86
500                           250                     250         ARM64
...
Obviously, the pmu interrupt number far exceed the user's expect.

Fixes: e050e3f0a71b ("perf: Fix broken interrupt rate throttling")
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250405141635.243786-3-wangqing7171@gmail.com
6 months agoMerge branch 'perf/urgent'
Peter Zijlstra [Fri, 25 Apr 2025 12:55:20 +0000 (14:55 +0200)] 
Merge branch 'perf/urgent'

Merge urgent fixes for dependencies.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
6 months agoperf/x86/intel/ds: Fix counter backwards of non-precise events counters-snapshotting
Kan Liang [Thu, 24 Apr 2025 13:47:18 +0000 (06:47 -0700)] 
perf/x86/intel/ds: Fix counter backwards of non-precise events counters-snapshotting

The counter backwards may be observed in the PMI handler when
counters-snapshotting some non-precise events in the freq mode.

For the non-precise events, it's possible the counters-snapshotting
records a positive value for an overflowed PEBS event. Then the HW
auto-reload mechanism reset the counter to 0 immediately. Because the
pebs_event_reset is cleared in the freq mode, which doesn't set the
PERF_X86_EVENT_AUTO_RELOAD.
In the PMI handler, 0 will be read rather than the positive value
recorded in the counters-snapshotting record.

The counters-snapshotting case has to be specially handled. Since the
event value has been updated when processing the counters-snapshotting
record, only needs to set the new period for the counter via
x86_pmu_set_period().

Fixes: e02e9b0374c3 ("perf/x86/intel: Support PEBS counters snapshotting")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424134718.311934-6-kan.liang@linux.intel.com
6 months agoperf/x86/intel: Check the X86 leader for pebs_counter_event_group
Kan Liang [Thu, 24 Apr 2025 13:47:15 +0000 (06:47 -0700)] 
perf/x86/intel: Check the X86 leader for pebs_counter_event_group

The PEBS counters snapshotting group also requires a group flag in the
leader. The leader must be a X86 event.

Fixes: e02e9b0374c3 ("perf/x86/intel: Support PEBS counters snapshotting")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424134718.311934-3-kan.liang@linux.intel.com
6 months agoperf/x86/intel: Only check the group flag for X86 leader
Kan Liang [Thu, 24 Apr 2025 13:47:14 +0000 (06:47 -0700)] 
perf/x86/intel: Only check the group flag for X86 leader

A warning in intel_pmu_lbr_counters_reorder() may be triggered by below
perf command.

perf record -e "{cpu-clock,cycles/call-graph="lbr"/}" -- sleep 1

It's because the group is mistakenly treated as a branch counter group.

The hw.flags of the leader are used to determine whether a group is a
branch counters group. However, the hw.flags is only available for a
hardware event. The field to store the flags is a union type. For a
software event, it's a hrtimer. The corresponding bit may be set if the
leader is a software event.

For a branch counter group and other groups that have a group flag
(e.g., topdown, PEBS counters snapshotting, and ACR), the leader must
be a X86 event. Check the X86 event before checking the flag.
The patch only fixes the issue for the branch counter group.
The following patch will fix the other groups.

There may be an alternative way to fix the issue by moving the hw.flags
out of the union type. It should work for now. But it's still possible
that the flags will be used by other types of events later. As long as
that type of event is used as a leader, a similar issue will be
triggered. So the alternative way is dropped.

Fixes: 33744916196b ("perf/x86/intel: Support branch counters logging")
Closes: https://lore.kernel.org/lkml/20250412091423.1839809-1-luogengkun@huaweicloud.com/
Reported-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250424134718.311934-2-kan.liang@linux.intel.com
6 months agoperf/x86: Fix non-sampling (counting) events on certain x86 platforms
Luo Gengkun [Wed, 23 Apr 2025 06:47:24 +0000 (06:47 +0000)] 
perf/x86: Fix non-sampling (counting) events on certain x86 platforms

Perf doesn't work at perf stat for hardware events on certain x86 platforms:

 $perf stat -- sleep 1
 Performance counter stats for 'sleep 1':
             16.44 msec task-clock                       #    0.016 CPUs utilized
                 2      context-switches                 #  121.691 /sec
                 0      cpu-migrations                   #    0.000 /sec
                54      page-faults                      #    3.286 K/sec
   <not supported> cycles
   <not supported> instructions
   <not supported> branches
   <not supported> branch-misses

The reason is that the check in x86_pmu_hw_config() for sampling events is
unexpectedly applied to counting events as well.

It should only impact x86 platforms with limit_period used for non-PEBS
events. For Intel platforms, it should only impact some older platforms,
e.g., HSW, BDW and NHM.

Fixes: 88ec7eedbbd2 ("perf/x86: Fix low freqency setting issue")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250423064724.3716211-1-luogengkun@huaweicloud.com
6 months agoperf/core: Change to POLLERR for pinned events with error
Namhyung Kim [Tue, 22 Apr 2025 22:33:18 +0000 (15:33 -0700)] 
perf/core: Change to POLLERR for pinned events with error

Commit:

  f4b07fd62d4d11d5 ("perf/core: Use POLLHUP for pinned events in error")

started to emit POLLHUP for pinned events in an error state.

But the POLLHUP is also used to signal events that the attached task is
terminated.  To distinguish pinned per-task events in the error state
it would need to check if the task is live.

Change it to POLLERR to make it clear.

Suggested-by: Gabriel Marin <gmx@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250422223318.180343-1-namhyung@kernel.org
6 months agoMerge tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Tue, 22 Apr 2025 17:22:38 +0000 (10:22 -0700)] 
Merge tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - subpage mode fixes:
     - access correct object (folio) when looking up bit offset
     - fix assertion condition for number of blocks per folio
     - fix upper boundary of locking range in hole punch

 - zoned fixes:
     - fix potential deadlock caught by lockdep when zone reporting and
       device freeze run in parallel
     - fix zone write pointer mismatch and NULL pointer dereference when
       metadata are converted from DUP to RAID1

 - fix error handling when reloc inode creation fails

 - in tree-checker, unify error code for header level check

 - block layer: add helpers to read zone capacity

* tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: skip reporting zone for new block group
  block: introduce zone capacity helper
  btrfs: tree-checker: adjust error code for header level check
  btrfs: fix invalid inode pointer after failure to create reloc inode
  btrfs: zoned: return EIO on RAID1 block group write pointer mismatch
  btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP()
  btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()
  btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()

6 months agoMerge tag 'integrity-6.15-rc3-fix' of https://github.com/linux-integrity/linux
Linus Torvalds [Tue, 22 Apr 2025 16:36:45 +0000 (09:36 -0700)] 
Merge tag 'integrity-6.15-rc3-fix' of https://github.com/linux-integrity/linux

Pull integrity fix from Roberto Sassu:
 "One performance fix to avoid unnecessarily taking the inode lock"

* tag 'integrity-6.15-rc3-fix' of https://github.com/linux-integrity/linux:
  ima: process_measurement() needlessly takes inode_lock() on MAY_READ

6 months agoima: process_measurement() needlessly takes inode_lock() on MAY_READ
Frederick Lawler [Thu, 27 Mar 2025 16:09:11 +0000 (11:09 -0500)] 
ima: process_measurement() needlessly takes inode_lock() on MAY_READ

On IMA policy update, if a measure rule exists in the policy,
IMA_MEASURE is set for ima_policy_flags which makes the violation_check
variable always true. Coupled with a no-action on MAY_READ for a
FILE_CHECK call, we're always taking the inode_lock().

This becomes a performance problem for extremely heavy read-only workloads.
Therefore, prevent this only in the case there's no action to be taken.

Signed-off-by: Frederick Lawler <fred@cloudflare.com>
Acked-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
6 months agoMerge tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Tue, 22 Apr 2025 02:16:29 +0000 (19:16 -0700)] 
Merge tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Use kvzalloc() so that large exit_dump buffer allocations don't fail
   easily

 - Remove cpu.weight / cpu.idle unimplemented warnings which are more
   annoying than helpful.

   This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary. Mark it for
   deprecation

* tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
  sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
  sched_ext: Use kvzalloc for large exit_dump allocation

6 months agoMerge tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Tue, 22 Apr 2025 02:13:25 +0000 (19:13 -0700)] 
Merge tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix compilation in CONFIG_LOCKDEP && !CONFIG_PROVE_RCU configurations

 - Allow "cpuset_v2_mode" mount option for "cpuset" filesystem type to
   make life easier for android

* tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset-v1: Add missing support for cpuset_v2_mode
  cgroup: Fix compilation issue due to cgroup_mutex not being exported

6 months agogcc-15: disable '-Wunterminated-string-initialization' entirely for now
Linus Torvalds [Sun, 20 Apr 2025 22:30:53 +0000 (15:30 -0700)] 
gcc-15: disable '-Wunterminated-string-initialization' entirely for now

I had left the warning around but as a non-fatal error to get my gcc-15
builds going, but fixed up some of the most annoying warning cases so
that it wouldn't be *too* verbose.

Because I like the _concept_ of the warning, even if I detested the
implementation to shut it up.

It turns out the implementation to shut it up is even more broken than I
thought, and my "shut up most of the warnings" patch just caused fatal
errors on gcc-14 instead.

I had tested with clang, but when I upgrade my development environment,
I try to do it on all machines because I hate having different systems
to maintain, and hadn't realized that gcc-14 now had issues.

The ACPI case is literally why I wanted to have a *type* that doesn't
trigger the warning (see commit d5d45a7f2619: "gcc-15: make
'unterminated string initialization' just a warning"), instead of
marking individual places as "__nonstring".

But gcc-14 doesn't like that __nonstring location that shut gcc-15 up,
because it's on an array of char arrays, not on one single array:

  drivers/acpi/tables.c:399:1: error: 'nonstring' attribute ignored on objects of type 'const char[][4]' [-Werror=attributes]
    399 | static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst __nonstring = {
        | ^~~~~~

and my attempts to nest it properly with a type had failed, because of
how gcc doesn't like marking the types as having attributes, only
symbols.

There may be some trick to it, but I was already annoyed by the bad
attribute design, now I'm just entirely fed up with it.

I wish gcc had a proper way to say "this type is a *byte* array, not a
string".

The obvious thing would be to distinguish between "char []" and an
explicitly signed "unsigned char []" (as opposed to an implicitly
unsigned char, which is typically an architecture-specific default, but
for the kernel is universal thanks to '-funsigned-char').

But any "we can typedef a 8-bit type to not become a string just because
it's an array" model would be fine.

But "__attribute__((nonstring))" is sadly not that sane model.

Reported-by: Chris Clayton <chris2553@googlemail.com>
Fixes: 4b4bd8c50f48 ("gcc-15: acpi: sprinkle random '__nonstring' crumbles around")
Fixes: d5d45a7f2619 ("gcc-15: make 'unterminated string initialization' just a warning")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 months agoLinux 6.15-rc3 v6.15-rc3
Linus Torvalds [Sun, 20 Apr 2025 20:43:47 +0000 (13:43 -0700)] 
Linux 6.15-rc3

6 months agogcc-15: work around sequence-point warning
Linus Torvalds [Sun, 20 Apr 2025 18:30:11 +0000 (11:30 -0700)] 
gcc-15: work around sequence-point warning

The C sequence points are complicated things, and gcc-15 has apparently
added a warning for the case where an object is both used and modified
multiple times within the same sequence point.

That's a great warning.

Or rather, it would be a great warning, except gcc-15 seems to not
really be very exact about it, and doesn't notice that the modification
are to two entirely different members of the same object: the array
counter and the array entries.

So that seems kind of silly.

That said, the code that gcc complains about is unnecessarily
complicated, so moving the array counter update into a separate
statement seems like the most straightforward fix for these warnings:

  drivers/net/wireless/intel/iwlwifi/mld/d3.c: In function ‘iwl_mld_set_netdetect_info’:
  drivers/net/wireless/intel/iwlwifi/mld/d3.c:1102:66: error: operation on ‘netdetect_info->n_matches’ may be undefined [-Werror=sequence-point]
   1102 |                 netdetect_info->matches[netdetect_info->n_matches++] = match;
        |                                         ~~~~~~~~~~~~~~~~~~~~~~~~~^~

  drivers/net/wireless/intel/iwlwifi/mld/d3.c:1120:58: error: operation on ‘match->n_channels’ may be undefined [-Werror=sequence-point]
   1120 |                         match->channels[match->n_channels++] =
        |                                         ~~~~~~~~~~~~~~~~~^~

side note: the code at that second warning is actively buggy, and only
works on little-endian machines that don't do strict alignment checks.

The code casts an array of integers into an array of unsigned long in
order to use our bitmap iterators.  That happens to work fine on any
sane architecture, but it's still wrong.

This does *not* fix that more serious problem.  This only splits the two
assignments into two statements and fixes the compiler warning.  I need
to get rid of the new warnings in order to be able to actually do any
build testing.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 months agogcc-15: add '__nonstring' markers to byte arrays
Linus Torvalds [Sun, 20 Apr 2025 18:18:55 +0000 (11:18 -0700)] 
gcc-15: add '__nonstring' markers to byte arrays

All of these cases are perfectly valid and good traditional C, but hit
by the "you're not NUL-terminating your byte array" warning.

And none of the cases want any terminating NUL character.

Mark them __nonstring to shut up gcc-15 (and in the case of the ak8974
magnetometer driver, I just removed the explicit array size and let gcc
expand the 3-byte and 6-byte arrays by one extra byte, because it was
the simpler change).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 months agogcc-15: get rid of misc extra NUL character padding
Linus Torvalds [Sun, 20 Apr 2025 18:04:00 +0000 (11:04 -0700)] 
gcc-15: get rid of misc extra NUL character padding

This removes two cases of explicit NUL padding that now causes warnings
because of '-Wunterminated-string-initialization' being part of -Wextra
in gcc-15.

Gcc is being silly in this case when it says that it truncates a NUL
terminator, because in these cases there were _multiple_ NUL characters.

But we can get rid of the warning by just simplifying the two
initializers that trigger the warning for me, so this does exactly that.

I'm not sure why the power supply code did that odd

    .attr_name = #_name "\0",

pattern: it was introduced in commit 2cabeaf15129 ("power: supply: core:
Cleanup power supply sysfs attribute list"), but that 'attr_name[]'
field is an explicitly sized character array in a statically initialized
variable, and a string initializer always has a terminating NUL _and_
statically initialized character arrays are zero-padded anyway, so it
really seems to be rather extraneous belt-and-suspenders.

The zero_uuid[16] initialization in drivers/md/bcache/super.c makes
perfect sense, but it isn't necessary for the same reasons, and not
worth the new gcc warning noise.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 months agogcc-15: acpi: sprinkle random '__nonstring' crumbles around
Linus Torvalds [Sun, 20 Apr 2025 18:02:18 +0000 (11:02 -0700)] 
gcc-15: acpi: sprinkle random '__nonstring' crumbles around

This is not great: I'd much rather introduce a typedef that is a "ACPI
name byte buffer", and use that to mark these special 4-byte ACPI names
that do not use NUL termination.

But as noted in the previous commit ("gcc-15: make 'unterminated string
initialization' just a warning") gcc doesn't actually seem to support
that notion, so instead you have to just mark every single array
declaration individually.

So this is not pretty, but this gets rid of the bulk of the annoying
warnings during an allmodconfig build for me.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 months agogcc-15: make 'unterminated string initialization' just a warning
Linus Torvalds [Sun, 20 Apr 2025 17:33:23 +0000 (10:33 -0700)] 
gcc-15: make 'unterminated string initialization' just a warning

gcc-15 enabling -Wunterminated-string-initialization in -Wextra by
default was done with the best intentions, but the warning is still
quite broken.

What annoys me about the warning is that this is a very traditional AND
CORRECT way to initialize fixed byte arrays in C:

unsigned char hex[16] = "0123456789abcdef";

and we use this all over the kernel.  And the warning is fine, but gcc
developers apparently never made a reasonable way to disable it.  As is
(sadly) tradition with these things.

Yes, there's "__attribute__((nonstring))", and we have a macro to make
that absolutely disgusting syntax more palatable (ie the kernel syntax
for that monstrosity is just "__nonstring").

But that attribute is misdesigned.  What you'd typically want to do is
tell the compiler that you are using a type that isn't a string but a
byte array, but that doesn't work at all:

warning: ‘nonstring’ attribute does not apply to types [-Wattributes]

and because of this fundamental mis-design, you then have to mark each
instance of that pattern.

This is particularly noticeable in our ACPI code, because ACPI has this
notion of a 4-byte "type name" that gets used all over, and is exactly
this kind of byte array.

This is a sad oversight, because the warning is useful, but really would
be so much better if gcc had also given a sane way to indicate that we
really just want a byte array type at a type level, not the broken "each
and every array definition" level.

So now instead of creating a nice "ACPI name" type using something like

typedef char acpi_name_t[4] __nonstring;

we have to do things like

char name[ACPI_NAMESEG_SIZE] __nonstring;

in every place that uses this concept and then happens to have the
typical initializers.

This is annoying me mainly because I think the warning _is_ a good
warning, which is why I'm not just turning it off in disgust.  But it is
hampered by this bad implementation detail.

[ And obviously I'm doing this now because system upgrades for me are
  something that happen in the middle of the release cycle: don't do it
  before or during travel, or just before or during the busy merge
  window period. ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 months agoMerge tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sun, 20 Apr 2025 04:46:58 +0000 (21:46 -0700)] 
Merge tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc hotfixes from Andrew Morton:
 "16 hotfixes. 2 are cc:stable and the remainder address post-6.14
  issues or aren't considered necessary for -stable kernels.

  All patches are basically for MM although five are alterations to
  MAINTAINERS"

[ Basic counting skills are clearly not a strictly necessary requirement
  for kernel maintainers.     - Linus ]

* tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add section for locking of mm's and VMAs
  mm: vmscan: fix kswapd exit condition in defrag_mode
  mm: vmscan: restore high-cpu watermark safety in kswapd
  MAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section
  mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
  mm, hugetlb: increment the number of pages to be reset on HVO
  writeback: fix false warning in inode_to_wb()
  docs: ABI: replace mcroce@microsoft.com with new Meta address
  mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()
  MAINTAINERS: add memory advice section
  MAINTAINERS: add mmap trace events to MEMORY MAPPING
  mm: memcontrol: fix swap counter leak from offline cgroup
  MAINTAINERS: add MM subsection for the page allocator
  MAINTAINERS: update SLAB ALLOCATOR maintainers
  fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
  mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()

6 months agoMerge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 19 Apr 2025 21:31:08 +0000 (14:31 -0700)] 
Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Revert the hfs{plus} deprecation warning that's also included in this
   pull request. The commit introducing the deprecation warning resides
   rather early in this branch. So simply dropping it would've rebased
   all other commits which I decided to avoid. Hence the revert in the
   same branch

   [ Background - the deprecation warning discussion resulted in people
     stepping up, and so hfs{plus} will have a maintainer taking care of
     it after all..   - Linus ]

 - Switch CONFIG_SYSFS_SYCALL default to n and decouple from
   CONFIG_EXPERT

 - Fix an audit bug caused by changes to our kernel path lookup helpers
   this cycle. Audit needs the parent path even if the dentry it tried
   to look up is negative

 - Ensure that the kernel path lookup helpers leave the passed in path
   argument clean when they return an error. This is consistent with all
   our other helpers

 - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
   information is available to kernel consumers as well

 - Don't set a timer and call schedule() if the timer will expire
   immediately in epoll

 - Make netfs lookup tables with __nonstring

* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Revert "hfs{plus}: add deprecation warning"
  fs: move the bdex_statx call to vfs_getattr_nosec
  netfs: Mark __nonstring lookup tables
  eventpoll: Set epoll timeout if it's in the future
  fs: ensure that *path_locked*() helpers leave passed path pristine
  fs: add kern_path_locked_negative()
  hfs{plus}: add deprecation warning
  Kconfig: switch CONFIG_SYSFS_SYCALL default to n

6 months agoMerge tag 'i2c-for-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 19 Apr 2025 20:59:04 +0000 (13:59 -0700)] 
Merge tag 'i2c-for-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

 - Address translator: fix wrong include

 - ChromeOS EC tunnel: fix potential NULL pointer dereference

* tag 'i2c-for-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: atr: Fix wrong include
  i2c: cros-ec-tunnel: defer probe if parent EC is not present

6 months agoRevert "hfs{plus}: add deprecation warning"
Christian Brauner [Sat, 19 Apr 2025 20:48:59 +0000 (22:48 +0200)] 
Revert "hfs{plus}: add deprecation warning"

This reverts commit ddee68c499f76ae47c011549df5be53db0057402.

There's ongoing discussion about better maintenance of at least hfsplus.
Rever the deprecation warning for now.

Signed-off-by: Christian Brauner <brauner@kernel.org>
6 months agoMerge tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Sat, 19 Apr 2025 18:57:36 +0000 (11:57 -0700)] 
Merge tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Initialize hash variables in ftrace subops logic

   The fix that simplified the ftrace subops logic opened a path where
   some variables could be used without being initialized, and done
   subtly where the compiler did not catch it. Initialize those
   variables to the EMPTY_HASH, which is the default hash.

 - Reinitialize the hash pointers after they are freed

   Some of the hash pointers in the subop logic were freed but may still
   be referenced later. To prevent use-after-free bugs, initialize them
   back to the EMPTY_HASH.

 - Free the ftrace hashes when they are replaced

   The fix that simplified the subops logic updated some hash pointers,
   but left the original hash that they were pointing to where they are
   no longer used. This caused a memory leak. Free the hashes that are
   pointed to by the pointers when they are replaced.

 - Fix size initialization of ftrace direct function hash

   The ftrace direct function hash used by BPF initialized the hash size
   incorrectly. It checked the size of items to a hard coded 32, which
   made the hash bit size of 5. The hash size is supposed to be limited
   by the bit size of the hash, as the bitmask is allowed to be greater
   than 5. Rework the size check to first pass the number of elements to
   fls() and then compare that to FTRACE_HASH_MAX_BITS before allocating
   the hash.

 - Fix format output of ftrace_graph_ent_entry event

   The field depth of the ftrace_graph_ent_entry event is of size 4 but
   the output showed it as unsigned long and use "%lu". Change it to
   unsigned int and use "%u" in the print format that is displayed to
   user space.

 - Fix the trace event filter on strings

   Events can be filtered on numbers or string values. The return value
   checked from strncpy_from_kernel_nofault() and
   strncpy_from_user_nofault() was used to determine if reading the
   strings would fault or not. It would return fault if the value was
   non zero, which is basically meant that it was always considering the
   read as a fault.

 - Add selftest to test trace event string filtering

   In order to catch the breakage of the string filtering, add a self
   test to make sure that it continues to work.

* tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: selftests: Add testing a user string to filters
  tracing: Fix filter string testing
  ftrace: Fix type of ftrace_graph_ent_entry.depth
  ftrace: fix incorrect hash size in register_ftrace_direct()
  ftrace: Free ftrace hashes after they are replaced in the subops code
  ftrace: Reinitialize hash to EMPTY_HASH after freeing
  ftrace: Initialize variables for ftrace_startup/shutdown_subops()

6 months agoMerge tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Sat, 19 Apr 2025 17:38:03 +0000 (10:38 -0700)] 
Merge tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - v6.15 libcrc clean-up makes invalid configurations possible

 - Fix a potential deadlock introduced during the v6.15 merge window

* tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: decrease sc_count directly if fail to queue dl_recall
  nfs: add missing selections of CONFIG_CRC32

6 months agoMerge tag 'rust-fixes-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda...
Linus Torvalds [Sat, 19 Apr 2025 17:02:43 +0000 (10:02 -0700)] 
Merge tag 'rust-fixes-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull rust fixes from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Fix missing KASAN LLVM flags on first build (and fix spurious
     rebuilds) by skipping '--target'

   - Fix Make < 4.3 build error by using '$(pound)'

   - Fix UML build error by removing 'volatile' qualifier from io
     helpers

   - Fix UML build error by adding 'dma_{alloc,free}_attrs()' helpers

   - Clean gendwarfksyms warnings by avoiding to export '__pfx' symbols

   - Clean objtool warning by adding a new 'noreturn' function for
     1.86.0

   - Disable 'needless_continue' Clippy lint due to new 1.86.0 warnings

   - Add missing 'ffi' crate to 'generate_rust_analyzer.py'

  'pin-init' crate:

   - Import a couple fixes from upstream"

* tag 'rust-fixes-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  rust: helpers: Add dma_alloc_attrs() and dma_free_attrs()
  rust: helpers: Remove volatile qualifier from io helpers
  rust: kbuild: use `pound` to support GNU Make < 4.3
  objtool/rust: add one more `noreturn` Rust function for Rust 1.86.0
  rust: kasan/kbuild: fix missing flags on first build
  rust: disable `clippy::needless_continue`
  rust: kbuild: Don't export __pfx symbols
  rust: pin-init: use Markdown autolinks in Rust comments
  rust: pin-init: alloc: restrict `impl ZeroableOption` for `Box` to `T: Sized`
  scripts: generate_rust_analyzer: Add ffi crate

6 months agoMerge tag 'drm-fixes-2025-04-19' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Sat, 19 Apr 2025 16:31:21 +0000 (09:31 -0700)] 
Merge tag 'drm-fixes-2025-04-19' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Easter rc3 pull request, fixes in all the usuals, amdgpu, xe, msm,
  with some i915/ivpu/mgag200/v3d fixes, then a couple of bits in
  dma-buf/gem.

  Hopefully has no easter eggs in it.

  dma-buf:
   - Correctly decrement refcounter on errors

  gem:
   - Fix test for imported buffers

  amdgpu:
   - Cleaner shader sysfs fix
   - Suspend fix
   - Fix doorbell free ordering
   - Video caps fix
   - DML2 memory allocation optimization
   - HDP fix

  i915:
   - Fix DP DSC configurations that require 3 DSC engines per pipe

  xe:
   - Fix LRC address being written too late for GuC
   - Fix notifier vs folio deadlock
   - Fix race betwen dma_buf unmap and vram eviction
   - Fix debugfs handling PXP terminations unconditionally

  msm:
   - Display:
       - Fix to call dpu_plane_atomic_check_pipe() for both SSPPs in
         case of multi-rect
       - Fix to validate plane_state pointer before using it in
         dpu_plane_virtual_atomic_check()
       - Fix to make sure dereferencing dpu_encoder_phys happens after
         making sure it is valid in _dpu_encoder_trigger_start()
       - Remove the remaining intr_tear_rd_ptr which we initialized to
         -1 because NO_IRQ indices start from 0 now
   - GPU:
       - Fix IB_SIZE overflow

  ivpu:
   - Fix debugging
   - Fixes to frequency
   - Support firmware API 3.28.3
   - Flush jobs upon reset

  mgag200:
   - Set vblank start to correct values

  v3d:
   - Fix Indirect Dispatch"

* tag 'drm-fixes-2025-04-19' of https://gitlab.freedesktop.org/drm/kernel: (26 commits)
  drm/msm/a6xx+: Don't let IB_SIZE overflow
  drm/xe/pxp: do not queue unneeded terminations from debugfs
  drm/xe/dma_buf: stop relying on placement in unmap
  drm/xe/userptr: fix notifier vs folio deadlock
  drm/xe: Set LRC addresses before guc load
  drm/mgag200: Fix value in <VBLKSTR> register
  drm/gem: Internally test import_attach for imported objects
  drm/amdgpu: Use the right function for hdp flush
  drm/amd/display/dml2: use vzalloc rather than kzalloc
  drm/amdgpu: Add back JPEG to video caps for carrizo and newer
  drm/amdgpu: fix warning of drm_mm_clean
  drm/amd: Forbid suspending into non-default suspend states
  drm/amdgpu: use a dummy owner for sysfs triggered cleaner shaders v4
  drm/i915/dp: Check for HAS_DSC_3ENGINES while configuring DSC slices
  drm/i915/display: Add macro for checking 3 DSC engines
  dma-buf/sw_sync: Decrement refcount on error in sw_sync_ioctl_get_deadline()
  accel/ivpu: Add cmdq_id to job related logs
  accel/ivpu: Show NPU frequency in sysfs
  accel/ivpu: Fix the NPU's DPU frequency calculation
  accel/ivpu: Update FW Boot API to version 3.28.3
  ...

6 months agoMerge tag 'drm-msm-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/msm into...
Dave Airlie [Sat, 19 Apr 2025 05:09:29 +0000 (15:09 +1000)] 
Merge tag 'drm-msm-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/msm into drm-fixes

Fixes for v6.15-rc3

Display:
- Fix to call dpu_plane_atomic_check_pipe() for both SSPPs in
  case of multi-rect
- Fix to validate plane_state pointer before using it in
  dpu_plane_virtual_atomic_check()
- Fix to make sure dereferencing dpu_encoder_phys happens after
  making sure it is valid in _dpu_encoder_trigger_start()
- Remove the remaining intr_tear_rd_ptr which we initialized
  to -1 because NO_IRQ indices start from 0 now

GPU:
- Fix IB_SIZE overflow

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://lore.kernel.org/r/CAF6AEGtVKXEVdzUzFWmQE8JmK3nx_hp+ynOd-5j3vnfcU-sgOA@mail.gmail.com
6 months agoMerge tag 'drm-xe-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/xe/kernel...
Dave Airlie [Sat, 19 Apr 2025 04:59:47 +0000 (14:59 +1000)] 
Merge tag 'drm-xe-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Fix LRC address being written too late for GuC
- Fix notifier vs folio deadlock
- Fix race betwen dma_buf unmap and vram eviction
- Fix debugfs handling PXP terminations unconditionally

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/ndinq644zenywaaycxyfqqivsb2xer4z7err3dlpalbz33jfkm@ttabzsg6wnet
6 months agoMerge tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 19 Apr 2025 03:10:42 +0000 (20:10 -0700)] 
Merge tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix hard link lease key problem when close is deferred

 - Revert the socket lockdep/refcount workarounds done in cifs.ko now
   that it is fixed at the socket layer

* tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  Revert "smb: client: fix TCP timers deadlock after rmmod"
  Revert "smb: client: Fix netns refcount imbalance causing leaks and use-after-free"
  smb3 client: fix open hardlink on deferred close file error

6 months agodrm/msm/a6xx+: Don't let IB_SIZE overflow
Rob Clark [Mon, 17 Mar 2025 15:00:06 +0000 (08:00 -0700)] 
drm/msm/a6xx+: Don't let IB_SIZE overflow

IB_SIZE is only b0..b19.  Starting with a6xx gen3, additional fields
were added above the IB_SIZE.  Accidentially setting them can cause
badness.  Fix this by properly defining the CP_INDIRECT_BUFFER packet
and using the generated builder macro to ensure unintended bits are not
set.

v2: add missing type attribute for IB_BASE
v3: fix offset attribute in xml

Reported-by: Connor Abbott <cwabbott0@gmail.com>
Fixes: a83366ef19ea ("drm/msm/a6xx: add A640/A650 to gpulist")
Signed-off-by: Rob Clark <robdclark@chromium.org>
Patchwork: https://patchwork.freedesktop.org/patch/643396/

6 months agoMerge tag 'i2c-host-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Wolfram Sang [Fri, 18 Apr 2025 21:42:56 +0000 (23:42 +0200)] 
Merge tag 'i2c-host-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host-fixes for v6.15-rc3

- ChromeOS EC tunnel: fix potential NULL pointer dereference

6 months agoMerge tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 18 Apr 2025 21:04:57 +0000 (14:04 -0700)] 
Merge tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Fix hypercall detection on Xen guests

 - Extend the AMD microcode loader SHA check to Zen5, to block loading
   of any unreleased standalone Zen5 microcode patches

 - Add new Intel CPU model number for Bartlett Lake

 - Fix the workaround for AMD erratum 1054

 - Fix buggy early memory acceptance between SEV-SNP guests and the EFI
   stub

* tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/sev: Avoid shared GHCB page for early memory acceptance
  x86/cpu/amd: Fix workaround for erratum 1054
  x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
  x86/microcode/AMD: Extend the SHA check to Zen5, block loading of any unreleased standalone Zen5 microcode patches
  x86/xen: Fix __xen_hypercall_setfunc()

6 months agoMerge tag 'timers-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 18 Apr 2025 21:02:45 +0000 (14:02 -0700)] 
Merge tag 'timers-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix a lockdep false positive in the i8253 driver"

* tag 'timers-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/i8253: Call clockevent_i8253_disable() with interrupts disabled

6 months agoMerge tag 'perf-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 18 Apr 2025 20:35:13 +0000 (13:35 -0700)] 
Merge tag 'perf-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf event fixes from Ingo Molnar:
 "Miscellaneous fixes and a hardware-enabling change:

   - Fix Intel uncore PMU IIO free running counters on SPR, ICX and SNR
     systems

   - Fix Intel PEBS buffer overflow handling

   - Fix skid in Intel PEBS sampling of user-space general purpose
     registers

   - Enable Panther Lake PMU support - similar to Lunar Lake"

* tag 'perf-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Add Panther Lake support
  perf/x86/intel: Allow to update user space GPRs from PEBS records
  perf/x86/intel: Don't clear perf metrics overflow bit unconditionally
  perf/x86/intel/uncore: Fix the scale of IIO free running counters on SPR
  perf/x86/intel/uncore: Fix the scale of IIO free running counters on ICX
  perf/x86/intel/uncore: Fix the scale of IIO free running counters on SNR

6 months agoMerge tag 'irq-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 18 Apr 2025 20:28:41 +0000 (13:28 -0700)] 
Merge tag 'irq-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc irq fixes from Ingo Molnar:

 - Fix BCM2712 irqchip driver Kconfig dependencies required on the
   Raspberry PI5

 - Fix spurious interrupts on RZ/G3E SMARC EVK systems

 - Fix crash regression on Sun/NIU hardware

 - Apply MSI driver quirk for Sun Neptune chips

* tag 'irq-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/irq-bcm2712-mip: Enable driver when ARCH_BCM2835 is enabled
  irqchip/renesas-rzv2h: Prevent TINT spurious interrupt
  net/niu: Niu requires MSIX ENTRY_DATA fields touch before entry reads
  PCI/MSI: Add an option to write MSIX ENTRY_DATA before any reads

6 months agoMerge tag 'core-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 18 Apr 2025 20:25:33 +0000 (13:25 -0700)] 
Merge tag 'core-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc core fixes from Ingo Molnar:
 "Fix a genksyms related bug, triggered by recent changes to the percpu
  code, and update the .clang-format file to not include obsolete
  function names"

* tag 'core-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genksyms: Handle typeof_unqual keyword and __seg_{fs,gs} qualifiers
  clang-format: Update the ForEachMacros list for v6.15-rc1

6 months agoMerge tag 'hardening-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 18 Apr 2025 20:20:20 +0000 (13:20 -0700)] 
Merge tag 'hardening-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

 - lib/prime_numbers: KUnit test should not select PRIME_NUMBERS (Geert
   Uytterhoeven)

 - ubsan: Fix panic from test_ubsan_out_of_bounds (Mostafa Saleh)

 - ubsan: Remove 'default UBSAN' from UBSAN_INTEGER_WRAP (Nathan
   Chancellor)

 - string: Add load_unaligned_zeropad() code path to sized_strscpy()
   (Peter Collingbourne)

 - kasan: Add strscpy() test to trigger tag fault on arm64 (Vincenzo
   Frascino)

 - Disable GCC randstruct for COMPILE_TEST

* tag 'hardening-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lib/prime_numbers: KUnit test should not select PRIME_NUMBERS
  ubsan: Fix panic from test_ubsan_out_of_bounds
  lib/Kconfig.ubsan: Remove 'default UBSAN' from UBSAN_INTEGER_WRAP
  hardening: Disable GCC randstruct for COMPILE_TEST
  kasan: Add strscpy() test to trigger tag fault on arm64
  string: Add load_unaligned_zeropad() code path to sized_strscpy()

6 months agoMerge tag 'gpio-fixes-for-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 18 Apr 2025 20:18:01 +0000 (13:18 -0700)] 
Merge tag 'gpio-fixes-for-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fix from Bartosz Golaszewski:

 - check for both the new AND old (deprecated) setter callback when
   changing GPIO direction to output

* tag 'gpio-fixes-for-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpiolib: Allow to use setters with return value for output-only gpios

6 months agoMerge tag 'thermal-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Fri, 18 Apr 2025 20:09:20 +0000 (13:09 -0700)] 
Merge tag 'thermal-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull thermal control fixes from Rafael Wysocki:
 "Add missing DVFS support flags for the Lunar Lake and Panther Lake
  platforms to the int340x Intel thermal driver and fix DLVR support
  for Panther Lake in it (Srinivas Pandruvada)"

* tag 'thermal-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: intel: int340x: Fix Panther Lake DLVR support
  thermal: intel: int340x: Add missing DVFS support flags

6 months agoMerge tag 'pm-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Fri, 18 Apr 2025 20:06:12 +0000 (13:06 -0700)] 
Merge tag 'pm-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These are mostly cpufreq fixes, some of which address recent
  regressions and some address older issues that have come to light
  during the last two weeks, and a runtime PM documentation correction:

   - Fix the performance-to-frequency scaling factor computation on
     systems using HWP in the intel_pstate driver after a recent
     incorrect update of it (Rafael Wysocki)

   - Fix the usage of the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
     in the schedutil cpufreq governor after a recent update of it that
     has caused frequency limits changes to be missed sometimes (Rafael
     Wysocki)

   - Address some recently discovered synchronization issues related to
     frequency limits changes in the schedutil cpufreq governor and in
     the cpufreq core (Rafael Wysocki)

   - Fix ITMT support in the amd-pstate cpufreq driver so that it is
     enabled after asym priorities have been correctly initialized for
     all CPUs (K Prateek Nayak)

   - Fix changing min/max limits in the amd-pstate cpufreq driver while
     on the performance governor (Dhananjay Ugwekar)

   - Fix a function name in the runtime PM documentation that was
     previously incorrectly updated by mistake (Sakari Ailus)"

* tag 'pm-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Avoid using inconsistent policy->min and policy->max
  cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()
  cpufreq/sched: Explicitly synchronize limits_changed flag handling
  cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITS
  Documentation: PM: runtime: Fix a reference to pm_runtime_autosuspend()
  cpufreq: intel_pstate: Fix hwp_get_cpu_scaling()
  cpufreq/amd-pstate: Enable ITMT support after initializing core rankings
  cpufreq/amd-pstate: Fix min_limit perf and freq updation for performance governor

6 months agoMerge branch 'pm-docs'
Rafael J. Wysocki [Fri, 18 Apr 2025 18:55:48 +0000 (20:55 +0200)] 
Merge branch 'pm-docs'

Merge a runtime PM documentation correction for 6.15-rc3.

* pm-docs:
  Documentation: PM: runtime: Fix a reference to pm_runtime_autosuspend()

6 months agoMerge tag 'riscv-for-linus-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 18 Apr 2025 18:46:44 +0000 (11:46 -0700)] 
Merge tag 'riscv-for-linus-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - A fix for an issue where C instructions ended up in non-C builds, due
   to some broken inline assembly in the KGDB breakpoint insertion code

 - A fix to avoid spurious printk messages about misaligned access
   performance probing

 - A fix for a handful of issues with /proc/iomem's reserved region
   handling

 - A pair of fixes for module relocation processing

 - A few build-time fixes

* tag 'riscv-for-linus-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: KGDB: Remove ".option norvc/.option rvc" for kgdb_compiled_break
  riscv: KGDB: Do not inline arch_kgdb_breakpoint()
  riscv: Avoid fortify warning in syscall_get_arguments()
  riscv: Provide all alternative macros all the time
  riscv: module: Allocate PLT entries for R_RISCV_PLT32
  riscv: module: Fix out-of-bounds relocation access
  riscv: Properly export reserved regions in /proc/iomem
  riscv: Fix unaligned access info messages
  riscv: Avoid fortify warning in syscall_get_arguments()
  Documentation: riscv: Fix typo MIMPLID -> MIMPID
  riscv: Use kvmalloc_array on relocation_hashtable

6 months agoMerge tag 'linux_kselftest-kunit-fixes-6.15-rc3' of git://git.kernel.org/pub/scm...
Linus Torvalds [Fri, 18 Apr 2025 18:35:11 +0000 (11:35 -0700)] 
Merge tag 'linux_kselftest-kunit-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kunit fix from Shuah Khan:
 "Fixes arch sh kunit qemu_configs script sh.py to honor kunit cmdline"

* tag 'linux_kselftest-kunit-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: qemu_configs: SH: Respect kunit cmdline

6 months agoMerge tag 'linux_kselftest-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 18 Apr 2025 18:32:31 +0000 (11:32 -0700)] 
Merge tag 'linux_kselftest-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fix from Shuah Khan:
 "Fixes dynevent_limitations.tc test failure on dash by detecting and
  handling bash and dash differences in evaluating \\"

* tag 'linux_kselftest-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/ftrace: Differentiate bash and dash in dynevent_limitations.tc

6 months agoMerge tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Linus Torvalds [Fri, 18 Apr 2025 16:37:44 +0000 (09:37 -0700)] 
Merge tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - Fix integer overflow in server disconnect deadtime calculation

 - Three fixes for potential use after frees: one for oplocks, and one
   for leases and one for kerberos authentication

 - Fix to prevent attempted write to directory

 - Fix locking warning for durable scavenger thread

* tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: Prevent integer overflow in calculation of deadtime
  ksmbd: fix the warning from __kernel_write_iter
  ksmbd: fix use-after-free in smb_break_all_levII_oplock()
  ksmbd: fix use-after-free in __smb2_lease_break_noti()
  ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING"
  ksmbd: Fix dangling pointer in krb_authenticate

6 months agoMerge tag 'block-6.15-20250417' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 18 Apr 2025 16:21:14 +0000 (09:21 -0700)] 
Merge tag 'block-6.15-20250417' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD pull via Yu:
      - fix raid10 missing discard IO accounting (Yu Kuai)
      - fix bitmap stats for bitmap file (Zheng Qixing)
      - fix oops while reading all member disks failed during
        check/repair (Meir Elisha)

 - NVMe pull via Christoph:
      - fix scan failure for non-ANA multipath controllers (Hannes
        Reinecke)
      - fix multipath sysfs links creation for some cases (Hannes
        Reinecke)
      - PCIe endpoint fixes (Damien Le Moal)
      - use NULL instead of 0 in the auth code (Damien Le Moal)

 - Various ublk fixes:
      - Slew of selftest additions
      - Improvements and fixes for IO cancelation
      - Tweak to Kconfig verbiage

 - Fix for page dirtying for blk integrity mapped pages

 - loop fixes:
      - buffered IO fix
      - uevent fixes
      - request priority inheritance fix

 - Various little fixes

* tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits)
  selftests: ublk: add generic_06 for covering fault inject
  ublk: simplify aborting ublk request
  ublk: remove __ublk_quiesce_dev()
  ublk: improve detection and handling of ublk server exit
  ublk: move device reset into ublk_ch_release()
  ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
  ublk: add ublk_force_abort_dev()
  ublk: properly serialize all FETCH_REQs
  selftests: ublk: move creating UBLK_TMP into _prep_test()
  selftests: ublk: add test_stress_05.sh
  selftests: ublk: support user recovery
  selftests: ublk: support target specific command line
  selftests: ublk: increase max nr_queues and queue depth
  selftests: ublk: set queue pthread's cpu affinity
  selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
  selftests: ublk: add two stress tests for zero copy feature
  selftests: ublk: run stress tests in parallel
  selftests: ublk: make sure _add_ublk_dev can return in sub-shell
  selftests: ublk: cleanup backfile automatically
  selftests: ublk: add io_uring uapi header
  ...

6 months agoMerge tag 'io_uring-6.15-20250418' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 18 Apr 2025 16:13:52 +0000 (09:13 -0700)] 
Merge tag 'io_uring-6.15-20250418' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Correctly cap iov_iter->nr_segs for imports of registered buffers,
   both kbuf and normal ones.

   Three cleanups to make it saner first, then two fixes for each of the
   buffer types.

   This fixes a performance regression where partial buffer usage
   doesn't trim the tail number of segments, leading the block layer to
   iterate the IOs to check if it needs splitting.

 - Two patches tweaking the newly introduced zero-copy rx API, mostly to
   keep the API consistent once we add multiple interface queues per
   ring support in the 6.16 release.

 - zc rx unmapping fix for a dead device

* tag 'io_uring-6.15-20250418' of git://git.kernel.dk/linux:
  io_uring/zcrx: fix late dma unmap for a dead dev
  io_uring/rsrc: ensure segments counts are correct on kbuf buffers
  io_uring/rsrc: send exact nr_segs for fixed buffer
  io_uring/rsrc: refactor io_import_fixed
  io_uring/rsrc: separate kbuf offset adjustments
  io_uring/rsrc: don't skip offset calculation
  io_uring/zcrx: add pp to ifq conversion helper
  io_uring/zcrx: return ifq id to the user

6 months agotracing: selftests: Add testing a user string to filters
Steven Rostedt [Fri, 18 Apr 2025 14:12:08 +0000 (10:12 -0400)] 
tracing: selftests: Add testing a user string to filters

Running the following commands was broken:

  # cd /sys/kernel/tracing
  # echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter
  # echo 1 > events/syscalls/sys_enter_openat/enable
  # ls /proc/$$/maps
  # cat trace

And would produce nothing when it should have produced something like:

      ls-1192    [007] .....  8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0)

Add a test to check this case so that it will be caught if it breaks
again.

Link: https://lore.kernel.org/linux-trace-kernel/20250417183003.505835fb@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lore.kernel.org/20250418101208.38dc81f5@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
6 months agox86/boot/sev: Avoid shared GHCB page for early memory acceptance
Ard Biesheuvel [Thu, 17 Apr 2025 20:21:21 +0000 (22:21 +0200)] 
x86/boot/sev: Avoid shared GHCB page for early memory acceptance

Communicating with the hypervisor using the shared GHCB page requires
clearing the C bit in the mapping of that page. When executing in the
context of the EFI boot services, the page tables are owned by the
firmware, and this manipulation is not possible.

So switch to a different API for accepting memory in SEV-SNP guests, one
which is actually supported at the point during boot where the EFI stub
may need to accept memory, but the SEV-SNP init code has not executed
yet.

For simplicity, also switch the memory acceptance carried out by the
decompressor when not booting via EFI - this only involves the
allocation for the decompressed kernel, and is generally only called
after kexec, as normal boot will jump straight into the kernel from the
EFI stub.

Fixes: 6c3211796326 ("x86/sev: Add SNP-specific unaccepted memory support")
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250404082921.2767593-8-ardb+git@google.com
Link: https://lore.kernel.org/r/20250410132850.3708703-2-ardb+git@google.com
Link: https://lore.kernel.org/r/20250417202120.1002102-2-ardb+git@google.com
6 months agox86/cpu/amd: Fix workaround for erratum 1054
Sandipan Das [Fri, 18 Apr 2025 06:19:40 +0000 (11:49 +0530)] 
x86/cpu/amd: Fix workaround for erratum 1054

Erratum 1054 affects AMD Zen processors that are a part of Family 17h
Models 00-2Fh and the workaround is to not set HWCR[IRPerfEn]. However,
when X86_FEATURE_ZEN1 was introduced, the condition to detect unaffected
processors was incorrectly changed in a way that the IRPerfEn bit gets
set only for unaffected Zen 1 processors.

Ensure that HWCR[IRPerfEn] is set for all unaffected processors. This
includes a subset of Zen 1 (Family 17h Models 30h and above) and all
later processors. Also clear X86_FEATURE_IRPERF on affected processors
so that the IRPerfCount register is not used by other entities like the
MSR PMU driver.

Fixes: 232afb557835 ("x86/CPU/AMD: Add X86_FEATURE_ZEN1")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/caa057a9d6f8ad579e2f1abaa71efbd5bd4eaf6d.1744956467.git.sandipan.das@amd.com
6 months agoio_uring/zcrx: fix late dma unmap for a dead dev
Pavel Begunkov [Fri, 18 Apr 2025 12:02:27 +0000 (13:02 +0100)] 
io_uring/zcrx: fix late dma unmap for a dead dev

There is a problem with page pools not dma-unmapping immediately when
the device is going down, and delaying it until the page pool is
destroyed, which is not allowed (see links). That just got fixed for
normal page pools, and we need to address memory providers as well.

Unmap pages in the memory provider uninstall callback, and protect it
with a new lock. There is also a gap between when a dma mapping is
created and the mp is installed, so if the device is killed in between,
io_uring would be holding on to dma mappings to a dead device with no
one to call ->uninstall. Move it to page pool init and rely on
->is_mapped to make sure it's only done once.

Link: https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
Link: https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/
Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef9b7db249b14f6e0b570a1bb77ff177389f881c.1744965853.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 months agoperf/x86/amd/uncore: Prevent UMC counters from saturating
Sandipan Das [Fri, 18 Apr 2025 03:43:03 +0000 (09:13 +0530)] 
perf/x86/amd/uncore: Prevent UMC counters from saturating

Unlike L3 and DF counters, UMC counters (PERF_CTRs) set the Overflow bit
(bit 48) and saturate on overflow. A subsequent pmu->read() of the event
reports an incorrect accumulated count as there is no difference between
the previous and the current values of the counter.

To avoid this, inspect the current counter value and proactively reset
the corresponding PERF_CTR register on every pmu->read(). Combined with
the periodic reads initiated by the hrtimer, the counters never get a
chance saturate but the resolution reduces to 47 bits.

Fixes: 25e56847821f ("perf/x86/amd/uncore: Add memory controller support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/dee9c8af2c6d66814cf4c6224529c144c620cf2c.1744906694.git.sandipan.das@amd.com
6 months agoperf/x86/amd/uncore: Add parameter to configure hrtimer
Sandipan Das [Fri, 18 Apr 2025 03:43:02 +0000 (09:13 +0530)] 
perf/x86/amd/uncore: Add parameter to configure hrtimer

Introduce a module parameter for configuring the hrtimer duration in
milliseconds. The default duration is 60000 milliseconds and the intent
is to allow users to customize it to suit jitter tolerances. It should
be noted that a longer duration will reduce jitter but affect accuracy
if the programmed events cause the counters to overflow multiple times
in a single interval.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/6cb0101da74955fa9c8361f168ffdf481ae8a200.1744906694.git.sandipan.das@amd.com
6 months agoperf/x86/amd/uncore: Use hrtimer for handling overflows
Sandipan Das [Fri, 18 Apr 2025 03:43:01 +0000 (09:13 +0530)] 
perf/x86/amd/uncore: Use hrtimer for handling overflows

Uncore counters do not provide mechanisms like interrupts to report
overflows and the accumulated user-visible count is incorrect if there
is more than one overflow between two successive read requests for the
same event because the value of prev_count goes out-of-date for
calculating the correct delta.

To avoid this, start a hrtimer to periodically initiate a pmu->read() of
the active counters for keeping prev_count up-to-date. It should be
noted that the hrtimer duration should be lesser than the shortest time
it takes for a counter to overflow for this approach to be effective.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/8ecf5fe20452da1cd19cf3ff4954d3e7c5137468.1744906694.git.sandipan.das@amd.com
6 months agoperf/x86/intel/uncore: Use HRTIMER_MODE_HARD for detecting overflows
Sandipan Das [Fri, 18 Apr 2025 03:43:00 +0000 (09:13 +0530)] 
perf/x86/intel/uncore: Use HRTIMER_MODE_HARD for detecting overflows

hrtimer handlers can be deferred to softirq context and affect timely
detection of counter overflows. Hence switch to HRTIMER_MODE_HARD.

Disabling and re-enabling IRQs in the hrtimer handler is not required
as pmu->start() and pmu->stop() can no longer intervene while updating
event->hw.prev_count.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/0ad4698465077225769e8edd5b2c7e8f48f636d5.1744906694.git.sandipan.das@amd.com
6 months agoperf/x86/amd/uncore: Remove unused 'struct amd_uncore_ctx::node' member
Sandipan Das [Fri, 18 Apr 2025 03:42:59 +0000 (09:12 +0530)] 
perf/x86/amd/uncore: Remove unused 'struct amd_uncore_ctx::node' member

Fixes: d6389d3ccc13 ("perf/x86/amd/uncore: Refactor uncore management")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/30f9254c2de6c4318dd0809ef85a1677f68eef10.1744906694.git.sandipan.das@amd.com
6 months agoselftests/bpf: Add 5-byte NOP uprobe trigger benchmark
Jiri Olsa [Mon, 14 Apr 2025 08:36:47 +0000 (10:36 +0200)] 
selftests/bpf: Add 5-byte NOP uprobe trigger benchmark

Add a 5-byte NOP uprobe trigger benchmark (x86_64 specific) to measure
uprobes/uretprobes on top of NOP5 instructions.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20250414083647.1234007-2-jolsa@kernel.org
6 months agouprobes/x86: Add support to emulate NOP instructions
Jiri Olsa [Mon, 14 Apr 2025 08:36:46 +0000 (10:36 +0200)] 
uprobes/x86: Add support to emulate NOP instructions

Add support to emulate all NOP instructions as the original uprobe
instruction.

This change speeds up uprobe on top of all NOP instructions and is a
preparation for usdt probe optimization, that will be done on top of
NOP5 instructions.

With this change the usdt probe on top of NOP5s won't take the performance
hit compared to usdt probe on top of standard NOP instructions.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Hao Luo <haoluo@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20250414083647.1234007-1-jolsa@kernel.org
6 months agoMAINTAINERS: add section for locking of mm's and VMAs
Lorenzo Stoakes [Wed, 16 Apr 2025 10:38:37 +0000 (11:38 +0100)] 
MAINTAINERS: add section for locking of mm's and VMAs

We place this under memory mapping as related to memory mapping
abstractions in the form of mm_struct and vm_area_struct (VMA).  Now we
have separated out mmap/vma locking logic into the mmap_lock.c and
mmap_lock.h files, so this should encapsulate the majority of the mm
locking logic in the kernel.

Suren is best placed to maintain this logic as the core architect of VMA
locking as a whole.

Link: https://lkml.kernel.org/r/e6ed679a184ca444b20dfa77af96913fd8b5efa0.1744799282.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agomm: vmscan: fix kswapd exit condition in defrag_mode
Johannes Weiner [Wed, 16 Apr 2025 13:45:40 +0000 (09:45 -0400)] 
mm: vmscan: fix kswapd exit condition in defrag_mode

Vlastimil points out an issue with kswapd in defrag_mode not waking up
kcompactd reliably.

Background: When kswapd is woken for any higher-order request, it
initially checks those high-order watermarks to decide if work is
necesary.  However, it cannot (efficiently) meet the contiguity goal of
such a request by itself.  So once it has reclaimed a compaction gap, it
adjusts the request down to check for free order-0 pages, then wakes
kcompactd to coalesce them into larger blocks.

In defrag_mode, the initial watermark check needs to be analogously
against free pageblocks.  However, once kswapd drops the high-order to
hand off contiguity work, it also needs to fall back to base page
watermarks - otherwise it'll keep reclaiming until blocks are freed.

While it appears kcompactd is woken up frequently enough to do most of the
compaction work, kswapd ends up overreclaiming by quite a bit:

                                                     DEFRAGMODE     DEFRAGMODE-thispatch
Hugealloc Time mean                       79381.34 (    +0.00%)    88126.12 (   +11.02%)
Hugealloc Time stddev                     85852.16 (    +0.00%)   135366.75 (   +57.67%)
Kbuild Real time                            249.35 (    +0.00%)      226.71 (    -9.04%)
Kbuild User time                           1249.16 (    +0.00%)     1249.37 (    +0.02%)
Kbuild System time                          171.76 (    +0.00%)      166.93 (    -2.79%)
THP fault alloc                           51666.87 (    +0.00%)    52685.60 (    +1.97%)
THP fault fallback                        16970.00 (    +0.00%)    15951.87 (    -6.00%)
Direct compact fail                         166.53 (    +0.00%)      178.93 (    +7.40%)
Direct compact success                       17.13 (    +0.00%)        4.13 (   -71.69%)
Compact daemon scanned migrate          3095413.33 (    +0.00%)  9231239.53 (  +198.22%)
Compact daemon scanned free             2155966.53 (    +0.00%)  7053692.87 (  +227.17%)
Compact direct scanned migrate           265642.47 (    +0.00%)    68388.33 (   -74.26%)
Compact direct scanned free              130252.60 (    +0.00%)    55634.87 (   -57.29%)
Compact total migrate scanned           3361055.80 (    +0.00%)  9299627.87 (  +176.69%)
Compact total free scanned              2286219.13 (    +0.00%)  7109327.73 (  +210.96%)
Alloc stall                                1890.80 (    +0.00%)     6297.60 (  +232.94%)
Pages kswapd scanned                    9043558.80 (    +0.00%)  5952576.73 (   -34.18%)
Pages kswapd reclaimed                  1891708.67 (    +0.00%)  1030645.00 (   -45.52%)
Pages direct scanned                    1017090.60 (    +0.00%)  2688047.60 (  +164.29%)
Pages direct reclaimed                    92682.60 (    +0.00%)   309770.53 (  +234.22%)
Pages total scanned                    10060649.40 (    +0.00%)  8640624.33 (   -14.11%)
Pages total reclaimed                   1984391.27 (    +0.00%)  1340415.53 (   -32.45%)
Swap out                                 884585.73 (    +0.00%)   417781.93 (   -52.77%)
Swap in                                  287106.27 (    +0.00%)    95589.73 (   -66.71%)
File refaults                            551697.60 (    +0.00%)   426474.80 (   -22.70%)

Link: https://lkml.kernel.org/r/20250416135142.778933-3-hannes@cmpxchg.org
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agomm: vmscan: restore high-cpu watermark safety in kswapd
Johannes Weiner [Wed, 16 Apr 2025 13:45:39 +0000 (09:45 -0400)] 
mm: vmscan: restore high-cpu watermark safety in kswapd

Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
defrag_mode kswapd/kcompactd watermarks") switched kswapd from
zone_watermark_ok_safe() to the standard, percpu-cached version of reading
free pages, thus dropping the watermark safety precautions for systems
with high CPU counts (e.g.  >212 cpus on 64G).  Restore them.

Since zone_watermark_ok_safe() is no longer the right interface, and this
was the last caller of the function anyway, open-code the
zone_page_state_snapshot() conditional and delete the function.

Link: https://lkml.kernel.org/r/20250416135142.778933-2-hannes@cmpxchg.org
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agoMAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section
Lorenzo Stoakes [Wed, 16 Apr 2025 13:53:01 +0000 (14:53 +0100)] 
MAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section

Pedro has offered to review memory mapping code.  He has good experience
in this area and has provided excellent feedback on memory mapping series
in the past so I feel he'll be a great addition.

Link: https://lkml.kernel.org/r/20250416135301.43513-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agomm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
David Hildenbrand [Tue, 15 Apr 2025 09:50:07 +0000 (11:50 +0200)] 
mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization

In __folio_remove_rmap() for RMAP_LEVEL_PMD/RMAP_LEVEL_PUD and with
CONFIG_PAGE_MAPCOUNT we first decrement the folio mapcount (and recompute
mapped shared vs.  mapped exclusively) to then adjust the entire mapcount.

This means that another process might stumble in do_wp_page() over a
PTE-mapped PMD folio that is indicated as "exclusively mapped", but still
has an entire mapcount (PMD mapping), because it is racing with the
process that is unmapping the folio (PMD mapping).  Note that do_wp_page()
will back off once it detects the remaining folio reference from the
process that is in the process of unmapping the folio.

This will trigger the early VM_WARN_ON_ONCE(folio_entire_mapcount(folio))
check in do_wp_page(), that can easily be reproduced by looping a couple
of times over allocating a PMD THP, forking a child where we immediately
unmap it again, and writing in the parent concurrently to the THP.

[  252.738129][T16470] ------------[ cut here ]------------
[  252.739267][T16470] WARNING: CPU: 3 PID: 16470 at mm/memory.c:3738 do_wp_page+0x2a75/0x2c00
[  252.740968][T16470] Modules linked in:
[  252.741958][T16470] CPU: 3 UID: 0 PID: 16470 Comm: ...
...
[  252.765841][T16470]  <TASK>
[  252.766419][T16470]  ? srso_alias_return_thunk+0x5/0xfbef5
[  252.767558][T16470]  ? rcu_is_watching+0x12/0x60
[  252.768525][T16470]  ? srso_alias_return_thunk+0x5/0xfbef5
[  252.769645][T16470]  ? srso_alias_return_thunk+0x5/0xfbef5
[  252.770778][T16470]  ? lock_acquire+0x33/0x80
[  252.771697][T16470]  ? __handle_mm_fault+0x5e8/0x3e40
[  252.772735][T16470]  ? __handle_mm_fault+0x5e8/0x3e40
[  252.773781][T16470]  __handle_mm_fault+0x1869/0x3e40
[  252.774839][T16470]  handle_mm_fault+0x22a/0x640
[  252.775808][T16470]  do_user_addr_fault+0x618/0x1000
[  252.776847][T16470]  exc_page_fault+0x68/0xd0
[  252.777775][T16470]  asm_exc_page_fault+0x26/0x30

While we could adjust the sequence in __folio_remove_rmap(), let's rater
move the mapcount sanity checks after the mapcount vs.  refcount
stabilization phase.  With this fix, a simple reproducer is happy.

While at it, convert the two VM_WARN_ON_ONCE() we are moving to
VM_WARN_ON_ONCE_FOLIO().

Link: https://lkml.kernel.org/r/20250415095007.569836-1-david@redhat.com
Fixes: 1da190f4d0a6 ("mm: Copy-on-Write (COW) reuse support for PTE-mapped THP")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: syzbot+5e8feb543ca8e12e0ede@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/67fab4fe.050a0220.2c5fcf.0011.GAE@google.com
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agomm, hugetlb: increment the number of pages to be reset on HVO
Oscar Salvador [Tue, 15 Apr 2025 11:18:59 +0000 (13:18 +0200)] 
mm, hugetlb: increment the number of pages to be reset on HVO

commit 4eeec8c89a0c ("mm: move hugetlb specific things in folio to
page[3]") shifted hugetlb specific stuff, and now mapping overlaps
_hugetlb_cgroup field.

Upon restoring the vmemmap for HVO, only the first two tail pages are
reset, and this causes the check in free_tail_page_prepare() to fail as it
finds an unexpected mapping value in some tails.

Increment the number of pages to be reset to 4 (head + 3 tail pages)

Link: https://lkml.kernel.org/r/20250415111859.376302-1-osalvador@suse.de
Fixes: 4eeec8c89a0c ("mm: move hugetlb specific things in folio to page[3]")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agowriteback: fix false warning in inode_to_wb()
Andreas Gruenbacher [Sat, 12 Apr 2025 16:39:12 +0000 (18:39 +0200)] 
writeback: fix false warning in inode_to_wb()

inode_to_wb() is used also for filesystems that don't support cgroup
writeback.  For these filesystems inode->i_wb is stable during the
lifetime of the inode (it points to bdi->wb) and there's no need to hold
locks protecting the inode->i_wb dereference.  Improve the warning in
inode_to_wb() to not trigger for these filesystems.

Link: https://lkml.kernel.org/r/20250412163914.3773459-3-agruenba@redhat.com
Fixes: aaa2cacf8184 ("writeback: add lockdep annotation to inode_to_wb()")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agodocs: ABI: replace mcroce@microsoft.com with new Meta address
Ahmad Fatoum [Mon, 14 Apr 2025 07:35:31 +0000 (09:35 +0200)] 
docs: ABI: replace mcroce@microsoft.com with new Meta address

The Microsoft email address is bouncing:

    550 5.4.1 Recipient address rejected: Access denied.

So let's replace it with Matteo's current mail address.

Link: https://lkml.kernel.org/r/20250414-fix-mcroce-mail-bounce-v3-1-0aed2d71f3d7@pengutronix.de
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Acked-by: Matteo Croce <teknoraver@meta.com>
Link: https://lore.kernel.org/all/BYAPR15MB2504E4B02DFFB1E55871955DA1062@BYAPR15MB2504.namprd15.prod.outlook.com/
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matteo Croce <teknoraver@meta.com>
Cc: Sascha Hauer <kernel@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agomm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()
Baoquan He [Thu, 10 Apr 2025 03:57:14 +0000 (11:57 +0800)] 
mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()

Not like fault_in_readable() or fault_in_writeable(), in
fault_in_safe_writeable() local variable 'start' is increased page by page
to loop till the whole address range is handled.  However, it mistakenly
calculates the size of the handled range with 'uaddr - start'.

Fix it here.

Andreas said:

: In gfs2, fault_in_iov_iter_writeable() is used in
: gfs2_file_direct_read() and gfs2_file_read_iter(), so this potentially
: affects buffered as well as direct reads.  This bug could cause those
: gfs2 functions to spin in a loop.

Link: https://lkml.kernel.org/r/20250410035717.473207-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20250410035717.473207-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Fixes: fe673d3f5bf1 ("mm: gup: make fault_in_safe_writeable() use fixup_user_fault()")
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Yanjun.Zhu <yanjun.zhu@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agoMAINTAINERS: add memory advice section
Lorenzo Stoakes [Fri, 11 Apr 2025 07:27:24 +0000 (08:27 +0100)] 
MAINTAINERS: add memory advice section

The madvise code straddles both VMA and page table manipulation.  As a
result, separate it out into its own section and add maintainers/reviewers
as appropriate.

We additionally include the mman-common.h file as this contains the shared
madvise flags and it is important we maintain this alongside madvise.c.

Link: https://lkml.kernel.org/r/20250411072724.10841-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agoMAINTAINERS: add mmap trace events to MEMORY MAPPING
Liam R. Howlett [Fri, 11 Apr 2025 17:33:28 +0000 (13:33 -0400)] 
MAINTAINERS: add mmap trace events to MEMORY MAPPING

MEMORY MAPPING does not list the mmap.h trace point file, but does list
the mmap.c file.  Couple the trace points with the users and authors of
the trace points for notifications of updates.

Link: https://lkml.kernel.org/r/20250411173328.8172-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agomm: memcontrol: fix swap counter leak from offline cgroup
Muchun Song [Thu, 10 Apr 2025 08:18:12 +0000 (16:18 +0800)] 
mm: memcontrol: fix swap counter leak from offline cgroup

commit 73f839b6d2ed addressed an issue regarding the swap counter leak
that occurred from an offline cgroup.  However, commit 89ce924f0bd4
modified the parameter from @swap_memcg to @memcg (presumably this
alteration was introduced while resolving conflicts).  Fix this problem by
reverting this minor change.

Link: https://lkml.kernel.org/r/20250410081812.10073-1-songmuchun@bytedance.com
Fixes: 89ce924f0bd4 ("mm: memcontrol: move memsw charge callbacks to v1")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agoMAINTAINERS: add MM subsection for the page allocator
Vlastimil Babka [Thu, 10 Apr 2025 09:00:23 +0000 (11:00 +0200)] 
MAINTAINERS: add MM subsection for the page allocator

Add a subsection for the page allocator, including compaction as it's
crucial for high-order allocations and works together with the
anti-fragmentation features.  Add reviewers (including myself) who
voluteered.

Link: https://lkml.kernel.org/r/20250410090021.72296-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Christoph Lameter (Ampere) <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agoMAINTAINERS: update SLAB ALLOCATOR maintainers
Vlastimil Babka [Thu, 10 Apr 2025 09:00:22 +0000 (11:00 +0200)] 
MAINTAINERS: update SLAB ALLOCATOR maintainers

With permission, reduce the number of maintainers.  Create a CREDITS entry
for Joonsoo (Pekka already has one).  Thanks for all the work!

Link: https://lkml.kernel.org/r/20250410090021.72296-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Christoph Lameter (Ampere) <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agofs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
David Hildenbrand [Thu, 10 Apr 2025 09:10:20 +0000 (11:10 +0200)] 
fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages

Alison reports an issue with fsdax when large extends end up using large
ZONE_DEVICE folios:

[  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[  417.796982] #PF: supervisor read access in kernel mode
[  417.797540] #PF: error_code(0x0000) - not-present page
[  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
[  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
[  417.800150] Tainted: [O]=OOT_MODULE
[  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[  417.801948] Code: ...
[  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
[  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  417.811353] Call Trace:
[  417.811709]  <TASK>
[  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
[  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
[  417.813132]  insert_page+0x78/0xf0
[  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
[  417.814088]  dax_fault_iter+0x484/0x7b0
[  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
[  417.815055]  dax_iomap_fault+0x39/0x40
[  417.815499]  __xfs_write_fault+0x139/0x380
[  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
[  417.816483]  xfs_write_fault+0x41/0x50
[  417.816966]  xfs_filemap_fault+0x3b/0xe0
[  417.817424]  __do_fault+0x31/0x180
[  417.817859]  __handle_mm_fault+0xee1/0x1a60
[  417.818325]  ? debug_smp_processor_id+0x17/0x20
[  417.818844]  handle_mm_fault+0xe1/0x2b0
[...]

The issue is that when we split a large ZONE_DEVICE folio to order-0 ones,
we don't reset the order/_nr_pages.  As folio->_nr_pages overlays
page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
has folio->memcg_data set.  And we never manually initialize
folio->memcg_data in fsdax code, because we never expect it to be set at
all.

When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries
to use folio->memcg_data (because it's non-NULL) but it does not actually
point at a memcg, resulting in the problem.

Alison also observed that these folios sometimes have "locked" set, which
is rather concerning (folios locked from the beginning ...).  The reason
is that the order for large folios is stored in page[1]->flags, which
become the folio->flags of a new small folio.

Let's fix it by adding a folio helper to clear order/_nr_pages for
splitting purposes.

Maybe we should reinitialize other large folio flags / folio members as
well when splitting, because they might similarly cause harm once page[1]
becomes a folio?  At least other flags in PAGE_FLAGS_SECOND should not be
set for fsdax, so at least page[1]->flags might be as expected with this
fix.

From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
re-initialize most things from a previous page[1] used by large folios
that fsdax cares about.  For example folio->private might not get
reinitialized, but maybe that's not relevant -- no traces of it's use in
fsdax code.  Needs a closer look.

Another thing that should be considered in the future is performing
similar checks as we perform in free_tail_page_prepare()
-- checking pincount etc.
-- when freeing a large fsdax folio.

Link: https://lkml.kernel.org/r/20250410091020.119116-1-david@redhat.com
Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Alison Schofield <alison.schofield@intel.com>
Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Darrick J. Wong" <djwong@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agomm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
Kirill A. Shutemov [Sat, 29 Mar 2025 17:10:29 +0000 (19:10 +0200)] 
mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()

When the last page in the zone is accepted, __accept_page() calls
static_branch_dec().  This function takes cpu_hotplug_lock, which can lead
to a deadlock if the allocation occurs during CPU bringup path as
_cpu_up() also takes the lock.

To prevent this deadlock, defer static_branch_dec() to a workqueue.

Call static_branch_dec() only when the workqueue is not yet initialized.
Workqueues are initialized before CPU bring up, so this will not conflict
with the first scenario.

Link: https://lkml.kernel.org/r/20250329171030.3942298-1-kirill.shutemov@linux.intel.com
Fixes: 55ad43e8ba0f ("mm: add a helper to accept page")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
6 months agotracing: Fix filter string testing
Steven Rostedt [Thu, 17 Apr 2025 22:30:03 +0000 (18:30 -0400)] 
tracing: Fix filter string testing

The filter string testing uses strncpy_from_kernel/user_nofault() to
retrieve the string to test the filter against. The if() statement was
incorrect as it considered 0 as a fault, when it is only negative that it
faulted.

Running the following commands:

  # cd /sys/kernel/tracing
  # echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter
  # echo 1 > events/syscalls/sys_enter_openat/enable
  # ls /proc/$$/maps
  # cat trace

Would produce nothing, but with the fix it will produce something like:

      ls-1192    [007] .....  8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0)

Link: https://lore.kernel.org/all/CAEf4BzbVPQ=BjWztmEwBPRKHUwNfKBkS3kce-Rzka6zvbQeVpg@mail.gmail.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250417183003.505835fb@gandalf.local.home
Fixes: 77360f9bbc7e5 ("tracing: Add test for user space strings when filtering on string pointers")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
6 months agodrm/xe/pxp: do not queue unneeded terminations from debugfs
Daniele Ceraolo Spurio [Wed, 16 Apr 2025 20:16:22 +0000 (13:16 -0700)] 
drm/xe/pxp: do not queue unneeded terminations from debugfs

The PXP terminate debugfs currently unconditionally simulates a
termination, no matter what the HW status is. This is unneeded if PXP is
not in use and can cause errors if the HW init hasn't completed yet.
To solve these issues, we can simply limit the terminations to the cases
where PXP is fully initialized and in use.

v2: s/pxp_status/ready/ to avoid confusion with pxp->status (John)

Fixes: 385a8015b214 ("drm/xe/pxp: Add PXP debugfs support")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4749
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://lore.kernel.org/r/20250416201622.1295369-1-daniele.ceraolospurio@intel.com
(cherry picked from commit ba1f62a0cac84757ca35f4217e3cd3a2654233ae)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
6 months agodrm/xe/dma_buf: stop relying on placement in unmap
Matthew Auld [Thu, 10 Apr 2025 16:27:17 +0000 (17:27 +0100)] 
drm/xe/dma_buf: stop relying on placement in unmap

The is_vram() is checking the current placement, however if we consider
exported VRAM with dynamic dma-buf, it looks possible for the xe driver
to async evict the memory, notifying the importer, however importer does
not have to call unmap_attachment() immediately, but rather just as
"soon as possible", like when the dma-resv idles. Following from this we
would then pipeline the move, attaching the fence to the manager, and
then update the current placement. But when the unmap_attachment() runs
at some later point we might see that is_vram() is now false, and take
the complete wrong path when dma-unmapping the sg, leading to
explosions.

To fix this check if the sgl was mapping a struct page.

v2:
  - The attachment can be mapped multiple times it seems, so we can't
    really rely on encoding something in the attachment->priv. Instead
    see if the page_link has an encoded struct page. For vram we expect
    this to be NULL.

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4563
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250410162716.159403-2-matthew.auld@intel.com
(cherry picked from commit d755887f8e5a2a18e15e6632a5193e5feea18499)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
6 months agodrm/xe/userptr: fix notifier vs folio deadlock
Matthew Auld [Mon, 14 Apr 2025 13:25:40 +0000 (14:25 +0100)] 
drm/xe/userptr: fix notifier vs folio deadlock

User is reporting what smells like notifier vs folio deadlock, where
migrate_pages_batch() on core kernel side is holding folio lock(s) and
then interacting with the mappings of it, however those mappings are
tied to some userptr, which means calling into the notifier callback and
grabbing the notifier lock. With perfect timing it looks possible that
the pages we pulled from the hmm fault can get sniped by
migrate_pages_batch() at the same time that we are holding the notifier
lock to mark the pages as accessed/dirty, but at this point we also want
to grab the folio locks(s) to mark them as dirty, but if they are
contended from notifier/migrate_pages_batch side then we deadlock since
folio lock won't be dropped until we drop the notifier lock.

Fortunately the mark_page_accessed/dirty is not really needed in the
first place it seems and should have already been done by hmm fault, so
just remove it.

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4765
Fixes: 0a98219bcc96 ("drm/xe/hmm: Don't dereference struct page pointers without notifier lock")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.10+
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250414132539.26654-2-matthew.auld@intel.com
(cherry picked from commit bd7c0cb695e87c0e43247be8196b4919edbe0e85)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
6 months agodrm/xe: Set LRC addresses before guc load
Lucas De Marchi [Thu, 10 Apr 2025 04:59:34 +0000 (21:59 -0700)] 
drm/xe: Set LRC addresses before guc load

The metadata saved in the ADS is read by GuC when it's initialized.
Saving the addresses to the LRCs when they are populated is too late as
GuC will keep using the old ones.

This was causing GuC to use the RCS LRC for any engine class. It's not a
big problem on a Linux-only scenario since the they are used by GuC only
on media engines when the watchdog is triggered. However, in a
virtualization scenario with Windows as the VF, it causes the wrong LRCs
to be loaded as the watchdog is used for all engines.

Fix it by letting guc_golden_lrc_init() initialize the metadata, like
other *_init() functions, and later guc_golden_lrc_populate() to copy
the LRCs to the right places. The former is called before the second GuC
load, while the latter is called after LRCs have been recorded.

Cc: Chee Yin Wong <chee.yin.wong@intel.com>
Cc: John Harrison <john.c.harrison@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: <stable@vger.kernel.org> # v6.11+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Tested-by: Chee Yin Wong <chee.yin.wong@intel.com>
Link: https://lore.kernel.org/r/20250409-fix-guc-ads-v1-1-494135f7a5d0@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit c31a0b6402d15b530514eee9925adfcb8cfbb1c9)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>