git.ipfire.org Git - thirdparty/kernel/linux.git/log

selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor

Add a simple file stressor that lives directly in-tree. This will create
a bunch of processes that each open 500 file descriptors and then use
close_range() to close them all.

Concurrently, other processes read /proc/<pid>/fd/ which rougly does

    f = fget_task_next(p, &fd);
    if (!f)
           break;
    data.mode = f->f_mode;
    fput(f);

Which means that it'll try to get a reference to a file in another
task's file descriptor table.

Under heavy file load it is increasingly likely that the other task will
manage to close @file and @file will be recycled due to
SLAB_TYPEAFE_BY_RCU concurrently. This will trigger various warnings in
the file reference counting code.

Link: https://lore.kernel.org/r/20241021-vergab-streuen-924df15dceb9@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>

Merge branch 'work.fdtable' into vfs.file

Bring in the fdtable changes for this cycle.

Signed-off-by: Christian Brauner <brauner@kernel.org>

Merge patch series "fs: introduce file_ref_t"

Christian Brauner <brauner@kernel.org> says:

As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().

The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.

However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.

As suggested by Linus, add a file specific variant instead of making
this a generic library.

I've been testing this with will-it-scale using a multi-threaded fstat()
on the same file descriptor on a machine that Jens gave me access (thank
you very much!):

processor       : 511
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 160
model name      : AMD EPYC 9754 128-Core Processor

and I consistently get a 3-5% improvement on workloads with 256+ and
more threads comparing v6.12-rc1 as base with and without these patches
applied.

* patches from https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org:
  fs: port files to file_ref
  fs: add file_ref
  fs: protect backing files with rcu

Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

fs: port files to file_ref

Port files to rely on file_ref reference to improve scaling and gain
overflow protection.

- We continue to WARN during get_file() in case a file that is already
  marked dead is revived as get_file() is only valid if the caller
  already holds a reference to the file. This hasn't changed just the
  check changes.

- The semantics for epoll and ttm's dmabuf usage have changed. Both
  epoll and ttm synchronize with __fput() to prevent the underlying file
  from beeing freed.

  (1) epoll

      Explaining epoll is straightforward using a simple diagram.
      Essentially, the mutex of the epoll instance needs to be taken in both
      __fput() and around epi_fget() preventing the file from being freed
      while it is polled or preventing the file from being resurrected.

          CPU1                                   CPU2
          fput(file)
          -> __fput(file)
             -> eventpoll_release(file)
                -> eventpoll_release_file(file)
                                                 mutex_lock(&ep->mtx)
                                                 epi_item_poll()
                                                 -> epi_fget()
                                                    -> file_ref_get(file)
                                                 mutex_unlock(&ep->mtx)
                   mutex_lock(&ep->mtx);
                   __ep_remove()
                   mutex_unlock(&ep->mtx);
             -> kmem_cache_free(file)

  (2) ttm dmabuf

      This explanation is a bit more involved. A regular dmabuf file stashed
      the dmabuf in file->private_data and the file in dmabuf->file:

          file->private_data = dmabuf;
          dmabuf->file = file;

      The generic release method of a dmabuf file handles file specific
      things:

          f_op->release::dma_buf_file_release()

      while the generic dentry release method of a dmabuf handles dmabuf
      freeing including driver specific things:

          dentry->d_release::dma_buf_release()

      During ttm dmabuf initialization in ttm_object_device_init() the ttm
      driver copies the provided struct dma_buf_ops into a private location:

          struct ttm_object_device {
                  spinlock_t object_lock;
                  struct dma_buf_ops ops;
                  void (*dmabuf_release)(struct dma_buf *dma_buf);
                  struct idr idr;
          };

          ttm_object_device_init(const struct dma_buf_ops *ops)
          {
                  // copy original dma_buf_ops in private location
                  tdev->ops = *ops;

                  // stash the release method of the original struct dma_buf_ops
                  tdev->dmabuf_release = tdev->ops.release;

                  // override the release method in the copy of the struct dma_buf_ops
                  // with ttm's own dmabuf release method
                  tdev->ops.release = ttm_prime_dmabuf_release;
          }

      When a new dmabuf is created the struct dma_buf_ops with the overriden
      release method set to ttm_prime_dmabuf_release is passed in exp_info.ops:

          DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
          exp_info.ops = &tdev->ops;
          exp_info.size = prime->size;
          exp_info.flags = flags;
          exp_info.priv = prime;

      The call to dma_buf_export() then sets

          mutex_lock_interruptible(&prime->mutex);
          dma_buf = dma_buf_export(&exp_info)
          {
                  dmabuf->ops = exp_info->ops;
          }
          mutex_unlock(&prime->mutex);

      which creates a new dmabuf file and then install a file descriptor to
      it in the callers file descriptor table:

          ret = dma_buf_fd(dma_buf, flags);

      When that dmabuf file is closed we now get:

          fput(file)
          -> __fput(file)
             -> f_op->release::dma_buf_file_release()
             -> dput()
                -> d_op->d_release::dma_buf_release()
                   -> dmabuf->ops->release::ttm_prime_dmabuf_release()
                      mutex_lock(&prime->mutex);
                      if (prime->dma_buf == dma_buf)
                            prime->dma_buf = NULL;
                      mutex_unlock(&prime->mutex);

      Where we can see that prime->dma_buf is set to NULL. So when we have
      the following diagram:

          CPU1                                                          CPU2
          fput(file)
          -> __fput(file)
             -> f_op->release::dma_buf_file_release()
             -> dput()
                -> d_op->d_release::dma_buf_release()
                   -> dmabuf->ops->release::ttm_prime_dmabuf_release()
                                                                        ttm_prime_handle_to_fd()
                                                                        mutex_lock_interruptible(&prime->mutex)
                                                                        dma_buf = prime->dma_buf
                                                                        dma_buf && get_dma_buf_unless_doomed(dma_buf)
                                                                        -> file_ref_get(dma_buf->file)
                                                                        mutex_unlock(&prime->mutex);

                      mutex_lock(&prime->mutex);
                      if (prime->dma_buf == dma_buf)
                            prime->dma_buf = NULL;
                      mutex_unlock(&prime->mutex);
             -> kmem_cache_free(file)

      The logic of the mechanism is the same as for epoll: sync with
      __fput() preventing the file from being freed. Here the
      synchronization happens through the ttm instance's prime->mutex.
      Basically, the lifetime of the dma_buf and the file are tighly
      coupled.

  Both (1) and (2) used to call atomic_inc_not_zero() to check whether
  the file has already been marked dead and then refuse to revive it.

  This is only safe because both (1) and (2) sync with __fput() and thus
  prevent kmem_cache_free() on the file being called and thus prevent
  the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU.

  Both (1) and (2) have been ported from atomic_inc_not_zero() to
  file_ref_get(). That means a file that is already in the process of
  being marked as FILE_REF_DEAD:

  file_ref_put()
  cnt = atomic_long_dec_return()
  -> __file_ref_put(cnt)
     if (cnt == FIlE_REF_NOREF)
             atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  can be revived again:

  CPU1                                                             CPU2
  file_ref_put()
  cnt = atomic_long_dec_return()
  -> __file_ref_put(cnt)
     if (cnt == FIlE_REF_NOREF)
                                                                   file_ref_get()
                                                                   // Brings reference back to FILE_REF_ONEREF
                                                                   atomic_long_add_negative()
             atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  This is fine and inherent to the file_ref_get()/file_ref_put()
  semantics. For both (1) and (2) this is safe because __fput() is
  prevented from making progress if file_ref_get() fails due to the
  aforementioned synchronization mechanisms.

  Two cases need to be considered that affect both (1) epoll and (2) ttm
  dmabuf:

   (i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but
       before that fput() can mark the file as FILE_REF_DEAD someone
       manages to sneak in a file_ref_get() and brings the refcount back
       from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original
       fput() doesn't call __fput(). For epoll the poll will finish and
       for ttm dmabuf the file can be used again. For ttm dambuf this is
       actually an advantage because it avoids immediately allocating
       a new dmabuf object.

       CPU1                                                             CPU2
       file_ref_put()
       cnt = atomic_long_dec_return()
       -> __file_ref_put(cnt)
          if (cnt == FIlE_REF_NOREF)
                                                                        file_ref_get()
                                                                        // Brings reference back to FILE_REF_ONEREF
                                                                        atomic_long_add_negative()
                  atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  (ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and
       also suceeds in actually marking it FILE_REF_DEAD and then calls
       into __fput() to free the file.

       When either (1) or (2) call file_ref_get() they fail as
       atomic_long_add_negative() will return true.

       At the same time, both (1) and (2) all file_ref_get() under
       mutexes that __fput() must also acquire preventing
       kmem_cache_free() from freeing the file.

  So while this might be treated as a change in semantics for (1) and
  (2) it really isn't. It if should end up causing issues this can be
  fixed by adding a helper that does something like:

  long cnt = atomic_long_read(&ref->refcnt);
  do {
          if (cnt < 0)
                  return false;
  } while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1));
  return true;

  which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions.

- Jann correctly pointed out that kmem_cache_zalloc() cannot be used
  anymore once files have been ported to file_ref_t.

  The kmem_cache_zalloc() call will memset() the whole struct file to
  zero when it is reallocated. This will also set file->f_ref to zero
  which mens that a concurrent file_ref_get() can return true:

  CPU1                            CPU2
                                  __get_file_rcu()
                                    rcu_dereference_raw()
  close()
    [frees file]
  alloc_empty_file()
    kmem_cache_zalloc()
      [reallocates same file]
      memset(..., 0, ...)
                                    file_ref_get()
                                      [increments 0->1, returns true]
    init_file()
      file_ref_init(..., 1)
        [sets to 0]
                                    rcu_dereference_raw()
                                    fput()
                                      file_ref_put()
                                        [decrements 0->FILE_REF_NOREF, frees file]
    [UAF]

   causing a concurrent __get_file_rcu() call to acquire a reference to
   the file that is about to be reallocated and immediately freeing it
   on realizing that it has been recycled. This causes a UAF for the
   task that reallocated/recycled the file.

   This is prevented by switching from kmem_cache_zalloc() to
   kmem_cache_alloc() and initializing the fields manually. With
   file->f_ref initialized last.

   Note that a memset() also isn't guaranteed to atomically update an
   unsigned long so it's theoretically possible to see torn and
   therefore bogus counter values.

Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

fs: add file_ref

As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().

The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.

However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.

As suggested by Linus, add a file specific variant instead of making
this a generic library.

Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu
protection. In short, freeing of files isn't delayed until a grace
period has elapsed. Instead, they are freed immediately and thus can be
reused (multiple times) within the same grace period.

So when picking a file from the file descriptor table via its file
descriptor number it is thus possible to see an elevated reference count
on file->f_count even though the file has already been recycled possibly
multiple times by another task.

To guard against this the vfs will pick the file from the file
descriptor table twice. Once before the refcount increment and once
after to compare the pointers (grossly simplified). If they match then
the file is still valid. If not the caller needs to fput() it.

The unconditional increment makes the following race possible as
illustrated by rcuref:

> Deconstruction race
> ===================
>
> The release operation must be protected by prohibiting a grace period in
> order to prevent a possible use after free:
>
>      T1                              T2
>      put()                           get()
>      // ref->refcnt = ONEREF
>      if (!atomic_add_negative(-1, &ref->refcnt))
>              return false;                           <- Not taken
>
>      // ref->refcnt == NOREF
>      --> preemption
>                                      // Elevates ref->refcnt to ONEREF
>                                      if (!atomic_add_negative(1, &ref->refcnt))
>                                              return true;                    <- taken
>
>                                      if (put(&p->ref)) { <-- Succeeds
>                                              remove_pointer(p);
>                                              kfree_rcu(p, rcu);
>                                      }
>
>              RCU grace period ends, object is freed
>
>      atomic_cmpxchg(&ref->refcnt, NOREF, DEAD);      <- UAF
>
> [...] it prevents the grace period which keeps the object alive until
> all put() operations complete.

Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for
this deconstruction race. Afaict, the only interesting case would be
someone freeing the file and someone immediately recycling it within the
same grace period and reinitializing file->f_count to ONEREF while a
concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as
in the race above.

But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should
be safe from rcuref's perspective.

      T1                              T2                                                    T3
      fput()                          fget()
      // f_count->refcnt = ONEREF
      if (!atomic_add_negative(-1, &f_count->refcnt))
              return false;                           <- Not taken

      // f_count->refcnt == NOREF
      --> preemption
                                      // Elevates f_count->refcnt to ONEREF
                                      if (!atomic_add_negative(1, &f_count->refcnt))
                                              return true;                    <- taken

                                      if (put(&f_count)) { <-- Succeeds
                                              remove_pointer(p);
                                              /*
                                               * Cache is SLAB_TYPESAFE_BY_RCU
                                               * so this is freed without a grace period.
                                               */
                                              kmem_cache_free(p);
                                      }

                                                                                             kmem_cache_alloc()
                                                                                             init_file() {
                                                                                                     // Sets f_count->refcnt to ONEREF
                                                                                                     rcuref_long_init(&f->f_count, 1);
                                                                                             }

                        Object has been reused within the same grace period
                        via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU.

      /*
       * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and
       * it would work correctly because the atomic_cmpxchg()
       * will fail because the refcount has been reset to ONEREF by T3.
       */
      atomic_cmpxchg(&ref->refcnt, NOREF, DEAD);      <- UAF

However, there are other cases to consider:

(1) Benign race due to multiple atomic_long_read()

    CPU1                                                    CPU2

    file_ref_put()
    // last reference
    // => count goes negative/FILE_REF_NOREF
    atomic_long_add_negative_release(-1, &ref->refcnt)
    -> __file_ref_put()
                                                        file_ref_get()
                                                        // goes back from negative/FILE_REF_NOREF to 0
                                                        // and file_ref_get() succeeds
                                                        atomic_long_add_negative(1, &ref->refcnt)

                                                        // This is immediately followed by file_ref_put()
                                                        // managing to set FILE_REF_DEAD
                                                        file_ref_put()

       // __file_ref_put() continues and sees
       // cnt > FILE_REF_RELEASED // and splats with
       // "imbalanced put on file reference count"
       cnt = atomic_long_read(&ref->refcnt);

    The race however is benign and the problem is the
    atomic_long_read(). Instead of performing a separate read this uses
    atomic_long_dec_return() and pass the value to __file_ref_put().
    Thanks to Linus for pointing out that braino.

(2) SLAB_TYPESAFE_BY_RCU may cause recycled files to be marked dead

    When a file is recycled the following race exists:

    CPU1                                                       CPU2
    // @file is already dead and thus
    // cnt >= FILE_REF_RELEASED.
    file_ref_get(file)
    atomic_long_add_negative(1, &ref->refcnt)
       // We thus call into __file_ref_get()
    -> __file_ref_get()

       // which sees cnt >= FILE_REF_RELEASED
       cnt = atomic_long_read(&ref->refcnt);
                                                               // In the meantime @file gets freed
                                                               kmem_cache_free()

                                                               // and is immediately recycled
                                                               file = kmem_cache_zalloc()
                                                               // and the reference count is reinitialized
                                                               // and the file alive again in someone
                                                               // else's file descriptor table
                                                               file_ref_init(&ref->refcnt, 1);

       // the __file_ref_get() slowpath now continues
       // and as it saw earlier that cnt >= FILE_REF_RELEASED
       // it wants to ensure that we're staying in the middle
       // of the deadzone and unconditionally sets
       // FILE_REF_DEAD.
       // This marks @file dead for CPU2...
       atomic_long_set(&ref->refcnt, FILE_REF_DEAD);

                                                               // Caller issues a close() system call to close @file
                                                               close(fd)
                                                               file = file_close_fd_locked()
                                                               filp_flush()
                                                               // The caller sees that cnt >= FILE_REF_RELEASED
                                                               // and warns the first time...
                                                               CHECK_DATA_CORRUPTION(file_count(file) == 0)

                                                               // and then splats a second time because
                                                               // __file_ref_put() sees cnt >= FILE_REF_RELEASED
                                                               file_ref_put(&ref->refcnt);
                                                               -> __file_ref_put()

    My initial inclination was to replace the unconditional
    atomic_long_set() with an atomic_long_try_cmpxchg() but Linus
    pointed out that:

    > I think we should just make file_ref_get() do a simple
    >
    >        return !atomic_long_add_negative(1, &ref->refcnt));
    >
    > and nothing else. Yes, multiple CPU's can race, and you can increment
    > more than once, but the gap - even on 32-bit - between DEAD and
    > becoming close to REF_RELEASED is so big that we simply don't care.
    > That's the point of having a gap.

I've been testing this with will-it-scale using fstat() on a machine
that Jens gave me access (thank you very much!):

processor       : 511
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 160
model name      : AMD EPYC 9754 128-Core Processor

and I consistently get a 3-5% improvement on 256+ threads.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202410151043.5d224a27-oliver.sang@intel.com
Closes: https://lore.kernel.org/all/202410151611.f4cd71f2-oliver.sang@intel.com
Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-2-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

expand_files(): simplify calling conventions

All callers treat 0 and 1 returned by expand_files() in the same way
now since the call in alloc_fd() had been made conditional. Just make
it return 0 on success and be done with it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

make __set_open_fd() set cloexec state as well

->close_on_exec[] state is maintained only for opened descriptors;
as the result, anything that marks a descriptor opened has to
set its cloexec state explicitly.

As the result, all calls of __set_open_fd() are followed by
__set_close_on_exec(); might as well fold it into __set_open_fd()
so that cloexec state is defined as soon as the descriptor is
marked opened.

[braino fix folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fs: protect backing files with rcu

Currently backing files are not under any form of rcu protection.
Switching to file_ref requires rcu protection and so does the
speculative vma lookup. Switch backing files to the same rcu slab type
as regular files. There should be no additional magic required as the
lifetime of a backing file is always tied to a regular file.

Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-1-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

file.c: merge __{set,clear}_close_on_exec()

they are always go in pairs; seeing that they are inlined, might
as well make that a single inline function taking a boolean
argument ("do we want close_on_exec set for that descriptor")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

alloc_fdtable(): change calling conventions.

First of all, tell it how many slots do we want, not which slot
is wanted.  It makes one caller (dup_fd()) more straightforward
and doesn't harm another (expand_fdtable()).

Furthermore, make it return ERR_PTR() on failure rather than
returning NULL.  Simplifies the callers.

Simplify the size calculation, while we are at it - note that we
always have slots_wanted greater than BITS_PER_LONG.  What the
rules boil down to is
* use the smallest power of two large enough to give us
that many slots
* on 32bit skip 64 and 128 - the minimal capacity we want
there is 256 slots (i.e. 1Kb fd array).
* on 64bit don't skip anything, the minimal capacity is
128 - and we'll never be asked for 64 or less.  128 slots means
1Kb fd array, again.
* on 128bit, if that ever happens, don't skip anything -
we'll never be asked for 128 or less, so the fd array allocation
will be at least 2Kb.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fs/file.c: add fast path in find_next_fd()

Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.

This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-4-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fs/file.c: conditionally clear full_fds

64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-3-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()

alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-2-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

move close_range(2) into fs/file.c, fold __close_range() into it

We never had callers for __close_range() except for close_range(2)
itself. Nothing of that sort has appeared in four years and if any users
do show up, we can always separate those suckers again.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

close_files(): don't bother with xchg()

At that point nobody else has references to the victim files_struct;
as the matter of fact, the caller will free it immediately after
close_files() returns, with no RCU delays or anything of that sort.

That's why we are not protecting against fdtable reallocation on
expansion, not cleaning the bitmaps, etc. There's no point
zeroing the pointers in ->fd[] either, let alone make that an
atomic operation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

remove pointless includes of <linux/fdtable.h>

some of those used to be needed, some had been cargo-culted for
no reason...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

get rid of ...lookup...fdget_rcu() family

Once upon a time, predecessors of those used to do file lookup
without bumping a refcount, provided that caller held rcu_read_lock()
across the lookup and whatever it wanted to read from the struct
file found. When struct file allocation switched to SLAB_TYPESAFE_BY_RCU,
that stopped being feasible and these primitives started to bump the
file refcount for lookup result, requiring the caller to call fput()
afterwards.

But that turned them pointless - e.g.
rcu_read_lock();
file = lookup_fdget_rcu(fd);
rcu_read_unlock();
is equivalent to
file = fget_raw(fd);
and all callers of lookup_fdget_rcu() are of that form. Similarly,
task_lookup_fdget_rcu() calls can be replaced with calling fget_task().
task_lookup_next_fdget_rcu() doesn't have direct counterparts, but
its callers would be happier if we replaced it with an analogue that
deals with RCU internally.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Linux 6.12-rc2

Merge tag 'kbuild-fixes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- Move non-boot built-in DTBs to the .rodata section

- Fix Kconfig bugs

- Fix maint scripts in the linux-image Debian package

- Import some list macros to scripts/include/

* tag 'kbuild-fixes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: deb-pkg: Remove blank first line from maint scripts
  kbuild: fix a typo dt_binding_schema -> dt_binding_schemas
  scripts: import more list macros
  kconfig: qconf: fix buffer overflow in debug links
  kconfig: qconf: move conf_read() before drawing tree pain
  kconfig: clear expr::val_is_valid when allocated
  kconfig: fix infinite loop in sym_calc_choice()
  kbuild: move non-boot built-in DTBs to .rodata section

Merge tag 'platform-drivers-x86-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:

- Intel PMC fix for suspend/resume issues on some Sky and Kaby Lake
   laptops

- Intel Diamond Rapids hw-id additions

- Documentation and MAINTAINERS fixes

- Some other small fixes

* tag 'platform-drivers-x86-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors
  platform/x86: wmi: Update WMI driver API documentation
  platform/x86: dell-ddv: Fix typo in documentation
  platform/x86: dell-sysman: add support for alienware products
  platform/x86/intel: power-domains: Add Diamond Rapids support
  platform/x86: ISST: Add Diamond Rapids to support list
  platform/x86:intel/pmc: Disable ACPI PM Timer disabling on Sky and Kaby Lake
  platform/x86: dell-laptop: Do not fail when encountering unsupported batteries
  MAINTAINERS: Update Intel In Field Scan(IFS) entry
  platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM64:

   - Fix pKVM error path on init, making sure we do not change critical
     system registers as we're about to fail

   - Make sure that the host's vector length is at capped by a value
     common to all CPUs

   - Fix kvm_has_feat*() handling of "negative" features, as the current
     code is pretty broken

   - Promote Joey to the status of official reviewer, while James steps
     down -- hopefully only temporarly

  x86:

   - Fix compilation with KVM_INTEL=KVM_AMD=n

   - Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use

  Selftests:

   - Fix compilation on non-x86 architectures"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  x86/reboot: emergency callbacks are now registered by common KVM code
  KVM: x86: leave kvm.ko out of the build if no vendor module is requested
  KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
  KVM: arm64: Fix kvm_has_feat*() handling of negative features
  KVM: selftests: Fix build on architectures other than x86_64
  KVM: arm64: Another reviewer reshuffle
  KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM
  KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error path

Merge tag 'powerpc-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:

- Allow r30 to be used in vDSO code generation of getrandom

Thanks to Jason A. Donenfeld

* tag 'powerpc-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/vdso: allow r30 in vDSO code generation of getrandom

kbuild: deb-pkg: Remove blank first line from maint scripts

The blank line causes execve() to fail:

  # strace ./postinst
  execve("./postinst", ...) = -1 ENOEXEC (Exec format error)
  strace: exec: Exec format error
  +++ exited with 1 +++

However running the scripts via shell does work (at least with bash)
because the shell attempts to execute the file as a shell script when
execve() fails.

Fixes: b611daae5efc ("kbuild: deb-pkg: split image and debug objects staging out into functions")
Signed-off-by: Aaron Thompson <dev@aaront.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

kbuild: fix a typo dt_binding_schema -> dt_binding_schemas

If we follow "make help" to "make dt_binding_schema", we will see
below error:

$ make dt_binding_schema
make[1]: *** No rule to make target 'dt_binding_schema'. Stop.
make: *** [Makefile:224: __sub-make] Error 2

It should be a typo. So this will fix it.

Fixes: 604a57ba9781 ("dt-bindings: kbuild: Add separate target/dependency for processed-schema.json")
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

scripts: import more list macros

Import list_is_first, list_is_last, list_replace, and list_replace_init.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors

x86_android_tablet_remove() frees the pdevs[] array, so it should not
be used after calling x86_android_tablet_remove().

When platform_device_register() fails, store the pdevs[x] PTR_ERR() value
into the local ret variable before calling x86_android_tablet_remove()
to avoid using pdevs[] after it has been freed.

Fixes: 5eba0141206e ("platform/x86: x86-android-tablets: Add support for instantiating platform-devs")
Fixes: e2200d3f26da ("platform/x86: x86-android-tablets: Add gpio_keys support to x86_android_tablet_init()")
Cc: stable@vger.kernel.org
Reported-by: Aleksandr Burakov <a.burakov@rosalinux.ru>
Closes: https://lore.kernel.org/platform-driver-x86/20240917120458.7300-1-a.burakov@rosalinux.ru/
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20241005130545.64136-1-hdegoede@redhat.com

platform/x86: wmi: Update WMI driver API documentation

The WMI driver core now passes the WMI event data to legacy notify
handlers, so WMI devices sharing notification IDs are now being
handled properly.

Fixes: e04e2b760ddb ("platform/x86: wmi: Pass event data directly to legacy notify handlers")
Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Link: https://lore.kernel.org/r/20241005213825.701887-1-W_Armin@gmx.de
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

platform/x86: dell-ddv: Fix typo in documentation

Fix typo in word 'diagnostics' in documentation.

Signed-off-by: Anaswara T Rajan <anaswaratrajan@gmail.com>
Reviewed-by: Armin Wolf <W_Armin@gmx.de>
Link: https://lore.kernel.org/r/20241005070056.16326-1-anaswaratrajan@gmail.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

platform/x86: dell-sysman: add support for alienware products

Alienware supports firmware-attributes and has its own OEM string.

Signed-off-by: Crag Wang <crag_wang@dell.com>
Link: https://lore.kernel.org/r/20241004152826.93992-1-crag_wang@dell.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

platform/x86/intel: power-domains: Add Diamond Rapids support

Add Diamond Rapids (INTEL_PANTHERCOVE_X) to tpmi_cpu_ids to support
domaid id mappings.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lore.kernel.org/r/20241003215554.3013807-3-srinivas.pandruvada@linux.intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

platform/x86: ISST: Add Diamond Rapids to support list

Add Diamond Rapids (INTEL_PANTHERCOVE_X) to SST support list by adding
to isst_cpu_ids.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lore.kernel.org/r/20241003215554.3013807-2-srinivas.pandruvada@linux.intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

platform/x86:intel/pmc: Disable ACPI PM Timer disabling on Sky and Kaby Lake

There have been multiple reports that the ACPI PM Timer disabling is
causing Sky and Kaby Lake systems to hang on all suspend (s2idle, s3,
hibernate) methods.

Remove the acpi_pm_tmr_ctl_offset and acpi_pm_tmr_disable_bit settings from
spt_reg_map to disable the ACPI PM Timer disabling on Sky and Kaby Lake to
fix the hang on suspend.

Fixes: e86c8186d03a ("platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Closes: https://lore.kernel.org/linux-pm/18784f62-91ff-4d88-9621-6c88eb0af2b5@molgen.mpg.de/
Reported-by: Todd Brandt <todd.e.brandt@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219346
Cc: Marek Maslanka <mmaslanka@google.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Tested-by: Todd Brandt <todd.e.brandt@intel.com>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> # Dell XPS 13 9360/0596KF
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20241003202614.17181-2-hdegoede@redhat.com

platform/x86: dell-laptop: Do not fail when encountering unsupported batteries

If the battery hook encounters a unsupported battery, it will
return an error. This in turn will cause the battery driver to
automatically unregister the battery hook.

On machines with multiple batteries however, this will prevent
the battery hook from handling the primary battery, since it will
always get unregistered upon encountering one of the unsupported
batteries.

Fix this by simply ignoring unsupported batteries.

Reviewed-by: Pali Rohár <pali@kernel.org>
Fixes: ab58016c68cc ("platform/x86:dell-laptop: Add knobs to change battery charge settings")
Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Link: https://lore.kernel.org/r/20241001212835.341788-4-W_Armin@gmx.de
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

MAINTAINERS: Update Intel In Field Scan(IFS) entry

Ashok is no longer with Intel and his e-mail address will start bouncing
soon. Update his email address to the new one he provided to ensure
correct contact details in the MAINTAINERS file.

Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Link: https://lore.kernel.org/r/20241001170808.203970-1-jithu.joseph@intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

Merge tag 'kvmarm-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.12, take #1

- Fix pKVM error path on init, making sure we do not change critical
  system registers as we're about to fail

- Make sure that the host's vector length is at capped by a value
  common to all CPUs

- Fix kvm_has_feat*() handling of "negative" features, as the current
  code is pretty broken

- Promote Joey to the status of official reviewer, while James steps
  down -- hopefully only temporarly

x86/reboot: emergency callbacks are now registered by common KVM code

Guard them with CONFIG_KVM_X86_COMMON rather than the two vendor modules.
In practice this has no functional change, because CONFIG_KVM_X86_COMMON
is set if and only if at least one vendor-specific module is being built.
However, it is cleaner to specify CONFIG_KVM_X86_COMMON for functions that
are used in kvm.ko.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled")
Fixes: 6d55a94222db ("x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: leave kvm.ko out of the build if no vendor module is requested

kvm.ko is nothing but library code shared by kvm-intel.ko and kvm-amd.ko.
It provides no functionality on its own and it is unnecessary unless one
of the vendor-specific module is compiled. In particular, /dev/kvm is
not created until one of kvm-intel.ko or kvm-amd.ko is loaded.

Use CONFIG_KVM to decide if it is built-in or a module, but use the
vendor-specific modules for the actual decision on whether to build it.

This also fixes a build failure when CONFIG_KVM_INTEL and CONFIG_KVM_AMD
are both disabled. The cpu_emergency_register_virt_callback() function
is called from kvm.ko, but it is only defined if at least one of
CONFIG_KVM_INTEL and CONFIG_KVM_AMD is provided.

Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
"A lot of little fixes, bigger ones include:

   - bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
     changes, now fixed along with another lost wakeup

   - fragmentation LRU fixes; fsck now repairs successfully (this is the
     data structure copygc uses); along with some nice simplification.

   - Rework logged op error handling, so that if logged op replay errors
     (due to another filesystem error) we delete the logged op instead
     of going into an infinite loop)

   - Various small filesystem connectivitity repair fixes"

* tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs:
  bcachefs: Rework logged op error handling
  bcachefs: Add warn param to subvol_get_snapshot, peek_inode
  bcachefs: Kill snapshot arg to fsck_write_inode()
  bcachefs: Check for unlinked, non-empty dirs in check_inode()
  bcachefs: Check for unlinked inodes with dirents
  bcachefs: Check for directories with no backpointers
  bcachefs: Kill alloc_v4.fragmentation_lru
  bcachefs: minor lru fsck fixes
  bcachefs: Mark more errors AUTOFIX
  bcachefs: Make sure we print error that causes fsck to bail out
  bcachefs: bkey errors are only AUTOFIX during read
  bcachefs: Create lost+found in correct snapshot
  bcachefs: Fix reattach_inode()
  bcachefs: Add missing wakeup to bch2_inode_hash_remove()
  bcachefs: Fix trans_commit disk accounting revert
  bcachefs: Fix bch2_inode_is_open() check
  bcachefs: Fix return type of dirent_points_to_inode_nowarn()
  bcachefs: Fix bad shift in bch2_read_flag_list()

Merge tag 'for-linus-6.12a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fix from Juergen Gross:
"Fix Xen config issue introduced in the merge window"

* tag 'for-linus-6.12a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: Fix config option reference in XEN_PRIVCMD definition

Merge tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
"Fix some ext4 bugs and regressions relating to oneline resize and fast
  commits"

* tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix off by one issue in alloc_flex_gd()
  ext4: mark fc as ineligible using an handle in ext4_xattr_set()
  ext4: use handle to mark fc as ineligible in __track_dentry_update()

Merge tag 'cxl-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl fix from Ira Weiny:

- Fix calculation for SBDF in error injection

* tag 'cxl-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
EINJ, CXL: Fix CXL device SBDF calculation

Merge tag 'i2c-for-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fix from Wolfram Sang:

- Fix potential deadlock during runtime suspend and resume (stm32f7)

* tag 'i2c-for-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: stm32f7: Do not prepare/unprepare clock during runtime suspend/resume

Merge tag 'spi-fix-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A small set of driver specific fixes that came in since the merge
  window, about half of which is fixes for correctness in the use of the
  runtime PM APIs done as part of a broader cleanup"

* tag 'spi-fix-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: s3c64xx: fix timeout counters in flush_fifo
  spi: atmel-quadspi: Fix wrong register value written to MR
  spi: spi-cadence: Fix missing spi_controller_is_target() check
  spi: spi-cadence: Fix pm_runtime_set_suspended() with runtime pm enabled
  spi: spi-imx: Fix pm_runtime_set_suspended() with runtime pm enabled

Merge tag 'hardening-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

- gcc plugins: Avoid Kconfig warnings with randstruct (Nathan
   Chancellor)

- MAINTAINERS: Add security/Kconfig.hardening to hardening section
   (Nathan Chancellor)

- MAINTAINERS: Add unsafe_memcpy() to the FORTIFY review list

* tag 'hardening-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  MAINTAINERS: Add security/Kconfig.hardening to hardening section
  hardening: Adjust dependencies in selection of MODVERSIONS
  MAINTAINERS: Add unsafe_memcpy() to the FORTIFY review list

Merge tag 'lsm-pr-20241004' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm revert from Paul Moore:
"Here is the CONFIG_SECURITY_TOMOYO_LKM revert that we've been
  discussing this week. With near unanimous agreement that the original
  TOMOYO patches were not the right way to solve the distro problem
  Tetsuo is trying the solve, reverting is our best option at this time"

* tag 'lsm-pr-20241004' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  tomoyo: revert CONFIG_SECURITY_TOMOYO_LKM support

platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug

Attaching SST PCI device to VM causes "BUG: KASAN: slab-out-of-bounds".
kasan report:
[   19.411889] ==================================================================
[   19.413702] BUG: KASAN: slab-out-of-bounds in _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[   19.415634] Read of size 8 at addr ffff888829e65200 by task cpuhp/16/113
[   19.417368]
[   19.418627] CPU: 16 PID: 113 Comm: cpuhp/16 Tainted: G            E      6.9.0 #10
[   19.420435] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.20192059.B64.2207280713 07/28/2022
[   19.422687] Call Trace:
[   19.424091]  <TASK>
[   19.425448]  dump_stack_lvl+0x5d/0x80
[   19.426963]  ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[   19.428694]  print_report+0x19d/0x52e
[   19.430206]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[   19.431837]  ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[   19.433539]  kasan_report+0xf0/0x170
[   19.435019]  ? _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[   19.436709]  _isst_if_get_pci_dev+0x3d5/0x400 [isst_if_common]
[   19.438379]  ? __pfx_sched_clock_cpu+0x10/0x10
[   19.439910]  isst_if_cpu_online+0x406/0x58f [isst_if_common]
[   19.441573]  ? __pfx_isst_if_cpu_online+0x10/0x10 [isst_if_common]
[   19.443263]  ? ttwu_queue_wakelist+0x2c1/0x360
[   19.444797]  cpuhp_invoke_callback+0x221/0xec0
[   19.446337]  cpuhp_thread_fun+0x21b/0x610
[   19.447814]  ? __pfx_cpuhp_thread_fun+0x10/0x10
[   19.449354]  smpboot_thread_fn+0x2e7/0x6e0
[   19.450859]  ? __pfx_smpboot_thread_fn+0x10/0x10
[   19.452405]  kthread+0x29c/0x350
[   19.453817]  ? __pfx_kthread+0x10/0x10
[   19.455253]  ret_from_fork+0x31/0x70
[   19.456685]  ? __pfx_kthread+0x10/0x10
[   19.458114]  ret_from_fork_asm+0x1a/0x30
[   19.459573]  </TASK>
[   19.460853]
[   19.462055] Allocated by task 1198:
[   19.463410]  kasan_save_stack+0x30/0x50
[   19.464788]  kasan_save_track+0x14/0x30
[   19.466139]  __kasan_kmalloc+0xaa/0xb0
[   19.467465]  __kmalloc+0x1cd/0x470
[   19.468748]  isst_if_cdev_register+0x1da/0x350 [isst_if_common]
[   19.470233]  isst_if_mbox_init+0x108/0xff0 [isst_if_mbox_msr]
[   19.471670]  do_one_initcall+0xa4/0x380
[   19.472903]  do_init_module+0x238/0x760
[   19.474105]  load_module+0x5239/0x6f00
[   19.475285]  init_module_from_file+0xd1/0x130
[   19.476506]  idempotent_init_module+0x23b/0x650
[   19.477725]  __x64_sys_finit_module+0xbe/0x130
[   19.476506]  idempotent_init_module+0x23b/0x650
[   19.477725]  __x64_sys_finit_module+0xbe/0x130
[   19.478920]  do_syscall_64+0x82/0x160
[   19.480036]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   19.481292]
[   19.482205] The buggy address belongs to the object at ffff888829e65000
which belongs to the cache kmalloc-512 of size 512
[   19.484818] The buggy address is located 0 bytes to the right of
allocated 512-byte region [ffff888829e65000, ffff888829e65200)
[   19.487447]
[   19.488328] The buggy address belongs to the physical page:
[   19.489569] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888829e60c00 pfn:0x829e60
[   19.491140] head: order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   19.492466] anon flags: 0x57ffffc0000840(slab|head|node=1|zone=2|lastcpupid=0x1fffff)
[   19.493914] page_type: 0xffffffff()
[   19.494988] raw: 0057ffffc0000840 ffff88810004cc80 0000000000000000 0000000000000001
[   19.496451] raw: ffff888829e60c00 0000000080200018 00000001ffffffff 0000000000000000
[   19.497906] head: 0057ffffc0000840 ffff88810004cc80 0000000000000000 0000000000000001
[   19.499379] head: ffff888829e60c00 0000000080200018 00000001ffffffff 0000000000000000
[   19.500844] head: 0057ffffc0000003 ffffea0020a79801 ffffea0020a79848 00000000ffffffff
[   19.502316] head: 0000000800000000 0000000000000000 00000000ffffffff 0000000000000000
[   19.503784] page dumped because: kasan: bad access detected
[   19.505058]
[   19.505970] Memory state around the buggy address:
[   19.507172]  ffff888829e65100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   19.508599]  ffff888829e65180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   19.510013] >ffff888829e65200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   19.510014]                    ^
[   19.510016]  ffff888829e65280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   19.510018]  ffff888829e65300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   19.515367] ==================================================================

The reason for this error is physical_package_ids assigned by VMware VMM
are not continuous and have gaps. This will cause value returned by
topology_physical_package_id() to be more than topology_max_packages().

Here the allocation uses topology_max_packages(). The call to
topology_max_packages() returns maximum logical package ID not physical
ID. Hence use topology_logical_package_id() instead of
topology_physical_package_id().

Fixes: 9a1aac8a96dc ("platform/x86: ISST: PUNIT device mapping with Sub-NUMA clustering")
Cc: stable@vger.kernel.org
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Zach Wade <zachwade.k@gmail.com>
Link: https://lore.kernel.org/r/20240923144508.1764-1-zachwade.k@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>

Merge tag 'linux_kselftest-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fixes from Shuah Khan:
"Fixes to build warnings, install scripts, run-time error path, and git
  status cleanups to tests:

   - devices/probe: fix for Python3 regex string syntax warnings

   - clone3: removing unused macro from clone3_cap_checkpoint_restore()

   - vDSO: fix to align getrandom states to cache line

   - core and exec: add missing executables to .gitignore files

   - rtc: change to skip test if /dev/rtc0 can't be accessed

   - timers/posix: fix warn_unused_result result in __fatal_error()

   - breakpoints: fix to detect suspend successful condition correctly

   - hid: fix to install required dependencies to run the test"

* tag 'linux_kselftest-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests: breakpoints: use remaining time to check if suspend succeed
  kselftest/devices/probe: Fix SyntaxWarning in regex strings for Python3
  selftest: hid: add missing run-hid-tools-tests.sh
  selftests: vDSO: align getrandom states to cache line
  selftests: exec: update gitignore for load_address
  selftests: core: add unshare_test to gitignore
  clone3: clone3_cap_checkpoint_restore: remove unused MAX_PID_NS_LEVEL macro
  selftests:timers: posix_timers: Fix warn_unused_result in __fatal_error()
  selftest: rtc: Check if could access /dev/rtc0 before testing

bcachefs: Rework logged op error handling

Initially it was thought that we just wanted to ignore errors from
logged op replay, but it turns out we do need to catch -EROFS, or we'll
go into an infinite loop.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add warn param to subvol_get_snapshot, peek_inode

These shouldn't always be fatal errors - logged op resume, in
particular, and we want it as a parameter there.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill snapshot arg to fsck_write_inode()

It was initially believed that it would be better to be explicit about
the snapshot we're updating when writing inodes in fsck; however, it
turns out that passing around the snapshot separately is more error
prone and we're usually updating the inode in the same snapshow we read
it from.

This is different from normal filesystem paths, where we do the update
in the snapshot of the subvolume we're in.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Check for unlinked, non-empty dirs in check_inode()

We want to check for this early so it can be reattached if necessary in
check_unreachable_inodes(); better than letting it be deleted and having
the children reattached, losing their filenames.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Check for unlinked inodes with dirents

link count works differently in bcachefs - it's only nonzero for files
with multiple hardlinks, which means we can also avoid checking it
except for files that are known to have hardlinks.

That means we need a few different checks instead; in particular, we
don't want fsck to delet a file that has a dirent pointing to it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Check for directories with no backpointers

It's legal for regular files to have missing backpointers (due to
hardlinks), and fsck should automatically add them, but for directories
this is an error that should be flagged.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Kill alloc_v4.fragmentation_lru

The fragmentation_lru field hasn't been needed since we reworked the LRU
btrees to use the btree write buffer; previously it was used to resolve
collisions, but the revised LRU btree uses the backpointer (the bucket)
as part of the key.

It should have been deleted at the time of the LRU rework; since it
wasn't, that left places for bugs to hide, in check/repair.

This fixes LRU fsck on a filesystem image helpfully provided by a user
who disappeared before I could get his name for the reported-by.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: minor lru fsck fixes

check_lru_key() wasn't using write buffer updates for deleting bad lru
entries - dating from before the lru btree used the btree write buffer.

And when possibly flushing the btree write buffer (to make sure we're
seeing a real inconsistency), we need to be using the modern
bch2_btree_write_buffer_maybe_flush().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Mark more errors AUTOFIX

Errors are getting marked as AUTOFIX once they've been (re)-tested and
audited.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Make sure we print error that causes fsck to bail out

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: bkey errors are only AUTOFIX during read

Newly generated keys, in the transaction commit path or write path,
should not be AUTOFIX; those indicate bugs that we need to fail fast
for.

Fixes: 5612daafb764 ("bcachefs: Fix fsck warnings from bkey validation")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Create lost+found in correct snapshot

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Fix reattach_inode()

Ensure a copy of the lost+found inode exists in the snapshot that we're
reattaching, so that we don't trigger warnings in
lookup_inode_for_snapshot() later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: Add missing wakeup to bch2_inode_hash_remove()

This fixes two different bugs:

- Looser locking with the rhashtable means we need to recheck if the
  inode is still hashed after prepare_to_wait(), and add a corresponding
  wakeup after removing from the hash table.

- da18ecbf0fb6 ("fs: add i_state helpers") changed the bit waitqueues
  used for inodes, and bcachefs wasn't updated and thus broke; this
  updates bcachefs to the new helper.

Fixes: 112d21fd1a12 ("bcachefs: switch to rhashtable for vfs inodes hash")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ext4: fix off by one issue in alloc_flex_gd()

Wesley reported an issue:

==================================================================
EXT4-fs (dm-5): resizing filesystem from 7168 to 786432 blocks
------------[ cut here ]------------
kernel BUG at fs/ext4/resize.c:324!
CPU: 9 UID: 0 PID: 3576 Comm: resize2fs Not tainted 6.11.0+ #27
RIP: 0010:ext4_resize_fs+0x1212/0x12d0
Call Trace:
__ext4_ioctl+0x4e0/0x1800
ext4_ioctl+0x12/0x20
__x64_sys_ioctl+0x99/0xd0
x64_sys_call+0x1206/0x20d0
do_syscall_64+0x72/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
==================================================================

While reviewing the patch, Honza found that when adjusting resize_bg in
alloc_flex_gd(), it was possible for flex_gd->resize_bg to be bigger than
flexbg_size.

The reproduction of the problem requires the following:

o_group = flexbg_size * 2 * n;
o_size = (o_group + 1) * group_size;
n_group: [o_group + flexbg_size, o_group + flexbg_size * 2)
o_size = (n_group + 1) * group_size;

Take n=0,flexbg_size=16 as an example:

              last:15
|o---------------|--------------n-|
o_group:0    resize to      n_group:30

The corresponding reproducer is:

img=test.img
rm -f $img
truncate -s 600M $img
mkfs.ext4 -F $img -b 1024 -G 16 8M
dev=`losetup -f --show $img`
mkdir -p /tmp/test
mount $dev /tmp/test
resize2fs $dev 248M

Delete the problematic plus 1 to fix the issue, and add a WARN_ON_ONCE()
to prevent the issue from happening again.

[ Note: another reproucer which this commit fixes is:

  img=test.img
  rm -f $img
  truncate -s 25MiB $img
  mkfs.ext4 -b 4096 -E nodiscard,lazy_itable_init=0,lazy_journal_init=0 $img
  truncate -s 3GiB $img
  dev=`losetup -f --show $img`
  mkdir -p /tmp/test
  mount $dev /tmp/test
  resize2fs $dev 3G
  umount $dev
  losetup -d $dev

  -- TYT ]

Reported-by: Wesley Hershberger <wesley.hershberger@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2081231
Reported-by: Stéphane Graber <stgraber@stgraber.org>
Closes: https://lore.kernel.org/all/20240925143325.518508-1-aleksandr.mikhalitsyn@canonical.com/
Tested-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Tested-by: Eric Sandeen <sandeen@redhat.com>
Fixes: 665d3e0af4d3 ("ext4: reduce unnecessary memory allocation in alloc_flex_gd()")
Cc: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240927133329.1015041-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: mark fc as ineligible using an handle in ext4_xattr_set()

Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch moves the call to this function so that an handle
can be used. If a transaction fails to start, then there's not point in
trying to mark the filesystem as ineligible, and an error will eventually be
returned to user-space.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240923104909.18342-3-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

ext4: use handle to mark fc as ineligible in __track_dentry_update()

Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch fixes the calls to this function in
__track_dentry_update() by adding an extra parameter to the callback used in
ext4_fc_track_template().

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240923104909.18342-2-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:
"A couple of build/config issues and expanding the speculative SSBS
  workaround to more CPUs:

   - Expand the speculative SSBS workaround to cover Cortex-A715,
     Neoverse-N3 and Microsoft Azure Cobalt 100

   - Force position-independent veneers - in some kernel configurations,
     the LLD linker generates position-dependent veneers for otherwise
     position-independent code, resulting in early boot-time failures

   - Fix Kconfig selection of HAVE_DYNAMIC_FTRACE_WITH_ARGS so that it
     is not enabled when not supported by the combination of clang and
     GNU ld"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Subscribe Microsoft Azure Cobalt 100 to erratum 3194386
  arm64: fix selection of HAVE_DYNAMIC_FTRACE_WITH_ARGS
  arm64: errata: Expand speculative SSBS workaround once more
  arm64: cputype: Add Neoverse-N3 definitions
  arm64: Force position-independent veneers

Merge tag 'riscv-for-linus-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

- PERF_TYPE_BREAKPOINT now returns -EOPNOTSUPP instead of -ENOENT,
   which aligns to other ports and is a saner value

- The KASAN-related stack size increasing logic has been moved to a C
   header, to avoid dependency issues

* tag 'riscv-for-linus-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Fix kernel stack size when KASAN is enabled
  drivers/perf: riscv: Align errno for unsupported perf event

Merge tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Fix tp_printk command line option crashing the kernel

   With the code that can handle a buffer from a previous boot, the
   trace_check_vprintf() needed access to the delta of the address space
   used by the old buffer and the current buffer. To do so, the
   trace_array (tr) parameter was used. But when tp_printk is enabled on
   the kernel command line, no trace buffer is used and the trace event
   is sent directly to printk(). That meant the tr field of the iterator
   descriptor was NULL, and since tp_printk still uses
   trace_check_vprintf() it caused a NULL dereference.

- Add ptrace.h include to x86 ftrace file for completeness

- Fix rtla installation when done with out-of-tree build

- Fix the help messages in rtla that were incorrect

- Several fixes to fix races with the timerlat and hwlat code

   Several locking issues were discovered with the coordination between
   timerlat kthread creation and hotplug. As timerlat has callbacks from
   hotplug code to start kthreads when CPUs come online. There are also
   locking issues with grabbing the cpu_read_lock() and the locks within
   timerlat.

* tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/hwlat: Fix a race during cpuhp processing
  tracing/timerlat: Fix a race during cpuhp processing
  tracing/timerlat: Drop interface_lock in stop_kthread()
  tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline
  x86/ftrace: Include <asm/ptrace.h>
  rtla: Fix the help text in osnoise and timerlat top tools
  tools/rtla: Fix installation from out-of-tree build
  tracing: Fix trace_check_vprintf() when tp_printk is used

Merge tag 'slab-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fixes from Vlastimil Babka:
"Fixes for issues introduced in this merge window: kobject memory leak,
  unsupressed warning and possible lockup in new slub_kunit tests,
  misleading code in kvfree_rcu_queue_batch()"

* tag 'slab-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  slub/kunit: skip test_kfree_rcu when the slub kunit test is built-in
  mm, slab: suppress warnings in test_leak_destroy kunit test
  rcu/kvfree: Refactor kvfree_rcu_queue_batch()
  mm, slab: fix use of SLAB_SUPPORTS_SYSFS in kmem_cache_release()

Merge tag 'acpi-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These fix up the ACPI IRQ override quirk list and add two new entries
  to it, add a new quirk to the ACPI backlight (video) driver, and fix
  the ACPI battery driver.

  Specifics:

   - Add a quirk for Dell OptiPlex 5480 AIO to the ACPI backlight
     (video) driver (Hans de Goede)

   - Prevent the ACPI battery driver from crashing when unregistering a
     battery hook and simplify battery hook locking in it (Armin Wolf)

   - Fix up the ACPI IRQ override quirk list and add quirks for Asus
     Vivobook X1704VAP and Asus ExpertBook B2502CVA to it (Hans de
     Goede)"

* tag 'acpi-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: battery: Fix possible crash when unregistering a battery hook
  ACPI: battery: Simplify battery hook locking
  ACPI: video: Add backlight=native quirk for Dell OptiPlex 5480 AIO
  ACPI: resource: Add Asus ExpertBook B2502CVA to irq1_level_low_skip_override[]
  ACPI: resource: Add Asus Vivobook X1704VAP to irq1_level_low_skip_override[]
  ACPI: resource: Loosen the Asus E1404GAB DMI match to also cover the E1404GA
  ACPI: resource: Remove duplicate Asus E1504GAB IRQ override

Merge tag 'pm-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix two cpufreq issues, one in the core and one in the
  intel_pstate driver:

   - Fix CPU device node reference counting in the cpufreq core (Miquel
     Sabaté Solà)

   - Turn the spinlock used by the intel_pstate driver in hard IRQ
     context into a raw one to prevent the driver from crashing when
     PREEMPT_RT is enabled (Uwe Kleine-König)"

* tag 'pm-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Avoid a bad reference count on CPU node
  cpufreq: intel_pstate: Make hwp_notify_lock a raw spinlock

Merge tag 'gpio-fixes-for-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

- fix a potential NULL-pointer dereference in gpiolib core

- fix a probe() regression from the v6.12 merge window and an older bug
   leading to missed interrupts in gpio-davinci

* tag 'gpio-fixes-for-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpiolib: Fix potential NULL pointer dereference in gpiod_get_label()
  gpio: davinci: Fix condition for irqchip registration
  gpio: davinci: fix lazy disable

Merge tag 'sound-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Slightly high amount of changes in this round, partly because of my
  vacation in the last weeks. But all changes are small and nothing
  looks worrisome.

  The biggest LOCs is MAINTAINERS updates, and there is a core change
  for card-ID string creation for non-ASCII inputs. Others are rather
  device-specific, such as new quirks and device IDs for ASoC, usual
  HD-audio and USB-audio quirks and fixes, as well as regression fixes
  in HD-audio HDMI audio and Conexant codec"

* tag 'sound-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (39 commits)
  ALSA: hda/conexant: Fix conflicting quirk for System76 Pangolin
  ALSA: line6: add hw monitor volume control to POD HD500X
  ALSA: gus: Fix some error handling paths related to get_bpos() usage
  ALSA: hda: Add missing parameter description for snd_hdac_stream_timecounter_init()
  ALSA: usb-audio: Add native DSD support for Luxman D-08u
  ALSA: core: add isascii() check to card ID generator
  MAINTAINERS: ALSA: use linux-sound@vger.kernel.org list
  Revert "ALSA: hda: Conditionally use snooping for AMD HDMI"
  ASoC: intel: sof_sdw: Add check devm_kasprintf() returned value
  ASoC: imx-card: Set card.owner to avoid a warning calltrace if SND=m
  ASoC: dt-bindings: davinci-mcasp: Fix interrupts property
  ASoC: qcom: sm8250: add qrb4210-rb2-sndcard compatible string
  ASoC: dt-bindings: qcom,sm8250: add qrb4210-rb2-sndcard
  ALSA: hda: fix trigger_tstamp_latched
  ALSA: hda/realtek: Add a quirk for HP Pavilion 15z-ec200
  ALSA: hda/generic: Drop obsoleted obey_preferred_dacs flag
  ALSA: hda/generic: Unconditionally prefer preferred_dacs pairs
  ALSA: silence integer wrapping warning
  ASoC: Intel: soc-acpi: arl: Fix some missing empty terminators
  ASoC: Intel: soc-acpi-intel-rpl-match: add missing empty item
  ...

Merge tag 'drm-fixes-2024-10-04' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly fixes, xe and amdgpu lead the way, with panthor, and few core
  components getting various fixes. Nothing seems too out of the
  ordinary.

  atomic:
   - Use correct type when reading damage rectangles

  display:
   - Fix kernel docs

  dp-mst:
   - Fix DSC decompression detection

  hdmi:
   - Fix infoframe size

  sched:
   - Update maintainers
   - Fix race condition whne queueing up jobs
   - Fix locking in drm_sched_entity_modify_sched()
   - Fix pointer deref if entity queue changes

  sysfb:
   - Disable sysfb if framebuffer parent device is unknown

  amdgpu:
   - DML2 fix
   - DSC fix
   - Dispclk fix
   - eDP HDR fix
   - IPS fix
   - TBT fix

  i915:
   - One fix for bitwise and logical "and" mixup in PM code

  xe:
   - Restore pci state on resume
   - Fix locking on submission, queue and vm
   - Fix UAF on queue destruction
   - Fix resource release on freq init error path
   - Use rw_semaphore to reduce contention on ASID->VM lookup
   - Fix steering for media on Xe2_HPM
   - Tuning updates to Xe2
   - Resume TDR after GT reset to prevent jobs running forever
   - Move id allocation to avoid userspace using a guessed number to
     trigger UAF
   - Fix OA stream close preventing pbatch buffers to complete
   - Fix NPD when migrating memory on LNL
   - Fix memory leak when aborting binds

  panthor:
   - Fix locking
   - Set FOP_UNSIGNED_OFFSET in fops instance
   - Acquire lock in panthor_vm_prepare_map_op_ctx()
   - Avoid uninitialized variable in tick_ctx_cleanup()
   - Do not block scheduler queue if work is pending
   - Do not add write fences to the shared BOs

  vbox:
   - Fix VLA handling"

* tag 'drm-fixes-2024-10-04' of https://gitlab.freedesktop.org/drm/kernel: (41 commits)
  drm/xe: Fix memory leak when aborting binds
  drm/xe: Prevent null pointer access in xe_migrate_copy
  drm/xe/oa: Don't reset OAC_CONTEXT_ENABLE on OA stream close
  drm/xe/queue: move xa_alloc to prevent UAF
  drm/xe/vm: move xa_alloc to prevent UAF
  drm/xe: Clean up VM / exec queue file lock usage.
  drm/xe: Resume TDR after GT reset
  drm/xe/xe2: Add performance tuning for L3 cache flushing
  drm/xe/xe2: Extend performance tuning to media GT
  drm/xe/mcr: Use Xe2_LPM steering tables for Xe2_HPM
  drm/xe: Use helper for ASID -> VM in GPU faults and access counters
  drm/xe: Convert to USM lock to rwsem
  drm/xe: use devm_add_action_or_reset() helper
  drm/xe: fix UAF around queue destruction
  drm/xe/guc_submit: add missing locking in wedged_fini
  drm/xe: Restore pci state upon resume
  drm/amd/display: Fix system hang while resume with TBT monitor
  drm/amd/display: Enable idle workqueue for more IPS modes
  drm/amd/display: Add HDR workaround for specific eDP
  drm/amd/display: avoid set dispclk to 0
  ...

Merge tag 'block-6.12-20241004' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- Fix another use-after-free in aoe

- Fixup wrong nested non-saving irq disable/restore in blk-iocost

- Fixup a kerneldoc complaint introduced by a merge window patch

* tag 'block-6.12-20241004' of git://git.kernel.dk/linux:
  aoe: fix the potential use-after-free problem in more places
  blk_iocost: remove some duplicate irq disable/enables
  block: fix blk_rq_map_integrity_sg kernel-doc

Merge tag 'io_uring-6.12-20241004' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

- Fix an error path memory leak, if one part fails to allocate.
   Obviously not something that'll generally hit without error
   injection.

- Fix an io_req_flags_t cast to make sparse happier.

- Improve the recv multishot termination. Not a bug now, but could be
   one in the future. This makes it do the same thing that recvmsg does
   in terms of when to terminate a request or not.

* tag 'io_uring-6.12-20241004' of git://git.kernel.dk/linux:
  io_uring/net: harden multishot termination case for recv
  io_uring: fix casts to io_req_flags_t
  io_uring: fix memory leak when cache init fail

Merge tag 'fsnotify_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fixes from Jan Kara:
"Fixes for an inotify deadlock and a data race in fsnotify"

* tag 'fsnotify_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
inotify: Fix possible deadlock in fsnotify_destroy_mark
fsnotify: Avoid data race between fsnotify_recalc_mask() and fsnotify_object_watched()

Merge tag 'fs_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF fixes from Jan Kara:
"A couple of UDF error handling fixes for issues spotted by syzbot"

* tag 'fs_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: fix uninit-value use in udf_get_fileshortad
  udf: refactor inode_bmap() to handle error
  udf: refactor udf_next_aext() to handle error
  udf: refactor udf_current_aext() to handle error

Merge tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
"A fix from Patrick for a variety of CephFS lockup scenarios caused by
  a regression in cap handling which sneaked in through the netfs helper
  library in 5.18 (marked for stable) and an unrelated one-line cleanup"

* tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client:
  ceph: fix cap ref leak via netfs init_request
  ceph: use struct_size() helper in __ceph_pool_perm_get()

Merge branches 'acpi-video' and 'acpi-battery'

Merge an ACPI backlight (video) quirk and ACPI battery driver fix and
cleanup for 6.12-rc2:

- Add a quirk for Dell OptiPlex 5480 AIO to the ACPI backlight (video)
   driver (Hans de Goede).

- Prevent the ACPI battery driver from crashing when unregistering a
   battery hook and simplify battery hook locking in it (Armin Wolf).

* acpi-video:
  ACPI: video: Add backlight=native quirk for Dell OptiPlex 5480 AIO

* acpi-battery:
  ACPI: battery: Fix possible crash when unregistering a battery hook
  ACPI: battery: Simplify battery hook locking

Merge tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- in incremental send, fix invalid clone operation for file that got
   its size decreased

- fix __counted_by() annotation of send path cache entries, we do not
   store the terminating NUL

- fix a longstanding bug in relocation (and quite hard to hit by
   chance), drop back reference cache that can get out of sync after
   transaction commit

- wait for fixup worker kthread before finishing umount

- add missing raid-stripe-tree extent for NOCOW files, zoned mode
   cannot have NOCOW files but RST is meant to be a standalone feature

- handle transaction start error during relocation, avoid potential
   NULL pointer dereference of relocation control structure (reported by
   syzbot)

- disable module-wide rate limiting of debug level messages

- minor fix to tracepoint definition (reported by checkpatch.pl)

* tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: disable rate limiting when debug enabled
  btrfs: wait for fixup workers before stopping cleaner kthread during umount
  btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
  btrfs: send: fix invalid clone operation for file that got its size decreased
  btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class
  btrfs: drop the backref cache during relocation if we commit
  btrfs: also add stripe entries for NOCOW writes
  btrfs: send: fix buffer overflow detection when copying path to cache entry

Merge tag 'v6.12-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- statfs fix (e.g. when limited access to root directory of share)

- special file handling fixes: fix packet validation to avoid buffer
   overflow for reparse points, fixes for symlink path parsing (one for
   reparse points, and one for SFU use case), and fix for cleanup after
   failed SET_REPARSE operation.

- fix for SMB2.1 signing bug introduced by recent patch to NFS symlink
   path, and NFS reparse point validation

- comment cleanup

* tag 'v6.12-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Do not convert delimiter when parsing NFS-style symlinks
  cifs: Validate content of NFS reparse point buffer
  cifs: Fix buffer overflow when parsing NFS reparse points
  smb: client: Correct typos in multiple comments across various files
  smb: client: use actual path when queryfs
  cifs: Remove intermediate object of failed create reparse call
  Revert "smb: client: make SHA-512 TFM ephemeral"
  smb: Update comments about some reparse point tags
  cifs: Check for UTF-16 null codepoint in SFU symlink target location

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull close_range() fix from Al Viro:
"Fix the logic in descriptor table trimming"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
close_range(): fix the logics in descriptor table trimming

Merge tag 'i2c-host-fixes-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host fixes for v6.12-rc2

In the stm32f7 a potential deadlock is fixed during runtime
suspend and resume.

tomoyo: revert CONFIG_SECURITY_TOMOYO_LKM support

This patch reverts two TOMOYO patches that were merged into Linus' tree
during the v6.12 merge window:

8b985bbfabbe ("tomoyo: allow building as a loadable LSM module")
268225a1de1a ("tomoyo: preparation step for building as a loadable LSM module")

Together these two patches introduced the CONFIG_SECURITY_TOMOYO_LKM
Kconfig build option which enabled a TOMOYO specific dynamic LSM loading
mechanism (see the original commits for more details).  Unfortunately,
this approach was widely rejected by the LSM community as well as some
members of the general kernel community.  Objections included concerns
over setting a bad precedent regarding individual LSMs managing their
LSM callback registrations as well as general kernel symbol exporting
practices.  With little to no support for the CONFIG_SECURITY_TOMOYO_LKM
approach outside of Tetsuo, and multiple objections, we need to revert
these changes.

Link: https://lore.kernel.org/all/0c4b443a-9c72-4800-97e8-a3816b6a9ae2@I-love.SAKURA.ne.jp
Link: https://lore.kernel.org/all/CAHC9VhR=QjdoHG3wJgHFJkKYBg7vkQH2MpffgVzQ0tAByo_wRg@mail.gmail.com
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>

arm64: Subscribe Microsoft Azure Cobalt 100 to erratum 3194386

Add the Microsoft Azure Cobalt 100 CPU to the list of CPUs suffering
from erratum 3194386 added in commit 75b3c43eab59 ("arm64: errata:
Expand speculative SSBS workaround")

CC: Mark Rutland <mark.rutland@arm.com>
CC: James More <james.morse@arm.com>
CC: Will Deacon <will@kernel.org>
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/20241003225239.321774-1-eahariha@linux.microsoft.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

ALSA: hda/conexant: Fix conflicting quirk for System76 Pangolin

We received a regression report for System76 Pangolin (pang14) due to
the recent fix for Tuxedo Sirius devices to support the top speaker.
The reason was the conflicting PCI SSID, as often seen.

As a workaround, now the codec SSID is checked and the quirk is
applied conditionally only to Sirius devices.

Fixes: 4178d78cd7a8 ("ALSA: hda/conexant: Add pincfg quirk to enable top speakers on Sirius devices")
Reported-by: Christian Heusel <christian@heusel.eu>
Reported-by: Jerry <jerryluo225@gmail.com>
Closes: https://lore.kernel.org/c930b6a6-64e5-498f-b65a-1cd5e0a1d733@heusel.eu
Link: https://patch.msgid.link/20241004082602.29016-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: line6: add hw monitor volume control to POD HD500X

Add hw monitor volume control for POD HD500X. This is done adding
LINE6_CAP_HWMON_CTL to the capabilities

Signed-off-by: Hans P. Moller <hmoller@uc.cl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20241003232828.5819-1-hmoller@uc.cl

ALSA: gus: Fix some error handling paths related to get_bpos() usage

If get_bpos() fails, it is likely that the corresponding error code should
be returned.

Fixes: a6970bb1dd99 ("ALSA: gus: Convert to the new PCM ops")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://patch.msgid.link/d9ca841edad697154afa97c73a5d7a14919330d9.1727984008.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Merge tag 'drm-xe-fixes-2024-10-03' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Restore pci state on resume (Rodrigo Vivi)
- Fix locking on submission, queue and vm (Matthew Auld, Matthew Brost)
- Fix UAF on queue destruction (Matthew Auld)
- Fix resource release on freq init error path (He Lugang)
- Use rw_semaphore to reduce contention on ASID->VM lookup (Matthew Brost)
- Fix steering for media on Xe2_HPM (Gustavo Sousa)
- Tuning updates to Xe2 (Gustavo Sousa)
- Resume TDR after GT reset to prevent jobs running forever (Matthew Brost)
- Move id allocation to avoid userspace using a guessed number
to trigger UAF (Matthew Auld, Matthew Brost)
- Fix OA stream close preventing pbatch buffers to complete (José)
- Fix NPD when migrating memory on LNL (Zhanjun Dong)
- Fix memory leak when aborting binds (Matthew Brost)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/2fiv63yanlal5mpw3mxtotte6yvkvtex74c7mkjxca4bazlyja@o4iejcfragxy

Merge tag 'rust-fixes-6.12' of https://github.com/Rust-for-Linux/linux

Pull Rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:

   - Fix/improve a couple 'depends on' on the newly added CFI/KASAN
     suppport to avoid build errors/warnings

   - Fix ARCH_SLAB_MINALIGN multiple definition error for RISC-V under
     !CONFIG_MMU

   - Clean upcoming (Rust 1.83.0) Clippy warnings

  'kernel' crate:

   - 'sync' module: fix soundness issue by requiring 'T: Sync' for
     'LockedBy::access'; and fix helpers build error under PREEMPT_RT

   - Fix trivial sorting issue ('rustfmtcheck') on the v6.12 Rust merge"

* tag 'rust-fixes-6.12' of https://github.com/Rust-for-Linux/linux:
  rust: kunit: use C-string literals to clean warning
  cfi: encode cfi normalized integers + kasan/gcov bug in Kconfig
  rust: KASAN+RETHUNK requires rustc 1.83.0
  rust: cfi: fix `patchable-function-entry` starting version
  rust: mutex: fix __mutex_init() usage in case of PREEMPT_RT
  rust: fix `ARCH_SLAB_MINALIGN` multiple definition error
  rust: sync: require `T: Sync` for `LockedBy::access`
  rust: kernel: sort Rust modules

Merge tag 'pull-fixes.ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull ufs fix from Al Viro:
"Fix ufs_rename() braino introduced this cycle.

  The 'folio_release_kmap(dir_folio, new_dir)' in ufs_rename() part of
  folio conversion should've been getting a pointer to ufs directory
  entry within the page, rather than a pointer to directory struct
  inode..."

* tag 'pull-fixes.ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs_rename(): fix bogus argument of folio_release_kmap()

sched: psi: fix bogus pressure spikes from aggregation race

Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.

While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.

The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.

The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.

With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.

Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.

Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU

As was tried in commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant
pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs,
need to be zapped. All of the accounting for a shadow page is tied to the
memslot, i.e. the shadow page holds a reference to the memslot, for all
intents and purposes. Deleting the memslot without removing all relevant
shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled,
results in NULL pointer derefs when tearing down the VM.

Reintroduce from that commit the code that walks the whole memslot when
there are active shadow MMU pages.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

tracing/hwlat: Fix a race during cpuhp processing

The cpuhp online/offline processing race also exists in percpu-mode hwlat
tracer in theory, apply the fix too. That is:

    T1                       | T2
    [CPUHP_ONLINE]           | cpu_device_down()
     hwlat_hotplug_workfn()  |
                             |     cpus_write_lock()
                             |     takedown_cpu(1)
                             |     cpus_write_unlock()
    [CPUHP_OFFLINE]          |
        cpus_read_lock()     |
        start_kthread(1)     |
        cpus_read_unlock()   |

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com
Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing/timerlat: Fix a race during cpuhp processing

There is another found exception that the "timerlat/1" thread was
scheduled on CPU0, and lead to timer corruption finally:

```
ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
Modules linked in:
CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
<TASK>
? __warn+0x7c/0x110
? debug_print_object+0x7d/0xb0
? report_bug+0xf1/0x1d0
? prb_read_valid+0x17/0x20
? handle_bug+0x3f/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? debug_print_object+0x7d/0xb0
? debug_print_object+0x7d/0xb0
? __pfx_timerlat_irq+0x10/0x10
__debug_object_init+0x110/0x150
hrtimer_init+0x1d/0x60
timerlat_main+0xab/0x2d0
? __pfx_timerlat_main+0x10/0x10
kthread+0xb7/0xe0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2d/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
```

After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:

T1                       | T2
[CPUHP_ONLINE]           | cpu_device_down()
osnoise_hotplug_workfn() |
                         |     cpus_write_lock()
                         |     takedown_cpu(1)
                         |     cpus_write_unlock()
[CPUHP_OFFLINE]          |
    cpus_read_lock()     |
    start_kthread(1)     |
    cpus_read_unlock()   |

To fix this, skip online processing if the CPU is already offline.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing/timerlat: Drop interface_lock in stop_kthread()

stop_kthread() is the offline callback for "trace/osnoise:online", since
commit 5bfbcd1ee57b ("tracing/timerlat: Add interface_lock around clearing
of kthread in stop_kthread()"), the following ABBA deadlock scenario is
introduced:

T1                            | T2 [BP]               | T3 [AP]
osnoise_hotplug_workfn()      | work_for_cpu_fn()     | cpuhp_thread_fun()
                              |   _cpu_down()         |   osnoise_cpu_die()
  mutex_lock(&interface_lock) |                       |     stop_kthread()
                              |     cpus_write_lock() |       mutex_lock(&interface_lock)
  cpus_read_lock()            |     cpuhp_kick_ap()   |

As the interface_lock here in just for protecting the "kthread" field of
the osn_var, use xchg() instead to fix this issue. Also use
for_each_online_cpu() back in stop_per_cpu_kthreads() as it can take
cpu_read_lock() again.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-3-liwei391@huawei.com
Fixes: 5bfbcd1ee57b ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline

osnoise_hotplug_workfn() is the asynchronous online callback for
"trace/osnoise:online". It may be congested when a CPU goes online and
offline repeatedly and is invoked for multiple times after a certain
online.

This will lead to kthread leak and timer corruption. Add a check
in start_kthread() to prevent this situation.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

x86/ftrace: Include <asm/ptrace.h>

<asm/ftrace.h> uses struct pt_regs in several places. Include
<asm/ptrace.h> to ensure it's visible. This is needed to make sure
object files that only include <asm/asm-prototypes.h> compile.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/20240916221557.846853-2-samitolvanen@google.com
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

rtla: Fix the help text in osnoise and timerlat top tools

The help text in osnoise top and timerlat top had some minor errors
and omissions. The -d option was missing the 's' (second) abbreviation and
the error message for '-d' used '-D'.

Cc: stable@vger.kernel.org
Fixes: 1eceb2fc2ca54 ("rtla/osnoise: Add osnoise top mode")
Fixes: a828cd18bc4ad ("rtla: Add timerlat tool and timelart top mode")
Link: https://lore.kernel.org/20240813155831.384446-1-ezulian@redhat.com
Suggested-by: Tomas Glozar <tglozar@redhat.com>
Reviewed-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Eder Zulian <ezulian@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tools/rtla: Fix installation from out-of-tree build

rtla now supports out-of-tree builds, but installation fails as it
still tries to install the rtla binary from the source tree. Use the
existing macro $(RTLA) to refer to the binary.

Link: https://lore.kernel.org/ZudubuoU_JHjPZ7w@decadent.org.uk
Fixes: 01474dc706ca ("tools/rtla: Use tools/build makefiles to build rtla")
Reviewed-by: Tomas Glozar <tglozar@redhat.com>
Tested-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Ben Hutchings <benh@debian.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>