git.ipfire.org Git - thirdparty/kernel/linux.git/log

ASoC: dt-bindings: ti,tas2781: Add TAS2573 support

The TAS2573 belongs to the TAS257x device family, featuring an integrated
DSP and IV sensing capability.

Signed-off-by: Baojun Xu <baojun.xu@ti.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260602100532.6463-1-baojun.xu@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>

i2c: stm32f7: fix timing computation ignoring i2c-analog-filter

stm32f7_i2c_compute_timing() uses i2c_dev->analog_filter to pick
the analog filter delay, but i2c_dev->analog_filter is parsed from
the "i2c-analog-filter" DT property only after the compute_timing
loop in stm32f7_i2c_setup_timing(), so in practice the timing
calculations always ignore the analog filter. On an STM32MP1 board
with clock-frequency = <400000> and i2c-analog-filter set, measured
SCL frequency was ~382 kHz.

This also affects (widens) the computed SDADEL range. At high bus
clock speeds, this can select an SDADEL value that violates tVD;DAT
(data valid time).

Fix by parsing "i2c-analog-filter" before the compute_timing loop.

Fixes: 83c3408f7b9c ("i2c: stm32f7: support DT binding i2c-analog-filter")
Signed-off-by: Guillermo Rodríguez <guille.rodriguez@gmail.com>
Cc: <stable@vger.kernel.org> # v5.13+
Acked-by: Alain Volmat <alain.volmat@foss.st.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260526091210.20383-1-guille.rodriguez@gmail.com

ASoC: simple-card: remove platform data style

Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> says:

SuperH ecovec24/7724se are the last user of Simple Audio Card as
"platform data style". It is mainly supporting "DT style" in these days.

Now, Simple Audio Card "platform data style" is no longer correctly working
during almost this 10 years. but we have not get such report.
Let's remove Sound support from SuperH ecovec24/7724se, and remove
Simple Audio Card platform data style.

Link: https://patch.msgid.link/87zf1le4fu.wl-kuninori.morimoto.gx@renesas.com

ASoC: simple-card: remove platform data style

Simple-Card has created for "platform data" style first, and expanded
to "DT style". Current Simple-Card "platform data" style should not
work during almost 10 years, but no one reported it.

No one is using "platform data" style. Let's remove its support.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Link: https://patch.msgid.link/87v7c9e4f4.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>

sh: 7724se: remove FSI/AK4642/Simple-Audio-Card support

7724se is using Simple-Audio-Card with "platform data" style
(which is mainly supporting "DT style" today), but "platform data"
style is not working correctly working during almost 10 years.

7724se sound doesn't work in these days, and there has been no
such report. Let's remove sound support.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Link: https://patch.msgid.link/87wlwpe4f9.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>

sh: ecovec24: remove FSI/DA7210/Simple-Audio-Card support

Ecovec24 is using Simple-Audio-Card with "platform data" style
(which is mainly supporting "DT style" today), but "platform data"
style is not working correctly working during almost 10 years.

And DA7210 which is used in Ecovec24 was prototype version, and has
diff between production version. The driver doesn't care about it.

Ecovec24 sound doesn't work in these days, and there has been no
such report. Let's remove sound support.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Link: https://patch.msgid.link/87y0h5e4ff.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: imx-rpmsg: Add headphone jack detection and driver_name support

Chancel Liu <chancel.liu@nxp.com> says:

This series adds two features to the i.MX RPMSG ASoC card:
1. Headphone jack detection via GPIO: Introduce the "hp-det-gpios"
   device tree property and use simple_util_init_jack() to
   register a headphone jack with GPIO-based insertion detection.

2. driver_name assignment: Set driver_name on the snd_soc_card to
   "imx-audio-rpmsg", enabling userspace tools such as UCM to reliably
   identify the card by driver name regardless of the board-specific
   card name.

Link: https://patch.msgid.link/20260528020725.2265321-1-chancel.liu@nxp.com

ASoC: imx-rpmsg: Set driver_name for snd_soc_card

Set driver_name to "imx-audio-rpmsg" for the i.MX RPMSG sound card.
This allows userspace audio configuration tools (e.g., UCM) to match
the card by driver name independently of the card name, which may vary
across board configurations.

Signed-off-by: Chancel Liu <chancel.liu@nxp.com>
Link: https://patch.msgid.link/20260528020725.2265321-4-chancel.liu@nxp.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: imx-rpmsg: Support headphone jack detection

Add headphone jack detection support for i.MX RPMSG audio cards.
When the "hp-det-gpios" property is present in the device tree node,
use simple_util_init_jack() from the ASoC simple card utilities to
register a headphone jack with GPIO-based insertion detection.

Signed-off-by: Chancel Liu <chancel.liu@nxp.com>
Link: https://patch.msgid.link/20260528020725.2265321-3-chancel.liu@nxp.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: dt-bindings: fsl,rpmsg: Add hp-det-gpios property

Sound cards using the i.MX RPMSG audio interface may connect a
headphone jack with GPIO-based insertion detection. Add the
"hp-det-gpios" property to the fsl,rpmsg binding to support this
configuration.

Signed-off-by: Chancel Liu <chancel.liu@nxp.com>
Link: https://patch.msgid.link/20260528020725.2265321-2-chancel.liu@nxp.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: wm_adsp: Fix NULL dereference when removing firmware controls

In wm_adsp_control_remove() check that the priv pointer is not NULL
before attempting to cleanup what it points to.

When cs_dsp creates a control it calls wm_adsp_control_add_cb() so that
wm_adsp can create its own private control data. There are two cases
where private data is not created:

1. The control is a SYSTEM control, so an ALSA control is not created.

2. The codec driver has registered a control_add() callback that
hides the control, so wm_adsp_control_add() is not called.

When cs_dsp_remove destroys its control list it calls
wm_adsp_control_remove() for each control. But wm_adsp_control_remove()
was attempting to cleanup the private data pointed to by cs_ctl->priv
without checking the pointer for NULL.

Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Fixes: 0700bc2fb94c ("ASoC: wm_adsp: Separate generic cs_dsp_coeff_ctl handling")
Link: https://patch.msgid.link/20260604101244.1402862-1-rf@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>

i2c: imx: fix clock and pinctrl state inconsistency in runtime PM

In i2c_imx_runtime_suspend(), the clock is disabled before switching
the pinctrl state to sleep. If pinctrl_pm_select_sleep_state() fails,
the runtime suspend is aborted but the clock remains disabled, causing
a system crash when the hardware is subsequently accessed.

Fix this by switching the pinctrl state before disabling the clock so
that a pinctrl failure leaves the clock enabled and the hardware
accessible.

In i2c_imx_runtime_resume(), restore the pinctrl state back to sleep
if clk_enable() fails to keep the consistent.

Fixes: 576eba03c994 ("i2c: imx: switch different pinctrl state in different system power status")
Signed-off-by: Carlos Song <carlos.song@nxp.com>
Cc: <stable@vger.kernel.org> # v6.14+
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260521065038.2954998-1-carlos.song@oss.nxp.com

IB/mlx5: Push pdn above pagefault_dmabuf_mr()

Remove the mlx5_mr_pdn() inside pagefault_dmabuf_mr(), the only user of
the pdn is the init path which is inside an ioctl.

Link: https://patch.msgid.link/r/10-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/mlx5: Push pdn above pagfault_real_mr()

Remove the mlx5_mr_pdn() in pagefault_real_mr() by pushing the pdn up, all
the callers use 0 since they don't pass MLX5_PF_FLAGS_ENABLE except the
ioctl reg_mr path which can use the ioctl pd.

Link: https://patch.msgid.link/r/9-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/mlx5: Push pdn above mlx5r_umr_update_xlt()

Keep pushing the pdn higher to remove more places touching mr->pd:

- XLT combinations that don't use PDN can just pass 0
- Use local pd values instead of mr->pd
- Implicit MR does not have inplace rereg, so the mr->pd is safe

Link: https://patch.msgid.link/r/8-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/mlx5: Don't mangle the mr->pd inside the rereg callback

The rereg protocol expects the core code to change mr->pd and synchronize
that change with the atomics and syncs. The driver should not touch it.

mlx5 needed to update it in umr_rereg_pas() because
mlx5r_umr_update_mr_pas() required the updated mr->pd to build the
UMR.

Simply switch mlx5r_umr_update_mr_pas() to use the pdn directly from
the new pd and remove the mr->pd update.

Fixes: 56e11d628c5d ("IB/mlx5: Added support for re-registration of MRs")
Link: https://patch.msgid.link/r/7-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/mlx5: Pull the pdn out of the depths of the umr machinery

Instead of getting the pdn deep inside the umr code, pass it in from the
top. to_mpd(mr->ibmr.pd)->pdn is not safe due to the rereg races, so all
the call sites need some revision to obtain the pdn in a safe way.

Mark them with mlx5_mr_pdn(); following patches will go through and remove
these.

Cases where the XLT flags are known and do not require the PDN can pass 0,
such as for mlx5_ib_dmabuf_invalidate_cb().

Also extract the DMABUF data_direct special case from inside the UMR code
and into the only place that needs it, pagefault_dmabuf_mr(). The actual
mr was created directly without using the UMR flow. Ultimately this will
be moved into mlx5_ib_init_dmabuf_mr().

Link: https://patch.msgid.link/r/6-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/mlx5: Remove unused mkc bits in mlx5r_umr_update_mr_page_shift()

The HW only processes mkc fields selected by mkey_mask.
pd, qpn and mkey_7_0 are never selected so they can be left as zero.

This removes a racy read of mr->pd.

Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size")
Link: https://patch.msgid.link/r/5-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/nldev: Fix locking when accessing mr->pd

Sashiko points out that, due to rereg_mr, the PD is actually variable and
all the touches in nldev are racy.

Use mr->device instead of mr->pd->device.

Getting the PD restrack ID is more tricky. To avoid disturbing all the
happy paths, add an rdma_restrack_sync() operation which is sort of like
flush_workqueue() or synchronize_irq(): after it returns, all the old
nldev touches to the mr are gone and everything sees the new PD. This
makes it safe to reach into the PD pointer.

Fixes: da5c85078215 ("RDMA/nldev: add driver-specific resource tracking")
Link: https://patch.msgid.link/r/4-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/mlx5: Properly support implicit ODP rereg_mr

Due to all the child mkeys in the implicit ODP configuration we cannot
change anything in place for the parent mkey. Instead the whole thing
needs to be rebuilt if any change is requested. If the user does not
specify a translation then force the implicit values which will then fall
through the logic into mlx5_ib_reg_user_mr() to allocate a completely new
MR.

Since implicit children were also touching the mr->pd, this removes
another case where the access was racy.

Fixes: ef3642c4f54d ("RDMA/mlx5: Fix error unwinds for rereg_mr")
Link: https://sashiko.dev/#/patchset/20260427-security-bug-fixes-v3-0-4621fa52de0e%40nvidia.com?part=4
Link: https://patch.msgid.link/r/3-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

RDMA/mlx5: Create ODP EQ for non-pinned dmabuf MRs

DMABUF generally relies on the ODP EQ mechanism to safely implement the
move semantics. ODP requires a device-global one time startup of the ODP
machinery when the first MR is created, and this was missed on the DMABUF
path.

Call mlx5r_odp_create_eq() when creating a ODP'able DMABUF.

The core code prevents using IB_ACCESS_ON_DEMAND unless the driver
advertises IB_ODP_SUPPORT, so until now, mlx5r_odp_create_eq() cannot be
called unless the device has ODP support.

However, DMABUF has no such protection and a second bug was allowing
DMABUFs to be created on non-ODP capable HW. Add a guard at the start of
mlx5r_odp_create_eq(). This is necessary here anyhow as the
dev->odp_eq_mutex is not initialized without IB_ODP_SUPPORT.

Link: https://patch.msgid.link/r/2-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/mlx5: Don't take the rereg_mr fallback without a new translation

Jumping to mlx5_ib_reg_user_mr() without IB_MR_REREG_TRANS set will use
garbage values for start, length, and iova. Recovering the original mr
parameters for ODP and DMABUF to properly recreate it is too hard in this
flow, so just fail it.

Fixes: ef3642c4f54d ("RDMA/mlx5: Fix error unwinds for rereg_mr")
Link: https://patch.msgid.link/r/1-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Merge tag 'at24-updates-for-v7.2-rc1' into i2c/i2c-host

at24 updates for v7.2-rc1

- use named initializers for arrays of i2c_device_data

io_uring/kbuf: validate ring provided buffer addresses with access_ok()

Commit:

809b997a5ce9 ("x86-64/arm64/powerpc: clean up and rename __copy_from_user_flushcache")

sanitized that any provided copy helper should separately validate
destination and source addresses, but we should also ensure that
anything that is retrieved from a buffer is validated upfront. For ring
provided buffers, always include an access_ok() when grabbing a new
buffer.

Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

s390: Remove GENERIC_LOCKBREAK Kconfig option

s390 selects GENERIC_LOCKBREAK if PREEMPT is enabled. Reason is a historic
18 years old commit [1] which fixed a compile error for PREEMPT enabled
kernels. Back than only PREEMPT_NONE and PREEMPT_VOLUNTARY kernels were
considered to be important for s390. PREEMPT should "just work".

However, since recently PREEMPT is always enabled [2], which also causes
GENERIC_LOCKBREAK to be always enabled. For some workloads this leads to
massive performance degradation; e.g. a simple kernel compile on machines
with many CPUs may take up to four times longer.

To fix this just remove the GENERIC_LOCKBREAK from s390's Kconfig, since
the compile error from 18 years ago does not exist anymore.

[1] commit b6b40c532a36 ("[S390] Define GENERIC_LOCKBREAK.")
[2] commit 7dadeaa6e851 ("sched: Further restrict the preemption modes")

Cc: stable@vger.kernel.org
Reported-by: Massimiliano Pellizzer <massimiliano.pellizzer@canonical.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

riscv: traps_misaligned: Avoid redundant unaligned access speed probe

When a CPU is taken offline and then is brought back online, unaligned
access speed probe always runs even though the unaligned access speed is
already known, wasting CPU cycles.

This is because when a CPU becomes online, the following happen:

  1. check_unaligned_access_emulated() is called, which clears
     misaligned_access_speed if there is no emulation.

  2. check_unaligned_access() is called because misaligned_access_speed is
     cleared, wasting CPU cycles determining something already previous
     known.

Avoid the redundant access speed probe by stop clearing
misaligned_access_speed in (1). If access speed is already known, just
reuse it.

On my Visionfive 2, this reduces CPU bring-up time from 26ms to 0.8ms.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://patch.msgid.link/aa5755142537d462a9e3d2074d82ad4eef6774ba.1780002199.git.namcao@linutronix.de
Signed-off-by: Paul Walmsley <pjw@kernel.org>

riscv: misaligned: Fix fast_unaligned_access_speed_key init

When booting with unaligned_scalar_speed=fast,
fast_unaligned_access_speed_key is initialized incorrectly.

The key is currently derived from the fast_misaligned_access cpumask, but
that mask is only populated when the unaligned access speed probe runs.
Specifying unaligned_scalar_speed=fast skips the probe entirely, leaving
the mask uninitialized.

The information tracked by fast_misaligned_access is already available in
the misaligned_access_speed per-CPU variable. Use that to initialize
fast_unaligned_access_speed_key instead and remove the redundant cpumask.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://patch.msgid.link/2468816ceb433394099a00d7822f819745276b49.1780002199.git.namcao@linutronix.de
Signed-off-by: Paul Walmsley <pjw@kernel.org>

RDMA/srp: bound SRP_RSP sense copy by the received length

srp_process_rsp() copies sense data from rsp->data + resp_data_len,
where resp_data_len is the full 32-bit value supplied by the SRP target
and is never checked against the number of bytes actually received
(wc->byte_len). The copy length is bounded to SCSI_SENSE_BUFFERSIZE, so
at most 96 bytes are copied, but the source offset is not bounded.

A malicious or compromised SRP target on the InfiniBand/RoCE fabric that
the initiator has logged into can return an SRP_RSP with
SRP_RSP_FLAG_SNSVALID set and a large resp_data_len. The receive buffer
is allocated at the target-chosen max_ti_iu_len, so the source of the
sense copy lands past the bytes actually received; with resp_data_len
near 0xFFFFFFFF it is gigabytes past the buffer and the read faults.

Copy the sense data only if it has not been truncated, that is, only if
the response header, the response data, and the sense region fit within
the bytes actually received; otherwise drop the sense and log. The
in-tree iSER and NVMe-RDMA receive paths already bound their parse by
wc->byte_len; this brings ib_srp into line with them.

Fixes: aef9ec39c47f ("IB: Add SCSI RDMA Protocol (SRP) initiator")
Link: https://patch.msgid.link/r/20260602220457.2542840-1-michael.bommarito@gmail.com
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN

In drivers/infiniband/ulp/isert/ib_isert.c, isert_login_recv_done()
computes the login request payload length as wc->byte_len minus
ISER_HEADERS_LEN with no lower bound, and login_req_len is a signed int.
A remote iSER initiator can post a login Send work request carrying
fewer than ISER_HEADERS_LEN (76) bytes, so the subtraction underflows
and login_req_len becomes negative.

isert_rx_login_req() then reads that negative length back into a signed
int, takes size = min(rx_buflen, MAX_KEY_VALUE_PAIRS), and because the
min() is signed it keeps the negative value; the value is then passed as
the memcpy() length and sign-extended to a multi-gigabyte size_t. The
copy into the 8192-byte login->req_buf runs far out of bounds and
faults, crashing the target node. The login phase precedes iSCSI
authentication, so no credentials are required to reach this path.

Reject any login PDU shorter than ISER_HEADERS_LEN before the
subtraction, mirroring the existing early return on a failed work
completion, so login_req_len can never go negative. The upper bound was
already safe: a posted login buffer cannot deliver more than
ISER_RX_PAYLOAD_SIZE, so the difference stays at or below
MAX_KEY_VALUE_PAIRS and the existing min() clamps it; only the missing
lower bound needs to be added.

Fixes: b8d26b3be8b3 ("iser-target: Add iSCSI Extensions for RDMA (iSER) target driver")
Link: https://patch.msgid.link/r/20260602194642.2273217-1-michael.bommarito@gmail.com
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

cpufreq: Use policy->min/max init as QoS request

Modify cpufreq_policy_init_qos() introduced previously to use
policy->min/max set in the driver .init() callback as the initial
values for the policy min/max frequency QoS requests, respectively,
so long as they are different from 0 (which means that they have
been updated by the driver). Update the documentation in accordance
with that code change.

This only affects the following drivers:

- gx-suspmod (min)
- cppc-cpufreq (min)
- longrun (min/max)

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
[ rjw: Changelog rewrite ]
Link: https://patch.msgid.link/20260528090913.2759118-5-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Remove driver default policy->min/max init

Prior to commit 521223d8b3ec ("cpufreq: Fix initialization of min and
max frequency QoS requests"), drivers were setting policy->min/max and
these values were used as initial policy QoS constraints.

After the above commit, these values are only used temporarily, as
cpufreq_set_policy() ultimately overrides them through:

cpufreq_policy_online()
\-cpufreq_init_policy()
\-cpufreq_set_policy()
\-/* Set policy->min/max */

A subsequent change will restore the previous behavior allowing
drivers to request special min/max QoS frequencies instead of
FREQ_QOS_MIN_DEFAULT_VALUE and FREQ_QOS_MAX_DEFAULT_VALUE, respectively,
if desired. For instance, the CPPC driver wants to advertise the lowest
non-linear frequency that should be used as the initial minimum
frequency QoS request.

However, for this purpose, all drivers setting policy->min/max to
policy->cpuinfo.min/max_freq, respectively, need to be updated so
their initial policy->min/max settings don't limit the frequency
scaling unnecessarily going forward (which would defeat the purpose
of commit 521223d8b3ec), so do that.

This does not actually alter the observed behavior of all of
the drivers in question because setting policy->min/max to
policy->cpuinfo.min/max_freq, respectively, is not necessary or
even useful any more after a previous change ("cpufreq: Set default
policy->min/max values for all drivers").

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Acked-by: Jie Zhan <zhanjie9@hisilicon.com>
[ rjw: Changelog rewrite ]
Link: https://patch.msgid.link/20260528090913.2759118-4-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Set default policy->min/max values for all drivers

Some drivers set policy->min/max in their .init() callback, but
cpufreq_set_policy() will ultimately override them through:

cpufreq_policy_online()
\-cpufreq_init_policy()
\-cpufreq_set_policy()
\-/* Set policy->min/max */

Thus the policy min/max values set by the drivers are only temporary.

There is an exception if CPUFREQ_NEED_INITIAL_FREQ_CHECK is set and
cpufreq_policy_online() calls __cpufreq_driver_target() which invokes
cpufreq_driver->target().

To prepare for a subsequent change that will remove all initialization
of policy->min/max in driver .init() callbacks if the min/max value is
equal to the corresponding cpuinfo.min/max_freq, set default
policy->min/max values in the core for all drivers.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
[ rjw: Edits of the new comment and changelog ]
Link: https://patch.msgid.link/20260528090913.2759118-3-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: Extract cpufreq_policy_init_qos() function

Extract the QoS-related logic from cpufreq_policy_online()
to make that function shorter/simpler.

The logic is placed in cpufreq_policy_init_qos() and is
now executed right after the following calls:

- cpufreq_driver->init()
- cpufreq_table_validate_and_sort()

This facilitats subsequent changes that will, in
cpufreq_policy_init_qos():

- Set a default policy->min/max value for all policies.
- Use the policy->min/max values set by drivers as initial request
values for policy frequency QoS requests.

No functional change.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
[ rjw: Changelog edits ]
Link: https://patch.msgid.link/20260528090913.2759118-2-pierre.gondois@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

RDMA: During rereg_mr ensure that REREG_ACCESS is compatible

If IB_MR_REREG_ACCESS changes from RO to RW then the umem has to be
re-evaluated to ensure it is properly pinned as RW. Since the umem is
hidden inside each driver's mr struct add a ib_umem_check_rereg() function
that each driver has to call before processing IB_MR_REREG_ACCESS.

mlx4 has to retain its duplicate ib_access_writable check because it
implements IB_MR_REREG_ACCESS | IB_MR_REREG_TRANS by changing both items
in place sequentially while the MR is live, so it will continue to not
support this combination.

Cc: stable@vger.kernel.org
Fixes: b40656aa7d55 ("RDMA/umem: remove FOLL_FORCE usage")
Link: https://patch.msgid.link/r/0-v1-06fb1a2d6cf5+107-rereg_access_jgg@nvidia.com
Reported-by: Philip Tsukerman <philiptsukerman@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Documentation: KVM: Synchronize x86 VM types

KVM has reflected KVM_X86_SNP_VM to userspace since 1dfe571c12cf
("KVM: SEV: Add initial SEV-SNP support"), and KVM_X86_TDX_VM since
161d34609f9b ("KVM: TDX: Make TDX VM type supported"). Update the
documentation to reflect this fact.

Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Fixes: 161d34609f9b ("KVM: TDX: Make TDX VM type supported")
Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260603114504.814647-2-clopez@suse.de
[sean: use one tab instead of two]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add regression test for mediated PMU fixed counter filter bug

Add a regression test where KVM would inadvertently ignore PMU event
filters on writes that change _some_ bits in FIXED_CTR_CTRL, but not the
enable bits for PMCs that are denied to the guest.

Link: https://patch.msgid.link/20260603231905.1738487-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Use hardware value when reprogramming for FIXED_CTR_CTRL changes

When (conditionally) reprogramming fixed counters, use the hardware value
of FIXED_CTR_CTRL to detect changes, not the guest's original value. For
guests with a mediated PMU, overwriting fixed_ctr_ctrl_hw at the start of
reprogramming without actually reacting to changes in fixed_ctr_ctrl_hw can
lead to KVM ignoring PMU event filters.

E.g. if the guest attempts to enable a fixed PMC that is disallowed, and
then toggles a different PMC in a subsequent WRMSR, KVM will update
pmu->fixed_ctr_ctrl_hw and reprogram the PMC that is changing, but not the
others that are now effectively enabled in pmu->fixed_ctr_ctrl_hw.

Note, the perf-based PMU is unaffected, as it doesn't use fixed_ctr_ctrl_hw
(which is also why keying off fixed_ctr_ctrl_hw works for both PMUs.

Note #2, fixed_ctr_ctrl_hw won't mess up pmc_in_use either, because the
latter isn't used by the mediated PMU. Its purpose is solely to release
perf events that are no longer being actively used, and the meadiated PMU
obviously doesn't create perf events.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260528005419.0228F1F00A3A@smtp.kernel.org
Link: https://patch.msgid.link/20260603231905.1738487-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: hyper-v: Bound the bank index when querying sparse banks

When checking if a VP ID is included in a sparse bank set, explicitly check
that the ID can actually be contained in a sparse bank (the TLFS allows for
a maximum of 64 banks of 64 vCPUs each).  When handling a paravirtual TLB
flush for L2, the VP ID is copied verbatim from the enlightened VMCS,
without any bounds check, i.e. isn't guaranteed to be under the limit of
4096.

Failure to check the bounds of the VP ID leads to an out-of-bounds read
when testing the sparse bank, and super strictly speaking could lead to KVM
performing an unnecessary TLB flush for an L2 vCPU.

  ==================================================================
  BUG: KASAN: use-after-free in hv_is_vp_in_sparse_set+0x85/0x100 [kvm]
  Read of size 8 at addr ffff88811ba5f598 by task hyperv_evmcs/2802

  CPU: 12 UID: 1000 PID: 2802 Comm: hyperv_evmcs Not tainted 7.1.0-rc2 #7 PREEMPT
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x51/0x60
   print_report+0xcb/0x5d0
   kasan_report+0xb4/0xe0
   kasan_check_range+0x35/0x1b0
   hv_is_vp_in_sparse_set+0x85/0x100 [kvm]
   kvm_hv_flush_tlb+0xe9e/0x16c0 [kvm]
   kvm_hv_hypercall+0xe6b/0x1e60 [kvm]
   vmx_handle_exit+0x485/0x1b60 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x22e3/0x5070 [kvm]
   kvm_vcpu_ioctl+0x5d0/0x10c0 [kvm]
   __x64_sys_ioctl+0x129/0x1a0
   do_syscall_64+0xb9/0xcf0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7f0e62d1a9bf
   </TASK>

  The buggy address belongs to the physical page:
  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x11ba5f
  flags: 0x4000000000000000(zone=1)
  raw: 4000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  raw: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff88811ba5f480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
   ffff88811ba5f500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  >ffff88811ba5f580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                              ^
   ffff88811ba5f600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
   ffff88811ba5f680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ==================================================================
  Disabling lock debugging due to kernel taint

Opportunistically add a compile time assertion to ensure the maximum number
of sparse banks exactly matches the number of possible bits in the passed
in mask.

Cc: stable@vger.kernel.org
Fixes: c58a318f6090 ("KVM: x86: hyper-v: L2 TLB flush")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/aiQyZIJtO-2Aj_xN@v4bel
[sean: add KASAN splat, drop comment, add assert, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: guest_memfd: fix NUMA interleave index double-counting

kvm_gmem_get_policy() sets the interleave index (the output param that's
typically named "ilx") to the full page offset (vm_pgoff + vma offset).
But get_vma_policy() adds the page offset on top of the interleave index,
and so the offset is counted twice. This causes NUMA interleaving to skip
nodes: for order-0 pages the effective index jumps by 2 for each
consecutive page.

The vm_op.get_policy() implementation should return only a per-file bias in
the interleave index (like shmem_get_policy does with inode->i_ino),
letting get_vma_policy() add the page-offset component.

Fix by setting the output interleave index to the inode number (a la shmem)
instead of the full page offset, as the index is intended to be a constant,
semi-random value for a given file, e.g. so that interleaving doesn't start
at the same node for every file, and so that allocations are round-robined
across nodes based on the page offset (the selected node would bounce/skip
around if the index isn't constant).

Found by Sashiko (sashiko.dev) AI code review.

Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Fixes: 7f3779a3ac3e ("mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()")
Link: https://patch.msgid.link/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com
[sean: use reverse fir-tree, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

nfs: expose FMODE_NOWAIT for read-only files

NFS O_DIRECT reads already (mostly) handle async requests, with the
exception of locking the inode for direct.
Handle async requests properly by using nfs_start_io_direct_nowait,
and then expose FMODE_NOWAIT since it's now supported for direct reads.

Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

nfs: add nowait version of nfs_start_io_direct

nfs_start_io_direct might block on existing operations to the same
inode. In order to support NOWAIT O_DIRECT reads, add a non-blocking
version of this nfs_start_io_direct that just returns -EAGAIN if locks
could not be taken.

Signed-off-by: Dylan Yudaken <dyudaken@gmail.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

NFSv4/flexfiles: honor FF_FLAGS_NO_IO_THRU_MDS in pg_get_mirror_count_write

The FF_FLAGS_NO_IO_THRU_MDS flag lives on each lseg, so any fallback
decision made when there is no current lseg (e.g. between LAYOUTRETURN
and the next LAYOUTGET) cannot run the per-lseg check.

Introduce a sticky hdr-level ditto for FF_FLAGS_NO_IO_THRU_MDS in
struct nfs4_flexfile_layout::flags (NFS4_FF_HDR_NO_IO_THRU_MDS bit),
set whenever ff_layout_alloc_lseg() parses an lseg with the flag.  The
bit is never cleared for the lifetime of the layout hdr; the server is
assumed to be consistent in its no-fallback policy per file.
kzalloc() in ff_layout_alloc_layout_hdr() zero-initializes the field.

Use the new ff_layout_hdr_no_fallback_to_mds() helper to gate
ff_layout_pg_get_mirror_count_write(): when pnfs_update_layout() returns
NULL (e.g. NFS_LAYOUT_BULK_RECALL, pnfs_layout_io_test_failed,
pnfs_layoutgets_blocked) the existing code unconditionally calls
nfs_pageio_reset_write_mds().  This is a source of unwanted WRITE to
MDS.  Fix it by checking NFS4_FF_HDR_NO_IO_THRU_MDS bit, and if set
surface -EAGAIN instead; the writepage-side caller (nfs_do_writepage()
for buffered, nfs_direct_write_reschedule() for O_DIRECT) then
redirties the request so writeback retries via pNFS.

Fixes: 260074cd8413 ("pNFS/flexfiles: Add support for FF_FLAGS_NO_IO_THRU_MDS")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

NFSv4/flexfiles: honor FF_FLAGS_NO_IO_THRU_MDS on fatal DS connect errors

Commit f06bedfa62d5 ("pNFS/flexfiles: don't attempt pnfs on fatal DS
errors") teaches ff_layout_{read,write}_pagelist() to return
PNFS_NOT_ATTEMPTED when nfs4_ff_layout_prepare_ds() fails with a
nfs_error_is_fatal() errno (e.g. -ETIMEDOUT from a SOFTCONN connect
deadline, -ENOMEM, -ERESTARTSYS), so that the client gives up instead
of spinning. pnfs_do_{read,write}() then dispatches the I/O through
pnfs_{read,write}_through_mds() → nfs_pageio_reset_{read,write}_mds().

That fallback is unconditional and silently violates FF_FLAGS_NO_IO_THRU_MDS:
when the layout segment carries the flag (typically single-mirror
appliance layouts where MDS I/O is explicitly forbidden), the
out_failed: path's \`&& !ds_fatal_error\` clause overrides the flag's
short-circuit through ff_layout_avoid_mds_available_ds() and routes
the I/O to the MDS file handle anyway.

This is reachable in practice during a data-server restart: SOFTCONN
exhaustion produces -ETIMEDOUT, which is fatal per nfs_error_is_fatal(),
which triggers PNFS_NOT_ATTEMPTED, which silently goes to MDS.

Preserve the upstream "don't spin on fatal errors" intent for layouts
that permit MDS fallback. For layouts with FF_FLAGS_NO_IO_THRU_MDS
set, mark the layout for return and request PNFS_TRY_AGAIN instead;
if the server cannot supply a usable layout the failure now surfaces
cleanly via pnfs_update_layout(), rather than via silent MDS I/O that
contradicts the flag.

Fixes: f06bedfa62d5 ("pNFS/flexfiles: don't attempt pnfs on fatal DS errors")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

sunrpc: fix uninitialized xprt_create_args structure

The xprt_create_args structure is allocated on the stack without
initialization in rpc_sysfs_xprt_switch_add_xprt_store(). While some
fields are manually populated, critical fields like srcaddr, bc_xps,
and flags contain uninitialized stack garbage.

This can lead to:
1. Kernel panic when xs_setup_xprt() dereferences garbage srcaddr
2. Information leak if srcaddr points to sensitive stack data
3. Unpredictable behavior if flags has random bits set

The fix is to zero-initialize the structure to ensure all unused
fields are NULL/0, preventing the transport setup code from acting
on garbage data.

Cc: stable@vger.kernel.org
Suggested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

nfs: keep PG_UPTODATE clear after read errors in page groups

When a read request is split into multiple subrequests, earlier
completions may advance PG_UPTODATE state for the page group once
their bytes fall within hdr->good_bytes. If a later subrequest in
the same group then completes with NFS_IOHDR_ERROR, the read path
needs to clear any accumulated PG_UPTODATE state and keep later
completions from rebuilding it.

Otherwise, a subsequent successful subrequest can re-enter
nfs_page_group_set_uptodate(), restore the page-group sync state,
and leave stale PG_UPTODATE behind for nfs_page_group_destroy()
to trip over in nfs_free_request().

Add a sticky page-group read-failed flag. Once any subrequest in
the group is known to be bad, mark the group failed, clear any
accumulated PG_UPTODATE state, and refuse further PG_UPTODATE
synchronization for the rest of the completion walk.

Fixes: 67d0338edd71 ("nfs: page group syncing in read path")
Signed-off-by: Clark Wang <xiaoning.wang@nxp.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

doc: security: Add documentation of exporting and deleting IMA measurements

Add the documentation of exporting and deleting IMA measurements in
Documentation/security/IMA-export-delete.rst.

Also add the missing Documentation/security/IMA-templates.rst file in
MAINTAINERS.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Support staging and deleting N measurements records

Add support for sending a value N between 1 and ULONG_MAX to the IMA
original measurement interface. This value represents the number of
measurements that should be deleted from the current measurements list. In
this case, measurements are staged in an internal non-user visible list,
and immediately deleted.

This staging method allows the remote attestation agents to easily separate
the measurements that were verified (staged and deleted) from those that
weren't due to the race between taking a TPM quote and reading the
measurements list.

In order to minimize the locking time of ima_extend_list_mutex, deleting
N records is realized by doing a lockless walk in the current measurements
list to determine the N-th entry to cut, to cut the current measurements
list under the lock, and by deleting the excess records after releasing the
lock.

Flushing the hash table is not supported for N records, since it would
require removing the N records one by one from the hash table under the
ima_extend_list_mutex lock, which would increase the locking time.

Link: https://github.com/linux-integrity/linux/issues/1
Co-developed-by: Steven Chen <chenste@linux.microsoft.com>
Signed-off-by: Steven Chen <chenste@linux.microsoft.com>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Add support for flushing the hash table when staging measurements

During staging and delete, measurements are not completely deallocated.
Their entry digest portion is kept and is still reachable with the hash
table to detect duplicate records. If the number of records is significant,
this reduces the memory saving benefit of staging.

Some users might be interested in achieving the best memory saving (the
measurements are completely deallocated) at the cost of having duplicate
records across the staged measurement lists. Duplicate records are still
avoided within the current measurement list.

Introduce the new kernel option ima_flush_htable to decide whether or not
the digests of staged measurement records are flushed from the hash table,
when they are deleted, to achieve the maximum memory saving.

When the option is enabled, replace the old hash table with a new one,
by calling ima_alloc_replace_htable(), and completely delete the
measurements records.

Note: This code derives from the Alt-IMA Huawei project, whose license is
GPL-2.0 OR MIT.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Add support for staging measurements with prompt

Introduce the ability of staging the IMA measurement list and deleting them
with a prompt.

Staging means moving the current measurement list records to a separate
location, and allowing users to read and delete it. This causes the current
measurement list to be emptied (since records were moved) and new
measurements to be added on the empty list. Staging can be done only once
at a time. In the event of kexec(), staging is aborted and staged records
will be carried over to the new kernel.

Introduce ascii_runtime_measurements_<algo>_staged and
binary_runtime_measurements_<algo>_staged interfaces to access and delete
the measurements.

Use 'echo A > <IMA _staged interface>' and
'echo D > <IMA _staged interface>' to respectively stage and delete the
entire measurements list. Locking of these interfaces is also mediated with
a call to _ima_measurements_open() and with ima_measurements_release().

Implement the staging functionality by introducing the new global
measurements list ima_measurements_staged, and ima_queue_stage() and
ima_queue_staged_delete_all() to respectively move measurements from the
current measurements list to the staged one, and to move staged
measurements to the ima_measurements_trim list for deletion. Introduce
ima_queue_delete() to delete the measurements.

Staging is forbidden after measurement is suspended, and between staging
and deleting, so that walking the staged and current measurements list can
be done locklessly in ima_dump_measurement_list(). Strict ordering of
suspending and dumping is enforced by two reboot notifiers with different
priority. Refusing to delete staged measurements also signals to user space
that those measurements are already carried over to the secondary kernel,
so that it does not save them twice.

Finally, introduce the BINARY_STAGED and BINARY_FULL binary measurements
list types, to maintain the counters and the binary size of staged
measurements and the full measurements list (including records that were
staged). BINARY still represents the current binary measurements list.

Use the binary size for the BINARY + BINARY_STAGED types in
ima_add_kexec_buffer(), since both measurements list types are copied to
the secondary kernel during kexec. Use BINARY_FULL in
ima_measure_kexec_event(), to generate a critical data record.

It should be noted that the BINARY_FULL counter is not passed through
kexec. Thus, the number of records included in the kexec critical data
records refers to the records since the critical data records generated
from the previous kexec event.

Note: This code derives from the Alt-IMA Huawei project, whose license is
GPL-2.0 OR MIT.

Link: https://github.com/linux-integrity/linux/issues/1
Suggested-by: Gregory Lumen <gregorylumen@linux.microsoft.com> (staging revert)
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Tested-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce ima_dump_measurement()

Introduce ima_dump_measurement() to simplify the code of
ima_dump_measurement_list() and to avoid repeating the
ima_dump_measurement() code block if iteration occurs on multiple lists.

No functional change: only code moved to a separate function.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Use snprintf() in create_securityfs_measurement_lists

Use the more secure snprintf() function (accepting the buffer size) in
create_securityfs_measurement_lists().

No functional change: sprintf() and snprintf() have the same behavior.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Mediate open/release method of the measurements list

Introduce the ima_measure_users counter, to implement a semaphore-like
locking scheme where the binary and ASCII measurements list interfaces can
be concurrently opened by multiple readers, or alternatively by a single
writer. In addition, allow the same writer to open the other interfaces for
write or read/write, so that it can see the same measurement state across
all the interfaces.

A semaphore cannot be used because the kernel cannot return to user space
with a lock held.

Introduce the ima_measure_lock() and ima_measure_unlock() primitives, to
respectively lock/unlock the interfaces (safely with the ima_measure_users
counter, without holding a lock).

Finally, introduce _ima_measurements_open() to lock the interface before
seq_open(), and call it from ima_measurements_open() and
ima_ascii_measurements_open(). And, introduce ima_measurements_release(),
to unlock the interface.

Require CAP_SYS_ADMIN if the interface is opened for write (not possible
for the current measurements interfaces, since they only have read
permission).

No functional changes: multiple readers are allowed as before.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce _ima_measurements_start() and _ima_measurements_next()

Introduce _ima_measurements_start() and _ima_measurements_next(), renamed
from ima_measurements_start() and ima_measurements_next(), to include the
list head as an additional parameter, so that iteration on different lists
can be implemented by calling those functions.

No functional change: ima_measurements_start() and ima_measurements_next()
pass the ima_measurements list head, used before. They become wrappers for
the new functions.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce per binary measurements list type binary_runtime_size value

Make binary_runtime_size as an array, to have separate counters per binary
measurements list type. Currently, define the BINARY type for the existing
binary measurements list.

Introduce ima_update_binary_runtime_size() to facilitate updating a
binary_runtime_size value with a given binary measurement list type.

Also add the binary measurements list type parameter to
ima_get_binary_runtime_size(), to retrieve the desired value. Retrieving
the value is now done under the ima_extend_list_mutex, since there can be
concurrent updates.

No functional change (except for the mutex usage, that fixes the
concurrency issue): the BINARY array element is equivalent to the old
binary_runtime_size.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Introduce per binary measurements list type ima_num_records counter

Make ima_num_records as an array, to have separate counters per binary
measurements list type. Currently, define the BINARY type for the existing
binary measurements list.

No functional change: the BINARY type is equivalent to the value without
the array.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Replace static htable queue with dynamically allocated array

The IMA hash table is a fixed-size array of hlist_head buckets:

struct hlist_head ima_htable[IMA_MEASURE_HTABLE_SIZE];

IMA_MEASURE_HTABLE_SIZE is (1 << IMA_HASH_BITS) = 1024 buckets, each a
struct hlist_head (one pointer, 8 bytes on 64-bit). That is 8 KiB allocated
in BSS for every kernel, regardless of whether IMA is ever used, and
regardless of how many measurements are actually made.

Replace the fixed-size array with a RCU-protected pointer to a dynamically
allocated array that is initialized in ima_init_htable(), which is called
from ima_init() during early boot. ima_init_htable() calls the static
function ima_alloc_replace_htable() which, other than initializing the hash
table the first time, can also hot-swap the existing hash table with a
blank one.

The allocation in ima_alloc_replace_htable() uses kcalloc() so the buckets
are zero-initialised (equivalent to HLIST_HEAD_INIT { .first = NULL }).
Callers of ima_alloc_replace_htable() must call synchronize_rcu() and free
the returned hash table.

Finally, access the hash table with rcu_dereference() in
ima_lookup_digest_entry() (reader side) and with
rcu_dereference_protected() in ima_add_digest_entry() (writer side).

No functional change: bucket count, hash function, and all locking remain
identical.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ima: Remove ima_h_table structure

The ima_h_table structure is a collection of IMA measurement list
metadata - number of records in the IMA measurement list, number of
integrity violations, and a hash table containing the IMA template data
hash, needed to prevent measurement list record duplication.

Removing records from the measurement list needs to be reflected in the
hash table. As a pre-req to removing records from the measurement list,
separate those counters from the hash table, remove the ima_h_table
structure, and just replace the hash table pointer.

Finally, rename ima_show_htable_value(), ima_show_htable_violations()
and ima_htable_violations_ops respectively to ima_show_counter(),
ima_show_num_violations() and ima_num_violations_ops.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

i2c: riic: fix refcount leak in riic_i2c_resume_noirq()

When riic_i2c_resume_noirq() is called, it deasserts the reset
using reset_control_deassert(), which for shared resets increments
a reference count. If pm_runtime_force_resume() then fails, the
function returns without calling reset_control_assert() to
decrement the count. This leaves the reset deasserted and the
reference count unbalanced, which can prevent other users of the
shared reset from properly asserting it later.

Fix the leak by calling reset_control_assert() on the error
handling path for a failed pm_runtime_force_resume().

Fixes: e383f0961422 ("i2c: riic: Move suspend handling to NOIRQ phase")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Cc: <stable@vger.kernel.org> # v6.19+
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260608071123.128964-1-vulab@iscas.ac.cn

Merge tag 'v7.1-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:

- Fix random config build failure on s390.

* tag 'v7.1-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: s390 - add select CRYPTO_AEAD for aes

power: sequencing: pcie-m2: Add PCI ID 0x1103 for WCN6855 Bluetooth

WCN6855 is a Qualcomm Wi-Fi/BT combo chip that uses PCI device ID
0x1103. Add it to pwrseq_m2_pci_ids[] alongside the existing 0x1107
(WCN7850) entry, so that the pwrseq-pcie-m2 driver creates a Bluetooth
serdev device for WCN6855 cards inserted into PCIe M.2 Key E connectors.

Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Wei Deng <wei.deng@oss.qualcomm.com>
Link: https://patch.msgid.link/20260608091702.3797437-2-wei.deng@oss.qualcomm.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

gpio: mvebu: fix NULL pointer dereference in suspend/resume

mvebu_pwm_suspend() and mvebu_pwm_resume() are called for all GPIO
banks during suspend/resume, but not all banks have PWM functionality.
GPIO banks without PWM have mvchip->mvpwm set to NULL.

Calling mvebu_pwm_suspend() with mvpwm == NULL causes a NULL pointer
dereference when it tries to access mvpwm->blink_select.

  Unable to handle kernel NULL pointer dereference at virtual address 00000020 when write
  [00000020] *pgd=00000000
  Internal error: Oops: 815 [#1] PREEMPT ARM
  Modules linked in:
  CPU: 0 UID: 0 PID: 406 Comm: sh Not tainted 6.12.74-rt12-yocto-standard-g4e96f98fb7db-dirty #353
  Hardware name: Marvell Armada 370/XP (Device Tree)
  PC is at regmap_mmio_read+0x38/0x54
  LR is at regmap_mmio_read+0x38/0x54
  pc : [<c05fd2ac>]    lr : [<c05fd2ac>]    psr: 200f0013
  sp : f0c11d10  ip : 00000000  fp : c100d2f0
  r10: c14fb854  r9 : 00000000  r8 : 00000000
  r7 : c1799c00  r6 : 00000020  r5 : 00000020  r4 : c179c7c0
  r3 : f0a231a0  r2 : 00000020  r1 : 00000020  r0 : 00000000
  Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
  Control: 10c5387d  Table: 135ec059  DAC: 00000051
  Call trace:
   regmap_mmio_read from _regmap_bus_reg_read+0x78/0xac
   _regmap_bus_reg_read from _regmap_read+0x60/0x154
   _regmap_read from regmap_read+0x3c/0x60
   regmap_read from mvebu_gpio_suspend+0xa4/0x14c
   mvebu_gpio_suspend from dpm_run_callback+0x54/0x180
   dpm_run_callback from device_suspend+0x124/0x630
   device_suspend from dpm_suspend+0x124/0x270
   dpm_suspend from dpm_suspend_start+0x64/0x6c
   dpm_suspend_start from suspend_devices_and_enter+0x140/0x8e8
   suspend_devices_and_enter from pm_suspend+0x2fc/0x308
   pm_suspend from state_store+0x6c/0xc8
   state_store from kernfs_fop_write_iter+0x10c/0x1f8
   kernfs_fop_write_iter from vfs_write+0x270/0x468
   vfs_write from ksys_write+0x70/0xf0
   ksys_write from ret_fast_syscall+0x0/0x54

Add a NULL check for mvchip->mvpwm before calling the PWM
suspend/resume functions.

Fixes: 757642f9a584 ("gpio: mvebu: Add limited PWM support")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Link: https://patch.msgid.link/20260608084334.2960803-1-yun.zhou@windriver.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

io_uring/net: support registered buffer for plain send and recv

So far IORING_RECVSEND_FIXED_BUF is only honoured on the SEND_ZC path,
even though the import wiring is already present for plain send and
completely absent for recv. Targets such as ublk's NBD backend want to
push/pull I/O data directly to/from an io_uring registered buffer over a
plain send/recv on a TCP socket.

Wire IORING_RECVSEND_FIXED_BUF into the plain IORING_OP_SEND and
IORING_OP_RECV paths:

- Accept the flag in SENDMSG_FLAGS / RECVMSG_FLAGS and, at prep time,
   restrict it to the non-vectorized IORING_OP_SEND / IORING_OP_RECV
   opcodes. It is mutually exclusive with buffer select, bundles and
   (for recv) multishot, and records sqe->buf_index.

- For recv, set REQ_F_IMPORT_BUFFER in setup so the registered buffer
   is imported lazily at issue time, mirroring the send path.

- In io_send()/io_recv(), import the registered buffer via
   io_import_reg_buf() (ITER_SOURCE for send, ITER_DEST for recv) and
   clear REQ_F_IMPORT_BUFFER. The resulting bvec iter persists in
   async_data, so MSG_WAITALL partial send/recv retries resume at the
   right offset.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260608142511.659240-2-ming.lei@redhat.com
[axboe: combine flags checks]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'hyperv-fixes-signed-20260607' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

- MSHV driver fixes from various people (Anirudh Rayabharam, Can Peng,
   Dexuan Cui, Michael Kelley, Jork Loeser, Wei Liu)

- Hyper-V user space tools fixes (Thorsten Blum)

- Allow VMBus to be unloaded after frame buffer is flushed (Michael
   Kelley)

* tag 'hyperv-fixes-signed-20260607' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  mshv: support 1G hugepages by passing them as 2M-aligned chunks
  Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
  mshv: use kmalloc_array in mshv_root_scheduler_init
  mshv: Add conditional VMBus dependency
  hyperv: Clean up and fix the guest ID comment in hvgdk.h
  drm/hyperv: During panic do VMBus unload after frame buffer is flushed
  Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
  mshv: unmap debugfs stats pages on kexec
  mshv: clean up SynIC state on kexec for L1VH
  mshv: limit SynIC management to MSHV-owned resources
  hv: utils: replace deprecated strcpy with strscpy in kvp_register
  hv: utils: handle and propagate errors in kvp_register
  mshv: add a missing padding field

NFSv4/pnfs: defer return_range callbacks until after inode unlock

Sometimes unmounting an NFS filesystem mounted with pNFS SCSI
layouts triggers the following warning:

     BUG: scheduling while atomic: umount.nfs4/...

    __schedule_bug+0xbd/0x100
     schedule_debug.constprop.0+0x19f/0x220
     __schedule+0x10d/0x10a0
     schedule+0x74/0x190
     schedule_timeout+0xf5/0x220
     io_schedule_timeout+0xd5/0x160
     __wait_for_common+0x186/0x4b0
     blk_execute_rq+0x2ef/0x3a0
     scsi_execute_cmd+0x1ff/0x700
     sd_pr_out_command.isra.0+0x242/0x380 [sd_mod]
     bl_unregister_scsi.constprop.0+0x109/0x3c0 [blocklayoutdriver]
     bl_unregister_dev+0x175/0x1c0 [blocklayoutdriver]
     bl_free_device+0x1f/0x1b0 [blocklayoutdriver]
     bl_free_deviceid_node+0x12/0x30 [blocklayoutdriver]
     nfs4_put_deviceid_node+0x171/0x360 [nfsv4]
     ext_tree_remove+0x11c/0x1d0 [blocklayoutdriver]
     _pnfs_return_layout+0x416/0x900 [nfsv4]
     nfs4_evict_inode+0x108/0x130 [nfsv4]
     evict+0x316/0x750
     dispose_list+0xf1/0x1a0
     evict_inodes+0x33f/0x440
     generic_shutdown_super+0xc9/0x4e0
     kill_anon_super+0x3a/0x90
     nfs_kill_super+0x44/0x60 [nfs]
     deactivate_locked_super+0xb8/0x1b0
     cleanup_mnt+0x25a/0x380
     task_work_run+0x13e/0x210
     exit_to_user_mode_loop+0x169/0x400
     do_syscall_64+0x467/0x1550
     entry_SYSCALL_64_after_hwframe+0x76/0x7e

The warning occurs because the block layout driver unregisters the SCSI
device while the inode lock is still held. Device unregistration issues
a SCSI PR command, which may sleep, resulting in a "scheduling while
atomic" warning.

During layout return, ext_tree_remove() invokes the layout driver's
return_range callback while holding the inode lock. For block layouts,
this callback eventually calls bl_unregister_scsi(), which may block in
scsi_execute_cmd() while issuing PR commands to the device.

Fix this by deferring the return_range callbacks until after the inode
lock has been released. The layout header reference count is incremented
before invoking return_range(), ensuring that the layout header remains
valid while the layout driver removes extents from the extent tree.

Fixes: c88953d87f5c8 ("pnfs: add return_range method")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Remove tautological I2 assertion in rpcrdma_reply_put

rpcrdma_reply_put() sets req->rl_reply to NULL when it is
non-NULL, and skips the block when it is already NULL. The
WARN_ON_ONCE(req->rl_reply) that follows can never fire
because both paths leave rl_reply NULL.

Remove the dead assertion and its comment.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Fix I3 invariant comment in rpcrdma_complete_rqst

frwr_unmap_sync() and frwr_unmap_async() drain rl_registered via
rpcrdma_mr_pop() before posting invalidation Work Requests to
hardware. The WARN_ON_ONCE verifies that the list-drain step
has occurred, not that hardware unmapping has completed.

Reword the comment to match what the assertion actually checks.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Document and assert reply-handler invariants

The xprtrdma reply path has been the subject of recurring
LLM-driven review claims that 'an RPC can complete while
receive buffers are still DMA-mapped' or that 'the req can be
freed while the HCA still owns the send buffer.'  No runtime
reproducer has surfaced, but the absence of a written-down
invariant set lets each pass of automated review reach the
same hypothetical conclusion.  Subsequent fixes against
ce2f9a4d9ccc ('xprtrdma: Decouple req recycling from RPC
completion') closed the underlying races but did not document
the closure where future readers will look for it.

State the invariants explicitly in a comment above
rpcrdma_reply_handler() and back four of them with
WARN_ON_ONCE() probes positioned where each invariant is
locally checkable on the previous patch's cleaned-up
ownership state:

- I1 (Receive WR ownership): WARN at rpcrdma_post_recvs() that
  a rep pulled from rb_free_reps carries rr_rqst == NULL.

- I2 (rep attachment): WARN at rpcrdma_reply_put() that
  req->rl_reply was NULLed before the matching rep_put.

- I3 (Registered-MR fence): WARN at rpcrdma_complete_rqst()
  that req->rl_registered is empty.  Strong send-queue
  ordering of the LocalInv WR chain makes the last
  completion observe the ib_dma_unmap_sg() of every earlier
  MR, so 'list empty' implies 'all MRs unmapped'.

- I4 (Send-buffer release): WARN at rpcrdma_req_release()
  that req->rl_sendctx is NULL.  Reaching the kref release
  callback requires both the RPC-layer and Send-side
  references to have dropped; the Send-side drop runs in
  rpcrdma_sendctx_unmap(), which clears rl_sendctx
  (previous patch).  A non-NULL rl_sendctx here would mean
  the Send-side owner had not run -- a contradiction.

The XXX comment in xprt_rdma_free() about signal-driven
release racing the Send completion described the pre-decouple
state.  Replace it with a one-line note pointing at the
invariant set, since the kref scheme now holds the req across
the in-flight Send regardless of which path released the
rpc_task.

I5 (req lifecycle) is stated in the comment but not probed:
making it locally assertible would require moving kref_init
out of rpcrdma_req_release(), which in turn requires adding
kref_init to the bc_pa_list and backlog-wake reuse paths.
That restructuring is deferred -- the invariant is unchanged
either way.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Clear receive-side ownership pointers on release

Three small ownership-state cleanups land the transport in a
state that lets future reviewers reason about each pointer
locally rather than tracing the whole reply path:

rpcrdma_rep_put() clears rep->rr_rqst before the rep enters
rb_free_reps so that no rep on the free list still carries a
stale rqst pointer.  rpcrdma_reply_handler() and
rpcrdma_unpin_rqst() are the only sites that set rr_rqst;
rpcrdma_reply_handler() hands the rep through
rpcrdma_rep_put(), and rpcrdma_unpin_rqst() NULLs rr_rqst
directly because its error path abandons the rep for
teardown cleanup rather than returning it to rb_free_reps.

rpcrdma_reply_put() NULLs req->rl_reply before calling
rpcrdma_rep_put().  The previous order placed the rep on
rb_free_reps while req->rl_reply still pointed at it; the
window was harmless because xprt_rdma_free_slot() holds the
req exclusively across the pair, but closing it makes the
invariant 'rep on rb_free_reps implies no req references it'
strictly checkable.

rpcrdma_sendctx_unmap() and rpcrdma_sendctx_cancel() clear
req->rl_sendctx after dropping the sendctx pointer in the
sendctx ring.  Without this, req->rl_sendctx survives across
Send completion and points at a sendctx that may already have
been reassigned by rpcrdma_sendctx_get_locked() to a different
req.  No caller dereferences the stale pointer today --
rpcrdma_prepare_send_sges() overwrites it before the next
Send -- but a NULL is a more honest representation of 'the
Send is no longer outstanding' and lets the assertion patch
that follows trip on any future regression.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Add request-pool slack for delayed recycling

After the previous patch gates req recycling on Send completion,
a completed RPC's rpcrdma_req can remain pinned by the sendctx
ring until the next signaled Send completion releases it. The
transmitted-RPC ceiling is unchanged: xprt_request_get_cong()
gates Sends against xprt->cwnd, the RPC/RDMA credit window fed
by server-granted credits and capped at re_max_requests. The
req pool, however, must exceed max_reqs by enough that this
recycle delay does not stall a slot allocation that the credit
window would admit.

The headroom is bounded. frwr_open() sets re_send_batch to
re_max_requests >> 3 -- one in every eight Sends is signaled --
so at most re_send_batch unsignaled Sends can be outstanding
before the next signaled completion releases them. That equals
max_reqs / 8 reqs in the worst case, with a one-slot floor for
small max_reqs values where the right-shift rounds to zero.

The sendctx ring and the hardware Send Queue are not enlarged
to match. Both are sized in rpcrdma_sendctxs_create() and
frwr_query_device() for re_max_requests in-flight Sends, which
is the ceiling the credit window enforces. The pool slack does
not raise that ceiling -- it only lets allocation keep pace
with the credit window during the brief interval in which
earlier reqs are pinned waiting for the next signaled
completion. At any moment, at most re_send_batch sendctxes are
held by unswept unsignaled Sends, leaving the rest of the ring
available for newly admitted Sends.

Allocate max_reqs + DIV_ROUND_UP(max_reqs, 8) request objects
and name the slack calculation at the allocation site so the
1/8 bound stays tied to the Send-signaling batch size.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Decouple req recycling from RPC completion

rl_kref formerly served two distinct lifetimes through a single
refcount: it gated when a Reply could wake its RPC task, and it
gated when an rpcrdma_req could return to its free pool. The
marshal path took the Send-side reference only when SGEs needed
DMA-unmap (sc_unmap_count > 0), which made a Send carrying only
pre-registered buffers an exception: the Reply handler dropped
rl_kref from 1 to 0 and freed the req while the HCA might still
be DMA-reading from its send buffer.

Give rl_kref a narrower job. The RPC layer takes one reference
when slot allocation hands a req out. rpcrdma_prepare_send_sges()
takes a Send-side reference unconditionally after WR preparation
succeeds. xprt_rdma_free_slot() and xprt_rdma_bc_free_rqst() drop
the RPC-layer reference; rpcrdma_sendctx_unmap() drops the
Send-side reference. The req returns to its free pool only after
both owners have signed off.

The existing kref_init(&req->rl_kref) call in
rpcrdma_prepare_send_sges() is removed. Initialization moves to
the slot-allocation paths (xprt_rdma_alloc_slot and
rpcrdma_bc_rqst_get), and the release callback re-arms rl_kref
before the req returns to a free pool. A re-init in the marshal
path would discard the RPC-layer reference that already exists
on entry.

Three invariants follow:

  - Any rpcrdma_req held by an rpc_rqst has rl_kref >= 1.
    xprt_rdma_alloc_slot(), rpcrdma_bc_rqst_get(), and the
    backlog-wake branch in xprt_rdma_alloc_slot() each kref_init
    rl_kref before publishing the req. Without this invariant,
    an RPC task that aborts between slot allocation and marshal
    (gss_refresh failure or signal during call_connect, for
    example) would drive xprt_release() ->
    xprt_rdma_free_slot() -> kref_put against a refcount of
    zero, saturating refcount_t and stranding the slot.

  - The Send-side reference is taken only after WR prep
    succeeds. A mapping failure in rpcrdma_prepare_send_sges()
    runs rpcrdma_sendctx_cancel(), which DMA-unmaps the sendctx
    and clears sc_req without touching rl_kref. The sendctx
    ring walks in rpcrdma_sendctx_put_locked() and
    rpcrdma_sendctxs_destroy() skip entries with sc_req == NULL,
    so a burst of -EIO marshal failures cannot hold reqs off
    rb_send_bufs.

  - The release callback re-arms rl_kref so the next consumer
    enters with the invariant satisfied.

Replies now complete the RPC directly. rpcrdma_reply_handler()
calls rpcrdma_complete_rqst() in place of kref_put on the
non-LocalInv branch. The LocalInv branch already completes the
RPC from frwr_unmap_async() and is unaffected.

Because Send-side references can now outlive RPC completion,
connection teardown drains sendctx entries whose unsignaled
Sends never had a later signaled completion to walk the ring.
rpcrdma_sendctxs_destroy() walks the active range and runs
rpcrdma_sendctx_unmap() on each entry with a non-NULL sc_req
before the request buffers are reset, and is moved ahead of
rpcrdma_reqs_reset() in rpcrdma_xprt_disconnect() so the reqs
are still in their pre-reset state when the Send-side refs are
released.

The drain creates a teardown-ordering hazard on the backchannel
path. With the new lifetime, releasing a bc_prealloc req from
rpcrdma_req_release() re-adds it to bc_pa_list. The disconnect
in xprt_rdma_destroy() runs after xprt_destroy_backchannel() has
already emptied bc_pa_list, so the drained reqs would otherwise
leak. xprt_rdma_destroy() now runs xprt_rdma_bc_destroy(xprt, 0)
a second time after the disconnect to reclaim them.

Fixes: 0ab115237025 ("xprtrdma: Wake RPCs directly in rpcrdma_wc_send path")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Use sendctx DMA state for Send signaling

Send signaling matters only when the prepared Send has page
mappings to unmap. Today that test is expressed indirectly with
rl_kref, because the Send-side reference is taken only for Sends
with mapped SGEs.

Split the SGE DMA unmap loop into its own helper and use
sc_unmap_count directly for the signaling decision. This keeps the
current behavior but removes one dependency on the old rl_kref
semantics before the request lifetime rules are changed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

pNFS: Fix use-after-free in pnfs_update_layout()

When hitting the NFS_LAYOUT_RETURN branch in pnfs_update_layout(),
the code calls pnfs_prepare_to_retry_layoutget(lo). If it succeeds,
pnfs_put_layout_hdr(lo) is called before trace_pnfs_update_layout(),
which still references 'lo'. This results in a use-after-free when the
tracepoint accesses lo's fields.

Fix this by moving the tracepoint call before pnfs_put_layout_hdr(lo).

Fixes: 2c8d5fc37fe2 ("pNFS: Stricter ordering of layoutget and layoutreturn")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

NFS: fix eof updates after NFSv4.2 fallocate/zero-range

Generic/075 reliably exposes a regression when the client holds an
NFSv4 write delegation: ZERO_RANGE/ALLOCATE extends the file on the
server, but the local inode keeps the old i_size. The test then fails
with 'Size error' because the post-op attribute refresh refuses to
touch i_size while a delegation is outstanding, and the cached EOF
was never marked stale.

Update _nfs42_proc_fallocate() so that on success it:

- bumps i_size when the operation extends the file, and
- marks NFS_INO_INVALID_BLOCKS since the block count can also change

Tested with xfstests generic/075 over NFSv4.2.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

NFS: show redacted cert_serial and privkey_serial in mount options

mount output should not reveal the contents of the serials, but indicate
they were provided.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

pNFS/filelayout: fix cheking if a layout is striped

A layout can still be striped with num_fh = 1 as it is perfectly possible
that both MDS and DSs can handle the same filehandle. Hence check according
to stripe_count > 1, which is the correct check to begin with.

We should not be called with flseg->dsaddr = NULL, but if for some reason
we do, return our best guess with is flseg->num_fh > 1.

Fixes: a6b9d2fa0024 ("pNFS/filelayout: Fix coalescing test for single DS")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

xprtrdma: Move long delayed work on system_dfl_long_wq

Currently the code enqueue work items using {queue|mod}_delayed_work(),
using system_long_wq. This workqueue should be used when long works are
expected and it is a per-cpu workqueue.

The function(s) end up calling __queue_delayed_work(), which set a global
timer that could fire anywhere, enqueuing the work where the timer fired.

Unbound works could benefit from scheduler task placement, to optimize
performance and power consumption. Long work shouldn't stick to a single
CPU.

Recently, a new unbound workqueue specific for long running work has
been added:

c116737e972e ("workqueue: Add system_dfl_long_wq for long unbound works")

Since the workqueue work doesn't rely on per-cpu variables, there is no
obvious reason that justify the use of a per-cpu workqueue. So change
system_long_wq with system_dfl_long_wq so that the work may benefit from
scheduler task placement.

Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

NFSv4: clear exception state on successful mkdir retry

After a server returns NFS4ERR_DELAY for an NFSv4 CREATE issued by
mkdir(2), the client correctly waits and retries.  When the retry
succeeds, however, mkdir(2) can still surface -EEXIST to userspace
even though the directory was just created on the server.

Reproducer (random 16-hex names so collisions are not the cause)
against an in-kernel Linux nfsd; reproduces under both NFSv4.0 and
NFSv4.2:

  N=2000000; base=/var/gdc/export
  for ((i=1; i<=N; i++)); do
      d=$base/$(openssl rand -hex 8)
      mkdir "$d" 2>/dev/null || echo "$(date +%T) failed loop=$i $d"
      rmdir "$d" 2>/dev/null
  done

Failures cluster at the cadence at which the server-side auth/export
cache refresh path causes nfsd to return NFS4ERR_DELAY for CREATE.

A wire trace of one failure (the three CREATE RPCs all come from a
single mkdir(2), generated by the do-while in nfs4_proc_mkdir()):

  client -> server  CREATE name=...  -> NFS4ERR_DELAY
  ~100 ms later
  client -> server  CREATE name=...  -> NFS4_OK         (dir created)
  ~80 us later
  client -> server  CREATE name=...  -> NFS4ERR_EXIST   (correct)

Since commit dd862da61e91 ("nfs: fix incorrect handling of large-number
NFS errors in nfs4_do_mkdir()"), nfs4_handle_exception() is called only
when _nfs4_proc_mkdir() returned an error.  That gate breaks retry-state
hygiene: nfs4_do_handle_exception() resets exception.{delay,recovering,
retry} to 0 on entry, so calling it on success is what previously
cleared the retry flag set by the preceding NFS4ERR_DELAY iteration.
With the gate in place, exception.retry stays at 1 after the successful
retry, the loop runs once more, and the resulting CREATE for an
already-created name yields NFS4ERR_EXIST -> -EEXIST to userspace.

Drop the conditional and call nfs4_handle_exception() unconditionally,
matching every other do-while in fs/nfs/nfs4proc.c (nfs4_proc_symlink(),
nfs4_proc_link(), etc.).  The dentry/status separation introduced by
that commit is preserved.

Fixes: dd862da61e91 ("nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir()")
Reported-and-tested-by: Jan Čípa <jan.cipa@gooddata.com>
Closes: https://lore.kernel.org/linux-nfs/CA+9S74hSp_tJu2Ffe2BPNC2T25gfkhgjjDkdgSsF5c2rnJq_wA@mail.gmail.com/
Reviewed-by: NeilBrown <neil@brown.name>
Cc: stable@vger.kernel.org
Signed-off-by: Igor Raits <igor.raits@gmail.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

NFSv4/flexfiles: reject zero filehandle version count

ff_layout_alloc_lseg() decodes the filehandle-version array count
from the flexfiles layout body. The value is used as the count for
kzalloc_objs(), and the current code only rejects NULL.

A zero count yields ZERO_SIZE_PTR, which can be stored in
dss_info->fh_versions even though later flexfiles paths assume that at
least one filehandle version exists.

Reject fh_count == 0 before the allocation, matching the existing zero
version_count validation in the flexfiles GETDEVICEINFO parser.

A QEMU/KASAN run with a malformed flexfiles layout hit:

  KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
  RIP: 0010:ff_layout_encode_ff_layoutupdate.isra.0+0x15f/0x750
  ff_layout_encode_layoutreturn+0x683/0x970
  nfs4_xdr_enc_layoutreturn+0x278/0x3a0
  Kernel panic - not syncing: Fatal exception

The patched kernel rejects the malformed layout without KASAN/oops/panic,
and a valid fh_count=1 regression still opens, reads, and unmounts cleanly.

Cc: stable@vger.kernel.org
Fixes: d67ae825a59d ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

sunrpc: Fix error handling in rpc_sysfs_xprt_switch_add_xprt_store()

xprt_create_transport() never returns NULL, only valid pointers or
error pointers. Using IS_ERR_OR_NULL() is incorrect, and PTR_ERR(NULL)
would return 0, which indicates EOF in a sysfs store function.

Fix this by using IS_ERR() instead of IS_ERR_OR_NULL().

Fixes: df210d9b0951 ("sunrpc: Add a sysfs file for adding a new xprt")
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

nfs: remove fileid field from struct nfs_inode

Now that all NFS client code uses inode->i_ino directly to store and
access the 64-bit NFS fileid, the separate fileid field in struct
nfs_inode is unused. Remove it to save 8 bytes per NFS inode.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

nfs: replace NFS_FILEID() and nfsi->fileid with inode->i_ino

Now that inode->i_ino stores the full 64-bit NFS fileid, replace all
uses of NFS_FILEID(), set_nfs_fileid(), and direct nfsi->fileid
accesses with inode->i_ino throughout the NFS client.

Remove the NFS_FILEID() and set_nfs_fileid() helper functions from
include/linux/nfs_fs.h since they are no longer needed.

Also fix two pre-existing truncation bugs in nfs4trace.h where fileid
trace fields were declared as u32 instead of u64.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

nfs: remove nfs_compat_user_ino64() and deprecate enable_ino64

Now that inode->i_ino stores the full 64-bit NFS fileid, the
nfs_compat_user_ino64() function is no longer needed.
generic_fillattr() already copies inode->i_ino into stat->ino, so the
explicit override in nfs_getattr() is also redundant.

Also remove the now-unused nfs_fileid_to_ino_t() and
nfs_fattr_to_ino_t() helper functions that were used to XOR-fold
64-bit fileids into the old unsigned long i_ino.

Keep the enable_ino64 module parameter as a deprecated stub that
accepts but ignores the value, logging a notice when set. This avoids
breaking existing configurations that pass nfs.enable_ino64 on the
kernel command line or in modprobe.d.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

nfs: store the full NFS fileid in inode->i_ino

Now that inode->i_ino is a 64-bit value, store the full NFS fileid in
it directly instead of an XOR-folded hash. This makes NFS_FILEID() and
set_nfs_fileid() operate on inode->i_ino rather than the separate
nfsi->fileid field.

Since iget5_locked() and ilookup5() now accept a u64 hashval, pass the
full fileid as the hash parameter directly.

Convert direct nfsi->fileid accesses in nfs_check_inode_attributes(),
nfs_update_inode(), and nfs_same_file() to use inode->i_ino.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@hammerspace.com>

clk: qcom: regmap-phy-mux: Rework the implementation

The sole reason this hw exists is to let the branch clock downstream of
it keep running, with the PHY disengaged. This is not possible with the
current implementation, as the enabled status is hijacked to mean
"enabled" = "use fast/PHY source" and "disabled" = "use XO source".

This is an issue, since the mux enable state follows that of the child
branch, making the desired "child enabled, MUX @ XO" combination
impossible.

Solve that by implementing ratesetting. Because PHY clock rates may
change at runtime and aren't really deterministic from Linux, assume
ULONG_MAX as "fast clock" and 19.2 MHz as XO. All the branches in
question already set CLK_SET_RATE_PARENT, so everything works out.

Signed-off-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260409-topic-phy_fastclk-v1-1-6b4aaee56b90@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

clk: qcom: a53: Corrected frequency multiplier for 1152MHz

The 1152MHz frequency entry for the a53 currently selects a multiplier of 62, giving 1190MHz. This changes the mulitiplier to 60 giving the intended 1152MHz.

Signed-off-by: Phillip Varney <pbvarney@protonmail.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Fixes: 0c6ab1b8f894 ("clk: qcom: Add A53 PLL support")
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260605005502.313928-1-pbvarney@protonmail.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

ntfs: fix u16 truncation of restart-area length check

ntfs_check_restart_area() validates that the $LogFile restart area and
its trailing log client record array fit within the system page size:

        u16 ra_ofs, ra_len, ca_ofs;
        ...
        ra_len = ca_ofs + le16_to_cpu(ra->log_clients) *
                        sizeof(struct log_client_record);
        if (ra_ofs + ra_len > le32_to_cpu(rp->system_page_size) || ...)
                return false;

ra_len is u16, but the right-hand side is computed in size_t
(sizeof(struct log_client_record) == 160). Both ca_ofs and log_clients
come straight from the on-disk restart area. With an on-disk
log_clients of 410 the product 410 * 160 = 65600; adding ca_ofs and
storing into the u16 ra_len truncates modulo 65536 (e.g. ca_ofs 64
gives ra_len 128), so the "fits in the page" check passes even though
the client array described by log_clients extends far beyond the page.

ntfs_check_log_client_array() then walks the array bounded only by the
on-disk log_clients count:

        cr = ca + idx;
        if (cr->prev_client != LOGFILE_NO_CLIENT) ...

For log_clients 410 it dereferences records up to ca + 409 * 160,
~64 KiB past the kvzalloc(system_page_size) restart-page buffer -- an
out-of-bounds read of attacker-controlled extent, reachable when a
crafted NTFS image is mounted (load_and_check_logfile() at mount time).
This is the in-kernel analogue of CVE-2022-30789, fixed in the ntfs-3g
userspace driver but never in this revived classic driver.

Compute the restart-area length in a u32 so the existing bounds check
rejects an over-large client array instead of being defeated by the
truncation. Widen ra_ofs and ca_ofs to u32 as well: both are loaded
from __le16 on-disk fields and every comparison already promotes to
int/size_t, so this changes no result and keeps the declaration uniform.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

ntfs: bound the attribute-list entry in ntfs_read_inode_mount()

The $MFT attribute-list walk in ntfs_read_inode_mount() validates each
entry only with "(u8 *)al_entry + 6 > al_end" and
"(u8 *)al_entry + le16_to_cpu(al_entry->length) > al_end", but then reads
al_entry->lowest_vcn (an __le64 at offset 8) and al_entry->mft_reference
(offset 16) -- fields beyond the 6 bytes proven in range. al_entry->length
is attacker-controlled and only required non-zero, so a short entry (e.g.
length 8) placed at the tail passes both checks while the lowest_vcn /
mft_reference reads fall past al_end.

al_end is ni->attr_list + attr_list_size (the on-disk size); the buffer is
kvzalloc(round_up(attr_list_size, SECTOR_SIZE)), so the sector rounding
usually absorbs the over-read -- but when attr_list_size is a multiple of
SECTOR_SIZE there is no slack and a crafted $MFT attribute list produces an
out-of-bounds read at mount time.

Validate the entry with ntfs_attr_list_entry_is_valid() (added in patch
1/3) before dereferencing it, matching the bound the other attribute-list
walks now use. The validator already requires the length to cover the fixed
header, which makes the separate "!al_entry->length" check redundant, so
drop it too.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

ntfs: bound the look-ahead attribute-list entry in ntfs_external_attr_find()

When resolving an attribute lookup with a non-zero @lowest_vcn,
ntfs_external_attr_find() peeks at the next $ATTRIBUTE_LIST entry to
decide whether to keep searching, but bounds that not-yet-validated
entry only with "(u8 *)next_al_entry + 6 < al_end" (which proves just
bytes 0..6 are in range) and "(u8 *)next_al_entry + length <= al_end"
with an attacker-controlled, non-8-aligned length. It then reads
next_al_entry->lowest_vcn (an __le64 at offset 8) and the name at
next_al_entry->name_offset, both of which can lie past al_end -- the
exact end of the kvmalloc'd attribute-list buffer (allocated at the
on-disk attr_list_size, no rounding). A crafted on-disk $ATTRIBUTE_LIST
whose last entry sits a few bytes before al_end therefore yields a slab
out-of-bounds read when the inode is read.

Validate the look-ahead entry with ntfs_attr_list_entry_is_valid() (added
in patch 1/3) before dereferencing lowest_vcn and the name, so the same
fixed-header, length and name bounds the main attribute-list walk uses now
guard this read too.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

ntfs: validate resident attribute lists and harden the validator

A base inode's $ATTRIBUTE_LIST is sanity-checked by load_attribute_list()
only on the non-resident path; ntfs_read_locked_inode() copies a *resident*
attribute list into ni->attr_list with a plain memcpy() and no validation
at all. Every subsequent walk of ni->attr_list --
ntfs_external_attr_find(), ntfs_inode_attach_all_extents() and
ntfs_attrlist_need() -- then trusts the entries are well-formed and reads
attr_list_entry fixed-header fields
(lowest_vcn at offset 8, mft_reference at offset 16, and the name) with
bounds that assume validation already happened. A crafted resident
attribute list therefore reaches those walks unvalidated and can drive
out-of-bounds reads of the attribute-list buffer.

load_attribute_list() itself reads ale->name_offset (offset 7),
ale->mft_reference (offset 16) and the name length under only an
"al < al_start + size" bound, so its own validation loop can over-read the
fixed header of a truncated trailing entry by a few bytes.

Factor the per-entry validation into ntfs_attr_list_entry_is_valid(),
which requires each entry's fixed header (offsetof(struct
attr_list_entry, name)) to be in range before any field is dereferenced,
that ale->length is a multiple of 8 covering the fixed header plus the
name, and that the entry is in use and carries a live MFT reference.
ntfs_attr_list_is_valid() walks the buffer with it and checks the entries
tile it exactly. Use the list validator in load_attribute_list()
(replacing the open-coded loop, closing its own over-read) and on the
resident path in ntfs_read_locked_inode() (which previously skipped
validation entirely); patches 2/3 reuse the per-entry helper at the other
two attribute-list walks.

Fixes: 1e9ea7e04472 ("Revert "fs: Remove NTFS classic"")
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

thermal: sysfs: Replace sscanf() with kstrtoul()

Replace sscanf() with kstrtoul() in cur_state_store(), as kstrto<type>
is preferred over single-variable sscanf().

Signed-off-by: Ovidiu Panait <ovidiu.panait.oss@gmail.com>
[ rjw: Changelog edits ]
Link: https://patch.msgid.link/20260606210420.2311145-3-ovidiu.panait.oss@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal: testing: Replace sscanf() with kstrtoint()

Generally, kstrtoint() is preferred to sscanf() in kernel code, so
replace the latter with the former in tt_del_tz() and tt_get_tt_zone().

Signed-off-by: Ovidiu Panait <ovidiu.panait.oss@gmail.com>
[ rjw: Changelog rewrite ]
Link: https://patch.msgid.link/20260606210420.2311145-2-ovidiu.panait.oss@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

btrfs: tracepoints: add trace event for log_new_dir_dentries()

log_new_dir_dentries() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for log_all_new_ancestors()

log_all_new_ancestors() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for btrfs_log_all_parents()

btrfs_log_all_parents() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for btrfs_log_inode()

btrfs_log_inode() is one of the most important steps called during a fsync,
as well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use a named enum for the log mode in inode log functions

We use this unnamed enum for the log mode and then pass it around log
functions as an int type with the odd name "inode_only" which suggests a
boolean. So add a name to the enum and change the type everywhere to that
enum and rename the parameters to something more clear - "log_mode".
Also move the enum into tree-log.h - it will be used later by new trace
events.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for btrfs_log_inode_parent()

btrfs_log_inode_parent() is one of the most important steps called during
a fsync operation as well as during rename and link operations on inodes
that were previously logged. Add trace events for when entering and
exiting that function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tracepoints: add trace event for when fsync finishes

Currently we only have a trace event for when a fsync operation starts,
but this alone is not very helpful. Add a trace event for when fsync
finishes, which reports its return value, so that using tracing we can
see which other trace events happened in between (several will be added
soon for inode logging steps) and even measure execution time.

So rename the existing trace event btrfs_sync_file to
btrfs_sync_file_enter and add the trace event btrfs_sync_file_exit.
The naming is similar to what ext4 does (ext4_sync_file_enter and
ext4_sync_file_exit) and with similar information reported.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove redundant writeback error check during fsync

If we can skip logging the inode during fsync, we check for writeback
errors in the inode's mapping by calling filemap_check_wb_err() and then
jump to the 'out_release_extents' label, which in turn jumps to the 'out'
label under which we check again for a writeback error by calling
file_check_and_advance_wb_err(). So the filemap_check_wb_err() ends up
being redundant. This happens since commit 333427a505be ("btrfs: minimal
conversion to errseq_t writeback error reporting on fsync").

Remove the filemap_check_wb_err() call.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: stop checking for greater then zero return values in btrfs_sync_file()

The value of 'ret' can never be greater than zero when we reach the end of
btrfs_sync_file() but we have this ternary operator converting any such
value into -EIO. This logic exists since the first fsync implementation,
added in 2007 by commit 8fd17795b226 ("Btrfs: early fsync support"), when
all that fsync did was simply to commit a transaction, but even a call to
btrfs_commit_transaction() could never return a value greater than zero.

So stop checking for a greater than zero value and assert that 'ret' is
never greater than zero, to catch any eventual regression during future
development.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>