Eric Biggers [Mon, 20 Apr 2026 06:34:19 +0000 (23:34 -0700)]
crypto: drbg - Change DRBG_MAX_REQUESTS to 4096
Currently a formal reseed happens only after each 1048576 requests.
That's quite a high number. Let's follow the example of BoringSSL and
use a more conservative value of 4096.
Note that in practice this makes little difference, now that we're
including 32 bytes from get_random_bytes() in the additional input on
every request anyway, which is a de facto reseed.
But for the same reason, we might as well decrease the actual reseed
interval to something more reasonable.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:18 +0000 (23:34 -0700)]
crypto: drbg - Include get_random_bytes() output in additional input
Woodage & Shumow (2018) (https://eprint.iacr.org/2018/349.pdf) showed
that contrary to the claims made by NIST in SP800-90A, HMAC_DRBG doesn't
satisfy the formal definition of forward secrecy (i.e. "backtracking
resistance") when it's called with an empty additional input string.
The actual attack seems pretty benign, as it doesn't actually give the
attacker any previous RNG output, but rather just allows them to test
whether their guess of the previous block of RNG output is correct.
Regardless, it's an annoying design flaw, and it's yet another example
of why NIST's DRBGs aren't all that great.
Meanwhile, the kernel's HMAC_DRBG code also tries to reseed itself
automatically after random.c has reseeded itself. But the
implementation is buggy, as it just checks whether 300 seconds have
elapsed, rather than looking at the actual generation counter.
Let's just follow the example of BoringSSL and use the conservative
approach of always including 32 bytes of "regular" random data in the
additional input string. This fixes both issues described above.
This does reduce performance. But this should be tolerable, since:
- Due to earlier changes, the kernel code that was previously using
drbg.c regardless of FIPS mode is now using it only in FIPS mode.
- The additional input string is processed only once per request. So
if a lot of bytes are generated at once, the cost is amortized.
- The NIST DRBGs are notoriously slow anyway.
Note that this fix should have no impact (either positive or negative)
on FIPS 140 certifiability. From FIPS's point of view the code added by
this commit simply doesn't matter: it adds zero entropy to something
that doesn't need to contain entropy.
Fixes: 541af946fe13 ("crypto: drbg - SP800-90A Deterministic Random Bit Generator") Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:17 +0000 (23:34 -0700)]
crypto: drbg - Simplify "uninstantiate" logic
drbg_kcapi_seed() calls drbg_uninstantiate() only to free drbg->jent and
set drbg->instantiated = false. However, the latter is necessary only
because drbg_kcapi_seed() sets drbg->instantiated = true too early. Fix
that, then just inline the freeing of drbg->jent.
Then, simplify the actual "uninstantiate" in drbg_kcapi_exit(). Just
free drbg->jent (note that this is a no-op on error and null pointers),
then memzero_explicit() the entire drbg_state.
Note that in reality the memzero_explicit() is redundant, since the
crypto_rng API zeroizes the memory anyway. But the way SP800-90A is
worded, it's easy to imagine that someone assessing conformance with it
would be looking for code in drbg.c that says it does an "Uninstantiate"
and does the zeroization. So it's probably worth keeping it somewhat
explicit, even though that means double zeroization in practice.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:13 +0000 (23:34 -0700)]
crypto: drbg - Put rng_alg methods in logical order
Put the DRBG implementation of the rng_alg methods in the order in which
they're called (cra_init => set_ent => seed => generate => cra_exit) so
that it's easier to understand the flow.
Also rename drbg_kcapi_random to drbg_kcapi_generate, and
drbg_kcapi_cleanup to drbg_kcapi_exit, so they match the method names.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:10 +0000 (23:34 -0700)]
crypto: drbg - Consolidate "instantiate" logic and remove drbg_state::C
Currently some of the steps of "Instantiate the DRBG" occur in a
convoluted way in different places in the call stack:
drbg_instantiate()
=> drbg_seed(reseed=0) sets the raw HMAC key drbg_state::C to its
correct initial value, and it sets the state value
drbg_state::V to an *incorrect* initial value.
=> drbg_hmac_update(reseed=0) overwrites drbg_state::V with the
correct initial value, then prepares the hmac_sha512_key
drbg_state::key from the initial raw HMAC key drbg_state::C.
Later, each time the HMAC key is updated, drbg_hmac_update() also uses
drbg_state::C to temporarily store the new raw key.
Simplify all of this by:
- Making drbg_instantiate() set the correct initial values of
drbg_state::V and drbg_state::key.
- Converting drbg_hmac_update() to generate the raw key in a
temporary on-stack array instead of drbg_state::C.
- Removing drbg_state::C.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:08 +0000 (23:34 -0700)]
crypto: drbg - Install separate seed functions for pr and nopr
Set rng_alg::seed to different functions for the prediction-resistant
and non-prediction-resistant algorithms, so that the function does not
need to parse the algorithm name to figure out which algorithm it is.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:04 +0000 (23:34 -0700)]
crypto: drbg - Move fixed values into constants
Since only one drbg_core remains, the state length, block length, and
security strength are now fixed values. Moreover, the maximum request
length, maximum additional data length, and maximum number of requests
were all already fixed values.
Simplify the code by just using #defines for all these fixed values.
In drbg_seed_from_random(), take advantage of the constant to define the
array size. Remove assertions that are no longer useful.
In the case of drbg_blocklen() and drbg_statelen(), replace these with a
single value DRBG_STATE_LEN, as for HMAC_DRBG they are the same thing.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:03 +0000 (23:34 -0700)]
crypto: drbg - De-virtualize drbg_state_ops
Now that there's only one set of state operations, use direct calls to
those operations.
No change in behavior. In particular, drbg_alloc_state() doesn't change
behavior, because the only remaining drbg_core uses HMAC_DRBG.
drbg_uninstantiate() doesn't change behavior, because a NULL d_ops
implied NULL priv_data which makes a drbg_fini_hash_kernel() a no-op.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:02 +0000 (23:34 -0700)]
crypto: drbg - Simplify algorithm registration
Now that "drbg_pr_hmac_sha512" and "drbg_nopr_hmac_sha512" are the only
crypto_rng algorithms left in crypto/drbg.c, simplify the algorithm
registration logic to register these more directly without relying on
the drbg_cores[] array (which will be removed).
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:01 +0000 (23:34 -0700)]
crypto: drbg - Remove support for HMAC-SHA256 and HMAC-SHA384
Remove support for the HMAC-SHA256 and HMAC-SHA384 variants of
HMAC_DRBG, leaving only the HMAC-SHA512 variant of HMAC_DRBG.
HMAC-SHA512 is already the default. The default did used to be
HMAC-SHA256, but several years ago it was upgraded to HMAC-SHA512 "to
support compliance with SP800-90B and SP800-90C". Given that the point
of crypto/drbg.c is compliance with those standards, and there's also no
technical reason to prefer HMAC-SHA384 in this situation even if
acceptable, there's really no point in offering anything else.
Note: now that only HMAC-SHA512 remains, a lot of unnecessary
abstractions can be removed. A later commit will do that. This commit
just straightforwardly removes the HMAC-SHA256 and HMAC-SHA384 code.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:34:00 +0000 (23:34 -0700)]
crypto: testmgr - Update test for drbg_nopr_hmac_sha512
Synchronize the drbg_nopr_hmac_sha512 test vector with the first test
vector from the latest ACVP json files, so that both of the DRBG test
vectors are pulled from a consistent source.
Note that the new test vector has a nonempty personalization string.
That should be helpful as well: Some FIPS labs require this, due to
their interpretation of SP800-90A 11.3.2 which says that a
"representative" value of the personalization string must be tested.
It also now does an explicit reseed, which makes it clearer that the
requirement to test "Reseed" is met, without having to interpret the
additional input processing as covering that.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:58 +0000 (23:33 -0700)]
crypto: drbg - Flatten the DRBG menu
Now that the menuconfig CRYPTO_DRBG_MENU has no options in it other than
the hidden symbol CRYPTO_DRBG, remove it and move CRYPTO_DRBG to its
parent menu. Give CRYPTO_DRBG an appropriate prompt and help text.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:57 +0000 (23:33 -0700)]
crypto: drbg - Remove support for HASH_DRBG
Remove the support for HASH_DRBG. It's likely unused code, seeing as
HMAC_DRBG is always enabled and prioritized over it unless
NETLINK_CRYPTO is used to change the algorithm priorities.
There's also no compelling reason to support more than one of
[HMAC_DRBG, HASH_DRBG, CTR_DRBG]. By definition, callers cannot tell
any difference in their outputs. And all are FIPS-certifiable, which is
the only point of the kernel's NIST DRBGs anyway.
Switching to HASH_DRBG doesn't seem all that compelling, either. For
one, it's more complex than HMAC_DRBG.
Thus, let's just drop HASH_DRBG support and focus on HMAC_DRBG.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:56 +0000 (23:33 -0700)]
crypto: drbg - Remove support for CTR_DRBG
Remove the support for CTR_DRBG. It's likely unused code, seeing as
HMAC_DRBG is always enabled and prioritized over it unless
NETLINK_CRYPTO is used to change the algorithm priorities.
There's also no compelling reason to support more than one of
[HMAC_DRBG, HASH_DRBG, CTR_DRBG]. By definition, callers cannot tell
any difference in their outputs. And all are FIPS-certifiable, which is
the only point of the kernel's NIST DRBGs anyway.
Switching to CTR_DRBG doesn't seem all that compelling, either. While
it's often the fastest NIST DRBG, it has several disadvantages:
- CTR_DRBG uses AES. Some platforms don't have AES acceleration at all,
causing a fallback to the table-based AES code which is very slow and
can be vulnerable to cache-timing attacks. In contrast, HMAC_DRBG
uses primitives that are consistently constant-time.
- CTR_DRBG is usually considered to be somewhat less cryptographically
robust than HMAC_DRBG. Granted, HMAC_DRBG isn't all that great
either, e.g. given the negative result from Woodage & Shumow (2018)
(https://eprint.iacr.org/2018/349.pdf), but that can be worked around.
- CTR_DRBG is more complex than HMAC_DRBG, risking bugs. Indeed, while
reviewing the CTR_DRBG code, I found two bugs, including one where it
can return success while leaving the output buffer uninitialized.
- The kernel's implementation of CTR_DRBG uses an "ctr(aes)"
crypto_skcipher and relies on it returning the next counter value.
That's fragile, and indeed historically many "ctr(aes)"
crypto_skcipher implementations haven't done that. E.g. see
commit 511306b2d075 ("crypto: arm/aes-ce - update IV after partial final CTR block"),
commit fa5fd3afc7e6 ("crypto: arm64/aes-blk - update IV after partial final CTR block"),
commit 371731ec2179 ("crypto: atmel-aes - Fix saving of IV for CTR mode"),
commit 25baaf8e2c93 ("crypto: crypto4xx - fix ctr-aes missing output IV"),
commit 334d37c9e263 ("crypto: caam - update IV using HW support"),
commit 0a4491d3febe ("crypto: chelsio - count incomplete block in IV"),
commit e8e3c1ca57d4 ("crypto: s5p - update iv after AES-CBC op end").
I.e., there were many years where the kernel's CTR_DRBG code (if it
were to have actually been used) repeated outputs on some platforms.
AES-CTR also uses a 128-bit counter, which creates overflow edge cases
that are sometimes gotten wrong. E.g. see commit 009b30ac7444
("crypto: vmx - CTR: always increment IV as quadword").
So, while switching to CTR_DRBG for performance reasons isn't completely
out of the question (notably BoringSSL uses it), it would take quite a
bit more work to create a solid implementation of it in the kernel,
including a more solid implementation of AES-CTR itself (in lib/crypto/,
with a scalar bit-sliced fallback, etc). Since HMAC_DRBG has always
been the default NIST DRBG variant in the kernel and is in a better
state, let's just standardize on it for now.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:55 +0000 (23:33 -0700)]
crypto: drbg - Remove import of crypto_cipher functions
The inclusion of <crypto/internal/cipher.h> and the import of the
internal crypto namespace became unnecessary in commit ba0570bdf1d9
("crypto: drbg - Replace AES cipher calls with library calls").
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:53 +0000 (23:33 -0700)]
crypto: drbg - Remove obsolete FIPS 140-2 continuous test
FIPS 140-2 required that a continuous test for repeated outputs be done
on both "Approved RNGs" and "Non-Approved RNGs".
That's apparently why crypto/drbg.c does such a test on the bytes it
pulls from get_random_bytes(), despite get_random_bytes() being a
"Non-Approved RNG" that is credited with zero entropy for FIPS purposes.
(From FIPS's point of view, the "Approved RNG" is jitterentropy.)
FIPS 140-3 "modernized" the continuous RNG test requirements. They're
now a bit more sophisticated, requiring both an "Adaptive Proportion
Test" and a "Repetition Count Test".
At the same time, FIPS 140-3 doesn't require continuous RNG tests on
"Non-Approved RNGs" if a "vetted conditioning component" is used. The
SP800-90A DRBGs are exactly such a vetted conditioning component, by
their design. (In the case of HASH_DRBG and CTR_DRBG, the derivation
function does have to be implemented. But the kernel does that.)
In other words: from FIPS 140-3's point of view, get_random_bytes()
still produces zero entropy, but the way the DRBG combines those bytes
with the jitterentropy bytes preserves all the "approved" entropy from
jitterentropy. Thus no test for get_random_bytes() is required.
Seeing as FIPS 140-2 certificates stopped being issued in 2021 in favor
of FIPS 140-3, this means this code is obsolete. Remove it.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:52 +0000 (23:33 -0700)]
crypto: drbg - Remove unhelpful helper functions
Fold the contents of the inline functions crypto_drbg_get_bytes_addtl(),
crypto_drbg_get_bytes_addtl_test(), and crypto_drbg_reset_test() into
their only caller in drbg_cavs_test(). It ends up being much simpler.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:51 +0000 (23:33 -0700)]
crypto: drbg - Remove broken commented-out code
This commented-out code doesn't compile. Even if it did, it wouldn't
actually do what it was apparently intended to do, seeing as the "test"
for "drbg_pr_hmac_sha512" and "drbg_pr_ctr_aes256" is alg_test_null().
Just delete it to avoid keeping broken code around, and so that there
isn't any perceived need to try to update it as the DRBG code is
refactored.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:50 +0000 (23:33 -0700)]
crypto: drbg - Remove always-enabled symbol CRYPTO_DRBG_HMAC
The kconfig symbol CRYPTO_DRBG_HMAC is always enabled when
CRYPTO_DRBG_MENU is enabled, and all checks for CRYPTO_DRBG_HMAC are in
code conditional on CRYPTO_DRBG_MENU. Thus, the only purpose of the
CRYPTO_DRBG_HMAC symbol is to select CRYPTO_HMAC and CRYPTO_SHA512.
Move those two selections to CRYPTO_DRBG_MENU, remove the checks for
CRYPTO_DRBG_HMAC, and remove the CRYPTO_DRBG_HMAC symbol itself.
Note that this also fixes an issue where CRYPTO_HMAC and CRYPTO_SHA512
were unnecessarily being forced to built-in when CRYPTO_DRBG=m.
Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:49 +0000 (23:33 -0700)]
crypto: drbg - Fix the fips_enabled priority boost
When fips_enabled=1, it seems to have been intended for one of the
algorithms defined in crypto/drbg.c to be the highest priority "stdrng"
algorithm, so that it is what is used by "stdrng" users.
However, the code only boosts the priority to 400, which is less than
the priority 500 used in drivers/crypto/caam/caamprng.c. Thus, the CAAM
RNG could be used instead.
Fix this by boosting the priority by 2000 instead of 200.
Fixes: 541af946fe13 ("crypto: drbg - SP800-90A Deterministic Random Bit Generator") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:48 +0000 (23:33 -0700)]
crypto: drbg - Fix drbg_max_addtl() on 64-bit kernels
On 64-bit kernels, drbg_max_addtl() returns 2**35 bytes. That's too
large, for two reasons:
1. SP800-90A says the maximum limit is 2**35 *bits*, not 2**35 bytes.
So the implemented limit has confused bits and bytes.
2. When drbg_kcapi_hash() calls crypto_shash_update() on the additional
information string, the length is implicitly cast to 'unsigned int'.
That truncates the additional information string to U32_MAX bytes.
Fix the maximum additional information string length to always be
U32_MAX - 1, causing an error to be returned for any longer lengths.
Fixes: 541af946fe13 ("crypto: drbg - SP800-90A Deterministic Random Bit Generator") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:47 +0000 (23:33 -0700)]
crypto: drbg - Fix ineffective sanity check
Fix drbg_healthcheck_sanity() to correctly check the return value of
drbg_generate(). drbg_generate() returns 0 on success, or a negative
errno value on failure. drbg_healthcheck_sanity() incorrectly assumed
that it returned a positive value on success.
This didn't make the sanity check fail, but it made it ineffective.
Fixes: cde001e4c3c3 ("crypto: rng - RNGs must return 0 in success case") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Eric Biggers [Mon, 20 Apr 2026 06:33:46 +0000 (23:33 -0700)]
crypto: drbg - Fix misaligned writes in CTR_DRBG and HASH_DRBG
drbg_cpu_to_be32() is being used to do a plain write to a byte array,
which doesn't have any alignment guarantee. This can cause a misaligned
write. Replace it with the correct function, put_unaligned_be32().
Fixes: 72f3e00dd67e ("crypto: drbg - replace int2byte with cpu_to_be") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Harshal Dev [Thu, 16 Apr 2026 13:07:20 +0000 (18:37 +0530)]
dt-bindings: crypto: qcom-qce: Document the Glymur crypto engine
Document the crypto engine on Glymur platform.
Signed-off-by: Harshal Dev <harshal.dev@oss.qualcomm.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested on hardware with an ATECC608B at 0x60. The device binds
successfully, passes the driver's sanity check, and registers the
ecdh-nist-p256 KPP algorithm.
The hardware ECDH path was also exercised using a minimal KPP test
module, covering private key generation, public key derivation, and
shared secret computation.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
crypto: ccp - Initialize data during __sev_snp_init_locked()
Sashiko notes:
> is the stack variable data left uninitialized when taking the else branch?
> Since data.tio_en is later evaluated unconditionally, could stack garbage
> cause it to evaluate to true, leading to erroneous attempts to allocate
> pages and initialize SEV-TIO on unsupported hardware?
If the firmware is too old to support SEV_INIT_EX, data is left
uninitialized but used in the debug logging about whether TIO is enabled or
not.
crypto: ccp - Check for page allocation failure correctly in TIO
Sashiko notes:
> if __snp_alloc_firmware_pages() returns NULL under memory pressure, is it
> safe to pass it directly to page_address()?
>
> On architectures without HASHED_PAGE_VIRTUAL, page_address(NULL) might
> compute a deterministic but invalid, non-zero virtual address. The
> subsequent if (tio_status) check would then evaluate to true, and
> sev_tsm_init_locked() would dereference the invalid pointer.
Indeed, page_address(NULL) will return non-NULL garbage here. Fix this by
checking the page allocation itself for NULL, not the resulting virtual
address.
> regarding the bounds check in snp_filter_reserved_mem_regions()
> called via walk_iomem_res_desc(): does the check
> if ((range_list->num_elements * 16 + 8) > PAGE_SIZE)
> allow an off-by-one heap buffer overflow?
>
> If range_list->num_elements is 255, 255 * 16 + 8 = 4088, which is <= 4096.
> Writing range->base (8 bytes) fills 4088-4095, but writing range->page_count
> (4 bytes) would write to 4096-4099, overflowing the kzalloc-allocated
> PAGE_SIZE buffer.
Fix this by accounting for the entry about to be written to, in addition to
the entries that are already allocated.
Fixes: 1ca5614b84ee ("crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP") Reported-by: Sashiko Assisted-by: Gemini:gemini-3.1-pro-preview Link: https://sashiko.dev/#/patchset/20260324161301.1353976-1-tycho%40kernel.org Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
crypto: ccp - Reverse the cleanup order in psp_dev_destroy()
Before SNP x86 shutdown [1], all HV_FIXED pages were always leaked on
module unload. Now pages can be reclaimed if they are freed before SNP
shutdown.
The SFS driver does sfs_dev_destroy() -> snp_free_hv_fixed_pages(), marking
the command buffer as free. But this happens after sev_dev_destroy() in
psp_dev_destroy(), so the pages are always leaked.
Rearrange psp_dev_destroy() to destroy things in the reverse order from
psp_init(), so that any dependencies can be unwound accordingly. This lets
SFS free the page and the subsequent SNP shutdown release it.
This was identified with use of Chris Mason's review-prompts:
https://github.com/masoncl/review-prompts
David Gow [Thu, 16 Apr 2026 06:57:43 +0000 (14:57 +0800)]
x86/boot/e820: Re-enable BIOS fallback if e820 table is empty
In commit:
157266edcc56 ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables")
the check on the number of entries in the e820 table was removed. The intention
was to support single-entry maps, but by removing the check entirely, we also
skip the fallback (to, e.g., the BIOS 88h function).
This means that if no E820 map is passed in from the bootloader (which is the
case on some bootloaders, like linld), we end up with an empty memory map, and
the kernel fails to boot (either by deadlocking on OOM, or by failing to
allocate the real mode trampoline, or similar).
Re-instate the check in append_e820_table(), but only check that nr_entries is
non-zero. This allows e820__memory_setup_default() to fall back to other memory
size sources, and doesn't affect e820__memory_setup_extended(), as the latter
ignores the return value from append_e820_table().
In doing so, we also update the return values to be proper error codes, with
-ENOENT for this case (there are no entries), and -EINVAL for the case where an
entry appears invalid. Given none of the callers check the actual value -- just
whether it's nonzero -- this is largely aesthetic in practice.
Tested against linld, and the kernel boots again fine.
[ mingo: Readability edits to the comment and the changelog. ]
Fixes: 157266edcc56 ("x86/boot/e820: Simplify append_e820_table() and remove restriction on single-entry tables") Signed-off-by: David Gow <david@davidgow.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: stable@vger.kernel.org Cc: Arnd Bergmann <arnd@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://patch.msgid.link/20260416065746.1896647-1-david@davidgow.net
Maoyi Xie [Mon, 4 May 2026 14:27:36 +0000 (22:27 +0800)]
xfrm: route MIGRATE notifications to caller's netns
xfrm_send_migrate() in net/xfrm/xfrm_user.c and pfkey_send_migrate()
in net/key/af_key.c both hardcode &init_net for the multicast that
announces a successful XFRM_MSG_MIGRATE / SADB_X_MIGRATE.
XFRM_MSG_MIGRATE arrives on a per-netns NETLINK_XFRM socket, and the
rest of the xfrm/af_key netlink path was made netns-aware in 2008.
The other 14 multicast paths in xfrm_user.c route their event using
xs_net(x), xp_net(xp) or sock_net(skb->sk); only the migrate path
was missed.
Two consequences of the init_net hardcoding:
1. The notification (selector, old/new endpoint addresses, and the
km_address) is delivered to listeners on init_net's
XFRMNLGRP_MIGRATE / pfkey BROADCAST_ALL groups rather than on
the issuing netns. An IKE daemon running in init_net therefore
receives migration notifications originating from any other
netns on the host.
2. An IKE daemon running inside a non-init netns and subscribed
to its own XFRMNLGRP_MIGRATE / pfkey groups never receives the
notification of its own migration. IKEv2 MOBIKE / address-update
handling inside a netns is silently broken.
Thread struct net through km_migrate() and the xfrm_mgr.migrate
function pointer, drop the &init_net override in xfrm_send_migrate()
and pfkey_send_migrate(), and pass the caller's net (already in
scope in xfrm_migrate() via sock_net(skb->sk)) all the way down.
struct xfrm_mgr is in-tree only and not exported as a stable API,
so the function-pointer signature change is internal.
pfkey_broadcast() is already netns-aware via net_generic(net,
pfkey_net_id) since the pernet conversion. The five other
pfkey_broadcast() callers in af_key.c already pass xs_net(x),
sock_net(sk) or a per-netns net, so this only removes the
&init_net outlier.
Fixes: 5c79de6e79cd ("[XFRM]: User interface for handling XFRM_MSG_MIGRATE") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Shuicheng Lin [Fri, 1 May 2026 17:59:56 +0000 (17:59 +0000)]
drm/gpusvm: Drop redundant @flags.* kernel-doc on struct drm_gpusvm_pages
The kernel-doc block above struct drm_gpusvm_pages duplicates the
descriptions of the bit-flags that live in struct drm_gpusvm_pages_flags
using dotted notation (@flags.migrate_devmem, @flags.unmapped, ...).
That dotted notation is intended for nested anonymous structs/unions that
the parser flattens into the parent's parameter list. Here, however,
flags is of a named external type, so the parser does not flatten its
members and the dotted entries do not match any member of
drm_gpusvm_pages. They also duplicate the canonical descriptions already
present in the kernel-doc of struct drm_gpusvm_pages_flags itself.
Drop the five @flags.* lines and replace them with a single @flags entry
that cross-references the type via kernel-doc's "&struct ..." syntax.
This eliminates the redundancy and removes warnings emitted by the new
parameterdescs check in scripts/kernel-doc:
Excess struct member 'flags.migrate_devmem' description in
'drm_gpusvm_pages'
Excess struct member 'flags.unmapped' description in 'drm_gpusvm_pages'
Excess struct member 'flags.partial_unmap' description in
'drm_gpusvm_pages'
Excess struct member 'flags.has_devmem_pages' description in
'drm_gpusvm_pages'
Excess struct member 'flags.has_dma_mapping' description in
'drm_gpusvm_pages'
No functional change.
Assisted-by: Claude:claude-opus-4.6 Cc: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260501175956.4054088-1-shuicheng.lin@intel.com
Linus Torvalds [Thu, 7 May 2026 05:02:28 +0000 (22:02 -0700)]
Merge tag 'v7.1-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix memory leak in connection free
- Fix inherited ACL ACE validation
- Minor cleanup
- Fix for share config
- Fix durable handle cleanup race
- Fix close_file_table_ids in session teardown
- smbdirect fixes:
- Fix memory region registration
- Two fixes for out-of-tree builds
* tag 'v7.1-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: validate inherited ACE SID length
ksmbd: fix kernel-doc warnings from ksmbd_conn_get/put()
ksmbd: fail share config requests when path allocation fails
ksmbd: close durable scavenger races against m_fp_list lookups
ksmbd: harden file lifetime during session teardown
ksmbd: centralize ksmbd_conn final release to plug transport leak
smb: smbdirect: fix MR registration for coalesced SG lists
smb: smbdirect: introduce and use include/linux/smbdirect.h
smb: smbdirect: make use of DEFAULT_SYMBOL_NAMESPACE and EXPORT_SYMBOL_GPL
Linus Torvalds [Thu, 7 May 2026 03:44:03 +0000 (20:44 -0700)]
Merge tag 'chrome-platform-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
Pull chrome-platform fix from Tzung-Bi Shih:
- Fix a NULL dereference in cros_ec_typec
* tag 'chrome-platform-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_typec: Init mutex in Thunderbolt registration
The newly added OPP is inserted into the list before its kref is
initialized. A concurrent lookup can find this OPP and increment its
reference count while it is still uninitialized, leading to refcount
corruption and a potential premature free.
Fix this by initializing ->kref and ->opp_table before making the OPP
visible via list_add(). This ensures any concurrent lookup observes a
fully initialized object.
Arnd Bergmann [Tue, 5 May 2026 18:04:59 +0000 (20:04 +0200)]
w5100: remove unused gpio link detection
Since the platform_device support is now gone, nothing ever passes a
valid gpio number, and all the link state handling can go away.
An earlier version of my patch changed this to look up the GPIO descriptor
from devicetree and convert it all to the modern interface, but there
are no users of that binding at the moment.
Remove the gpio handling, which is now one of the last users of the
legacy gpio interface in platform-independent code.
Arnd Bergmann [Tue, 5 May 2026 18:04:58 +0000 (20:04 +0200)]
w5300: remove unused driver
Unlike w5100, this driver does not support SPI mode or devicetree
bindings, and is hence entirely unusable without third-party board
support patches that likely haven't existed for any recent kernel
version.
Remove the entire driver.
If anyone is in fact using it with their custom board files, they
can bring it back and include an earlier patch I sent to add
DT based probing for the GPIO lines.
Arnd Bergmann [Tue, 5 May 2026 18:04:57 +0000 (20:04 +0200)]
w5100: remove MMIO support
This driver supports both SPI and MMIO based register access, but only
the former has devicetree support. While MMIO mode would have worked
with old-style board files, those have never defined such a device
upstream.
Remove the MMIO mode, leaving SPI as the only way to use this driver,
but leave it in two loadable modules. More cleanups can be done by
combining the two into one file.
====================
net/mlx5: Improve representor lifecycle and late IB representor loading
This series addresses two problems that have been present for years, and
fixes one representor reload error-unwind case exposed while making the
reload path reusable.
First, there is no coordination between E-Switch reconfiguration and
representor registration. The E-Switch can be mid-way through a mode
change or VF count update while mlx5_ib walks in and registers or
unregisters representors. Nothing stops them. The race window is small
and there is no field report, but it is clearly wrong.
Second, loading mlx5_ib while the device is already in switchdev mode
does not bring up the IB representors. mlx5_eswitch_register_vport_reps()
only stores callbacks; nobody triggers the actual load after registration.
The series fixes the registration race with a per-E-Switch representor
mutex. The lock is introduced first, then LAG shared-FDB and multiport
E-Switch transitions are adjusted so auxiliary device rescans and IB
representor reloads do not hold ldev->lock while taking the representor
lock. This keeps the intermediate commits bisectable before the stricter
E-Switch serialization and lock assertions are enabled.
After the LAG ordering is fixed, all E-Switch reconfiguration paths that
create, destroy, load, or unload representors take the representor mutex.
esw_mode_change() deliberately drops the mutex around
mlx5_rescan_drivers_locked(), because auxiliary probe and remove paths
re-enter mlx5_eswitch_register_vport_reps() and
mlx5_eswitch_unregister_vport_reps() on the same thread.
The shared-FDB peer IB registration path can hold one E-Switch
representor mutex and then register peer representor ops on another
E-Switch. The series annotates that case as nested locking so lockdep can
distinguish it from recursive locking on the same E-Switch.
For the missing IB representors, mlx5_eswitch_register_vport_reps() queues
a work item that acquires the devlink lock and loads all relevant
representors. This is the change that actually fixes the long-standing
bug.
The reload path also learns to track which representor types were loaded by
the current attempt, so an error does not unload representors that were
already active before the retry.
Patch 1 is cleanup. LAG and MPESW had the same representor reload
sequence duplicated in several places and the copies had started to
drift. This consolidates them into one helper.
Patch 3 adds the per-E-Switch representor lifecycle lock and helper APIs.
Patch 4 adjusts the LAG shared-FDB and multiport E-Switch transitions so
auxiliary device rescans and IB representor reloads run without
ldev->lock held while taking the representor lock.
Patch 5 protects the E-Switch reconfiguration, representor registration
and peer IB representor paths with the representor lock.
Patch 6 fixes representor load error unwind so only representor types
loaded by the current attempt are unloaded on failure.
Patch 7 moves the representor load triggered by
mlx5_eswitch_register_vport_reps() onto the work queue. This is the patch
that fixes IB representors not coming up when mlx5_ib is loaded while the
device is already in switchdev mode.
====================
Mark Bloch [Sun, 3 May 2026 20:27:26 +0000 (23:27 +0300)]
net/mlx5: E-Switch, load reps via work queue after registration
mlx5_eswitch_register_vport_reps() only installs representor callbacks and
marks the rep type as registered. If the E-Switch is already in switchdev
mode, the newly registered rep type must then be loaded for already enabled
vports.
That load path needs to run under the devlink lock, which is not held by
the auxiliary driver registration context. Queue the reload to the E-Switch
workqueue, whose handler acquires the devlink lock, and load the relevant
representors from there.
Since representor registration runs from sleepable auxiliary-driver
context, queue the late reload with GFP_KERNEL. The functions-change
notifier path remains the GFP_ATOMIC user of mlx5_esw_add_work().
The unregister path is unchanged and still unloads representors
synchronously while tearing down the registered callbacks.
Mark Bloch [Sun, 3 May 2026 20:27:25 +0000 (23:27 +0300)]
net/mlx5: E-Switch, unwind only newly loaded representor types
__esw_offloads_load_rep() may return success without invoking the
representor load callback when the representor type is already loaded.
On a later load failure, mlx5_esw_offloads_rep_load() unconditionally
unloaded all previously iterated representor types. This could unload
representor types that were already loaded before this load attempt.
Track which representor types were actually loaded by the current call and
unwind only those on error. Also restore the representor state back to
REP_REGISTERED when the load callback itself fails.
Representor callbacks can be registered and unregistered while the
E-Switch is already in switchdev mode, and the same E-Switch may also be
reconfigured by devlink, VF changes and SF changes. Serialize these paths
with the per-E-Switch representor mutex instead of relying on ad-hoc bit
state and wait queues.
Take the representor lock around the mode transition, VF/SF representor
changes and representor ops registration. Keep mode_lock and the
representor lock unnested by using the operation flag while the mode lock
is dropped. During mode changes, drop the representor lock around the
auxiliary bus rescan because driver bind/unbind may register or unregister
representor ops.
Split representor ops registration into locked public wrappers and blocked
internal helpers, clear the ops pointer on unregister, and add nested
wrappers for the shared-FDB master IB path that registers peer
representor ops while another E-Switch representor lock is already held.
On unregister, always call __unload_reps_all_vport() before marking reps
unregistered and clearing rep_ops. The per-representor state check makes
this a no-op for types that were not loaded, so unregister no longer has
to infer load state from esw->mode.
Mark Bloch [Sun, 3 May 2026 20:27:23 +0000 (23:27 +0300)]
net/mlx5: Lag, avoid LAG and representor lock cycles
The LAG shared-FDB and multiport E-Switch transitions rescan auxiliary
devices and reload IB representors while holding ldev->lock. Driver
bind/unbind paths may register or unregister E-Switch representor ops, and
representor load paths may enter LAG code, so holding ldev->lock across
those calls creates lock-order cycles with the E-Switch representor lock.
Keep the devcom component locked for the transition, but drop ldev->lock
before rescanning auxiliary devices or reloading IB representors. Mark the
LAG transition as in progress while the lock is dropped and assert the
devcom lock where the helper relies on it. This preserves LAG serialization
while avoiding ldev->lock nesting under E-Switch representor registration.
Add a per-E-Switch mutex for serializing representor lifecycle work and
provide small helpers for taking and dropping it. Initialize and destroy
the mutex with the E-Switch offloads state.
Add the lock and helper API first. Follow-up patches will take the lock in
the individual representor lifecycle components. This keeps the functional
changes split by component and leaves this patch without intended behavior
change, making the series easier to review and bisectable.
Mark Bloch [Sun, 3 May 2026 20:27:21 +0000 (23:27 +0300)]
net/mlx5: E-Switch, let esw work callers choose GFP flags
mlx5_esw_add_work() always allocates the queued work item with
GFP_ATOMIC. That is required for the E-Switch functions-change notifier,
but not every caller of this helper will run from atomic context.
Pass an allocation flag to mlx5_esw_add_work() and keep the notifier
caller using GFP_ATOMIC. This allows sleepable callers to use GFP_KERNEL
instead of unnecessarily relying on atomic reserves.
Representor reload during LAG/MPESW transitions has to be repeated in
several flows, and each open-coded loop was easy to get out of sync
when adding new flags or tweaking error handling. Move the sequencing
into a single helper so that all call sites share the same ordering
and checks.
====================
r8152: Add support for the RTL8159 10Gbit USB Ethernet chip
Add support for the RTL8159, which is a 10GBit USB-Ethernet adapter
chip in the RTL815x family of chips.
The RTL8159 re-uses the frame descriptor format and SRAM2 access introduced
with the RTL8157 as well as most of the setup and PM logic of the RTL8157.
The module was tested with a Lekuo DR59R11 USB-C 10GbE Ethernet Adapter:
[ 2502.906947] usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd
[ 2502.927859] usb 2-1: New USB device found, idVendor=0bda, idProduct=815a, bcdDevice=30.00
[ 2502.927867] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=7
[ 2502.927871] usb 2-1: Product: USB 10/100/1G/2.5G/5G/10G LAN
[ 2502.927873] usb 2-1: Manufacturer: Realtek
[ 2502.927875] usb 2-1: SerialNumber: 000388C9B3B5XXXX
[ 2503.063745] r8152-cfgselector 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[ 2503.123876] r8152 2-1:1.0: Requesting firmware: rtl_nic/rtl8159-1.fw
[ 2503.126267] r8152 2-1:1.0: PHY firmware installed 0 to be loaded: 20
[ 2503.156265] r8152 2-1:1.0: load rtl8159-1 v1 2026/01/01 successfully
[ 2503.270729] r8152 2-1:1.0 eth0: v1.12.13
[ 2503.289349] r8152 2-1:1.0 enx88c9b3b5xxxx: renamed from eth0
[ 2507.777055] r8152 2-1:1.0 enx88c9b3b5xxxx: carrier on
The RTL8159 adapter was tested against an AQC107 PCIe-card supporting
10GBit/s and an RTL8157 5Gbit USB-Ethernet adapter supporting 5GBit/s for
performance, link speed and EEE negotiation. Using USB3.2 Gen 2 (20GBit) with
the RTL8159 USB adapter and running iperf3 against the AQC107 PCIe
card resulted in 8.96 Gbits/sec transfer speed.
The code is based on the out-of-tree r8152 driver published by Realtek under
the GPL.
The RTL8159 requires firmware for the PHY in order to achieve a 10GBit link
speed. Without firmware, only 5GBit were achieved. The firmware can be
extracted from the out-of-tree r8152 driver-code where it is stored in the
ram17 u8-array. Code is added to use the existing firmware upload mechanism
of the driver for the RTL8157/9 PHY firmware code. The firmware will be
submitted separately to linux-firmware.
====================
Birger Koblitz [Tue, 5 May 2026 15:56:35 +0000 (17:56 +0200)]
r8152: Add firmware upload capability for RTL8157/RTL8159
The RTL8159 (RTL_VER_17) requires firmware for its PHY in order to work
at connection speeds > 5GBit. Add support for uploading firmware for
the PHY using the existing rtl8152_apply_firmware() function
in r8157_hw_phy_cfg() and set up the correct names for the firmware
files.
This also adds support for uploading firmware for the RTL8157
(RTL_VER_16) PHY, for which firmware is however not strictly necessary
to work. Still, this allows to upload newer versions of the firmware used
by this chip, e.g. to improve interoperability.
If no firmware is found, both the RTL8157 and the RTL8159 will continue
to work.
Birger Koblitz [Tue, 5 May 2026 15:56:34 +0000 (17:56 +0200)]
r8152: Add support for the RTL8159 chip
The RTL8159 re-uses the packet descriptor format introduced with the
RTL8157 and other hardware features of the RTL8157 (RTL_VER_16) such
as the SRAM access. The support therefore consists in expanding the
existing RTL8157 code for initialization and USB power management
to also be used for the RTL8159 (RTL_VER_17).
Most of the additional code is added in r8157_hw_phy_cfg() to configure
the RTL8159 PHY.
Add support for the USB device ID of Realtek RTL8159-based adapters,
for which the product ID is 0x815a. Detect the RTL8159 as RTL_VER_17
and set it up.
Birger Koblitz [Tue, 5 May 2026 15:56:33 +0000 (17:56 +0200)]
r8152: Add support for 10Gbit Link Speeds and EEE
The RTL8159 supports 10GBit Link speeds. Add support for this speed
in the setup and setting/getting through ethtool. Also add 10GBit EEE.
Add functionality for setup and ethtool get/set methods.
In gmac_rx() (drivers/net/ethernet/cortina/gemini.c), when
gmac_get_queue_page() returns NULL for the second page of a multi-page
fragment, the driver logs an error and continues — but does not free the
partially assembled skb that was being assembled via napi_build_skb() /
napi_get_frags().
Free the in-progress partially assembled skb via napi_free_frags()
and increase the number of dropped frames appropriately
and assign the skb pointer NULL to make sure it is not lingering
around, matching the pattern already used elsewhere in the driver.
Fixes: 4d5ae32f5e1e ("net: ethernet: Add a driver for Gemini gigabit ethernet") Signed-off-by: Andreas Haarmann-Thiemann <eitschman@nebelreich.de> Signed-off-by: Linus Walleij <linusw@kernel.org> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260505-gemini-ethernet-fix-v2-1-997c31d06079@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 7 May 2026 01:39:00 +0000 (18:39 -0700)]
Merge branch 'net-mlx5e-report-more-netdev-stats'
Tariq Toukan says:
====================
net/mlx5e: Report more netdev stats
This series by Gal extends the set of counters reported in netdev stats,
by adding:
- hw_gso_packets/bytes
- RX HW-GRO stats
- TX csum_none
- TX queue stop/wake
It also aligns the tso_bytes/tso_inner_bytes counters with the netdev
stats API and virtio spec definition.
====================
Gal Pressman [Mon, 4 May 2026 18:37:02 +0000 (21:37 +0300)]
net/mlx5e: Report RX HW-GRO netdev stats
Report RX hardware GRO statistics via the netdev queue stats API by
mapping the existing gro_packets, gro_bytes and gro_skbs counters to the
hw_gro_wire_packets, hw_gro_wire_bytes and hw_gro_packets fields.
Gal Pressman [Mon, 4 May 2026 18:37:00 +0000 (21:37 +0300)]
net/mlx5e: Count full skb length in TSO byte counters
The tso_bytes and tso_inner_bytes counters currently subtract the header
length from skb->len, counting only the payload. This is confusing and
doesn't align with the behavior of other _bytes counters in the driver.
Report the full skb length to align with this expectation.
This also makes our behavior consistent with the netdev stats API and
virtio spec definition.
Randy Dunlap [Wed, 6 May 2026 17:51:44 +0000 (10:51 -0700)]
spi: s3c64xx: fix all kernel-doc warnings
Add kernel-doc for one struct member and use the correct function name
to eliminate kernel-doc warnings:
Warning: include/linux/platform_data/spi-s3c64xx.h:40 struct member
'polling' not described in 's3c64xx_spi_info'
Warning: include/linux/platform_data/spi-s3c64xx.h:51 expecting prototype
for s3c64xx_spi_set_platdata(). Prototype was for
s3c64xx_spi0_set_platdata() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org> Link: https://patch.msgid.link/20260506175144.449364-1-rdunlap@infradead.org Signed-off-by: Mark Brown <broonie@kernel.org>
selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl
When pm_netlink.sh is executed with '-i', 'ip mptcp' is used instead of
'pm_nl_ctl'. IPRoute2 doesn't support the 'unknown' flag, which has only
been added to 'pm_nl_ctl' for this specific check: to ensure that the
kernel ignores such unsupported flag.
No reason to add this flag to 'ip mptcp'. Then, this check should be
skipped when 'ip mptcp' is used.
Using '${?}' inside the if-statement to check the returned value from
the command that was evaluated as part of the if-statement is not
correct: here, '${?}' will be linked to the previous instruction, not
the one that is expected here (${cmd}).
Instead, simply mark the error, except if an error is expected. If
that's the case, 1 can be passed as the 4th argument of this helper.
Three checks from pm_netlink.sh expect an error.
While at it, improve the error message when the command unexpectedly
fails or succeeds.
Note that we could expect a specific returned value, but the checks
currently expecting an error can be used with 'ip mptcp' or 'pm_nl_ctl',
and these two tools don't return the same error code.
When looking at the maximum RTO amongst the subflows, inactive subflows
were taken into account: that includes stale ones, and the initial one
if it has been already been closed.
Unusable subflows are now simply skipped. Stale ones are used as an
alternative: if there are only stale ones, to take their maximum RTO and
avoid to eventually fallback to net.mptcp.add_addr_timeout, which is set
to 2 minutes by default.
Fixes: 30549eebc4d8 ("mptcp: make ADD_ADDR retransmission timeout adaptive") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260505-net-mptcp-pm-fixes-7-1-rc3-v1-7-fca8091060a4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When an ADD_ADDR needs to be retransmitted and another one has already
been prepared -- e.g. multiple ADD_ADDRs have been sent in a row and
need to be retransmitted later -- this additional retransmission will
need to wait.
In this case, the timer was reset to TCP_RTO_MAX / 8, which is ~15
seconds. This delay is unnecessary long: it should just be rescheduled
at the next opportunity, e.g. after the retransmission timeout.
Without this modification, some issues can be seen from time to time in
the selftests when multiple ADD_ADDRs are sent, and the host takes time
to process them, e.g. the "signal addresses, ADD_ADDR timeout" MPTCP
Join selftest, especially with a debug kernel config.
Note that on older kernels, 'timeout' is not available. It should be
enough to replace it by one second (HZ).
When an ADD_ADDR is retransmitted, the sk is held in sk_reset_timer(),
and released at the end.
If at that moment, it was the last reference being held, the sk would
not be freed. sock_put() should then be called instead of __sock_put().
But that's not enough: if it is the last reference, sock_put() will call
sk_free(), which will end up calling sk_stop_timer_sync() on the same
timer, and waiting indefinitely to finish. So it is needed to mark that
the timer is done at the end of the timer handler when it has not been
rescheduled, not to call sk_stop_timer_sync() on "itself".
mptcp: pm: ADD_ADDR rtx: always decrease sk refcount
When an ADD_ADDR is retransmitted, the sk is held in sk_reset_timer().
It should then be released in all cases at the end.
Some (unlikely) checks were returning directly instead of calling
sock_put() to decrease the refcount. Jump to a new 'exit' label to call
__sock_put() (which will become sock_put() in the next commit) to fix
this potential leak.
While at it, drop the '!msk' check which cannot happen because it is
never reset, and explicitly mark the remaining one as "unlikely".
This mptcp_pm_add_timer() helper is executed as a timer callback in
softirq context. To avoid any data races, the socket lock needs to be
held with bh_lock_sock().
If the socket is in use, retry again soon after, similar to what is done
with the keepalive timer.
ADD_ADDR can be sent for the ID 0, which corresponds to the local
address and port linked to the initial subflow.
Indeed, this address could be removed, and re-added later on, e.g. what
is done in the "delete re-add signal" MPTCP Join selftests. So no reason
to ignore it.
mptcp: pm: kernel: correctly retransmit ADD_ADDR ID 0
When adding the ADD_ADDR to the list, the address including the IP, port
and ID are copied. On the other hand, when the endpoint corresponds to
the one from the initial subflow, the ID is set to 0, as specified by
the MPTCP protocol.
The issue is that the ID was reset after having copied the ID in the
ADD_ADDR entry. So the retransmission was done, but using a different ID
than the initial one.
Eric Dumazet [Tue, 5 May 2026 13:32:33 +0000 (13:32 +0000)]
inetpeer: add a missing read_seqretry() in inet_getpeer()
When performing a lockless lookup over the inet_peer rbtree,
if a matching node is found, inet_getpeer() returns it immediately
without validating the seqlock sequence.
This missing check introduces a race condition:
Trigger Path: When a host receives an incoming fragmented IPv4 packet,
ip4_frag_init() (in net/ipv4/ip_fragment.c) calls inet_getpeer_v4()
to track the peer.
The Race: If the packet is from a new source IP, CPU A acquires the
write_seqlock, allocates a new inet_peer node (p), sets its IP address
(daddr), and links it to the rbtree (rb_link_node).
Uninitialized Access: Due to the lack of memory barriers between
rb_link_node and the initialization of the rest of the struct
(like refcount_set(&p->refcnt, 1)), CPU A can make the node visible
to readers before its refcnt is initialized.
This is especially true on weakly-ordered architectures like ARM64
where the CPU can reorder the memory stores.
Lockless Reader: Concurrently, CPU B processes a second fragmented packet
from the same source IP. CPU B does a lockless lookup, finds the newly
inserted node, and returns it immediately.
Use-After-Free (UAF): CPU B reads p->refcnt as uninitialized garbage
(left over from previous kmalloc-128/192 allocations).
If the garbage is > 0, refcount_inc_not_zero(&p->refcnt) succeeds.
CPU A then executes refcount_set(&p->refcnt, 1), overwriting CPU B's increment.
When CPU B finishes with the fragment queue, it calls inet_putpeer(),
which drops the refcount to 0 and frees the node via RCU.
The node is now freed but remains linked in the rbtree,
resulting in a Use-After-Free in the rbtree.
Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260505133233.3039575-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
netdevsim: psp: fix init and uninit bugs
This series has three fixes. The first is a straightforward NULL
pointer dereference that is reachable by creating and destroying some
vfs on a kernel with INET_PSP enabled.
The last two patches deal with nsim_psp_rereg_write(), which is a
debugfs handler that reregisters netdevsim's psp_dev without
aquiescing and disabling tx/rx processing. This was added to enable
some tests in psp.py where a psp device is unregistered while it still
referenced by tcp socket state.
There are two issues with this code:
1. Calls to nsim_psp_uninit() are not properly serialized
2. netdevsim's psp_dev refcount can be released while nsim_do_psp() is
reading from it.
====================
Daniel Zahka [Tue, 5 May 2026 10:42:25 +0000 (03:42 -0700)]
netdevsim: psp: rcu protect psp_dev reference
There are two issues with the way psp_dev is used in nsim_do_psp():
1. There is no check for IS_ERR() on the peers psp_dev, before
dereferencing.
2. The refcount on this psp_dev can be dropped by
nsim_psp_rereg_write()
To fix this, we can make netdevsim's reference to its psp_dev an rcu
reference, and then nsim_do_psp() can read the fields it needs from an
rcu critical section.
Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation") Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260505-psd-rcu-v1-3-a8f69ec1ab96@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Tue, 5 May 2026 10:42:24 +0000 (03:42 -0700)]
netdevsim: psp: serialize calls to nsim_psp_uninit()
The debugfs write handler, nsim_psp_rereg_write(), can race against
nsim_destroy() and against itself, causing nsim_psp_uninit() to run
more than once concurrently. Two complementary changes serialize all
callers:
1. Delete the psp_rereg debugfs file from nsim_psp_uninit() before
doing the actual teardown. debugfs_remove() drains any in-flight
writers and prevents new ones from starting.
2. Add a mutex around the body of nsim_psp_rereg_write() so that two
concurrent userspace writers cannot both enter the teardown path
at once.
The teardown work itself is moved into a new __nsim_psp_uninit() that
the rereg handler calls under the mutex, while the public
nsim_psp_uninit() wraps it with the debugfs_remove()/mutex_destroy()
pair so nsim_destroy() doesn't have to know about the psp internals.
Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation") Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260505-psd-rcu-v1-2-a8f69ec1ab96@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Tue, 5 May 2026 10:42:23 +0000 (03:42 -0700)]
netdevsim: psp: only call nsim_psp_uninit() on PFs
VFs go through nsim_init_netdevsim_vf() which never calls
nsim_psp_init(), so ns->psp.dev stays NULL. nsim_psp_uninit() guards
with !IS_ERR(ns->psp.dev), so destroying a VF reaches
psp_dev_unregister(NULL) and dereferences NULL on the first
mutex_lock(&psd->lock):
Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation") Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260505-psd-rcu-v1-1-a8f69ec1ab96@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Justin Lai [Tue, 5 May 2026 06:41:21 +0000 (14:41 +0800)]
rtase: Fix flow control configuration
The hardware has two sets of registers controlling TX/RX flow control.
The effective flow control state is determined by the logical OR of
these two sets of bits.
RTASE_FORCE_TXFLOW_EN and RTASE_FORCE_RXFLOW_EN in RTASE_CPLUS_CMD are
the bits used by the driver to control TX/RX flow control according to
the ethtool pause configuration.
RTASE_TXFLOW_EN and RTASE_RXFLOW_EN in RTASE_GPHY_STD_00 are another
set of TX/RX flow control enable bits. Clear them by default so they do
not keep flow control enabled independently of the driver setting.
With the RTASE_GPHY_STD_00 bits cleared, the effective flow control
state is controlled through RTASE_CPLUS_CMD, so the ethtool setting can
take effect correctly.
1. Fix an IPv6 encapsulation error path that leaked route references
when UDPv6 ESP decapsulation resolved to an error route.
From Yilin Zhu.
2. Fix AH with ESN on async crypto paths by accounting for the extra
high-order sequence number when reconstructing the temporary
authentication layout in the completion callbacks.
From Michael Bomarito.
3. Fix XFRM output so it does not overwrite already-correct inner header
pointers when a tunnel layer such as VXLAN has already saved them.
The fix comes with new selftests. From Cosmin Ratiu.
4. Add the missing native payload size entry for XFRM_MSG_MAPPING in the
compat translation path. From Ruijie Li.
5. Harden __xfrm_state_delete() against repeated or inconsistent unhashing
of state list nodes by keying the removal on actual list membership and
using delete-and-init helpers. From Michal Kosiorek.
6. Prevent ESP from decrypting shared splice-backed skb fragments in place
by marking UDP splice frags as shared and forcing copy-on-write in ESP
input when needed. From Kuan-Ting Chen.
* tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: esp: avoid in-place decrypt on shared skb frags
xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete
xfrm: provide message size for XFRM_MSG_MAPPING
xfrm: Don't clobber inner headers when already set
tools/selftests: Add a VXLAN+IPsec traffic test
tools/selftests: Use a sensible timeout value for iperf3 client
xfrm: ah: account for ESN high bits in async callbacks
ipv6: xfrm6: release dst on error in xfrm6_rcv_encap()
====================
Validate that the target of AVTAB_TYPE rules and file transitions are
simple types and not attributes.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Acked-by: Stephen Smalley <stephen.smalley.work@gmail.com>
[PM: merge fuzz, dropped parts due to dependencies] Signed-off-by: Paul Moore <paul@paul-moore.com>
Validate the types used in bounds checks.
Replace the usage of BUG(), to avoid halting the system on malformed
polices.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Acked-by: Stephen Smalley <stephen.smalley.work@gmail.com>
[PM: merge fuzz, fixed typo identified by Smalley] Signed-off-by: Paul Moore <paul@paul-moore.com>