Theodore Ts'o [Wed, 29 Jun 2022 04:00:26 +0000 (00:00 -0400)]
ext4: update the s_overhead_clusters in the backup sb's when resizing
When the EXT4_IOC_RESIZE_FS ioctl is complete, update the backup
superblocks. We don't do this for the old-style resize ioctls since
they are quite ancient, and only used by very old versions of
resize2fs --- and we don't want to update the backup superblocks every
time EXT4_IOC_GROUP_ADD is called, since it might get called a lot.
Theodore Ts'o [Wed, 29 Jun 2022 04:00:25 +0000 (00:00 -0400)]
ext4: update s_overhead_clusters in the superblock during an on-line resize
When doing an online resize, the on-disk superblock on-disk wasn't
updated. This means that when the file system is unmounted and
remounted, and the on-disk overhead value is non-zero, this would
result in the results of statfs(2) to be incorrect.
This was partially fixed by Commits 10b01ee92df5 ("ext4: fix overhead
calculation to account for the reserved gdt blocks"), 85d825dbf489
("ext4: force overhead calculation if the s_overhead_cluster makes no
sense"), and eb7054212eac ("ext4: update the cached overhead value in
the superblock").
However, since it was too expensive to forcibly recalculate the
overhead for bigalloc file systems at every mount, this didn't fix the
problem for bigalloc file systems. This commit should address the
problem when resizing file systems with the bigalloc feature enabled.
Zhang Yi [Thu, 30 Jun 2022 09:01:00 +0000 (17:01 +0800)]
ext4: fix reading leftover inlined symlinks
Since commit 6493792d3299 ("ext4: convert symlink external data block
mapping to bdev"), create new symlink with inline_data is not supported,
but it missing to handle the leftover inlined symlinks, which could
cause below error message and fail to read symlink.
ls: cannot read symbolic link 'foo': Structure needs cleaning
EXT4-fs error (device sda): ext4_map_blocks:605: inode #12: block 2021161080: comm ls: lblock 0 mapped to illegal pblock 2021161080
(length 1)
Fix this regression by adding ext4_read_inline_link(), which read the
inline data directly and convert it through a kmalloced buffer.
UBLK_IO_NEED_GET_DATA is one ublk IO command. It is designed for a user
application who wants to allocate IO buffer and set IO buffer address
only after it receives an IO request from ublksrv. This is a reasonable
scenario because these users may use a RPC framework as one IO backend
to handle IO requests passed from ublksrv. And a RPC framework may
allocate its own buffer(or memory pool).
This new feature (UBLK_F_NEED_GET_DATA) is optional for ublk users.
Related userspace code has been added in ublksrv[1] as one pull request.
Test cases for this feature are added in ublksrv and all the tests pass.
The performance result shows that this new feature does bring additional
latency because one IO is issued back to ublk_drv once again to copy data
from bio vectors to user-provided data buffer. UBLK_IO_NEED_GET_DATA is
suitable for bigger block size such as 512B or 1MB.
ublk_cmd.h: add one new ublk command: UBLK_IO_NEED_GET_DATA
Add one new ublk command: UBLK_IO_NEED_GET_DATA. It is prepared for a new
feature designed for a user application who wants to allocate IO buffer
and set IO buffer address only after it receives an IO request from
ublksrv.
Ming Lei [Sat, 30 Jul 2022 09:27:49 +0000 (17:27 +0800)]
ublk_drv: add SET_PARAMS/GET_PARAMS control command
Add two commands to set/get parameters generically.
One important goal of ublk is to provide generic framework for making
block device by userspace flexibly.
As one generic block device, there are still lots of block parameters,
such as max_sectors, write_cache/fua, discard related limits,
zoned parameters, ...., so this patch starts to add generic mechanism
for set/get device parameters.
Both generic block parameters(all kinds of queue settings) and ublk
feature parameters can be covered with this way, then it becomes quite
easy to extend in future.
Add two parameter types are used so far: basic(covers basic queue setting
and misc settings which can't be grouped easily) and discard, basic type
must be set, and discard type becomes optional now
This way provides mechanism to simulate any kind of generic block device
from userspace easily, from both block queue setting viewpoint or ublk
feature viewpoint.
The style of putting all parameters together is suggested by Christoph.
Ming Lei [Sat, 30 Jul 2022 09:27:47 +0000 (17:27 +0800)]
ublk_drv: cancel device even though disk isn't up
Each ublk queue is started before adding disk, we have to cancel queues in
ublk_stop_dev() so that ubq daemon can be exited, otherwise DEL_DEV command
may hang forever.
Also avoid to cancel queues two times by checking if queue is ready,
otherwise use-after-free on io_uring may be triggered because ublk_stop_dev
is called by ublk_remove() too.
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220730092750.1118167-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Tue, 12 Jul 2022 15:32:56 +0000 (08:32 -0700)]
block: fix leaking page ref on truncated direct io
The size being added to a bio from an iov is aligned to a block size
after the pages were gotten. If the new aligned size truncates the last
page, its reference was being leaked. Ensure all pages that were not
added to the bio have their reference released.
Since this essentially requires doing the same that bio_put_pages(), and
there was only one caller for that function, this patch makes the
put_page() loop common for everyone.
Fixes: b1a000d3b8ec5 ("block: relax direct io memory alignment") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220712153256.2202024-3-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Tue, 12 Jul 2022 15:32:55 +0000 (08:32 -0700)]
block: ensure bio_iov_add_page can't fail
Adding the page could fail on the bio_full() condition, which checks for
either exceeding the bio's max segments or total size exceeding
UINT_MAX. We already ensure the max segments can't be exceeded, so just
ensure the total size won't reach the limit. This simplifies error
handling and removes unnecessary repeated bio_full() checks.
Keith Busch [Tue, 12 Jul 2022 15:32:54 +0000 (08:32 -0700)]
block: ensure iov_iter advances for added pages
There are cases where a bio may not accept additional pages, and the iov
needs to advance to the last data length that was accepted. The zone
append used to handle this correctly, but was inadvertently broken when
the setup was made common with the normal r/w case.
Fixes: 576ed9135489c ("block: use bio_add_page in bio_iov_iter_get_pages") Fixes: c58c0074c54c2 ("block/bio: remove duplicate append pages code") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20220712153256.2202024-1-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
In line 2884, "raid5_release_stripe(sh);" drops the reference to sh and
may cause sh to be released. However, sh is subsequently used in lines
2886 "if (sh->batch_head && sh != sh->batch_head)". This may result in an
use-after-free bug.
It can be fixed by moving "raid5_release_stripe(sh);" to the bottom of
the function.
Signed-off-by: Wentao_Liang <Wentao_Liang_g@163.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md/raid5: Ensure batch_last is released before sleeping for quiesce
A race condition exists where if raid5_quiesce() is called in the
middle of a request that has set batch_last, it will deadlock.
batch_last will hold a reference to a stripe when raid5_quiesce() is
called. This will cause the next raid5_get_active_stripe() call to
sleep waiting for the quiesce to finish, but the raid5_quiesce() thread
will wait for active_stripes to go to zero which will never happen
because request thread is waiting for the quiesce to stop.
Fix this by creating a special __raid5_get_active_stripe() function
which takes the request context and clears the last_batch before
sleeping.
While we're at it, change the arguments of raid5_get_active_stripe()
to bools.
Fixes: 3312e6c887fe ("md/raid5: Keep a reference to last stripe_head for batch") Reported-by: David Sloan <David.Sloan@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md/raid5: Drop unnecessary call to r5c_check_stripe_cache_usage()
Now that raid5_get_active_stripe() has been refactored it is appearant
that r5c_check_stripe_cache_usage() doesn't need to be called in
the wait_for_stripe branch.
r5c_check_stripe_cache_usage() will only conditionally call
r5l_wake_reclaim(), but that function is called two lines later.
Drop the call for cleanup.
Reported-by: Martin Oliveira <martin.oliveira@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The logic to wait_for_stripe is difficult to parse being on so many
lines and with confusing operator precedence. Move it to a helper
function to make it easier to read.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Refactor the raid5_get_active_stripe() to read more linearly in
the order it's typically executed.
The init_stripe() call is called if a free stripe is found and the
function is exited early which removes a lot of if (sh) checks and
unindents the following code.
Remove the while loop in favour of the 'goto retry' pattern, which
reduces indentation further. And use a 'goto wait_for_stripe' instead
of an additional indent seeing it is the unusual path and this makes
the code easier to read.
No functional changes intended. Will make subsequent changes
in patches easier to understand.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: pass struct queue_limits to the bio splitting helpers
Allow using the splitting helpers on just a queue_limits instead of
a full request_queue structure. This will eventually allow file systems
or remapping drivers to split REQ_OP_ZONE_APPEND bios based on limits
calculated as the minimum common capabilities over multiple devices.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: move bio_allowed_max_sectors to blk-merge.c
Move this helper into the only file where it is used.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: move the call to get_max_io_size out of blk_bio_segment_split
Prepare for reusing blk_bio_segment_split for (file system controlled)
splits of REQ_OP_ZONE_APPEND bios by letting the caller control the
maximum size of the bio.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only non-passthrough requests are split by the block layer and use the
->bio_split bio_set. Move it from the request_queue to the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Wed, 3 Aug 2022 02:50:47 +0000 (19:50 -0700)]
Merge tag 'flexible-array-transformations-UAPI-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull uapi flexible array update from Gustavo Silva:
"A treewide patch that replaces zero-length arrays with flexible-array
members in UAPI. This has been baking in linux-next for 5 weeks now.
'-fstrict-flex-arrays=3' is coming and we need to land these changes
to prevent issues like these in the short future:
fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; destination buffer has size 0, but the source string has length 2 (including NUL byte) [-Wfortify-source]
strcpy(de3->name, ".");
^
Since these are all [0] to [] changes, the risk to UAPI is nearly
zero. If this breaks anything, we can use a union with a new member
name"
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101836
* tag 'flexible-array-transformations-UAPI-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
treewide: uapi: Replace zero-length arrays with flexible-array members
Linus Torvalds [Wed, 3 Aug 2022 02:44:56 +0000 (19:44 -0700)]
Merge tag 'linux-kselftest-next-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest updates from Shuah Khan:
- timers test build fixes and cleanups for new tool chains
- removing khdr from kselftest framework and main Makefile
- changes to test output messages to improve reports
* tag 'linux-kselftest-next-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (24 commits)
Makefile: replace headers_install with headers for kselftest
selftests/landlock: drop deprecated headers dependency
selftests: timers: clocksource-switch: adapt to kselftest framework
selftests: timers: clocksource-switch: add 'runtime' command line parameter
selftests: timers: clocksource-switch: add command line switch to skip sanity check
selftests: timers: clocksource-switch: sort includes
selftests: timers: clocksource-switch: fix passing errors from child
selftests: timers: inconsistency-check: adapt to kselftest framework
selftests: timers: nanosleep: adapt to kselftest framework
selftests: timers: fix declarations of main()
selftests: timers: valid-adjtimex: build fix for newer toolchains
Makefile: add headers_install to kselftest targets
selftests: drop KSFT_KHDR_INSTALL make target
selftests: stop using KSFT_KHDR_INSTALL
selftests: drop khdr make target
selftests: drivers/dma-buf: Improve message in selftest summary
selftests/kcmp: Make the test output consistent and clear
selftests:timers: globals don't need initialization to 0
selftests/drivers/gpu: Add error messages to drm_mm.sh
selftests/tpm2: increase timeout for kselftests
...
Linus Torvalds [Wed, 3 Aug 2022 02:34:45 +0000 (19:34 -0700)]
Merge tag 'linux-kselftest-kunit-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
"This consists of several fixes and an important feature to discourage
running KUnit tests on production systems. Running tests on a
production system could leave the system in a bad state.
Summary:
- Add a new taint type, TAINT_TEST to signal that a test has been
run.
This should discourage people from running these tests on
production systems, and to make it easier to tell if tests have
been run accidentally (by loading the wrong configuration, etc)
- Several documentation and tool enhancements and fixes"
* tag 'linux-kselftest-kunit-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (29 commits)
Documentation: KUnit: Fix example with compilation error
Documentation: kunit: Add CLI args for kunit_tool
kcsan: test: Add a .kunitconfig to run KCSAN tests
kunit: executor: Fix a memory leak on failure in kunit_filter_tests
clk: explicitly disable CONFIG_UML_PCI_OVER_VIRTIO in .kunitconfig
mmc: sdhci-of-aspeed: test: Use kunit_test_suite() macro
nitro_enclaves: test: Use kunit_test_suite() macro
thunderbolt: test: Use kunit_test_suite() macro
kunit: flatten kunit_suite*** to kunit_suite** in .kunit_test_suites
kunit: unify module and builtin suite definitions
selftest: Taint kernel when test module loaded
module: panic: Taint the kernel when selftest modules load
Documentation: kunit: fix example run_kunit func to allow spaces in args
Documentation: kunit: Cleanup run_wrapper, fix x-ref
kunit: test.h: fix a kernel-doc markup
kunit: tool: Enable virtio/PCI by default on UML
kunit: tool: make --kunitconfig repeatable, blindly concat
kunit: add coverage_uml.config to enable GCOV on UML
kunit: tool: refactor internal kconfig handling, allow overriding
kunit: tool: introduce --qemu_args
...
Linus Torvalds [Wed, 3 Aug 2022 02:24:24 +0000 (19:24 -0700)]
Merge tag 'docs-6.0' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"This was a moderately busy cycle for documentation, but nothing
all that earth-shaking:
- More Chinese translations, and an update to the Italian
translations.
The Japanese, Korean, and traditional Chinese translations
are more-or-less unmaintained at this point, instead.
- Some build-system performance improvements.
- The removal of the archaic submitting-drivers.rst document,
with the movement of what useful material that remained into
other docs.
- Improvements to sphinx-pre-install to, hopefully, give more
useful suggestions.
- A number of build-warning fixes
Plus the usual collection of typo fixes, updates, and more"
* tag 'docs-6.0' of git://git.lwn.net/linux: (92 commits)
docs: efi-stub: Fix paths for x86 / arm stubs
Docs/zh_CN: Update the translation of sched-stats to 5.19-rc8
Docs/zh_CN: Update the translation of pci to 5.19-rc8
Docs/zh_CN: Update the translation of pci-iov-howto to 5.19-rc8
Docs/zh_CN: Update the translation of usage to 5.19-rc8
Docs/zh_CN: Update the translation of testing-overview to 5.19-rc8
Docs/zh_CN: Update the translation of sparse to 5.19-rc8
Docs/zh_CN: Update the translation of kasan to 5.19-rc8
Docs/zh_CN: Update the translation of iio_configfs to 5.19-rc8
doc:it_IT: align Italian documentation
docs: Remove spurious tag from admin-guide/mm/overcommit-accounting.rst
Documentation: process: Update email client instructions for Thunderbird
docs: ABI: correct QEMU fw_cfg spec path
doc/zh_CN: remove submitting-driver reference from docs
docs: zh_TW: align to submitting-drivers removal
docs: zh_CN: align to submitting-drivers removal
docs: ko_KR: howto: remove reference to removed submitting-drivers
docs: ja_JP: howto: remove reference to removed submitting-drivers
docs: it_IT: align to submitting-drivers removal
docs: process: remove outdated submitting-drivers.rst
...
Linus Torvalds [Wed, 3 Aug 2022 02:22:24 +0000 (19:22 -0700)]
Merge tag 'nolibc.2022.07.27a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull nolibc updates from Paul McKenney:
"This provides nolibc updates, perhaps most notably improved testing
via the 'cd tools/include/nolibc; make headers' command. This should
be considered a smoke test. More thorough testing is in the works"
* tag 'nolibc.2022.07.27a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
tools/nolibc: add a help target to list supported targets
tools/nolibc: make the default target build the headers
tools/nolibc: fix the makefile to also work as "make -C tools ..."
tools/nolibc/stdio: Add format attribute to enable printf warnings
tools/nolibc/stdlib: Support overflow checking for older compiler versions
Linus Torvalds [Wed, 3 Aug 2022 02:12:45 +0000 (19:12 -0700)]
Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes
- Callback-offload updates, perhaps most notably a new
RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be
offloaded at boot time, regardless of kernel boot parameters.
This is useful to battery-powered systems such as ChromeOS and
Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot
parameter prevents offloaded callbacks from interfering with
real-time workloads and with energy-efficiency mechanisms
- Polled grace-period updates, perhaps most notably making these APIs
account for both normal and expedited grace periods
- Tasks RCU updates, perhaps most notably reducing the CPU overhead of
RCU tasks trace grace periods by more than a factor of two on a
system with 15,000 tasks.
The reduction is expected to increase with the number of tasks, so it
seems reasonable to hypothesize that a system with 150,000 tasks
might see a 20-fold reduction in CPU overhead
- Torture-test updates
- Updates that merge RCU's dyntick-idle tracking into context tracking,
thus reducing the overhead of transitioning to kernel mode from
either idle or nohz_full userspace execution for kernels that track
context independently of RCU.
This is expected to be helpful primarily for kernels built with
CONFIG_NO_HZ_FULL=y
* tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits)
rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
rcu: Diagnose extended sync_rcu_do_polled_gp() loops
rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
rcutorture: Test polled expedited grace-period primitives
rcu: Add polled expedited grace-period primitives
rcutorture: Verify that polled GP API sees synchronous grace periods
rcu: Make Tiny RCU grace periods visible to polled APIs
rcu: Make polled grace-period API account for expedited grace periods
rcu: Switch polled grace-period APIs to ->gp_seq_polled
rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
rcu/nocb: Add option to opt rcuo kthreads out of RT priority
rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
rcu/nocb: Add an option to offload all CPUs on boot
rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
rcu/nocb: Add/del rdp to iterate from rcuog itself
rcu/tree: Add comment to describe GP-done condition in fqs loop
rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
rcu/kvfree: Remove useless monitor_todo flag
rcu: Cleanup RCU urgency state for offline CPU
...
Linus Torvalds [Wed, 3 Aug 2022 00:45:14 +0000 (17:45 -0700)]
Merge tag 'v5.20-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Make proc files report fips module name and version
Algorithms:
- Move generic SHA1 code into lib/crypto
- Implement Chinese Remainder Theorem for RSA
- Remove blake2s
- Add XCTR with x86/arm64 acceleration
- Add POLYVAL with x86/arm64 acceleration
- Add HCTR2
- Add ARIA
Drivers:
- Add support for new CCP/PSP device ID in ccp"
* tag 'v5.20-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (89 commits)
crypto: tcrypt - Remove the static variable initialisations to NULL
crypto: arm64/poly1305 - fix a read out-of-bound
crypto: hisilicon/zip - Use the bitmap API to allocate bitmaps
crypto: hisilicon/sec - fix auth key size error
crypto: ccree - Remove a useless dma_supported() call
crypto: ccp - Add support for new CCP/PSP device ID
crypto: inside-secure - Add missing MODULE_DEVICE_TABLE for of
crypto: hisilicon/hpre - don't use GFP_KERNEL to alloc mem during softirq
crypto: testmgr - some more fixes to RSA test vectors
cyrpto: powerpc/aes - delete the rebundant word "block" in comments
hwrng: via - Fix comment typo
crypto: twofish - Fix comment typo
crypto: rmd160 - fix Kconfig "its" grammar
crypto: keembay-ocs-ecc - Drop if with an always false condition
Documentation: qat: rewrite description
Documentation: qat: Use code block for qat sysfs example
crypto: lib - add module license to libsha1
crypto: lib - make the sha1 library optional
crypto: lib - move lib/sha1.c into lib/crypto/
crypto: fips - make proc files report fips module name and version
...
Linus Torvalds [Wed, 3 Aug 2022 00:31:35 +0000 (17:31 -0700)]
Merge tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"Though there's been a decent amount of RNG-related development during
this last cycle, not all of it is coming through this tree, as this
cycle saw a shift toward tackling early boot time seeding issues,
which took place in other trees as well.
Here's a summary of the various patches:
- The CONFIG_ARCH_RANDOM .config option and the "nordrand" boot
option have been removed, as they overlapped with the more widely
supported and more sensible options, CONFIG_RANDOM_TRUST_CPU and
"random.trust_cpu". This change allowed simplifying a bit of arch
code.
- x86's RDRAND boot time test has been made a bit more robust, with
RDRAND disabled if it's clearly producing bogus results. This would
be a tip.git commit, technically, but I took it through random.git
to avoid a large merge conflict.
- The RNG has long since mixed in a timestamp very early in boot, on
the premise that a computer that does the same things, but does so
starting at different points in wall time, could be made to still
produce a different RNG state. Unfortunately, the clock isn't set
early in boot on all systems, so now we mix in that timestamp when
the time is actually set.
- User Mode Linux now uses the host OS's getrandom() syscall to
generate a bootloader RNG seed and later on treats getrandom() as
the platform's RDRAND-like faculty.
- The arch_get_random_{seed_,}_long() family of functions is now
arch_get_random_{seed_,}_longs(), which enables certain platforms,
such as s390, to exploit considerable performance advantages from
requesting multiple CPU random numbers at once, while at the same
time compiling down to the same code as before on platforms like
x86.
- A small cleanup changing a cmpxchg() into a try_cmpxchg(), from
Uros.
- A comment spelling fix"
More info about other random number changes that come in through various
architecture trees in the full commentary in the pull request:
* tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: correct spelling of "overwrites"
random: handle archrandom with multiple longs
um: seed rng using host OS rng
random: use try_cmpxchg in _credit_init_bits
timekeeping: contribute wall clock to rng on time change
x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu"
random: remove CONFIG_ARCH_RANDOM
block: change the blk_queue_bounce calling convention
The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220727162300.3089193-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: change the blk_queue_split calling convention
The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.
Also give it and the helpers used to implement it more descriptive names.
nvmet-tcp: fix lockdep complaint on nvmet_tcp_wq flush during queue teardown
We probably need nvmet_tcp_wq to have MEM_RECLAIM as we are
sending/receiving for the socket from works on this workqueue.
Also this eliminates lockdep complaints:
--
[ 6174.010200] workqueue: WQ_MEM_RECLAIM
nvmet-wq:nvmet_tcp_release_queue_work [nvmet_tcp] is flushing
!WQ_MEM_RECLAIM nvmet_tcp_wq:nvmet_tcp_io_work [nvmet_tcp]
[ 6174.010216] WARNING: CPU: 20 PID: 14456 at kernel/workqueue.c:2628
check_flush_dependency+0x110/0x14c
Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Joel Granados [Tue, 12 Jul 2022 18:33:04 +0000 (20:33 +0200)]
nvme: enable generic interface (/dev/ngXnY) for unknown command sets
Extend nvme_alloc_ns() and nvme_validate_ns() for unknown command-set as
well. Both are made to use a new helper (nvme_update_ns_info_cs_indep)
which is similar to nvme_update_ns_info but performs fewer operations
to get the generic interface up.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
[hch: rebased on other refactoring patches] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change nvme_ns_scan to gather all information needed for generic
namespace setup into a nvme_ns_info structure. This structure is filled
from the Command Set Idependent Identify Namespace data structure if
it is available or else the legacy Identify namespace structure.
With that everything related to the NVM command set (and the ZNS command
set derived from it) can be encapsulated in the nvme_update_ns_info_block
function while keeping the rest of the namespace probing generic.
The downside is that we now always issue two Identify Namespace calls for
each probed namespace instead of usually just a single one previously.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme: generalize the nvme_multi_css check in nvme_scan_ns
Check for multiple command set support early on an error out if is
not supported when a !NVM command set namespace is found. This
prepares for adding command set independent passthrough support.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme: catch -ENODEV from nvme_revalidate_zones again
nvme_revalidate_zones can also return -ENODEV if e.g. zone sizes aren't
constant or not a power of two. In that case we should jump to marking
the gendisk hidden and only support pass through.
Fixes: 602e57c9799c ("nvme: also mark passthrough-only namespaces ready in nvme_update_ns_info") Reported-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvmet-auth: select the intended CRYPTO_DH_RFC7919_GROUPS
Commit 71ebe3842ebe ("nvmet-auth: Diffie-Hellman key exchange support")
intends to select 'Support for RFC 7919 FFDHE group parameters' for using
FFDHE groups for NVMe In-Band Authentication.
It however selects CRYPTO_DH_GROUPS_RFC7919, instead of the intended
CRYPTO_DH_RFC7919_GROUPS; notice the swapping of words here.
Correct the select to the intended config option.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split nvme_dev_add into a helper to actually allocate the tag set, and
one that just update the number of queues. Add a local variable for
the tag_set to clean up the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split nvme_alloc_admin_tags into a helper to actually allocate the
tag set, and one that just restarts the admin queue. Add a local
variable for the tag_set to clean up the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nick Bowler [Thu, 21 Jul 2022 03:57:35 +0000 (23:57 -0400)]
nvme: define compat_ioctl again to unbreak 32-bit userspace.
Commit 89b3d6e60550 ("nvme: simplify the compat ioctl handling") removed
the initialization of compat_ioctl from the nvme block_device_operations
structures.
Presumably the expectation was that 32-bit ioctls would be directed
through the regular handler but this is not the case: failing to assign
.compat_ioctl actually means that the compat case is disabled entirely,
and any attempt to submit nvme ioctls from 32-bit userspace fails
outright with -ENOTTY.
Bean Huo [Fri, 15 Jul 2022 21:27:21 +0000 (23:27 +0200)]
nvme: use command_id instead of req->tag in trace_nvme_complete_rq()
Use command_id instead of req->tag in trace_nvme_complete_rq(),
because of commit e7006de6c238 ("nvme: code command_id with a genctr
for use authentication after release"), cmd->common.command_id is set to
((genctl & 0xf)< 12 | req->tag), no longer req->tag, which makes cid in
trace_nvme_complete_rq and trace_nvme_setup_cmd are not the same.
Fixes: e7006de6c238 ("nvme: code command_id with a genctr for use authentication after release") Signed-off-by: Bean Huo <beanhuo@micron.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a KASAN warning in raid10_remove_disk when running the lvm
test lvconvert-raid-reshape.sh. We fix this warning by verifying that the
value "number" is valid.
BUG: KASAN: slab-out-of-bounds in raid10_remove_disk+0x61/0x2a0 [raid10]
Read of size 8 at addr ffff889108f3d300 by task mdX_raid10/124682
Last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
L __fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Second to last potentially related work creation:
kasan_save_stack+0x1e/0x40
__kasan_record_aux_stack+0x9e/0xc0
kvfree_call_rcu+0x84/0x480
timerfd_release+0x82/0x140
__fput+0xfa/0x400
task_work_run+0x80/0xc0
exit_to_user_mode_prepare+0x155/0x160
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x42/0xc0
entry_SYSCALL_64_after_hwframe+0x46/0xb0
The buggy address belongs to the object at ffff889108f3d200
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 0 bytes to the right of
256-byte region [ffff889108f3d200, ffff889108f3d300)
Memory state around the buggy address: ffff889108f3d200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff889108f3d280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff889108f3d300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff889108f3d380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff889108f3d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md-raid: destroy the bitmap after destroying the thread
When we ran the lvm test "shell/integrity-blocksize-3.sh" on a kernel with
kasan, we got failure in write_page.
The reason for the failure is that md_bitmap_destroy is called before
destroying the thread and the thread may be waiting in the function
write_page for the bio to complete. When the thread finishes waiting, it
executes "if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))", which
triggers the kasan warning.
Note that the commit 48df498daf62 that caused this bug claims that it is
neede for md-cluster, you should check md-cluster and possibly find
another bugfix for it.
BUG: KASAN: use-after-free in write_page+0x18d/0x680 [md_mod]
Read of size 8 at addr ffff889162030c78 by task mdX_raid1/5539
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop") Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Two callers of md_alloc want to use the newly allocated devices, so
return it instead of letting them find it cumbersomely after the
allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
autorun_devices should not be limited to the controls for the legacy
probe on open, so just call md_alloc directly.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-and-tested-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yang Li [Fri, 22 Jul 2022 00:27:55 +0000 (08:27 +0800)]
md: remove unneeded semicolon
Eliminate the following coccicheck warning:
./drivers/md/md.c:8208:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
This driver is for fairly obscure hardware, and has only seen random
drive-by changes after the maintainer stopped working on it in 2005
(about a year and a half after it was introduced). It has some
"interesting" block layer interactions, so let's just drop it unless
anyone complains.
Stephen Rothwell [Thu, 21 Jul 2022 03:11:32 +0000 (13:11 +1000)]
md: fix build failure for !MODULE
After merging the block tree, today's linux-next build (x86_64
allmodconfig) failed like this:
drivers/md/md.c:717:22: error: 'mddev_find' defined but not used [-Werror=unused-function]
717 | static struct mddev *mddev_find(dev_t unit)
| ^~~~~~~~~~
cc1: all warnings being treated as errors
Now that devices are on the all_mddevs list until the gendisk is freed,
there can't be any duplicates. Remove the global list lookup and just
grab a reference.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md: only delete entries from all_mddevs when the disk is freed
This ensures device names don't get prematurely reused. Instead add a
deleted flag to skip already deleted devices in mddev_get and other
places that only want to see live mddevs.
Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock and delete the now unused for_each_mddev
macro.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a simple list_for_each_entry_safe on all_mddevs, and only grab a
reference when we drop the lock.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just do a plain list_for_each that only grabs a mddev reference in
the case where the thread sleeps and restarts the list iteration.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md: factor out the rdev overlaps check from rdev_size_store
This splits the code into nicely readable chunks and also avoids
the refcount inc/dec manipulations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The md_free name is rather misleading, so pick a better one.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that all private data is only freed once all accesses are done.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Error handling in md_alloc is a mess. Untangle it to just free the mddev
directly before add_disk is called and thus the gendisk is globally
visible. After that clear the hold flag and let the mddev_put take care
of cleaning up the mddev through the usual mechanisms.
Fixes: 5e55e2f5fc95 ("[PATCH] md: convert compile time warnings into runtime warnings") Fixes: 9be68dd7ac0e ("md: add error handling support for add_disk()") Fixes: 7ad1069166c0 ("md: properly unwind when failing to add the kobject in md_alloc") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Once a kobject is initialized, the containing object should not be
directly freed. So delay initialization until it is added. Also
remove the kobject_del call as the last put will remove the kobject as
well. The explicitly delete isn't needed here, and dropping it will
simplify further fixes.
With this md_free now does not need to check that ->gendisk is non-NULL
as it is always set by the time that kobject_init is called on
mddev->kobj.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md/raid5: Convert prepare_to_wait() to wait_woken() api
raid5_get_active_stripe() can sleep in various situations and it
is called by make_stripe_request() while inside the
prepare_to_wait()/finish_wait() section. Nested waits like this are
not supported.
This was noticed while making other changes that add different sleeps
to raid5_get_active_stripe() that caused a WARNING with
CONFIG_DEBUG_ATOMIC_SLEEP.
No ill effects have been noticed with the code as is, but theoretically
a nested and here could cause a dead lock so it should be fixed.
To fix this, convert the prepare_to_wait() call to use wake_woken()
which supports nested sleeps.
Link: https://lwn.net/Articles/628628/ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
md/raid5: Fix sectors_to_do bitmap overflow in raid5_make_request()
For unaligned IO that have nearly maximum sectors, the number of stripes
will end up being one greater than the size of the bitmap. When this
happens, the last stripe in the IO will not be processed as it should
be, resulting in data corruption.
However, this is not normally seen when the backing block devices have
4K physical block sizes since the block layer will split the request
before that happens.
To fix this increase the bitmap size by one bit and ensure the full
number of stripes are checked when calling find_first_bit().
Reported-by: David Sloan <David.Sloan@eideticom.com> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Tue, 19 Jul 2022 04:27:24 +0000 (12:27 +0800)]
bcache: remove EXPERIMENTAL for Kconfig option 'Asynchronous device registration'
The "Asynchronous device registration (EXPERIMENTAL)" Kconfig option is
for 2+ years, it is used when registration takes too much time for
massive amount of cached data, to avoid udev task timeout during boot
time.
Many users and products enable this Kconfig option for quite long time
(e.g. SUSE Linux) and it works as expected and no issue reported.
It is time to remove the "EXPERIMENTAL" tag from this Kconfig item.
commit 1243172d5894 ("nbd: use pr_err to output error message") tries
to define pr_fmt and use short pr_err() to output error message,
however, the definition is missed.
This patch also remove existing "nbd:" inside pr_err().
Dan Carpenter [Fri, 15 Jul 2022 08:12:14 +0000 (11:12 +0300)]
null_blk: fix ida error handling in null_add_dev()
There needs to be some error checking if ida_simple_get() fails.
Also call ida_free() if there are errors later.
Fixes: 94bc02e30fb8 ("nullb: use ida to manage index") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/YtEhXsr6vJeoiYhd@kili Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guixin Liu [Fri, 8 Jul 2022 03:06:05 +0000 (11:06 +0800)]
nvme-apple: use nvme core helper to cancel requests in tagset
Use nvme core helper nvme_cancel_tagset and nvme_cancel_admin_tagset
instead of same logic code.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ruozhu Li <liruozhu@huawei.com> Reviewed-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, command data is only sent in-capsule on the for admin or I/O
commands on queues that indicate support for it. Send fabrics command
data in-capsule for I/O queues as well to avoid needing a separate
H2CData PDU for the connect command.
This is optimization. Without this change, we send the connect command
capsule and data in separate PDUs (CapsuleCmd and H2CData), and must wait
for the controller to respond with an R2T PDU before sending the H2CData.
With the change, we send a single CapsuleCmd PDU that includes the data.
This reduces the number of bytes (and likely packets) sent across the network,
and simplifies the send state machine handling in the driver.
Israel Rukshin [Sun, 15 May 2022 15:04:40 +0000 (18:04 +0300)]
nvme-rdma: remove timeout for getting RDMA-CM established event
In case many controllers start error recovery at the same time (i.e.,
when port is down and up), they may never succeed to reconnect again.
This is because the target can't handle all the connect requests at
three seconds (the arbitrary value set today). Even if some of the
connections are established, when a single queue fails to connect,
all the controller's queues are destroyed as well. So, on the
following reconnection attempts the number of connect requests may
remain the same. To fix this, remove the timeout and wait for RDMA-CM
event to abort/complete the connect request. RDMA-CM sends unreachable
event when a timeout of ~90 seconds is expired. This approach is used
at other RDMA-CM users like SRP and iSER at blocking mode. The commit
also renames NVME_RDMA_CONNECT_TIMEOUT_MS to NVME_RDMA_CM_TIMEOUT_MS.
Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block/rnbd-srv: Replace sess_dev_list with index_idr
The structure rnbd_srv_session maintains a list and an xarray of
rnbd_srv_dev. There is no need to keep both as one of them can serve the
purpose.
Since one of the places where the lookup of rnbd_srv_dev using
rnbd_srv_session is IO path, an xarray would serve us better than a list
traversal. Hence remove sess_dev_list from rnbd_srv_session, and replace
its uses from xarray.
block/rnbd-srv: Set keep_id to true after mutex_trylock
After setting keep_id if the mutex trylock fails, the keep_id stays set
for the rest of the sess_dev lifetime.
Therefore, set keep_id to true after mutex_trylock succeeds, so that a
failure of trylock does'nt touch keep_id.
Fixes: b168e1d85cf3 ("block/rnbd-srv: Prevent a deadlock generated by accessing sysfs in parallel") Cc: gi-oh.kim@ionos.com Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20220707143122.460362-2-haris.iqbal@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:07 +0000 (11:52 +0200)]
nvmet-auth: expire authentication sessions
Each authentication step is required to be completed within the
KATO interval (or two minutes if not set). So add a workqueue function
to reset the transaction ID and the expected next protocol step;
this will automatically the next authentication command referring
to the terminated authentication.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:06 +0000 (11:52 +0200)]
nvmet-auth: Diffie-Hellman key exchange support
Implement Diffie-Hellman key exchange using FFDHE groups for NVMe
In-Band Authentication.
This patch adds a new host configfs attribute 'dhchap_dhgroup' to
select the FFDHE group to use.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:05 +0000 (11:52 +0200)]
nvmet: implement basic In-Band Authentication
Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds three additional configfs entries 'dhchap_key',
'dhchap_ctrl_key', and 'dhchap_hash' to the 'host' configfs directory.
The 'dhchap_key' and 'dhchap_ctrl_key' entries need to be in the ASCII
format as specified in NVMe Base Specification v2.0 section 8.13.5.8
'Secret representation'.
'dhchap_hash' defaults to 'hmac(sha256)', and can be written to to
switch to a different HMAC algorithm.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:04 +0000 (11:52 +0200)]
nvmet: parse fabrics commands on io queues
Some fabrics commands can be sent via io queues, so add a new
function nvmet_parse_fabrics_io_cmd() and rename the existing
nvmet_parse_fabrics_cmd() to nvmet_parse_fabrics_admin_cmd().
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:02 +0000 (11:52 +0200)]
nvme: implement In-Band authentication
Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds two new fabric options 'dhchap_secret' to specify the
pre-shared key (in ASCII respresentation according to NVMe 2.0 section
8.13.5.8 'Secret representation') and 'dhchap_ctrl_secret' to specify
the pre-shared controller key for bi-directional authentication of both
the host and the controller.
Re-authentication can be triggered by writing the PSK into the new
controller sysfs attribute 'dhchap_secret' or 'dhchap_ctrl_secret'.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: fold in clang build fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>