Merge pull request #33192 from DaanDeMeyer/packaging

[thirdparty/systemd.git] / docs / BLOCK_DEVICE_LOCKING.md
diff --git a/docs/BLOCK_DEVICE_LOCKING.md b/docs/BLOCK_DEVICE_LOCKING.md

index f122eda417280cf10fa91354a1fc7dcea7622229..a6e3374bc7932dcfea6de6b761dbcfcb45e282c7 100644 (file)
--- a/docs/BLOCK_DEVICE_LOCKING.md
+++ b/docs/BLOCK_DEVICE_LOCKING.md
@@ -2,12 +2,13 @@
  title: Locking Block Device Access
  category: Interfaces
  layout: default
+SPDX-License-Identifier: LGPL-2.1-or-later
  ---
  
  # Locking Block Device Access
  
  *TL;DR: Use BSD file locks
-[(`flock(2)`)](http://man7.org/linux/man-pages/man2/flock.2.html) on block
+[(`flock(2)`)](https://man7.org/linux/man-pages/man2/flock.2.html) on block
  device nodes to synchronize access for partitioning and file system formatting
  tools.*
  
@@ -21,30 +22,41 @@ Applications manipulating a block device can temporarily stop `systemd-udevd`
  from processing rules on it — and thus bar it from probing the device — by
  taking a BSD file lock on the block device node. Specifically, whenever
  `systemd-udevd` starts processing a block device it takes a `LOCK_SH|LOCK_NB`
-lock using [`flock(2)`](http://man7.org/linux/man-pages/man2/flock.2.html) on
+lock using [`flock(2)`](https://man7.org/linux/man-pages/man2/flock.2.html) on
  the main block device (i.e. never on any partition block device, but on the
  device the partition belongs to). If this lock cannot be taken (i.e. `flock()`
-returns `EBUSY`), it refrains from processing the device. If it manages to take
+returns `EAGAIN`), it refrains from processing the device. If it manages to take
  the lock it is kept for the entire time the device is processed.
  
  Note that `systemd-udevd` also watches all block device nodes it manages for
-`inotify()` `IN_CLOSE` events: whenever such an event is seen, this is used as
-trigger to re-run the rule-set for the device.
+`inotify()` `IN_CLOSE_WRITE` events: whenever such an event is seen, this is
+used as trigger to re-run the rule-set for the device.
  
  These two concepts allow tools such as disk partitioners or file system
  formatting tools to safely and easily take exclusive ownership of a block
  device while operating: before starting work on the block device, they should
  take an `LOCK_EX` lock on it. This has two effects: first of all, in case
  `systemd-udevd` is still processing the device the tool will wait for it to
-finish. Second, after the lock is taken, it can be sure that
-`systemd-udevd` will refrain from processing the block device, and thus all
-other client applications subscribed to it won't get device notifications from
-potentially half-written data either. After the operation is complete the
+finish. Second, after the lock is taken, it can be sure that `systemd-udevd`
+will refrain from processing the block device, and thus all other client
+applications subscribed to it won't get device notifications from potentially
+half-written data either. After the operation is complete the
  partitioner/formatter can simply close the device node. This has two effects:
  it implicitly releases the lock, so that `systemd-udevd` can process events on
-the device node again. Secondly, it results an `IN_CLOSE` event, which causes
-`systemd-udevd` to immediately re-process the device — seeing all changes the
-tool made — and notify subscribed clients about it.
+the device node again. Secondly, it results an `IN_CLOSE_WRITE` event, which
+causes `systemd-udevd` to immediately re-process the device — seeing all
+changes the tool made — and notify subscribed clients about it.
+
+Ideally, `systemd-udevd` would explicitly watch block devices for `LOCK_EX`
+locks being released. Such monitoring is not supported on Linux however, which
+is why it watches for `IN_CLOSE_WRITE` instead, i.e. for `close()` calls to
+writable file descriptors referring to the block device. In almost all cases,
+the difference between these two events does not matter much, as any locks
+taken are implicitly released by `close()`. However, it should be noted that if
+an application unlocks a device after completing its work without closing it,
+i.e. while keeping the file descriptor open for further, longer time, then
+`systemd-udevd` will not notice this and not retrigger and thus reprobe the
+device.
  
  Besides synchronizing block device access between `systemd-udevd` and such
  tools this scheme may also be used to synchronize access between those tools
@@ -63,7 +75,169 @@ And please keep in mind: BSD file locks (`flock()`) and POSIX file locks
  orthogonal. The scheme discussed above uses the former and not the latter,
  because these types of locks more closely match the required semantics.
  
+If multiple devices are to be locked at the same time (for example in order to
+format a RAID file system), the devices should be locked in the order of the
+the device nodes' major numbers (primary ordering key, ascending) and minor
+numbers (secondary ordering key, ditto), in order to avoid ABBA locking issues
+between subsystems.
+
+Note that the locks should only be taken while the device is repartitioned,
+file systems formatted or `dd`'ed in, and similar cases that
+apply/remove/change superblocks/partition information. It should not be held
+during normal operation, i.e. while file systems on it are mounted for
+application use.
+
+The [`udevadm
+lock`](https://www.freedesktop.org/software/systemd/man/udevadm.html) command
+is provided to lock block devices following this scheme from the command line,
+for the use in scripts and similar. (Note though that it's typically preferable
+to use native support for block device locking in tools where that's
+available.)
+
  Summarizing: it is recommended to take `LOCK_EX` BSD file locks when
  manipulating block devices in all tools that change file system block devices
  (`mkfs`, `fsck`, …) or partition tables (`fdisk`, `parted`, …), right after
  opening the node.
+
+# Example of Locking The Whole Disk
+
+The following is an example to leverage `libsystemd` infrastructure to get the whole disk of a block device and take a BSD lock on it.
+
+## Compile and Execute
+**Note that this example requires `libsystemd` version 251 or newer.**
+
+Place the code in a source file, e.g. `take_BSD_lock.c` and run the following commands:
+```
+$ gcc -o take_BSD_lock -lsystemd take_BSD_lock.c
+
+$ ./take_BSD_lock /dev/sda1
+Successfully took a BSD lock: /dev/sda
+
+$ flock -x /dev/sda ./take_BSD_lock /dev/sda1
+Failed to take a BSD lock on /dev/sda: Resource temporarily unavailable
+```
+
+## Code
+```c
+/* SPDX-License-Identifier: MIT-0 */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/file.h>
+#include <systemd/sd-device.h>
+#include <unistd.h>
+
+static inline void closep(int *fd) {
+    if (*fd >= 0)
+        close(*fd);
+}
+
+/**
+ * lock_whole_disk_from_devname
+ * @devname: devname of a block device, e.g., /dev/sda or /dev/sda1
+ * @open_flags: the flags to open the device, e.g., O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY
+ * @flock_operation: the operation to call flock, e.g., LOCK_EX|LOCK_NB
+ *
+ * given the devname of a block device, take a BSD lock of the whole disk
+ *
+ * Returns: negative errno value on error, or non-negative fd if the lock was taken successfully.
+ **/
+int lock_whole_disk_from_devname(const char *devname, int open_flags, int flock_operation) {
+    __attribute__((cleanup(sd_device_unrefp))) sd_device *dev = NULL;
+    sd_device *whole_dev;
+    const char *whole_disk_devname, *subsystem, *devtype;
+    int r;
+
+    // create a sd_device instance from devname
+    r = sd_device_new_from_devname(&dev, devname);
+    if (r < 0) {
+        errno = -r;
+        fprintf(stderr, "Failed to create sd_device: %m\n");
+        return r;
+    }
+
+    // if the subsystem of dev is block, but its devtype is not disk, find its parent
+    r = sd_device_get_subsystem(dev, &subsystem);
+    if (r < 0) {
+        errno = -r;
+        fprintf(stderr, "Failed to get the subsystem: %m\n");
+        return r;
+    }
+    if (strcmp(subsystem, "block") != 0) {
+        fprintf(stderr, "%s is not a block device, refusing.\n", devname);
+        return -EINVAL;
+    }
+
+    r = sd_device_get_devtype(dev, &devtype);
+    if (r < 0) {
+        errno = -r;
+        fprintf(stderr, "Failed to get the devtype: %m\n");
+        return r;
+    }
+    if (strcmp(devtype, "disk") == 0)
+        whole_dev = dev;
+    else {
+        r = sd_device_get_parent_with_subsystem_devtype(dev, "block", "disk", &whole_dev);
+        if (r < 0) {
+            errno = -r;
+            fprintf(stderr, "Failed to get the parent device: %m\n");
+            return r;
+        }
+    }
+
+    // open the whole disk device node
+    __attribute__((cleanup(closep))) int fd = sd_device_open(whole_dev, open_flags);
+    if (fd < 0) {
+        errno = -fd;
+        fprintf(stderr, "Failed to open the device: %m\n");
+        return fd;
+    }
+
+    // get the whole disk devname
+    r = sd_device_get_devname(whole_dev, &whole_disk_devname);
+    if (r < 0) {
+        errno = -r;
+        fprintf(stderr, "Failed to get the whole disk name: %m\n");
+        return r;
+    }
+
+    // take a BSD lock of the whole disk device node
+    if (flock(fd, flock_operation) < 0) {
+        r = -errno;
+        fprintf(stderr, "Failed to take a BSD lock on %s: %m\n", whole_disk_devname);
+        return r;
+    }
+
+    printf("Successfully took a BSD lock: %s\n", whole_disk_devname);
+
+    // take the fd to avoid automatic cleanup
+    int ret_fd = fd;
+    fd = -EBADF;
+    return ret_fd;
+}
+
+int main(int argc, char **argv) {
+    if (argc != 2) {
+        fprintf(stderr, "Invalid number of parameters.\n");
+        return EXIT_FAILURE;
+    }
+
+    // try to take an exclusive and nonblocking BSD lock
+    __attribute__((cleanup(closep))) int fd =
+        lock_whole_disk_from_devname(
+            argv[1],
+            O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY,
+            LOCK_EX|LOCK_NB);
+
+    if (fd < 0)
+        return EXIT_FAILURE;
+
+    /**
+     * The device is now locked until the return below.
+     * Now you can safely manipulate the block device.
+     **/
+
+    return EXIT_SUCCESS;
+}
+```