libblkid: fix spurious ext superblock checksum mismatches
Reads of ext superblocks can race with updates. If libblkid observes a
checksum mismatch, re-read the superblock with O_DIRECT in order to get
a consistent view of its contents. Only if the O_DIRECT read fails the
checksum should it be reported to have failed.
This fixes a problem where devices that were named by filesystem label
failed to be found when systemd attempted to mount them on boot. The
problem was caused by systemd-udevd using libblkid. If a read of a
superblock resulted in a checksum mismatch, udev will remove the
by-label links which result in the mount call failing to find the
device. The checksum mismatch that was triggering the problem was
spurious, and when we use O_DIRECT, or even perform a subsequent retry,
the superblock is correctly read. This resulted in a failure to mount
/boot in one out of every 2,000 or so attempts in our environment.
e2fsprogs fixed[1] an identical version of this bug that afflicted
resize2fs during online grow operations when run from cloud-init. The
fix there was also to use O_DIRECT in order to read the superblock.
This patch uses a similar approach: read the superblock with O_DIRECT in
the case where a bad checksum is detected.
[1] https://lore.kernel.org/linux-ext4/
20230609042239.GA1436857@mit.edu/
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>