Yu Watanabe [Tue, 17 Feb 2026 14:23:23 +0000 (23:23 +0900)]
Fixes & improvments for using homed-luks on 4k disks (#35776)
Mostly consists of fixes to
- use the same sector_size as the fdisk context we are using, when
converting between sectors returned by libfdisk to bytes. Fixes #30394 ,
Fixes #30393
- Use the explicit sector size if specified in the home record when are
probing the image file using libblkid. Fixes #30393
Also contains some other improvements with using physical block devices.
- Automatically probe sector size of physical block device, if user does
not pass luks-sector-size explicitly.
- Assign partitions to 1 MiB boundaries, as it is the standard practice
followed by all tools, fdisk, gptfdisk, gnu parted etc.
- Avoid stacking of loop device on top of physical block device in
home_create_luks as it leads to degradation of discard operations, and
mkfs getting stuck.
Yu Watanabe [Tue, 17 Feb 2026 13:20:38 +0000 (22:20 +0900)]
Sensor cleanup 1st pass (#40675)
This is a general cleanup of the sensors hwdb file divides into several
commits per brand.
I have merged the devices that use the same matrices, clean up a little
some clear dmi matches, and apply a inline comment with the device where
is certainly very clear way to display.
My idea is to do more cleanup steps but actually will require more
effort to achieve complete dmis, I can do it with little time, and some
consensus for comment styling.
About the comment styling actually I thing we could follow two rules at
the same time: inline comment when the dmi match is short and there is
no additional many information than just the model, and the other one
comment above the dmi match when is too long or there are need to add
more info.
scarlet-storm [Sat, 28 Dec 2024 08:44:11 +0000 (14:14 +0530)]
Use sectorsize for partition tables on block devs
Fix for specific case #30393 where 4k sector luks container is created
on a 512b device. In this case the partition table needs to be 512b,
else the kernel will not be able to find the partition, and we will
have to create a loop device to translate the partition table to 4k.
scarlet-storm [Sat, 28 Dec 2024 07:55:25 +0000 (13:25 +0530)]
homework: Ensure we don't stack block devices
Ensure we don't create a loop device on top of a physical block device.
This leads to huge performance degradation of discard operations if the
physical device does not support discard_on_zeroes.
- loop device historical semantics dictates that when the device is
discarded, it needs to return zero data on read. This can be
implemented easily on a filesystem. since fallocate zero-range
would return immediately & the holes are handled at the filesystem
level to return zero data on read.
- For a raw block device, the feature (discard_zeroes_data) depends on
the capabilities of the physical device that is exposed to the
block layer by the driver. This means that to guarantee that the loop
device stacked on a block device returns zero on discarded data,
it needs to convert discarded range into write_zero op on the block device.
https://github.com/torvalds/linux/blob/63676eefb7a026d04b51dcb7aaf54f358517a2ec/drivers/block/loop.c#L773
For example on one of my local nvme I can see the following:
cat /sys/class/block/nvme1n1/queue/write_zeroes_max_bytes
131072
cat /sys/class/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040
This means maximum size of a write_zero operation can be 128KiB &
maximum size of discard operation can be 2TiB on the block device.
So discarding for example 1TB of data, which would be a single block
device operation, gets split into 8.3 million block device operations
when issued on top of stacked loop device.
scarlet-storm [Sat, 28 Dec 2024 03:43:34 +0000 (09:13 +0530)]
homework: Use same sector size when probing the device
If there is an explicit sector size specified in the user record,
use the same when probing the file using libblkid. The default
is 512 bytes, which will not be able to find the signatures, if the
partition table on regular file was created assuming 4k sectors
For safety, prefer board product name, that always has the short name,
over system product name, that in few models has a very long string with
the short name at the end.
The following models added at the time of this commit BR1100FKA, RC72LA
and TP412UA needs a wildcard before when using pn.
Unmerged Q502LAB, Q551LB and Q551LN, in the merged match there are many
more unreported models.
Chris Down [Tue, 17 Feb 2026 06:58:44 +0000 (14:58 +0800)]
oomd: Fix unnecessary delays during OOM kills with pending kills present
Let's say a user has two services with ManagedOOMMemoryPressure=kill,
perhaps a web server under system.slice and a batch job under
user.slice. Both exceed their pressure limits. On the previous timer
tick, oomd has already queued the web server's candidate for killing,
but the prekill hook has not yet responded, so the kill is still
pending.
In the code, monitor_memory_pressure_contexts_handler() iterates over
all pressure targets that have exceeded their limits. When it reaches
the web server target and calls oomd_cgroup_kill_mark(), which returns 0
because that cgroup is already queued. The code treats this the same as
a successful new kill: it resets the 15 second delay timer and returns
from the function, exiting the loop.
This loop is handled by SET_FOREACH and the iteration order is
hash-dependent. As such, if the web server target happens coincidentally
to be visited first, oomd never evaluates the batch job target at all.
The effect is twofold:
1. oomd stalls for 15 seconds despite not having initiated any new kill.
That can unnecessarily delay further action to stem increases in
memory pressure. The delay exists to let stale pressure counters
settle after a kill, but no kill has happened here.
2. It non-deterministically skips pressure targets that may have
unqueued candidates, dangerously allowing memory pressure to persist
for longer than it should.
Fix this by skipping cgroups that are already queued so the loop
proceeds to try other pressure targets. We should only delay when a new
kill mark is actually created.
Chris Down [Tue, 17 Feb 2026 06:30:16 +0000 (14:30 +0800)]
oomd: Fix silent failure to find bad cgroups when another cgroup dies
Consider a workload slice with several sibling cgroups. Imagine that one
of those cgroups is removed between the moment oomd enumerates the
directory and the moment it reads memory.oom.group. This is actually
relatively plausible under the high memory pressure conditions where
oomd is most needed.
In this case, the failed read prompts us to `return 0`, which exits the
entire enumeration loop in recursively_get_cgroup_context(). As a
result, all remaining sibling cgroups are silently dropped from the
candidate list for that monitoring cycle.
The effect is that oomd can fail to identify and kill the actual
offending cgroup, allowing memory pressure to persist until a subsequent
cycle where the race doesn't occur.
Fix this by instead proceeding to evaluate further sibling cgroups.
Let's say a user has two services with ManagedOOMMemoryPressure=kill,
one for a web server under system.slice, and one for a batch job under
user.slice. The batch job is causing severe memory pressure, whereas the
web server's cgroup has no processes with significant pgscan activity.
In the code, monitor_memory_pressure_contexts_handler() iterates over
all pressure targets that have exceeded their limits. When
oomd_select_by_pgscan_rate() returns 0 (that is, no candidates) for a
target, we return from the entire SET_FOREACH loop instead of moving to
the next target. Since SET_FOREACH iteration order is hash-dependent, if
the web server target happens to be visited first, oomd will find no
kill candidates for it and exit the loop. The batch job target that is
actually slamming the machine will never even be evaluated, and can
continue to wreak havoc without any intervention.
The effect is that oomd non-deterministically and silently fails to kill
cgroups that it should actually kill, allowing memory pressure to
persist and dangerously build up on the machine.
The fix is simple, keep evaluating remaining targets when one does not
match.
These were introduced as part of the effort of sd-executor
worker pool (#29566), which never landed due to unsignificant
performance improvement. Let's just remove the unused
helpers. If that work ever gets resurrected they can be
restored from this commit pretty easily.
Yu Watanabe [Tue, 17 Feb 2026 05:53:46 +0000 (14:53 +0900)]
oomd: Fix Killed signal reason being lost (#40689)
Emitting "oom" doesn't mesh with the org.freedesktop.oom1.Manager
Killed() API contract, which defines "memory-used" and "memory-pressure"
as possible reasons. Consumers that key off reason thus will either lose
policy attribution or may reject the unknown value completely.
Plumb the reason through so it is visible to consumers.
Chris Down [Sun, 15 Feb 2026 17:42:51 +0000 (01:42 +0800)]
oomd: Fix Killed signal reason being lost
Emitting "oom" doesn't mesh with the org.freedesktop.oom1.Manager
Killed() API contract, which defines "memory-used" and "memory-pressure"
as possible reasons. Consumers that key off reason thus will either lose
policy attribution or may reject the unknown value completely.
Plumb the reason through so it is visible to consumers.