NeilBrown [Thu, 1 Aug 2013 05:59:24 +0000 (15:59 +1000)]
mdmon: don't lie to systemd.
Now that mdmon responds fairly well to SIGTERM, stop lying to
systemd about being started on the initrd.
Note that if mdmon is rerun (--takeover) for some reason, and systemd
chooses to kill processes before remounting / readonly, then the
unmount will hang.
If systemd ever lets us tell it that we don't want to be killed until
root is readonly, then we should do that.
NeilBrown [Thu, 1 Aug 2013 04:32:04 +0000 (14:32 +1000)]
Introduce devid2kname - slightly different to devid2devnm.
The purpose od devid2devnm is to return a kernel name of an
md device, whether that device is a whole device or a partition,
we want the whole device. md4, never md4p2.
In one place I was using devid2devnm where I really wanted the
partition if there was one ... and wasn't really interested in it
being an md device.
So introduce a new 'devid2kname' for that case.
NeilBrown [Thu, 1 Aug 2013 04:04:07 +0000 (14:04 +1000)]
Don't lie to systemd about mdadm's status.
Telling systemd that mdadm was started from the initrd
is often a lie and never necessary. Now that the reshape monitoring
thread handles SIGTERM gracefully it is OK for system to kill
and mdadm that it finds running.
mdmon still have a bit of a question mark over it so I won't remove
the '@' from there just yet.
NeilBrown [Thu, 1 Aug 2013 01:16:14 +0000 (11:16 +1000)]
Grow: exit background thread cleanly on SIGTERM.
If the mdadm thread that monitors a reshape gets SIGTERM it should
exit cleanly and clear the 'suspended' region of the array.
However it mustn't clear 'sync_max' as that would allow the
reshape to continue unmonitored.
If the thread ever does get killed, the array should really be
shutdown soon after if possible.
Martin Wilck [Tue, 30 Jul 2013 21:18:33 +0000 (23:18 +0200)]
mdmon: manage_member: fix race condition during slow meta data writes
In order to track kernel state changes, the monitor needs to
notice changes in sysfs. If the changes are transient, and the
monitor is busy writing meta data, it can happen that the changes
are missed. This will cause the meta data to be inconsistent with
the real state of the array.
I can reproduce this in a test scenario with a DDF container and
two subarrays, where I set a disk to "failed" and then add a global
hot-spare. On a typical MD test setup with loop devices, I can
reliably reproduce a failure where the metadata show degraded members
although the kernel finished the recovery successfully.
This patch fixes this problem by applying two changes. First, when
a metadata update is queued, wait until it is certain that the monitor
actually applied these meta data (the for loop is actually needed to
avoid failures completely in my test case). Second, after triggering the
recovery, set prev_state of the changed array to "recover", in case
the monitor misses the transient "recover" state.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
Martin Wilck [Tue, 30 Jul 2013 21:18:31 +0000 (23:18 +0200)]
mdmon: wait_and_act: fix debug message for SIGUSR1
Correctly print out wake reason if it was a signal. Previous code
would print misleading select events (pselect(2) man page says the
fdsets become undefined in case of error).
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
Martin Wilck [Tue, 30 Jul 2013 21:18:30 +0000 (23:18 +0200)]
monitor: read_and_act: log status when called
read_and_act() currently prints a debug message only very late.
Print the status seen by mdmon right away, to track mdmon's
actions more closely. Add a time stamp to observe long delays
between read_and_act calls, e.g. caused by meta data writes.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
_avail_space1() is calls from both avail_space1() and validate_geometry1()
and does slightly different things.
The partial code sharing doesn't really help. In particularly the
responsibility for setting the size of the array is currently
confused.
So duplicate the code into the two locations - one where 'super' is
always NULL (validate_geometry1) and one where it is never NULL
(avail_space1), and simplify.
This call to validate_geometry is really rather gratuitous.
It is purely about the fact that super0 cannot use more than 4TB.
So just make it an explicit test - less confusing that way.
With this, validate_geometry is only called from Create, which
makes it easier to reason about.
Also validate_geometry is now never passed NULL for the 'chunk'
parameter, so we can remove those annoying tests for NULL.
Grow: don't hold array open while waiting for reshape.
If we will need to change array level when a reshape completes, a copy
of mdadm waits in the background.
Currently this copy hold the device (/dev/mdX) open. This prevents
the array from being stopped.
So close the file descriptor and re-open after the reshape completes.
super1: update data_size when performing "revert-reshape".
The "data_size" is with respect to "data_offset". When the kernel
changes "data_offset" it modifies "data_size" to match - see
md_finish_reshape() in the kernel.
So when mdadm switches the data_offset for the new data_offset, it
must update data_size correspondingly.
Let the first created array be RAID5 rather than RAID0. This makes
the test harder than before, because everything after the first
Create has do be done indirectly through mdmon.
This was waiting on "reshape_position" which doesn't
get update events.
Before sysfs_wait was introduced, the code to wait didn't
wait at all, so it spun.
With sysfs_wait, it would wait forever.
Change to wait in sync_completed which does get events.
* Aligned the spelling of “RAID” to use captial letters in all places.
* Aligned the spelling of the RAID level names (LINEAR, RAID1, …) to use capital
letters in all places, except for the string “faulty” in places where not the
RAID level was meant.
Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Stop: fix up synchronising end of reshape to good boundary.
If we stop too soon after reshape starts (probably only during
testing), we can get confused by the status of the reshape.
If that might be happening - sleep a bit longer.
Also allow for reshape going unusually slowly (again, probably only
during testing).
Grow: use mdstat_wait to wait for delayed reshape.
Having a fix time for a wait is clumsy and can make us
wait much too long.
So use mdstat_wait and keep the mdstat_fd open.
This requires an 'mdstat_close' so it doesn't stay open
forever.
DDF load headers: if primary is invalid, don't check fields.
Currently we compare fields between primary and secondary
superblocks, before we check if the primary is even valid.
This is a bit backwards, so reverse it.
The "indirect" code path for adding VDs was not working correctly
for secondary RAID level. The "other BVDs" were not transmitted
to mdmon. Thus mdmon wouldn't build up correct information, and
RAID creation would fail when mdmon was already running on the container.
The return value of disk.raid_disk may be wrong.
The old code was using raiddisk, which is only valid with auto
layout. This leads to errors when arrays are created with
specified disks and mdmon is already running, like this:
Implement kill_subarray, for mdmon running and not running.
The way Kill_subarray() is implemented, this requires that the
DDF layer uses "currentconf" to remember the last subarray
queried with container_content(), and use it as the one to kill.
I don't like this much but IMSM does it the same way.
DDF: write_init_super_ddf: don't zero superblocks for subarrays
commit d682f344 inserted this call to "Kill" in write_init_super_ddf:
"Matching the functionality already in super0 and super1, when
we first create a container, remove any other recognisable metadata to
ensure it doesn't cause confusion."
But we should do this only at first container creation, not when
subarrays are created later.
Monitor: Don't write metadata in inactive array state
The kernel docs state that meta data is never written in states
clear, inactive, suspended, readonly, and read_auto.
Why should this be different for containers?
We need to write metadata when the array is disabled, though.
Tested with the DDF (10*) and IMSM (9*) tests, works.
DDF: add_to_super_ddf: Use same amount of workspace as other disks
If there are already disks in the container, reserve the same amount
of workspace as the first disk. Fill in the primary/secondary/
workspace LBA values.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
DDF: use LBA_OFFSET macro instead of lba_offset field
Remove the lba_offset field from struct vcl. This field acted as
a "cache" for the address of the lba_offset field in the vd_config
structure. This isn't useful any more if there are multiple
vd_configs in a vcl.
This patch also adds __cpu_to_be64 in two places where it has been
quite obviously forgotten (ddf_set_disk, ddf_activate_spare).
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
Instead of allocating the other_bvds array and every element
separately, allocate all in a single chunk. Also, move allocation
in a subroutine as it's used in several places.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
Support for RAID 10 makes it necessary to rewrite the algorithm
for deriving DDF layout from MD layout. The functions level_to_prl
and layout_to_rlq are combined in a single function that takes
md layout parameters and converts them to DDF.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
DDF: layout_ddf2md: new DDF->md RAID layout conversion
layout_ddf2md() is a new RAID layout conversion routine.
It obsoletes the previous separate routines for obtaining
md level and layout (map_num1, rlq_to_layout).
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
The DDF code was assuming that the VD slots 0..populated_vdes
were used and the rest was unused. Remove this assumption and
deal with empty slots instead.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
If secondary RAID level is taken into account, translation between
the md RAID member (raid_disk) and the index of a physical disk
in a BVD becomes more complex.
Also, take into account that the member list can have unused entries
(this is independent of secondary RAID level).
Adapt usage of find_vdcr() accordingly
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
This patch implements the previously unsupported case where
store_super_ddf is called with a non-empty superblock.
For DDF, writing meta data to just one disk makes no sense.
We would run the risk of writing inconsistent meta data
to the devices. So just call __write_init_super_ddf and
write to all devices, including the one passed by the caller.
This patch assumes that the device to store the superblock on
has already been added to the DDF structure. Otherwise, an
error message will be emitted.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
Assemble: avoid a consistency check when --force is given.
mdadm will normally not include a device into an array if that device
reports that the "best" device has failed, as this normally implies
some sort of inconsistency.
However when --force is given it means that the given drives really
should be assembled if at all possible so in that case the test should
be avoided.
The particular case where this was a problem was a RAID5 were all
devices had the same event count but three of them reported that the
first two had failed.
As they all had the same event count the first was taken as the 'best'
and that caused the later ones to be excluded. Listing one of the
later ones first allowed the array to be assembled. So in this case
the test clearly just got in the way and did nothing useful.
Grow: notice when --stop is synchronising a reshape and don't mess it up.
--stop now tries to wait for a reshape to be at just the right spot.
However for a reducing reshape, mdadm will be running in the
background watching, and might adjust sync_max and mess things up.
So teach "progress_reshape" to notice when "sync_max" is modified, and
leave it alone.