mwilck@arcor.de [Tue, 6 Aug 2013 21:38:02 +0000 (23:38 +0200)]
DDF: ddf_open_new: check device status for new subarray
It is possible that mdadm creates a new subarray containing failed
devices. This may happen if a device has failed, but the meta data
containing that information hasn't been written out yet.
This code tests for this situation, and handles it in the monitor.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
mwilck@arcor.de [Tue, 6 Aug 2013 21:38:01 +0000 (23:38 +0200)]
tests/10ddf-fail-create-race: test handling of fail/create race
If a disk fails and simulaneously a new array is created, a race
condition may arise because the meta data on disk doesn't reflect
the disk failure yet. This is a test for that case.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Wed, 7 Aug 2013 06:59:26 +0000 (16:59 +1000)]
DDF: Write new conf entries with a single write.
The recent change to skip over invalid conf entries was bad because
it could leave garbage on the disk.
But we don't to write each entry separately as the writes a O_DIRECT
and so synchronous so it takes way too long.
So allocate a large buffer (probably the one used to read the config records)
and fill that then write it all at once.
Reported-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
mwilck@arcor.de [Mon, 5 Aug 2013 20:37:51 +0000 (22:37 +0200)]
test: allow LVM volumes or RAM disks as test devices
Allow other device types for testing; this allows to test on
a larger variety of devices.
Option --dev=[loop|lvm|ram] selects loop device (default), lvm,
and ram disk, respecively. To use RAM disks with DDF,
the kernel parameter ramdisk_size=65536 must be used.
For LVM, use --volgroup=<vg> to specify the name of the volume
group in which the test LVs will be created.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Mon, 5 Aug 2013 06:39:45 +0000 (16:39 +1000)]
Makefile: check that 'run' directory exists.
mdadm default to using /run/mdadm. However not all distros
provide /run yet. This can confuse people who build their own
mdadm.
So have "make" complain if the given directory doesn't exist.
This will make it harder to build an mdadm which doesn't work.
Reported-by: Albert Pauw <albert.pauw@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Mon, 5 Aug 2013 04:56:23 +0000 (14:56 +1000)]
DDF: fix removal of failed devices.
Commit c7079c84 arrange for DDF to forget about any device
that is failed and not still marked as part of any array.
However such devices could still be part of the container and this
removal and updating of 'pdnum' can result in multiple devices having
the same pdnum. This in turn easily leads to confusion and
corruption.
So only discard pd entries for devices which are failed, not listed in
any virtual device, and for which we don't have a handle on the
device.
pd entries will not get removed until a new device is added after
the device has been removed from the container, either by
"mdadm --remove" or by assembling without the failed devices.
Reported-by: Albert Pauw <albert.pauw@gmail.com> Analysed-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Mon, 5 Aug 2013 04:21:10 +0000 (14:21 +1000)]
DDF: fix writing metadata updates.
Recent commit 273989b93a3185c0e4d54f0d1bc404248a92d157
skipped writing some large blocks of 0xFF, but didn't seek
over the space, so subsequent data was written wrongly.
NeilBrown [Thu, 1 Aug 2013 05:59:24 +0000 (15:59 +1000)]
mdmon: don't lie to systemd.
Now that mdmon responds fairly well to SIGTERM, stop lying to
systemd about being started on the initrd.
Note that if mdmon is rerun (--takeover) for some reason, and systemd
chooses to kill processes before remounting / readonly, then the
unmount will hang.
If systemd ever lets us tell it that we don't want to be killed until
root is readonly, then we should do that.
NeilBrown [Thu, 1 Aug 2013 04:32:04 +0000 (14:32 +1000)]
Introduce devid2kname - slightly different to devid2devnm.
The purpose od devid2devnm is to return a kernel name of an
md device, whether that device is a whole device or a partition,
we want the whole device. md4, never md4p2.
In one place I was using devid2devnm where I really wanted the
partition if there was one ... and wasn't really interested in it
being an md device.
So introduce a new 'devid2kname' for that case.
NeilBrown [Thu, 1 Aug 2013 04:04:07 +0000 (14:04 +1000)]
Don't lie to systemd about mdadm's status.
Telling systemd that mdadm was started from the initrd
is often a lie and never necessary. Now that the reshape monitoring
thread handles SIGTERM gracefully it is OK for system to kill
and mdadm that it finds running.
mdmon still have a bit of a question mark over it so I won't remove
the '@' from there just yet.
NeilBrown [Thu, 1 Aug 2013 01:16:14 +0000 (11:16 +1000)]
Grow: exit background thread cleanly on SIGTERM.
If the mdadm thread that monitors a reshape gets SIGTERM it should
exit cleanly and clear the 'suspended' region of the array.
However it mustn't clear 'sync_max' as that would allow the
reshape to continue unmonitored.
If the thread ever does get killed, the array should really be
shutdown soon after if possible.
Martin Wilck [Tue, 30 Jul 2013 21:18:33 +0000 (23:18 +0200)]
mdmon: manage_member: fix race condition during slow meta data writes
In order to track kernel state changes, the monitor needs to
notice changes in sysfs. If the changes are transient, and the
monitor is busy writing meta data, it can happen that the changes
are missed. This will cause the meta data to be inconsistent with
the real state of the array.
I can reproduce this in a test scenario with a DDF container and
two subarrays, where I set a disk to "failed" and then add a global
hot-spare. On a typical MD test setup with loop devices, I can
reliably reproduce a failure where the metadata show degraded members
although the kernel finished the recovery successfully.
This patch fixes this problem by applying two changes. First, when
a metadata update is queued, wait until it is certain that the monitor
actually applied these meta data (the for loop is actually needed to
avoid failures completely in my test case). Second, after triggering the
recovery, set prev_state of the changed array to "recover", in case
the monitor misses the transient "recover" state.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
Martin Wilck [Tue, 30 Jul 2013 21:18:31 +0000 (23:18 +0200)]
mdmon: wait_and_act: fix debug message for SIGUSR1
Correctly print out wake reason if it was a signal. Previous code
would print misleading select events (pselect(2) man page says the
fdsets become undefined in case of error).
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
Martin Wilck [Tue, 30 Jul 2013 21:18:30 +0000 (23:18 +0200)]
monitor: read_and_act: log status when called
read_and_act() currently prints a debug message only very late.
Print the status seen by mdmon right away, to track mdmon's
actions more closely. Add a time stamp to observe long delays
between read_and_act calls, e.g. caused by meta data writes.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
_avail_space1() is calls from both avail_space1() and validate_geometry1()
and does slightly different things.
The partial code sharing doesn't really help. In particularly the
responsibility for setting the size of the array is currently
confused.
So duplicate the code into the two locations - one where 'super' is
always NULL (validate_geometry1) and one where it is never NULL
(avail_space1), and simplify.
This call to validate_geometry is really rather gratuitous.
It is purely about the fact that super0 cannot use more than 4TB.
So just make it an explicit test - less confusing that way.
With this, validate_geometry is only called from Create, which
makes it easier to reason about.
Also validate_geometry is now never passed NULL for the 'chunk'
parameter, so we can remove those annoying tests for NULL.
Grow: don't hold array open while waiting for reshape.
If we will need to change array level when a reshape completes, a copy
of mdadm waits in the background.
Currently this copy hold the device (/dev/mdX) open. This prevents
the array from being stopped.
So close the file descriptor and re-open after the reshape completes.
super1: update data_size when performing "revert-reshape".
The "data_size" is with respect to "data_offset". When the kernel
changes "data_offset" it modifies "data_size" to match - see
md_finish_reshape() in the kernel.
So when mdadm switches the data_offset for the new data_offset, it
must update data_size correspondingly.
Let the first created array be RAID5 rather than RAID0. This makes
the test harder than before, because everything after the first
Create has do be done indirectly through mdmon.
This was waiting on "reshape_position" which doesn't
get update events.
Before sysfs_wait was introduced, the code to wait didn't
wait at all, so it spun.
With sysfs_wait, it would wait forever.
Change to wait in sync_completed which does get events.
* Aligned the spelling of “RAID” to use captial letters in all places.
* Aligned the spelling of the RAID level names (LINEAR, RAID1, …) to use capital
letters in all places, except for the string “faulty” in places where not the
RAID level was meant.
Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Stop: fix up synchronising end of reshape to good boundary.
If we stop too soon after reshape starts (probably only during
testing), we can get confused by the status of the reshape.
If that might be happening - sleep a bit longer.
Also allow for reshape going unusually slowly (again, probably only
during testing).
Grow: use mdstat_wait to wait for delayed reshape.
Having a fix time for a wait is clumsy and can make us
wait much too long.
So use mdstat_wait and keep the mdstat_fd open.
This requires an 'mdstat_close' so it doesn't stay open
forever.
DDF load headers: if primary is invalid, don't check fields.
Currently we compare fields between primary and secondary
superblocks, before we check if the primary is even valid.
This is a bit backwards, so reverse it.
The "indirect" code path for adding VDs was not working correctly
for secondary RAID level. The "other BVDs" were not transmitted
to mdmon. Thus mdmon wouldn't build up correct information, and
RAID creation would fail when mdmon was already running on the container.
The return value of disk.raid_disk may be wrong.
The old code was using raiddisk, which is only valid with auto
layout. This leads to errors when arrays are created with
specified disks and mdmon is already running, like this:
Implement kill_subarray, for mdmon running and not running.
The way Kill_subarray() is implemented, this requires that the
DDF layer uses "currentconf" to remember the last subarray
queried with container_content(), and use it as the one to kill.
I don't like this much but IMSM does it the same way.
DDF: write_init_super_ddf: don't zero superblocks for subarrays
commit d682f344 inserted this call to "Kill" in write_init_super_ddf:
"Matching the functionality already in super0 and super1, when
we first create a container, remove any other recognisable metadata to
ensure it doesn't cause confusion."
But we should do this only at first container creation, not when
subarrays are created later.
Monitor: Don't write metadata in inactive array state
The kernel docs state that meta data is never written in states
clear, inactive, suspended, readonly, and read_auto.
Why should this be different for containers?
We need to write metadata when the array is disabled, though.
Tested with the DDF (10*) and IMSM (9*) tests, works.