Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
imsm: enable checkpointing of migration (resync/rebuild)
When the array is shutdown, or when mdadm --wait-clean is called, any
active resync process will be idled allowing mdmon to record the current
resync position.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
Extend --wait-clean to checkpoint resync
Root file systems backed by external metadata arrays need to be
explicitly checkpointed near the time the rootfs is marked readonly as
userspace will not have an opportunity to react to the final shutdown of
the array.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
--wait-clean: shorten timeout
Set the safemode timeout to a small value to get the array marked clean as
soon as possible. We don't write 'clean' directly as it may cause mdmon to
miss a 'write-pending' event.
Include a couple fixes to sysfs_set_safemode():
1/ 0 pad the milliseconds field
2/ workaround input truncation in the kernel
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
monitor: protect against CONFIG_LBD=n
md/resync_start reports different terminal values depending on kernel
configuration (~0UL versus ~0ULL). Make detection of the
resync-complete state more robust by comparing against array size.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:03 +0000 (12:12 -0700)]
imsm: trust sector reservation from metadata
On ich6r the option-rom appears to reserve only 432 sectors rather than
the 418+4096 of newer implementations. For compatibility trust the
metadata in these cases.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Don't try to set_array_info when -I find new devices for an array.
When -I get a new device for a container and tries to incrementally
assemble the container array, it calls sysfs_set_array to create the
array without first checking if it already exists. This produces
unpleasant error messages.
Allow metadata handler to report that it doesn't record homehost.
For now, this means that the lack of a homehost doesn't always prevent
assembly.
Soon we will allow assembly anyway, but have different messages if
homehost isn't supported.
Don't allow spares when creating 'external' arrays.
It is meaningless when creating the container, and for
subarrays, the container is responsible for assigning
spares.
Also, don't do the 'spare' fiddle for raid5 as we cannot
set up a spare at this point yet. Later maybe just create
the array degraded and let the container sort it out.
The variety of approaches to 'add_disk' are factored out into
a separate function, and Incremental mode benefits by being
closer to supporting the assembly of containers.
Also remove the adding-to-array-data-structure out of sysfs_add_disk
and into add_disk.
And add some tests for --incremental mode to make sure we don't break it.
Dan Williams [Tue, 16 Sep 2008 03:58:43 +0000 (20:58 -0700)]
mdmon: recreate socket/pid file on SIGHUP
Allow mdmon to start while /var/run/mdadm is readonly. Later a SIGHUP
can trigger mdmon to drop its pid and socket once /var/run/mdadm is
writable. Of course one needs the pid to send a HUP, that can be stored
in a distribution specific rw-init directory... For now, rely on a
killall -HUP mdmon to get the files dumped.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:43 +0000 (20:58 -0700)]
ping_manager() to prevent 'add' before 'remove' completes
It is currently possible to remove a device and re-add it without the
manager noticing, i.e. without detecting a mdstat->devcnt
container->devcnt mismatch. Introduce ping_manager() to arrange for
mdmon to run manage_container() prior to mdadm dropping the exclusive
open() on the container. Despite these precautions sysfs_read() may
still fail. If this happens invalidate container->devcnt to ensure
manage_container() runs at the next event.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:43 +0000 (20:58 -0700)]
sysfs: detect disks that are in the process of being removed
When removing a disk there is a window where the 'slot' attribute of
md/dev-$name will return -EBUSY to read attempts. When this happens
look at the the 'block' link, if it is removed then we can be sure the
device has been removed, versus some other error.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:42 +0000 (20:58 -0700)]
'mdadm --wait-clean' wait for array to be marked clean
For use in distro shutdown scripts with a RAID root file system.
Returns immediately if the array is 'readonly', or not an externally
managed array. It is up to the distro's scripts to make sure no new
writes hit the device after this returns 'true'.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:42 +0000 (20:58 -0700)]
Add ping_monitor() to mdadm --wait
The action we are waiting for may not be complete until the monitor has
had a chance to take action on the result.
The following script can now remove the device on the first attempt,
versus a few attempts with the original Wait():
#!/bin/bash
#export MDADM_NO_MDMON=1
export IMSM_DEVNAME_AS_SERIAL=1
./mdadm -Ss
./mdadm --zero-superblock /dev/loop[0-3]
echo 2 > /proc/sys/dev/raid/speed_limit_max
./mdadm --create /dev/imsm /dev/loop[0-3] -n 4 -e imsm -a md
./mdadm --create /dev/md/r1 /dev/loop[0-3] -n 4 -l 5 --force -a mdp
./mdadm --fail /dev/md/r1 /dev/loop3
./mdadm --wait /dev/md/r1
x=0
while ! ./mdadm --remove /dev/imsm /dev/loop3 > /dev/null 2>&1
do
x=$((x+1))
done
echo "removed after $x attempts"
./mdadm --add /dev/imsm /dev/loop3
Include 2 small cleanups:
* remove the almost open coded fd2devnum() in Wait() by introducing a
new utility routine stat2devnum()
* teach connect_monitor() to parse the container device from a subarray
string
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:41 +0000 (20:58 -0700)]
imsm: rectify map handling
The secondary map is used to reflect the migration state of the array
i.e. from dev->vol.map[1] to dev->vol.map[0]. Ensure a rebuilding /
initializing array is marked in the second map, while normal status is
reflected in the first map. Also mark rebuilding drives with
IMSM_ORD_REBUILD.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:55:34 +0000 (20:55 -0700)]
imsm: mark failures like the Matrix driver
* Truncate the first character of the serial number
* Set 'scsi_id' to all f's
* Expect to find disk entries with unmatchable serial numbers, i.e.
expect get_imsm_disk() to return NULL in some situations
* Allow discrepencies between mpb->num_disks and len(super->disks)
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 19 Aug 2008 07:19:51 +0000 (17:19 +1000)]
mdadm: add device to a container
Adding a device updates the container and then mdmon takes action upon
noticing a change in devices. This reuses the container version of
add_to_super to create a new record for the device.
Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
Dan Williams [Tue, 19 Aug 2008 04:55:12 +0000 (14:55 +1000)]
mdmon: remove devices from container
Once the monitor thread has kicked a drive from all managed arrays mdadm
-r is permitted. We are guaranteed that the drive is marked failed at
this point, so allow the drive to be re-added as a spare.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 19 Aug 2008 04:55:10 +0000 (14:55 +1000)]
imsm: delete kicked disks
When we have determined that a disk is no longer of any value, remove
it from the data structure. This is now safe because the manager
will back off while any metadata update is pending in the monitor.
Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Tue, 19 Aug 2008 04:55:03 +0000 (14:55 +1000)]
Make metadata updates from manage to monitor 'synchronous'
A metadata update may modify the data structure of the metadata
including freeing things, so it is not safe of the manager to touch
the metadata while an update is pending in the monitor.
So When an update has been submitted, don't do anything else in the
manager until it is complete.
NeilBrown [Tue, 19 Aug 2008 04:54:55 +0000 (14:54 +1000)]
Extra option for set_array_state: you choose dirty or clean.
When we first start an array, it might be good to start recovery
straight away. That requires setting the array to 'dirty', but
only the metadata handler can know if that is required or not.
So have a third possible 'consistent' option to set_array_state.
Either 'no' or 'yes' or 'you choose'.
Return value indicates what was chosen.
'1' (no) should be chosen unless there is a good reason.
Dan Williams [Tue, 12 Aug 2008 09:25:49 +0000 (02:25 -0700)]
imsm: fix up assembly of disks that are not in-sync
1/ Do not assemble !in_sync or failed devices in container_content.
2/ Prevent activation of failed or configured devices in activate_spare.
3/ Be sure to avoid dirty degraded if the array was shutdown cleanly.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 12 Aug 2008 09:25:46 +0000 (02:25 -0700)]
mdmon: use activate spare for re-add
Disks that are not in-sync or failed are not assembled into member
arrays by mdadm. Teach mdmon to resolve this situation by checking for
spares at start. imsm_activate_spare() is updated to prefer devices
that can be re-added versus new spares.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 10 Aug 2008 03:28:24 +0000 (20:28 -0700)]
imsm: fix handling of the 'migr_state' and 'migr_type' bits
The option-rom and the Matrix driver mark resyncs/rebuilds with the
migrate state bits. Update sizeof_imsm_dev to allow allocation of
imsm_dev entries large enough to grow if migr_state is later set.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Wed, 6 Aug 2008 16:09:25 +0000 (09:09 -0700)]
imsm: spare devices are represented as single disk containers
This poses a small problem for the case of handling multiple raid1 arrays
across separate disk pairs i.e. 2 mirrors on 4 disks. The option-ROM will
configure this as two containers. We may need the capability for one
container to ask for an unused spare in another container. For now spares
will just maintain the affinity established at assemble time.
To support this configuration spare devices must be allowed to be assembled
into the container even though the metadata indicates the disk belongs to a
different family.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Jacek Danecki [Thu, 7 Aug 2008 06:55:53 +0000 (23:55 -0700)]
imsm: bad block management (phase1)
This is the initial defensive implementation of bad block management
support. It simply precludes assembly if there are entries in the bad
block logs. This is sufficient for now as the conditions that lead to
an entry in the bad block log would cause the array to be failed by MD
(as of 2.6.27).
[dan.j.williams@intel.com: general cleanups] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Wed, 30 Jul 2008 02:01:06 +0000 (19:01 -0700)]
mdmon: ignore inactive arrays and other manage_new() cleanups
While mdadm is constructing an array mdmon may see an intermediate state
(some disks not yet added / redundancy attributes like sync_action not
available). Waiting for mdstat->active == true ensures that the array
is ready to be handled. This fixes a bug in create array via mdmon
update whereby failures are not detected in the new array.
Introduce aa_ready() to catch cases where the active_array is not
correctly initialized. Barring a kernel bug this should never trigger,
nonetheless it precludes a class of bugs like the one mentioned above
from triggering.
Cleanup the exit paths and only call replace_array when the new array is
ready to be inserted into container->arrays.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Fri, 25 Jul 2008 23:59:47 +0000 (16:59 -0700)]
imsm: refactor mpb handling into parse and coalesce
Maintaining a single global buffer is unwieldly when extending/rewriting
sections of the metadata. Parse the metadata into component data
structures upon reading and coalesce to a coherent buffer before
writing.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Fri, 25 Jul 2008 00:26:24 +0000 (17:26 -0700)]
sysfs: deprecate sysfs_disk_to_sg
The cmd_filter patch merged for 2.6.27 broke retrieving the serial
number via an ioctl to /dev/sgN. In debugging this I found that other
utilities like sdparm simply run the ioctl on /dev/sdX. So just convert
to that for protection in numbers, but scream on the mailing list for
the inconvenience grr...
Signed-off-by: Dan Williams <dan.j.williams@intel.com>