NeilBrown [Mon, 3 Nov 2008 23:35:08 +0000 (10:35 +1100)]
Move recently merged /sys/dev/ lookup into stat2devnum.
But sysfs_init and stat2devnum try to convert stat information
into an md devnum. Combine all the value of both pieces of code
into stat2devnum and have sysfs_init call that.
NeilBrown [Sun, 2 Nov 2008 20:19:37 +0000 (07:19 +1100)]
mapfile: fix bug in testing for /var/run/mdadm/
There was a bug. If /var/run/mdadm/ did not exist as a directory,
the map file should have been created in /var/run/mdadm.map, but
due to bug it would never get created.
NeilBrown [Sun, 2 Nov 2008 19:39:02 +0000 (06:39 +1100)]
Incremental: change precedence order for autof setting.
It doesn't really make sense for the --auto setting to ever over-ride
the setting on an ARRAY line. That could cause failure if the
ARRAY line has a 'standard' now. So revert to the array line having
precedence over command line, then CREATE line last.
NeilBrown [Thu, 30 Oct 2008 05:37:29 +0000 (16:37 +1100)]
Adjust major number testing to allow for extended minor number in 2.6.28
From 2.6.28, normal md device will be able to have partitions. These
partitions will have a different major number. Sometimes mdadm tests
the major number and so can get confused.
Change these tests to test against get_mdp_major(). mdp does not use
extended minor number and so this test will always be accurate.
Also use /sys/dev links to map major/minor to devnum in sysfs.
NeilBrown [Wed, 29 Oct 2008 22:48:18 +0000 (09:48 +1100)]
Incremental: allow assembly of foreign array.
If a foreign (i.e. not known to be local) array is discovered
by --incremental assembly, we now assemble it. However we ignore
any name information in the array so as not to potentially create
a name that conflict with a 'local' array.
Also, foreign arrays are always assembled 'read-auto' to avoid writing
anything until the array is actually used.
NeilBrown [Wed, 29 Oct 2008 22:34:04 +0000 (09:34 +1100)]
Fix --incremental assembly of partitions arrays.
If incremental assembly finds an array mentioned in mdadm.conf,
with a 'standard partitioned' name like /dev/md_d0 or /dev/md/d0,
it will not create a partitioned array like it should.
This is because it mishandled the 'devnum' returned by
is_standard.
That is a devnum that does not have the partition-or-not encoded
into it. So we need to check the actual return value of
is_standard and encode the partition-or-not info into the devnum.
Doug Ledford [Wed, 29 Oct 2008 19:05:36 +0000 (15:05 -0400)]
Fix NULL pointer oops
RAID10 is the only raid level that uses the avail char array pointer
during the enough() operation, so it was the only one that saw this.
The code in incremental assumes unconditionally that count_active will
allocate the avail char array, that it might be used by enough, and that
it will need to be freed afterward. Once you make count_active actually
do that, then the oops goes away.
Doug Ledford [Wed, 29 Oct 2008 19:05:35 +0000 (15:05 -0400)]
Fix bad metadata formatting
Certain operations (Detail.c mainly) would print out the metadata of
an array in a format that the scan operation in super0.c and super1.c
would later reject as unknown when it was found in the mdadm.conf file.
Use a consistent format, but also modify the super0 and super1 match
methods to accept the other format without complaint.
NeilBrown [Sat, 25 Oct 2008 07:20:49 +0000 (18:20 +1100)]
Allow WRITEMOSTLY to be cleared on --readd using --readwrite.
Previously it was possible to set the WRITEMOSTLY flag when
adding a device to an array, but not to clear the flag when re-adding.
This is now possible with --readwrite.
NeilBrown [Fri, 17 Oct 2008 00:52:38 +0000 (11:52 +1100)]
Remove .UR .UE macros from man page because the don't do what we want.
.UR URL
text
.UE
is meant to create a hyperlink from the 'text' to the 'URL'.
But I wanted just to have the URL, so UR isn't really the right
tool - the URL gets displayed twice.
So just display the URL in bold and assume man2html etc can recognise
it and do the right thing.
Dan Williams [Fri, 3 Oct 2008 05:26:00 +0000 (22:26 -0700)]
mdmon: suicide prevention
mdmon cannot remove the pidfile at shutdown becuase it needs to stay
running across the "mount -o remount,ro /" event. When it relaunches
after a reboot there is a good chance that the pid will match what was
there previously. The result is that the "take over for unresponsive
mdmon" logic results in self termination.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Thu, 2 Oct 2008 22:50:23 +0000 (15:50 -0700)]
mdmon: --switch-root
For raid rootfs we cannot run the array unmonitored for any length of
time. At least XFS will not mount/replay the journal if the underlying
block device is readonly (FIXME it also seems that XFS does not always
honor the ro status of the backing device as I was able to hit the
BUG_ON(mddev->ro == 1) in md_write_start... but I digress).
So we need to start mdmon in the initramfs before '/' is mounted and
then restart it after the real rootfs is available. Upon seeing the
--switch-root option, mdmon will kill any victims in the current
/var/run/mdadm directory and then chroot(2) before continuing.
The option is deliberately called 'switch-root' instead of 'chroot' to
hopefully indicate that this is different than doing "chroot mdmon
/dev/imsm".
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Thu, 2 Oct 2008 22:42:57 +0000 (15:42 -0700)]
mdmon: wait after trying to kill
Now that mdmon handles sigterm if another monitor wants to take over it
should wait until all managed arrays are clean. So make WaitClean()
available to mdmon and teach try_kill_monitor() to wait on each subarray
in the container.
...since we may be communicating with a dieing process, we need to
block SIGPIPE earlier.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Thu, 2 Oct 2008 13:32:08 +0000 (06:32 -0700)]
mdmon: terminate clean
We generally don't want mdmon to be terminated, but if a SIGTERM gets
through try to leave the monitored arrays in a clean state, block
attempts to mark the array dirty, and stop servicing the socket.
When we are killed by sigterm don't remove the pidfile let that be
cleaned up by the next monitor.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Thu, 2 Oct 2008 01:50:44 +0000 (18:50 -0700)]
Treat all devices at the container level as spares
Raid disk and disk number information is not relevant at the container
level, especially for imsm. So arrange for getinfo_super_imsm() to
always publish devices as spares and report the number of spares at
Assemble() time.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Thu, 2 Oct 2008 01:50:43 +0000 (18:50 -0700)]
mdmon: periodically retry to create the socket
If initial socket creation fails, EROFS, set a periodic alarm to wake up
the manager and retry. Include a kernel patch that will wake us up if
the mount flags are changed.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:07 +0000 (12:12 -0700)]
trivial warn_unused_result squashing
Made the mistake of recompiling the F9 mdadm rpm which has a patch to
remove -Werror and add "-Wp,-D_FORTIFY_SOURCE -O2" which turns on lots
of errors:
config.c:568: warning: ignoring return value of asprintf
Assemble.c:411: warning: ignoring return value of asprintf
Assemble.c:413: warning: ignoring return value of asprintf
super0.c:549: warning: ignoring return value of posix_memalign
super0.c:742: warning: ignoring return value of posix_memalign
super0.c:812: warning: ignoring return value of posix_memalign
super1.c:692: warning: ignoring return value of posix_memalign
super1.c:1039: warning: ignoring return value of posix_memalign
super1.c:1155: warning: ignoring return value of posix_memalign
super-ddf.c:508: warning: ignoring return value of posix_memalign
super-ddf.c:645: warning: ignoring return value of posix_memalign
super-ddf.c:696: warning: ignoring return value of posix_memalign
super-ddf.c:715: warning: ignoring return value of posix_memalign
super-ddf.c:1476: warning: ignoring return value of posix_memalign
super-ddf.c:1603: warning: ignoring return value of posix_memalign
super-ddf.c:1614: warning: ignoring return value of posix_memalign
super-ddf.c:1842: warning: ignoring return value of posix_memalign
super-ddf.c:2013: warning: ignoring return value of posix_memalign
super-ddf.c:2140: warning: ignoring return value of write
super-ddf.c:2143: warning: ignoring return value of write
super-ddf.c:2147: warning: ignoring return value of write
super-ddf.c:2150: warning: ignoring return value of write
super-ddf.c:2162: warning: ignoring return value of write
super-ddf.c:2169: warning: ignoring return value of write
super-ddf.c:2172: warning: ignoring return value of write
super-ddf.c:2176: warning: ignoring return value of write
super-ddf.c:2181: warning: ignoring return value of write
super-ddf.c:2686: warning: ignoring return value of posix_memalign
super-ddf.c:2690: warning: ignoring return value of write
super-ddf.c:3070: warning: ignoring return value of posix_memalign
super-ddf.c:3254: warning: ignoring return value of posix_memalign
bitmap.c:128: warning: ignoring return value of posix_memalign
mdmon.c:94: warning: ignoring return value of write
mdmon.c:221: warning: ignoring return value of pipe
mdmon.c:327: warning: ignoring return value of write
mdmon.c:330: warning: ignoring return value of chdir
mdmon.c:335: warning: ignoring return value of dup
monitor.c:415: warning: rv may be used uninitialized in this function
...some of these like the write() ones are not so trivial so save those
fixes for the next patch.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:07 +0000 (12:12 -0700)]
imsm: determine failed indexes from the most up-to-date disk
load_imsm_disk() currently notices if spares missed their activation
update, but we allow a stale failed disk back in to the array because its
serial number is clobbered in the most up-to-date disk.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:07 +0000 (12:12 -0700)]
imsm: manage a list of missing disks
If a drive is removed while mdmon is not running we need a way to
identify what is missing and mark that disk as failed in the metadata.
At ->load_super() time create a list of missing disks defined as a disk
that is marked in-sync yet does not appear in super->disks.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
imsm: enable checkpointing of migration (resync/rebuild)
When the array is shutdown, or when mdadm --wait-clean is called, any
active resync process will be idled allowing mdmon to record the current
resync position.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
Extend --wait-clean to checkpoint resync
Root file systems backed by external metadata arrays need to be
explicitly checkpointed near the time the rootfs is marked readonly as
userspace will not have an opportunity to react to the final shutdown of
the array.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
--wait-clean: shorten timeout
Set the safemode timeout to a small value to get the array marked clean as
soon as possible. We don't write 'clean' directly as it may cause mdmon to
miss a 'write-pending' event.
Include a couple fixes to sysfs_set_safemode():
1/ 0 pad the milliseconds field
2/ workaround input truncation in the kernel
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:06 +0000 (12:12 -0700)]
monitor: protect against CONFIG_LBD=n
md/resync_start reports different terminal values depending on kernel
configuration (~0UL versus ~0ULL). Make detection of the
resync-complete state more robust by comparing against array size.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sun, 28 Sep 2008 19:12:03 +0000 (12:12 -0700)]
imsm: trust sector reservation from metadata
On ich6r the option-rom appears to reserve only 432 sectors rather than
the 418+4096 of newer implementations. For compatibility trust the
metadata in these cases.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
NeilBrown [Wed, 15 Oct 2008 03:34:18 +0000 (14:34 +1100)]
Grow: Fix linear-growth when devices are not all the same size.
If we add a device to a linear array which is a difference size
to the other devices in the array then, for v1.x metadata, we need to
make sure the size is correctly reflected in the superblock.
NeilBrown [Mon, 13 Oct 2008 05:15:16 +0000 (16:15 +1100)]
Manage: allow adding device that is just large enough to v1.x array.
When adding a device to an array, we check that it is large enough.
Currently the check makes sure there is also room for a reasonably
sized bitmap. But if the array doesn't have a bitmap, then this test
might be too restrictive.
So when adding, only insist there is enough space for the current
bitmap.
When Creating, still require room for the standard sized bitmap.
Don't try to set_array_info when -I find new devices for an array.
When -I get a new device for a container and tries to incrementally
assemble the container array, it calls sysfs_set_array to create the
array without first checking if it already exists. This produces
unpleasant error messages.
Allow metadata handler to report that it doesn't record homehost.
For now, this means that the lack of a homehost doesn't always prevent
assembly.
Soon we will allow assembly anyway, but have different messages if
homehost isn't supported.
Don't allow spares when creating 'external' arrays.
It is meaningless when creating the container, and for
subarrays, the container is responsible for assigning
spares.
Also, don't do the 'spare' fiddle for raid5 as we cannot
set up a spare at this point yet. Later maybe just create
the array degraded and let the container sort it out.
The variety of approaches to 'add_disk' are factored out into
a separate function, and Incremental mode benefits by being
closer to supporting the assembly of containers.
Also remove the adding-to-array-data-structure out of sysfs_add_disk
and into add_disk.
And add some tests for --incremental mode to make sure we don't break it.
Dan Williams [Tue, 16 Sep 2008 03:58:43 +0000 (20:58 -0700)]
mdmon: recreate socket/pid file on SIGHUP
Allow mdmon to start while /var/run/mdadm is readonly. Later a SIGHUP
can trigger mdmon to drop its pid and socket once /var/run/mdadm is
writable. Of course one needs the pid to send a HUP, that can be stored
in a distribution specific rw-init directory... For now, rely on a
killall -HUP mdmon to get the files dumped.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:43 +0000 (20:58 -0700)]
ping_manager() to prevent 'add' before 'remove' completes
It is currently possible to remove a device and re-add it without the
manager noticing, i.e. without detecting a mdstat->devcnt
container->devcnt mismatch. Introduce ping_manager() to arrange for
mdmon to run manage_container() prior to mdadm dropping the exclusive
open() on the container. Despite these precautions sysfs_read() may
still fail. If this happens invalidate container->devcnt to ensure
manage_container() runs at the next event.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:43 +0000 (20:58 -0700)]
sysfs: detect disks that are in the process of being removed
When removing a disk there is a window where the 'slot' attribute of
md/dev-$name will return -EBUSY to read attempts. When this happens
look at the the 'block' link, if it is removed then we can be sure the
device has been removed, versus some other error.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:42 +0000 (20:58 -0700)]
'mdadm --wait-clean' wait for array to be marked clean
For use in distro shutdown scripts with a RAID root file system.
Returns immediately if the array is 'readonly', or not an externally
managed array. It is up to the distro's scripts to make sure no new
writes hit the device after this returns 'true'.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:42 +0000 (20:58 -0700)]
Add ping_monitor() to mdadm --wait
The action we are waiting for may not be complete until the monitor has
had a chance to take action on the result.
The following script can now remove the device on the first attempt,
versus a few attempts with the original Wait():
#!/bin/bash
#export MDADM_NO_MDMON=1
export IMSM_DEVNAME_AS_SERIAL=1
./mdadm -Ss
./mdadm --zero-superblock /dev/loop[0-3]
echo 2 > /proc/sys/dev/raid/speed_limit_max
./mdadm --create /dev/imsm /dev/loop[0-3] -n 4 -e imsm -a md
./mdadm --create /dev/md/r1 /dev/loop[0-3] -n 4 -l 5 --force -a mdp
./mdadm --fail /dev/md/r1 /dev/loop3
./mdadm --wait /dev/md/r1
x=0
while ! ./mdadm --remove /dev/imsm /dev/loop3 > /dev/null 2>&1
do
x=$((x+1))
done
echo "removed after $x attempts"
./mdadm --add /dev/imsm /dev/loop3
Include 2 small cleanups:
* remove the almost open coded fd2devnum() in Wait() by introducing a
new utility routine stat2devnum()
* teach connect_monitor() to parse the container device from a subarray
string
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Tue, 16 Sep 2008 03:58:41 +0000 (20:58 -0700)]
imsm: rectify map handling
The secondary map is used to reflect the migration state of the array
i.e. from dev->vol.map[1] to dev->vol.map[0]. Ensure a rebuilding /
initializing array is marked in the second map, while normal status is
reflected in the first map. Also mark rebuilding drives with
IMSM_ORD_REBUILD.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>