NeilBrown [Thu, 27 Jun 2013 00:12:31 +0000 (10:12 +1000)]
Grow: report better message when --grow --chunk cannot work.
When changing the chunksize of an array, the new chunksize must
divide the device size.
If it doesn't we report a very brief message.
Make this message a bit longer and suggest a way forward be reducing
the size of the array.
Reported-by: Mark Knecht <markknecht@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Tue, 25 Jun 2013 05:56:22 +0000 (15:56 +1000)]
Subject: Make wait_for and open_dev_excl faster
When we crete or assemble an array, we wait for udev to create the
device file in /dev so that as soon as mdadm complete, the device can
be used.
This waiting is performed in multiples of 200ms, which can sometimes
be too long to wait.
So change to an exponential backoff. Wait 1, then 2, then 4 msec etc.
Once we get to 256msec, stop backing off and continue waiting 256ms at
a time until we reach the limit which is now 4.608sec rather than 5sec
which it was before.
NeilBrown [Tue, 25 Jun 2013 05:52:58 +0000 (15:52 +1000)]
Grow: fix bug in raid0 -> raid5 conversion.
The moment we change a RAID0 to a RAID5 it will try to recovery. This
will abort quite quickly as there are not spare devices, but it could
confuse the attempt to freeze the array.
So allow 'freeze' to work even on a recovering array.
NeilBrown [Mon, 24 Jun 2013 06:59:37 +0000 (16:59 +1000)]
Make: CXFLAGS should be conditionally assigned.
As the Makefile encourages users to set CXFLAGS for extra flags,
we should only conditionally set it.
That way it can be over-ridden in the environment as well as on
the command line.
mwilck@arcor.de [Thu, 20 Jun 2013 20:21:05 +0000 (22:21 +0200)]
Detail: deterministic ordering in --brief --verbose
Have mdadm --Detail --brief --verbose print the list of devices in
alphabetical order.
This is useful for debugging purposes. E.g. the test script
10ddf-create compares the output of two mdadm -Dbv calls which
may be different if the order is not deterministic.
(I confess: I use a modified "test" script that always runs
"mdadm --verbose" rather than "mdadm --quiet", otherwise this
wouldn't happen in 10ddf-create).
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Mon, 24 Jun 2013 04:08:41 +0000 (14:08 +1000)]
Grow: remove excess drives when converting to RAID0.
When converting to RAID0, all spares and non-data drives
need to be removed first.
It is possible that the first HOT_REMOVE_DISK will fail because the
personality hasn't let go of it yet, so retry a few times.
NeilBrown [Mon, 24 Jun 2013 03:04:38 +0000 (13:04 +1000)]
Grow: fix two problems with new_data_offset
1/ ignore failed devices - obviously
2/ We need to tell the kernel which direction the reshape should
progress even if we didn't choose the particular data_offset
to use.
NeilBrown [Mon, 24 Jun 2013 03:02:35 +0000 (13:02 +1000)]
Grow: Try hard to set new_offset.
Setting new_offset can fail if the v1.x "data_size" is too small.
So if that happens, try increasing it first by writing "0".
That can fail on spare devices due to a kernel bug, so if it doesn't
try writing the correct number of sectors.
NeilBrown [Wed, 19 Jun 2013 01:09:33 +0000 (11:09 +1000)]
Assemble: when forcing a single-degraded RAID6 array, trigger a 'repair'.
When an active/degraded RAID6 array is force-started we clear the
'active' flag, but it is still possible that some parity is
no in sync. This is because there are two parity block.
It would be nice to be able to tell the kernel "P is OK, Q maybe not".
But that is not possible.
So when we force-assemble such an array, trigger a 'repair' to fix up
any errant Q blocks.
This is not ideal as a restart during the repair will not be continued
after the restart, but it is the best we can do without kernel help.
NeilBrown [Wed, 19 Jun 2013 00:33:47 +0000 (10:33 +1000)]
sysfs_read: return devices in same order as in filesystem.
When we read devices from sysfs (../md/dev-*), store them in the same
order that they appear. That makes more sense when exposed to a
human (as the next patch will).
Bernd Schubert [Tue, 18 Jun 2013 09:09:26 +0000 (11:09 +0200)]
raid6check: Fix memory leaks detected by valgrind
==2389947== 24 bytes in 1 blocks are definitely lost in loss record 1 of 10
==2389947== at 0x4C2B3F8: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2389947== by 0x408067: xmalloc (xmalloc.c:36)
==2389947== by 0x401B19: check_stripes (raid6check.c:151)
==2389947== by 0x4030C6: main (raid6check.c:521)
==2389947==
==2389947== 24 bytes in 1 blocks are definitely lost in loss record 2 of 10
==2389947== at 0x4C2B3F8: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2389947== by 0x408067: xmalloc (xmalloc.c:36)
==2389947== by 0x401B67: check_stripes (raid6check.c:155)
==2389947== by 0x4030C6: main (raid6check.c:521)
==2389947==
Bernd Schubert [Tue, 18 Jun 2013 09:09:16 +0000 (11:09 +0200)]
raid6check: Fix build of raid6check
After recent git pull 'make raid6check' did not work anymore, as
sysfs_read() was called with a wrong argument and as check_env()
was used by use_udev(), but not defined.
Replace sysfs_read(..., -1, ...) by sysfs_read(..., NULL, ...)
NeilBrown [Mon, 17 Jun 2013 06:55:31 +0000 (16:55 +1000)]
Assemble/Incr: Don't include spares with too-high event count.
Some failure scenarios can leave a spare with a higher event count
than an in-sync device. Assembling an array like this will confuse
the kernel.
So detect spares with event counts higher than the best non-spare
event count and exclude them from the array.
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Mon, 27 May 2013 05:09:38 +0000 (15:09 +1000)]
Grow: allow for different sized devices when updating data_offset.
It is possible that the devices in an array have different sizes, and
different data_offsets. So the 'before_space' and 'after_space' may
be different from drive to drive.
Any decisions about how much to change the data_offset must work on
all devices, so must be based on the minimum available space on
any devices.
So find this minimum first, then do the calculation.
NeilBrown [Thu, 23 May 2013 04:41:29 +0000 (14:41 +1000)]
Assemble: --update=metadata converts v0.90 to v1.0
This allows the smooth conversion of legacy 0.90 arrays
to 1.0 metadata.
Old metadata is likely to remain but will be ignored.
It can be removed with
mdadm --zero-superblock --metadata=0.90 /dev/whatever
NeilBrown [Tue, 21 May 2013 06:28:23 +0000 (16:28 +1000)]
Grow: use new_data_offset instead of backups for raid4/5/6 reshape.
If we can modify the data_offset, we can avoid doing any backups at all.
If we can't fall back on old approach - but not if --data-offset
was requested.
NeilBrown [Wed, 22 May 2013 02:17:32 +0000 (12:17 +1000)]
Grow: introduce min_offset_change to struct reshape.
raid10 currently uses the 'backup_blocks' field to store something
else: a minimum offset change.
This is bad practice, we will shortly need to have both for RAID5/6,
so make a separate field.
NeilBrown [Wed, 15 May 2013 01:40:27 +0000 (11:40 +1000)]
Create: over-ride "start_ro" setting when creating an array.
If module parameter start_ro is set, arrays start readonly.
This is OK when assembling, but is very surprising when creating
an array as the resync won't start.
So over-ride the setting (unless --read-only was given) make
arrays RW when created.
NeilBrown [Wed, 15 May 2013 01:10:54 +0000 (11:10 +1000)]
Suppress error messages from systemctl.
We call systemctl to see if systemd will run mdmon for us.
If it cannot, we run mdmon directly, so we aren't interested
in the error message.
So redirect stderr to /dev/null.
NeilBrown [Wed, 15 May 2013 01:03:25 +0000 (11:03 +1000)]
create_mddev: add support for /dev/md_XXX non-numeric names.
With the 'devnm' infrastructure fixed, it is quite easy to support
names like "md_home" for md arrays.
The currently defaults to "off" and can be enabled in mdadm.conf with
CREATE names=yes
This is incase other tools get confused by the new names.
NeilBrown [Mon, 13 May 2013 02:56:38 +0000 (12:56 +1000)]
misc_scan: don't trust the mapping file too much for device names.
misc_scan assumes that any device name found in the 'mapping' file
is usable. Usually it is but sometimes not, such as for inactive
devices.
Depending on it isn't really robust, when a name is found, check that
it exists. If not, fall back on map_dev.
This will allow "--detail --scan" to notice inactive devices.
NeilBrown [Mon, 13 May 2013 02:07:40 +0000 (12:07 +1000)]
Incrmental: tell udevs to unmount when array looks to have disappeared.
If a device is removed which appears to be busy in an md array, then
it is very like the array cannot be used.
We currently try to stop it, but that could fail if udisks had
automatically mounted it.
So tell udisks to unmount it, but ignore any error.
NeilBrown [Wed, 1 May 2013 00:23:40 +0000 (10:23 +1000)]
Wait: also wait if an action is about to start.
If a sync/recover action is about to start but hasn't actually begun
yet, /proc/mdstat won't show it, but md/sync_action will (it checks
MD_RECOVERY_NEEDED).
So when /proc/mdstat seems to say nothing is happening, double check
with md/sync_action.
Linux 3.10 will allow more "--add" to be handled as "--re-add".
To be sure the tests work correctly we sometimes need to zero
the device to ensure it really is an --add that happens.
mwilck@arcor.de [Fri, 25 Oct 2013 10:07:37 +0000 (12:07 +0200)]
monitor: read_and_act: handle race conditions for resync_start
When arrays are stopped, sysfs attributes may be deleted by
the kernel, and attempts to read these attributes will fail.
Setting resync_start to 0 is wrong in this case, because it
may make is_resync_complete() erroneously return
FALSE for a clean array. It is better to leave resync_start
untouched (the previously read value for this array).
Otherwise set_array_state() will pass thewrong state information
to the metadata handler, which will write it to disk, and at
the next restart an unnecessary recovery is started for the
array.
It is also possible that resync_start is actually *not* deleted
yet when read_and_act is running, and an apparently valid
value of "0" is read from it, with the same effect as described
above. This happens if the kernel has already called md_clean()
on the array (setting recovery_cp = 0), but the delayed removal
of "resync_start" hasn't happened yet. Therefore, in "clear"
state, "resync_start" shouldn't be read at all.
Signed-off-by: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>