If someone has an IMSM array, and disables RAID in the BIOS
and uses the devices for some other purpose, then they really don't
want mdadm to start syncing the array.
So don't assemble if OROM doesn't confirm it is OK.
There can still be problems for crash-dump not being able to find
the OROM. Some explicit work-around might be needed for that
rather than a more general workaround that can corrupt data.
NeilBrown [Mon, 3 Aug 2015 01:53:01 +0000 (11:53 +1000)]
mdassemble: don't try to perform cluster check.
mdassemble is meant to be small an simple, so avoid
trying to check for a cluster.
Currently it doesn't, but it still includes the code,
which doesn't build because the library isn't provided.
So just exclude the get_cluster_name code from mdassemble.
If someone has an IMSM array, and disables RAID in the BIOS
and uses the devices for some other purpose, then they really don't
want mdadm to start syncing the array.
So don't assemble if OROM doesn't confirm it is OK.
There can still be problems for crash-dump not being able to find
the OROM. Some explicit work-around might be needed for that
rather than a more general workaround that can corrupt data.
sometimes the removed device is re-added before the writes
get all the way to the md device - so the array doesn't need
any recovery and the test fails.
So flush first to be safe.
newer versions of mkfs.extX ask before creating a filesystem
on a device which appears to already have a filesystem.
We don't want that, so add the -F flag.
Also be explicit about fs type as one shouldn't depend on defaults.
restripe: fix data block order in raid6_2_data_recov
... rather than relying on the caller getting them in the
correct order.
This is better engineering and fixes a bug, but because the
failed_slotX numbers are used later with assumption that
they weren't swapped
- document meaning of various arrays. In particular:
stripes[]
blocks[]
blocks_page[]
block_index_for_slot[]
It needs to be clear if these are indexed by raid_disk
number or syndrome number.
- changed meaning of block_index_for_slot[]. It didn't seem
to be used consistently. It also made use of the block numbers
in array data ordering, which is not directly relevant for syndrome
calculations.
- reduced number of args to autorepair and manual_repair
There don't need both stripes[] and blocks[]. And they don't need
diskP or diskQ.
blocks[-1] is the P chunk, blocks[-2] is the Q chunk.
block_index_for_slot[] can be used to find the target device for
a particular syndrome block.
- remove stripe locking from within manual_repair, and instead
use the global stripe locking used for check and autorepair.
- this necessitated changes to raid6_datap_recov and raid5_2data_reov
so the P and Q blocks could be before or after the data blocks.
raid6check: get device ordering correct for syndrome calculation.
The order of devices used for the syndrome calculation is not
the same as the order of data in the array.
The D block immediately after Q is first, then they continue
cyclicly in raid-disk order, skipping over the P disk if it is seen.
This gets the 'check' right for all layouts other than DDF, which is
quite different.
I haven't confirmed that this does't break repair.
tests: slow down --stop a bit to allow revert-inplace to work.
revert-inplace would sometimes find that the original reshape had
finished.
So slow down the reshaping during --stop (which needs to be a little
bit fast so that stop doesn't timeout waiting) and don't wait quite
so long before stopping.
If a read fills the whole buffer, then we possibly
missed something of the end, and we definitely shouldn't
put a '\0' beyond the end, so just return an error.
This should never happen anyway.
A 'devnm' never starts with '/', so this test is pointless.
The code should use the passed-in devname unless it is clearly
not usable. So fix it to do that.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:12 +0000 (13:42 +0800)]
mdadm: change the num of cluster node
This extends nodes option for assemble mode, make the num of
cluster node could be change by user.
Before that, it is necessary to ensure there are enough space
for those nodes, calc_bitmap_size is introduced to calculate
the bitmap size of each node.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:11 +0000 (13:42 +0800)]
mdadm: add the ability to change cluster name
To support change the cluster name, the commit do the followings:
1. extend original write_bitmap function for new scenario.
2. add the scenarion to handle the modification of cluster's name
in write_bitmap1.
3. let the cluster name also show in examine_super1 and detail_super1
Guoqing Jiang [Wed, 10 Jun 2015 05:42:09 +0000 (13:42 +0800)]
Convert a bitmap=none device to clustered
This adds the ability to convert a regular md without bitmap
(--bitmap=none) to a clustered device (--bitmap=clustered).
To convert a device with --bitmap=internal or --bitmap=external,
you have to convert to --bitmap=none and then re-execute the
command with --bitmap=clustered.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:08 +0000 (13:42 +0800)]
Add a new clustered disk
A clustered disk is added by the traditional --add sequence.
However, other nodes need to acknowledge that they can "see"
the device. This is done by --cluster-confirm:
--cluster-confirm SLOTNUM:/dev/whatever (if disk is found)
or
--cluster-confirm SLOTNUM:missing (if disk is not found)
The node initiating the --add, has the disk state tagged with
MD_DISK_CLUSTER_ADD and the one confirming tag the disk with
MD_DISK_CANDIDATE.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:06 +0000 (13:42 +0800)]
Set home-cluster while creating an array
The home-cluster is stored in the bitmap super block of the
array. The device can be assembled on a cluster with the
cluster name same as the one recorded in the bitmap.
If home-cluster is not specified, this is auto-detected using
dlopen corosync cmap library.
neilb: allow code to compile when corosync-devel is not installed.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:05 +0000 (13:42 +0800)]
Add nodes option while creating md
Specifies the maximum number of nodes in the cluster that may use
this device simultaneously. This is equivalent to the number of
bitmaps created in the internal superblock (patches to follow).
NeilBrown [Thu, 28 May 2015 06:53:26 +0000 (16:53 +1000)]
test: make 'check wait' more reliable.
'recover' etc doesn't appear in /proc/mdstat immediately.
The "sync" thread must be started first.
But 'sync_action' shows it as soon as MD_RECOVERY_NEEDED is set
in the kernel. So look there too.
Now maybe I can get rid of some of those silly 'sleep' calls.
NeilBrown [Thu, 28 May 2015 06:43:15 +0000 (16:43 +1000)]
Grow: fix problem with --grow --continue
If an array is being reshaped using backup space on a 'spare' device,
then
mdadm --grow --continue
won't find it as by the time it runs, nothing looks like a spare are
more. The spare has been added to the array, but has no data yet.
So allow reshape_prepare_fdlist to find a newly-incorporated spare and
report this so it can be used.
Reported-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Mon, 25 May 2015 06:33:45 +0000 (16:33 +1000)]
Grow: another attempt to fix stop-during-reshape race.
When the array is stopped during a critical section, we sometimes
erase the backup, which is bad.
This happens when 'completed' is zero.
This can happen easily when 'stop' freezes reshape.
So try to be more careful and check 'reshape_position'.
Sergey Vidishev [Tue, 19 May 2015 19:02:46 +0000 (22:02 +0300)]
mdadm: monitor: fix nullptr dereference when get_md_name() returns NULL
Function add_new_arrays() expects that function get_md_name() should
return pointer to devname, but also get_md_name() may return NULL. So
check the pointer before use it in add_new_arrays().
NeilBrown [Fri, 15 May 2015 05:11:48 +0000 (15:11 +1000)]
Grow: be even more careful about handing a '0' completed value.
Some old kernels set 'completed' to '0' too soon.
But modern kernels don't.
And when 'mdadm --stop' freezes and resume the grow,
'completed' goes back to zero briefly, which can confuse this
logic.
So only think '0' might be wrong from an old kernel when
the reshape has gone idle.
NeilBrown [Thu, 14 May 2015 01:17:39 +0000 (11:17 +1000)]
Grow: only warn about incompatible metadata when no fallback available.
We might be trying to set_new_data_offset() for RAID10, when it is
a necessary requirement, or for RAID5 where it is optional.
In the latter case, a message about metadata versions is no helpful.
NeilBrown [Wed, 13 May 2015 04:08:41 +0000 (14:08 +1000)]
Manage: when re-adding, do check avail size if ->sb cannot be found.
avail_size1 requires ->sb, so we must only call it if ->sb
was loaded.
If ->sb wasn't loaded, then we are only proceding on the basis that
the kernel might be able to work something out - we don't need to
do any tests on size.
Martin Wilck [Mon, 11 May 2015 14:09:44 +0000 (16:09 +0200)]
DDF: _write_super_to_disk: fix anchor header type
Since commit 30bee0201, the anchor is updated from the active
DDF header. This requires fixing the header type before the
anchor is written.
The LSI Software RAID code will reject DDF meta data with wrong
anchor type and will erase all meta data when it encounters
such a broken anchor. Thus starting Linux md once on a system
with LSI RAID BIOS may cause the meta data to get destroyed.
NeilBrown [Wed, 6 May 2015 05:03:50 +0000 (15:03 +1000)]
Manage: fix test for 'is array failed'.
We 'active_disks' does not count spares, so if array is rebuilding,
this will not necessarily find all devices, so may report an array
as failed when it isn't.
Active arrays with IMSM metadata are counted per hba so far.
This is bad due to new functionality of orom shared between multiple
controllers i.e. more arrays can be created than is supported by orom.
This patch changes the way of counting arrays, so the result will be
sum of arrays under every hba supported by specific orom.
Assemble/force: make it possible to "force" a new device in a reshape.
Normally we do not "force"-assemble devices which are in the
middle of recovery, as they are unlikely to have useful data.
However, when a reshape increases the number of devices,
the newly added devices appear to be recovering because they
do not have complete data on them yet, but then they aren't expected
to until the reshape completes.
So in this case, it can be appropriate to force-assemble them.
allow new_offset to be set, but don't then allow a RAID5
to be reshaped to change that offset.
Due to selective backports, this includes the SLES11-SP3 kernel.
It is quite easy to handle this case in mdadm, so we do.
Specifically: if the reshape with data-offset fails with EINVAL,
abort the data-offset change and try the "old" way.
Pawel Baldysiak [Fri, 27 Feb 2015 14:47:54 +0000 (15:47 +0100)]
IncRemove: Set "auto-read" only after successful excl open.
"mdadm -If" - triggered from udev rules when disk is removed from OS -
tries to set array in auto-read-only mode. This can interrupt rebuild
process which is started automatically, e.g. if array is mounted and
spare disk is available (I/O error is detected faster than removing
failed disk by mdadm).
This patch prevents "mdadm -If" from setting array into "auto-read-only",
by requiring exclusive open to succeed.