Anna Czarnowska [Mon, 22 Nov 2010 09:58:06 +0000 (20:58 +1100)]
mdadm: added --no-sharing option for Monitor mode
--no-sharing option disables moving spares between arrays/containers.
Without the option spares are moved if needed according to config rules.
We only allow one process moving spares started with --scan option.
If there is such process running and another instance of Monitor
is starting without --scan, then we issue a warning but allow it
to continue.
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
Anna Czarnowska [Mon, 22 Nov 2010 09:58:06 +0000 (20:58 +1100)]
Monitor: set err on arrays not in mdstat
mse can be NULL when the array was not in mdstat when we read it
but existed in statelist and was recreated after reading mdstat.
In this case we set err as we can't get full update on this array
this time.
If the same array is given twice in command line it appears twice
in statelist. The first one will mark mse->devnum=INT_MAX
so the second one can't find mse. We set err on the second one as
it's not needed. Also if it becomes degraded we would look for spares
twice for the same array.
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Mon, 22 Nov 2010 09:58:06 +0000 (20:58 +1100)]
Add action=spare-same-slot policy.
When "mdadm -I" is given a device with no metadata, mdadm tries to add
it as a 'spare' somewhere based on policy.
This patch changes the behaviour in two ways:
1/ If the device is at a 'path' where a previous device was removed
from an array or container, then we preferentially add the spare to
that array or container.
2/ Previously only 'bare' devices were considered for adding as
spares. Now if action=spare-same-slot is active, we will add
non-bare devices, but *only* if the path was previously in use
for some array, and the device will only be added to that array.
Based on code
From: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com>
NeilBrown [Mon, 22 Nov 2010 09:58:06 +0000 (20:58 +1100)]
incr/spare: recheck allowed action for each metadata.
The current act_spare tests only test if it is allowed for some
metadata.
As we check each array or partitioning type, we need to double-check
that sparing is allowed for that array or partitioning type.
extension of IncrementalRemove to store location (path-id) of removed device
If the disk is taken out from its port this port information is
lost. Only udev rule can provide us with this information, and then we
have to store it somehow. This patch adds writing 'cookie' file in
/dev/.mdadm/failed-slots directory in form of file named with value of
f<path-id> containing the metadata type and uuid of the array (or
container) that the device was a member of. The uuid is in exactly
the same format as in the mapfile.
FAILED_SLOTS_DIR constant has been added to hold the location of
cookie files.
added --path <path_id> to give the information on the 'path-id' of removed device
<path-id> allows to identify the port to which given device is plugged
in. In case of hot-removal, udev can pass this information for future
use (eg. write this name as 'cookie' allowing to detect the fact of
reinserting device to the same port).
--path <path-id> parameter has been added to device removal handle
(and char *path has been added to IncrementalRemove() to pass this
value) in order to pass path-id to this handler.
NeilBrown [Mon, 22 Nov 2010 09:58:06 +0000 (20:58 +1100)]
Assemble: simplify the handling of is_member_busy.
This is somewhat inconsistent with the last member of a
container getting special handling.
Just simplify it so the code seems to make sense and important
is easy to follow.
NeilBrown [Mon, 22 Nov 2010 09:58:05 +0000 (20:58 +1100)]
Remove content from mddev_dev
Now that the next_member loop is much smaller it is easy to
just use 'content' rather than stashing it in 'tmpdev->content'.
So we can remove the 'content' field from 'struct mddev_dev'.
NeilBrown [Mon, 22 Nov 2010 09:24:35 +0000 (20:24 +1100)]
Use new container_content rather than passing subarray to load_super.
Now that we can ask container_content for a specific subarray,
we don't need to pass the subarray name to load_super, and have it
secretly modify the returned state.
NeilBrown [Mon, 22 Nov 2010 08:35:25 +0000 (19:35 +1100)]
mapinfo: simplify subarray handling.
We don't need ->container_dev here, and we will soon be passing
subarray as an explicit arg to load_super.
So simplify extraction of subarray and move the strcpy close to
->load_super.
NeilBrown [Mon, 22 Nov 2010 08:35:25 +0000 (19:35 +1100)]
Assemble - avoid including wayward devices.
If a device - typically in a mirrored set - is assembled independently
of the other devices, and then attempted to be brought back into the
set it could contain inconsistent data. It should not be included.
So detect this situation by ensuring that the 'most recent' device is
believed to be active by every other device. If a device is wayward,
it will only consider fellow wayward devices to be active and will
think all others are failed or missing.
This patch only fixes --assemble, not --incremental
NeilBrown [Mon, 22 Nov 2010 08:35:25 +0000 (19:35 +1100)]
Manage: be more careful about --add attempts.
If an --add is requested and a re-add looks promising but fails or
cannot possibly succeed, then don't try the add. This avoids
inadvertently turning devices into spares when an array is failed but
the devices seem to actually work.
If a device is bare and policy suggests that it can be used as a spare
for virtual 'partitions' array, find an appropriate partition table
and write it to the device.
Allow disk-policy to be computed given the path and
disk type explicitly. This can be used when hunting through
/dev/disk/by-path for something interesting.
First steps to supporting auto-spare-add to groups of partitioned devices.
Adding a spare to a group of partitioned devices is quite different
from adding one to an array. So detect which option is worth trying
based on policy and then try one or the other - or possibly both - as
appropriate.
NeilBrown [Thu, 19 Aug 2010 06:48:34 +0000 (16:48 +1000)]
Add mbr pseudo metadata handler.
To support incorpating a new bare device into a collection of arrays -
one partition each - mdadm needs a modest understanding of partition
tables.
The main needs to be able to recognise a partition table on one device
and copy it onto another.
This will be done using pseudo metadata types 'mbr' and 'gpt'.
NeilBrown [Mon, 23 Aug 2010 05:54:13 +0000 (15:54 +1000)]
Use action policy to keep recently-disconnected devices in the array.
When we find a device that was recently part of the array but is now
out of date (based on the event count) we might want to add it back in
(like --re-add) if the likely cause was a connection problem or we
might not if the likely cause was device failure.
So make this a policy issue: if action=re-add or better, try to re-add
any device that looks like it might be part of the array.
This applies:
when we assemble the array: old devices will be evicted by the
kernel and need to be re-added.
when we assemble the array during --incr for the same reason.
when we find a device that could be added to a running array.
This doesn't affect arrays with external metadata at all.
For such arrays:
When the container is assembled, the most recent instance of each
device is included without reference to whether it is too old or not.
Then the metadata handler must which slices of which devices to
include in which array and with what state. So the
->container_content should probably check the policy and compare the
sequence numbers/event counts.
When a device is added (--add) to a container with active arrays
we only add as a 'spare'. --re-add doesn't seem to be an option.
When a device is added with -I ->container_content gets another
chance to assess things again. So again it should check the policy.
This defines two distinct policies which apply to any disk (but not
partition) device reached through the pci device 0000:00:1f.2.
The policies are "action=ignore" which means certain actions will
ignore the device, and "domain=onboard" which means all such devices
as treated as being united under the name 'onboard'.
This patch just adds data structures and code to read and
manipulate them. Future patches will actually use them.
NeilBrown [Tue, 31 Aug 2010 05:21:40 +0000 (15:21 +1000)]
Don't remove md devices with standard names.
If udev is not in use, we create device in /dev when assembling
arrays and remove them when stopping the array.
However it may not always be correct to remove the device. If
the array was started with kernel auto-detect, them mdadm didn't
create anything and so shouldn't remove anything.
We don't record whether we created things, so just don't remove
anything with a 'standard' name. Only remove symlinks to the
standard name as we almost certainly created those.
NeilBrown [Sun, 29 Aug 2010 22:48:48 +0000 (08:48 +1000)]
Allow dev_open to work on read-only /dev
/dev could be read-only in which case we cannot make devices
there.
So dev_open should first try to use an existing device name,
and if that doesn't work try creating a node in /dev or /tmp.
NeilBrown [Thu, 12 Aug 2010 01:41:41 +0000 (11:41 +1000)]
Allow --incremental to add spares to an array.
Commit 3a6ec29ad56 stopped us from adding apparently-working devices
to an active array with --incremental as there is a good chance that they
are actually old/failed devices.
Unfortunately it also stopped spares from being added to an active
array, which is wrong. This patch refines the test to be more
careful.
Reported-by: <fibreraid@gmail.com> Analysed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
Dan Williams [Mon, 9 Aug 2010 17:26:24 +0000 (10:26 -0700)]
Incremental: accept '--no-degraded' as a deprecated option
Commit 3288b419 (Revert "Incremental: honor --no-degraded to delay assembly")
killed the --no-degraded flag since commit 97b4d0e9 (Incremental: honor
an 'enough' flag from external handlers) made this the default behavior
of -I, and brought -I usage for external/container formats in line with
native metadata. However, this breaks existing usages of '-I
--no-degraded', so allow it as a deprecated option.
Starting a degraded container, like the native metadata case, requires -R.
Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Ignacy Kasperowicz <ignacy.kasperowicz@intel.com>
Dan Williams [Tue, 10 Aug 2010 15:44:45 +0000 (08:44 -0700)]
Incremental: return success in 'container not enough' case
Commit 97b4d0e9 "Incremental: honor an 'enough' flag from external
handlers" introduced a regression in that it changed the error return
code for successful invocations.
Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Ignacy Kasperowicz <ignacy.kasperowicz@intel.com>
NeilBrown [Fri, 6 Aug 2010 04:40:53 +0000 (14:40 +1000)]
Grow: use raid_disks, not nr_disks
nr_disks is just wrong here - the arrays need room for all disk slots,
even if some are empty, plus spares, plus a possible backup file.
So raid_disks is correct.
NeilBrown [Thu, 5 Aug 2010 01:44:26 +0000 (11:44 +1000)]
Fix test for imsm prodigal member scenario
The 'container_enough' changes fliped the default from assembling
an array as soon as we possibly could, to assembling only when all
expected devices are present.
This broken 09imsm-assemble which expects the original default.
So change from "-I" to "-IR" to restore the expected behaviour.
Switch from /lib/init/rw to /dev for early-boot files.
It turns out that /lib/init/rw doesn't exist in early boot
like I thought. So give up on that idea and just use
/dev/.mdadm/ for files that must persist from early-boot
to regular boot.
While we attempt to use a lockfile to grant exclusive access to the
mapfile, our implementation is buggy. Specifically, we create a lockfile,
then lock it, at which point new instances can open the lockfile and
attempt to lock it, which will cause them to block. However, when we are
ready to unlock it, we unlink the file. This causes existing lock waiters
to get a lock on an unlinked inode while a different instance may now
create a new lockfile and get an exclusive lock on it.
There are several possible fixes. The chosen one is to test if
->s_nlink is zero after we get the lock and to retry if it isn't.
This means:
- failing to unlink a file doesn't leave a stale lock
- we can block waiting to get a lock rather than busy-waiting
- we don't need to leave a lock file permanently in place.
The watch option to udev tells udev to watch our mdadm device file for
close events and on a close it rechecks the device. This means that if,
for example, we use mdadm to --grow the array from a 4 disk to 5 disk
array, when mdadm closes the array, udev will re-read the superblock and
update its internal database with the new information. This can also be
used to cause udev to create new device special files if, for example, a
partitioning program is used to modify the partition table on the actual
md device and that program does not call the syscall to reread the
partition table.