NeilBrown [Fri, 18 Dec 2015 02:51:54 +0000 (13:51 +1100)]
Detail: don't assume a particular 'disk' number of missing devices.
When a particular raid-disk is missing, we don't know which disk number
it should have, and reporting a number could result in duplicate
numbers (with v1.x metadata - never with the old 0.90).
So set the default to -1 and recoginise that when printing.
NeilBrown [Fri, 18 Dec 2015 02:49:30 +0000 (13:49 +1100)]
Detail: report correct raid-disk for removed drives.
Back in
Commit: 8057db46a15d ("Detail: fix handling of 'disks' array.")
when we doubled the size of the 'disks' array to handle primary and
replacement, we should have halved the setting of the default raid_disk
number.
Reported-by: Coly Li <colyli@suse.de> Signed-off-by: NeilBrown <neilb@suse.com>
Guoqing Jiang [Wed, 16 Dec 2015 17:54:26 +0000 (01:54 +0800)]
mdadm: improve the safeguard for change cluster raid's sb
This commit does the following jobs:
1. rename is_clustered to dlm_funs_ready since it match the
function better.
2. st->cluster_name can't be use to identify the raid is a
clustered or not, we should check the bitmap's version to
perform the identification.
3. for cluster_get_dlmlock/cluster_release_dlmlock funcs, both
of them just need the lockid as parameter since the cluster
name can get by get_cluster_name().
Guoqing Jiang [Wed, 16 Dec 2015 17:54:25 +0000 (01:54 +0800)]
mdadm: do not try to hold dlm lock in free_super1
Since free_super1 actually doesn't change the sb, it
just free the addr space of sb. Also free_super1 is
called in lots of place within mdadm, so remove dlm
lock code since the func doesn't need the protection
and also reduce latency.
Guoqing Jiang [Tue, 1 Dec 2015 16:30:12 +0000 (00:30 +0800)]
mdadm: do not display bitmap info if it is cleared
"mdadm -X DISK" is used to report information about a bitmap
file, it is better to not display all the related infos if
bitmap is cleared with "--bitmap=none" under grow mode.
To do that, the locate_bitmap is changed a little to have a
return value based on MD_FEATURE_BITMAP_OFFSET.
Guoqing Jiang [Tue, 1 Dec 2015 16:30:10 +0000 (00:30 +0800)]
mdadm: output info more precisely when change bitmap to none
WHen change bitmap to none, the infos could be more accurate
based on existed bitmap type.
And s->bitmap_file is passed from cmd "--bitmap=TYPE", so
remove s->bitmap_file from err info since it should means
change the bitmap to one type failed rather than the type is
already presented.
Deepa Dinamani [Tue, 8 Dec 2015 23:10:21 +0000 (15:10 -0800)]
mdadm: Change timestamps to unsigned data type.
32 bit signed timestamps will overflow in the year 2038.
Change the user interface mdu_array_info_s structure timestamps:
ctime and utime values used in ioctls GET_ARRAY_INFO and
SET_ARRAY_INFO to unsigned int. This will extend the field to last
until the year 2106.
Add time_after/time_before and supporting typecheck from
the kernel to take care of unsigned time wraparound.
The long term plan is to get rid of ctime and utime values in
this structure as this information can be read from the on-disk
meta data directly.
v0.90 on disk meta data uses u32 for maintaining time stamps.
So this will also last until year 2106.
Assumption is that the usage of v0.90 will be deprecated by
year 2106.
Timestamp fields in the on disk meta data for v1.0 version already
use 64 bit data types.
Song Liu [Wed, 21 Oct 2015 18:35:14 +0000 (11:35 -0700)]
mdadm: refactor write journal code in Assemble and Incremental
As discussed, standalone require_journal() in struct superswitch
is not a very good idea. Instead, journal related information
fits well in struct mdinfo.
This patch simplifies journal support code in Assemble and
Incremental as:
- Add journal_device_required and journal_clean to struct mdinfo;
- Remove function require_journal from struct superswitch;
- Update Assemble and Incremental to use journal_device_required
and journal_clean from struct mdinfo (instead of separate var).
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Guoqing Jiang [Mon, 19 Oct 2015 08:03:19 +0000 (16:03 +0800)]
Safeguard against writing to an active device of another node
Modifying an exiting device's superblock or creating a new superblock
on an existing device needs to be checked because the device could be
in use by another node in another array. So, we check this by taking
all superblock locks in userspace so that we don't step onto an active
device used by another node and safeguard against accidental edits.
After the edit is complete, we release all locks and the lockspace so
that it can be used by the kernel space.
Song Liu [Fri, 9 Oct 2015 05:51:44 +0000 (22:51 -0700)]
Assemble array with write journal
Example output:
./mdadm --assemble /dev/md0 /dev/sd[c-f] /dev/sdb1
mdadm: /dev/md0 has been started with 4 drives and 1 journal.
mdadm checks superblock for journal devices. If the journal device
is missing or faulty, mdadm will show warning
./mdadm --assemble /dev/md0 /dev/sd[c-q] /dev/sdb1
mdadm: Not safe to assemble with missing or stale journal device, consider --force.
User can insist to start the array (read only) with --force
./mdadm --assemble /dev/md0 /dev/sd[c-q] /dev/sdb1 --force
mdadm: Journal is missing or stale, starting array read only.
mdadm: /dev/md0 has been started with 15 drives.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Number Major Minor RaidDevice State
0 8 32 0 active sync /dev/sdc
1 8 48 1 active sync /dev/sdd
2 8 64 2 active sync /dev/sde
3 8 80 3 active sync /dev/sdf
4 8 17 - journal /dev/sdb1
./mdadm -E /dev/sdb2
/dev/sdb2:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x201
Array UUID : 562b2334:35b9bcc1:add50892:1f30c4bd
Name : 0
Creation Time : Thu Aug 27 12:55:26 2015
Raid Level : raid5
Raid Devices : 15
Avail Dev Size : 249796608 (119.11 GiB 127.90 GB)
Array Size : 54696423936 (52162.57 GiB 56009.14 GB)
Used Dev Size : 7813774848 (3725.90 GiB 4000.65 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=0 sectors
State : active
Device UUID : 5015e522:d39ba566:5909cf3c:9c51f2ff
Internal Bitmap : 8 sectors from superblock
Update Time : Thu Aug 27 13:16:55 2015
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 4e6fd76d - correct
Events : 262
Layout : left-symmetric
Chunk Size : 256K
Device Role : Journal
Array State : AAAAAAAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
imsm: don't call abort_reshape() in imsm_manage_reshape()
Calling abort_reshape() in imsm_manage_reshape() is unnecessary in case
of an error because it is handled by reshape_array(). Calling it when
reshape completes successfully is also unnecessary and leads to a race
condition:
- reshape ends
- mdadm calls abort_reshape() -> sets sync_action to idle
- MD_RECOVERY_INTR is set and md_reap_sync_thread() does not finish the
reshape
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Konrad Dabrowski <konrad.dabrowski@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>
Commit 06bd679317a2 ("Skip clustered devices in incremental")
disabled incremental completely on clustered arrays.
What we really want is that mdadm should not start or create
a clustered array but still be able to add or readd to an existing
device. This would enable udev scripts to automatically add
or re-add a device after transient errors.
If someone has an IMSM array, and disables RAID in the BIOS
and uses the devices for some other purpose, then they really don't
want mdadm to start syncing the array.
So don't assemble if OROM doesn't confirm it is OK.
There can still be problems for crash-dump not being able to find
the OROM. Some explicit work-around might be needed for that
rather than a more general workaround that can corrupt data.
NeilBrown [Mon, 3 Aug 2015 01:53:01 +0000 (11:53 +1000)]
mdassemble: don't try to perform cluster check.
mdassemble is meant to be small an simple, so avoid
trying to check for a cluster.
Currently it doesn't, but it still includes the code,
which doesn't build because the library isn't provided.
So just exclude the get_cluster_name code from mdassemble.
If someone has an IMSM array, and disables RAID in the BIOS
and uses the devices for some other purpose, then they really don't
want mdadm to start syncing the array.
So don't assemble if OROM doesn't confirm it is OK.
There can still be problems for crash-dump not being able to find
the OROM. Some explicit work-around might be needed for that
rather than a more general workaround that can corrupt data.
sometimes the removed device is re-added before the writes
get all the way to the md device - so the array doesn't need
any recovery and the test fails.
So flush first to be safe.
newer versions of mkfs.extX ask before creating a filesystem
on a device which appears to already have a filesystem.
We don't want that, so add the -F flag.
Also be explicit about fs type as one shouldn't depend on defaults.
restripe: fix data block order in raid6_2_data_recov
... rather than relying on the caller getting them in the
correct order.
This is better engineering and fixes a bug, but because the
failed_slotX numbers are used later with assumption that
they weren't swapped
- document meaning of various arrays. In particular:
stripes[]
blocks[]
blocks_page[]
block_index_for_slot[]
It needs to be clear if these are indexed by raid_disk
number or syndrome number.
- changed meaning of block_index_for_slot[]. It didn't seem
to be used consistently. It also made use of the block numbers
in array data ordering, which is not directly relevant for syndrome
calculations.
- reduced number of args to autorepair and manual_repair
There don't need both stripes[] and blocks[]. And they don't need
diskP or diskQ.
blocks[-1] is the P chunk, blocks[-2] is the Q chunk.
block_index_for_slot[] can be used to find the target device for
a particular syndrome block.
- remove stripe locking from within manual_repair, and instead
use the global stripe locking used for check and autorepair.
- this necessitated changes to raid6_datap_recov and raid5_2data_reov
so the P and Q blocks could be before or after the data blocks.
raid6check: get device ordering correct for syndrome calculation.
The order of devices used for the syndrome calculation is not
the same as the order of data in the array.
The D block immediately after Q is first, then they continue
cyclicly in raid-disk order, skipping over the P disk if it is seen.
This gets the 'check' right for all layouts other than DDF, which is
quite different.
I haven't confirmed that this does't break repair.
tests: slow down --stop a bit to allow revert-inplace to work.
revert-inplace would sometimes find that the original reshape had
finished.
So slow down the reshaping during --stop (which needs to be a little
bit fast so that stop doesn't timeout waiting) and don't wait quite
so long before stopping.
If a read fills the whole buffer, then we possibly
missed something of the end, and we definitely shouldn't
put a '\0' beyond the end, so just return an error.
This should never happen anyway.
A 'devnm' never starts with '/', so this test is pointless.
The code should use the passed-in devname unless it is clearly
not usable. So fix it to do that.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:12 +0000 (13:42 +0800)]
mdadm: change the num of cluster node
This extends nodes option for assemble mode, make the num of
cluster node could be change by user.
Before that, it is necessary to ensure there are enough space
for those nodes, calc_bitmap_size is introduced to calculate
the bitmap size of each node.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:11 +0000 (13:42 +0800)]
mdadm: add the ability to change cluster name
To support change the cluster name, the commit do the followings:
1. extend original write_bitmap function for new scenario.
2. add the scenarion to handle the modification of cluster's name
in write_bitmap1.
3. let the cluster name also show in examine_super1 and detail_super1
Guoqing Jiang [Wed, 10 Jun 2015 05:42:09 +0000 (13:42 +0800)]
Convert a bitmap=none device to clustered
This adds the ability to convert a regular md without bitmap
(--bitmap=none) to a clustered device (--bitmap=clustered).
To convert a device with --bitmap=internal or --bitmap=external,
you have to convert to --bitmap=none and then re-execute the
command with --bitmap=clustered.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:08 +0000 (13:42 +0800)]
Add a new clustered disk
A clustered disk is added by the traditional --add sequence.
However, other nodes need to acknowledge that they can "see"
the device. This is done by --cluster-confirm:
--cluster-confirm SLOTNUM:/dev/whatever (if disk is found)
or
--cluster-confirm SLOTNUM:missing (if disk is not found)
The node initiating the --add, has the disk state tagged with
MD_DISK_CLUSTER_ADD and the one confirming tag the disk with
MD_DISK_CANDIDATE.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:06 +0000 (13:42 +0800)]
Set home-cluster while creating an array
The home-cluster is stored in the bitmap super block of the
array. The device can be assembled on a cluster with the
cluster name same as the one recorded in the bitmap.
If home-cluster is not specified, this is auto-detected using
dlopen corosync cmap library.
neilb: allow code to compile when corosync-devel is not installed.
Guoqing Jiang [Wed, 10 Jun 2015 05:42:05 +0000 (13:42 +0800)]
Add nodes option while creating md
Specifies the maximum number of nodes in the cluster that may use
this device simultaneously. This is equivalent to the number of
bitmaps created in the internal superblock (patches to follow).
NeilBrown [Thu, 28 May 2015 06:53:26 +0000 (16:53 +1000)]
test: make 'check wait' more reliable.
'recover' etc doesn't appear in /proc/mdstat immediately.
The "sync" thread must be started first.
But 'sync_action' shows it as soon as MD_RECOVERY_NEEDED is set
in the kernel. So look there too.
Now maybe I can get rid of some of those silly 'sleep' calls.