external-reshape-design.txt

   1 External Reshape
   2
   3 1 Problem statement
   4
   5 External (third-party metadata) reshape differs from native-metadata
   6 reshape in three key ways:
   7
   8 1.1 Format specific constraints
   9
  10 In the native case reshape is limited by what is implemented in the
  11 generic reshape routine (Grow_reshape()) and what is supported by the
  12 kernel.  There are exceptional cases where Grow_reshape() may block
  13 operations when it knows that the kernel implementation is broken, but
  14 otherwise the kernel is relied upon to be the final arbiter of what
  15 reshape operations are supported.
  16
  17 In the external case the kernel, and the generic checks in
  18 Grow_reshape(), become the super-set of what reshapes are possible.  The
  19 metadata format may not support, or have yet to implement a given
  20 reshape type.  The implication for Grow_reshape() is that it must query
  21 the metadata handler and effect changes in the metadata before the new
  22 geometry is posted to the kernel.  The ->reshape_super method allows
  23 Grow_reshape() to validate the requested operation and post the metadata
  24 update.
  25
  26 1.2 Scope of reshape
  27
  28 Native metadata reshape is always performed at the array scope (no
  29 metadata relationship with sibling arrays on the same disks).  External
  30 reshape, depending on the format, may not allow the number of member
  31 disks to be changed in a subarray unless the change is simultaneously
  32 applied to all subarrays in the container.  For example the imsm format
  33 requires all member disks to be a member of all subarrays, so a 4-disk
  34 raid5 in a container that also houses a 4-disk raid10 array could not be
  35 reshaped to 5 disks as the imsm format does not support a 5-disk raid10
  36 representation.  This requires the ->reshape_super method to check the
  37 contents of the array and ask the user to run the reshape at container
  38 scope (if all subarrays are agreeable to the change), or report an
  39 error in the case where one subarray cannot support the change.
  40
  41 1.3 Monitoring / checkpointing
  42
  43 Reshape, unlike rebuild/resync, requires strict checkpointing to survive
  44 interrupted reshape operations.  For example when expanding a raid5
  45 array the first few stripes of the array will be overwritten in a
  46 destructive manner.  When restarting the reshape process we need to know
  47 the exact location of the last successfully written stripe, and we need
  48 to restore the data in any partially overwritten stripe.  Native
  49 metadata stores this backup data in the unused portion of spares that
  50 are being promoted to array members, or in an external backup file
  51 (located on a non-involved block device).
  52
  53 The kernel is in charge of recording checkpoints of reshape progress,
  54 but mdadm is delegated the task of managing the backup space which
  55 involves:
  56 1/ Identifying what data will be overwritten in the next unit of reshape
  57    operation
  58 2/ Suspending access to that region so that a snapshot of the data can
  59    be transferred to the backup space.
  60 3/ Allowing the kernel to reshape the saved region and setting the
  61    boundary for the next backup.
  62
  63 In the external reshape case we want to preserve this mdadm
  64 'reshape-manager' arrangement, but have a third actor, mdmon, to
  65 consider.  It is tempting to give the role of managing reshape to mdmon,
  66 but that is counter to its role as a monitor, and conflicts with the
  67 existing capabilities and role of mdadm to manage the progress of
  68 reshape.  For clarity the external reshape implementation maintains the
  69 role of mdmon as a (mostly) passive recorder of raid events, and mdadm
  70 treats it as it would the kernel in the native reshape case (modulo
  71 needing to send explicit metadata update messages and checking that
  72 mdmon took the expected action).
  73
  74 External reshape can use the generic md backup file as a fallback, but in the
  75 optimal/firmware-compatible case the reshape-manager will use the metadata
  76 specific areas for managing reshape.  The implementation also needs to spawn a
  77 reshape-manager per subarray when the reshape is being carried out at the
  78 container level.  For these two reasons the ->manage_reshape() method is
  79 introduced.  This method in addition to base tasks mentioned above:
  80 1/ Processed each subarray one at a time in series - where appropriate.
  81 2/ Uses either generic routines in Grow.c for md-style backup file
  82    support, or uses the metadata-format specific location for storing
  83    recovery data.
  84 This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
  85 optionally take advantage of generic infrastructure in Grow.c
  86
  87 2 Details for specific reshape requests
  88
  89 There are quite a few moving pieces spread out across md, mdadm, and mdmon for
  90 the support of external reshape, and there are several different types of
  91 reshape that need to be comprehended by the implementation.  A rundown of
  92 these details follows.
  93
  94 2.0 General provisions:
  95
  96 Obtain an exclusive open on the container to make sure we are not
  97 running concurrently with a Create() event.
  98
  99 2.1 Freezing sync_action
 100
 101    Before making any attempt at a reshape we 'freeze' every array in
 102    the container to ensure no spare assignment or recovery happens.
 103    This involves writing 'frozen' to sync_action and changing the '/'
 104    after 'external:' in metadata_version to a '-'. mdmon knows that
 105    this means not to perform any management.
 106
 107    Before doing this we check that all sync_actions are 'idle', which
 108    is racy but still useful.
 109    Afterwards we check that all member arrays have no spares
 110    or partial spares (recovery_start != 'none') which would indicate a
 111    race.  If they do, we unfreeze again.
 112
 113    Once this completes we know all the arrays are stable.  They may
 114    still have failed devices as devices can fail at any time.  However
 115    we treat those like failures that happen during the reshape.
 116
 117 2.2 Reshape size
 118
 119    1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
 120       initializes st->update_tail
 121    2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
 122       is allowed (being performed at subarray scope / enough room) prepares a
 123       metadata update
 124    3/ mdadm::Grow_reshape(): flushes the metadata update (via
 125       flush_metadata_update(), or ->sync_metadata())
 126    4/ mdadm::Grow_reshape(): post the new size to the kernel
 127
 128
 129 2.3 Reshape level (simple-takeover)
 130
 131 "simple-takeover" implies the level change can be satisfied without touching
 132 sync_action
 133
 134     1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
 135        initializes st->update_tail
 136     2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
 137        is allowed (being performed at subarray scope) prepares a
 138        metadata update
 139        2a/ raid10 --> raid0: degrade all mirror legs prior to calling
 140            ->reshape_super
 141     3/ mdadm::Grow_reshape(): flushes the metadata update (via
 142        flush_metadata_update(), or ->sync_metadata())
 143     4/ mdadm::Grow_reshape(): post the new level to the kernel
 144
 145 2.4 Reshape chunk, layout
 146
 147 2.5 Reshape raid disks (grow)
 148
 149     1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
 150        because only redundant raid levels can modify the number of raid disks
 151     2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
 152        change is allowed (being performed at proper scope / permissible
 153        geometry / proper spares available in the container), chooses
 154        the spares to use, and prepares a metadata update.
 155     3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
 156        raid level that can perform the reshape and starts mdmon.
 157     4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
 158     5/ mdadm::Grow_reshape(): uses container_content to find details of
 159        the spares and passes them to the kernel.
 160     6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
 161        sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
 162        and starts the reshape by writing 'reshape' to sync_action.
 163     7/ mdmon::monitor notices the sync_action change and tells
 164        managemon to check for new devices.  managemon notices the new
 165        devices, opens relevant sysfs file, and passes them all to
 166        monitor.
 167     8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
 168        rest of the reshape.
 169
 170     9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
 171        the kernel to either the backup file or the metadata specific location,
 172        advances sync_max, waits for reshape, ping mdmon, repeat.
 173        Meanwhile mdmon::read_and_act(): records checkpoints.
 174        Specifically.
 175
 176        9a/ if the 'next' stripe to be reshaped will over-write
 177            itself during reshape then:
 178         9a.1/ increase suspend_hi to cover a suitable number of
 179            stripes.
 180         9a.2/ backup those stripes safely.
 181         9a.3/ advance sync_max to allow those stripes to be backed up
 182         9a.4/ when sync_completed indicates that those stripes have
 183            been reshaped, manage_reshape must ping_manager
 184         9a.5/ when mdmon notices that sync_completed has been updated,
 185            it records the new checkpoint in the metadata
 186         9a.6/ after the ping_manager, manage_reshape will increase
 187            suspend_lo to allow access to those stripes again
 188
 189        9b/ if the 'next' stripe to be reshaped will over-write unused
 190            space during reshape then we apply same process as above,
 191            except that there is no need to back anything up.
 192            Note that we *do* need to keep suspend_hi progressing as
 193            it is not safe to write to the area-under-reshape.  For
 194            kernel-managed-metadata this protection is provided by
 195            ->reshape_safe, but that does not protect us in the case
 196            of user-space-managed-metadata.
 197
 198    10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
 199        level back to the nominal raid level (if necessary)
 200
 201        FIXME: native metadata does not have the capability to record the original
 202        raid level in reshape-restart case because the kernel always records current
 203        raid level to the metadata, whereas external metadata can masquerade at an
 204        alternate level based on the reshape state.
 205
 206 2.6 Reshape raid disks (shrink)
 207
 208 3 Interaction with metadata handle.
 209
 210   The following calls are made into the metadata handler to assist
 211   with initiating and monitoring a 'reshape'.
 212
 213   1/ ->reshape_super is called quite early (after only minimial
 214      checks) to make sure that the metadata can record the new shape
 215      and any necessary transitions.  It may be passed a 'container'
 216      or an individual array within a container, and it should notice
 217      the difference and act accordingly.
 218      When a reshape is requested against a container it is expected
 219      that it should be applied to every array in the container,
 220      however it is up to the metadata handler to determine final
 221      policy.
 222
 223      If the reshape is supportable, the internal copy of the metadata
 224      should be updated, and a metadata update suitable for sending
 225      to mdmon should be queued.
 226
 227      If the reshape will involve converting spares into array members,
 228      this must be recorded in the metadata too.
 229
 230   2/ ->container_content will be called to find out the new state
 231      of all the array, or all arrays in the container.  Any newly
 232      added devices (with state==0 and raid_disk >= 0) will be added
 233      to the array as spares with the relevant slot number.
 234
 235      It is likely that the info returned by  ->container_content will
 236      have ->reshape_active set, ->reshape_progress set to e.g. 0, and
 237      new_* set appropriately.  mdadm will use this information to
 238      cause the correct reshape to start at an appropriate time.
 239
 240   3/ ->set_array_state will be called by mdmon when reshape has
 241      started and again periodically as it progresses.  This should
 242      record the ->last_checkpoint as the point where reshape has
 243      progressed to.  When the reshape finished this will be called
 244      again and it should notice that ->curr_action is no longer
 245      'reshape' and so should record that the reshape has finished
 246      providing 'last_checkpoint' has progressed suitably.
 247
 248   4/ ->manage_reshape will be called once the reshape has been set
 249      up in the kernel but before sync_max has been moved from 0, so
 250      no actual reshape will have happened.
 251
 252      ->manage_reshape should call progress_reshape() to allow the
 253      reshape to progress, and should back-up any data as indicated
 254      by the return value.  See the documentation of that function
 255      for more details.
 256      ->manage_reshape will be called multiple times when a
 257      container is being reshaped, once for each member array in
 258      the container.
 259
 260
 261    The progress of the metadata is as follows:
 262     1/ mdadm sends a metadata update to mdmon which marks the array
 263        as undergoing a reshape. This is set up by
 264        ->reshape_super and applied by ->process_update
 265        For container-wide reshape, this happens once for the whole
 266        container.
 267     2/ mdmon notices progress via the sysfs files and calls
 268        ->set_array_state to update the state periodically
 269        For container-wide reshape, this happens repeatedly for
 270        one array, then repeatedly for the next, etc.
 271     3/ mdmon notices when reshape has finished and call
 272        ->set_array_state to record the the reshape is complete.
 273        For container-wide reshape, this happens once for each
 274        member array.
 275
 276
 277
 278 ...
 279
 280 [1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/