How to speed up migration from Raid 5 to Raid 6 with mdadm?

Today I started the migration of my Raid 5 to Raid 6 by adding a new disk (going from 7 to 8 disks, all 3TB). Now the reshape is in progress:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 sdi[9] sdh[8] sdf[7] sdc[6] sdd[4] sda[0] sde[5] sdb[1]
      17581590528 blocks super 1.2 level 6, 512k chunk, algorithm 18 [8/7] [UUUUUUU_]
      [>....................]  reshape =  2.3% (69393920/2930265088) finish=6697.7min speed=7118K/sec

unused devices: <none>

but it is slow as hell. It will be almost 5 days before completion. I was using to reshape the array in about 1 day, but here it is horrible. The speed is very low. The backup file is on a SSD.

I did change the stripe size and the min and max speed limits, and it did not change a thing.

Is there any way I can speed up the process to reasonable amount of time or I have to wait 5 days to finish ?

Update: iostat -kx 10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05    0.00    1.68   22.07    0.00   76.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda            1675.90  1723.00   27.60   23.90 13875.20  6970.00   809.52     0.55   10.79    8.59   13.33   7.97  41.02
sdb            1675.90  1723.10   27.20   23.80 13670.40  6970.00   809.43     0.55   10.80    8.96   12.90   8.12  41.43
sdc            1675.90  1723.60   27.50   23.30 13824.00  6970.00   818.66     0.65   12.85   10.48   15.65   9.83  49.94
sdd            1675.90  1723.10   27.60   23.80 13875.20  6970.00   811.10     0.55   10.80    8.93   12.98   8.16  41.95
sde            1675.90  1723.10   27.20   23.80 13670.40  6970.00   809.43     0.60   11.79    9.17   14.79   9.19  46.87
sdf            1675.90  1723.80   27.70   23.10 13926.40  6970.00   822.69     0.72   14.28   11.65   17.43  10.12  51.40
sdg               0.00     4.10    0.00   93.20     0.00 39391.20   845.30     6.07   65.14    0.00   65.14   2.71  25.29
dm-0              0.00     0.00    0.00    4.30     0.00    18.40     8.56     0.00    0.07    0.00    0.07   0.02   0.01
dm-1              0.00     0.00    0.00   89.60     0.00 39372.80   878.86     6.07   67.78    0.00   67.78   2.82  25.28
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdh            1583.50  1631.70  216.50  115.90 13824.00  6970.00   125.11     1.56    4.73    5.36    3.55   0.43  14.41
sdi               0.00  1631.70    0.00  115.90     0.00  6970.00   120.28     0.21    1.77    0.00    1.77   0.28   3.25
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

sdi is the disk I last added. sdg is the SSD. The dmX are the partitions on LVM.

Asked By: Baptiste Wicht

||

It seems to me that it is related to the mdadm migration from raid 5 to raid 6. I have just added a new disk to the array and the grow speed is totally reasonable (40000K/s) for my hardware.

Answered By: Baptiste Wicht

According to this blog post by Neil Brown (the creator of mdadm), you can avoid the speed penalty due to mdadm‘s block range backup process by:

  1. Increasing the number of RAID devices (e.g.: Reshape from 4-disk RAID5 to 5-disk RAID6)
    mdadm --grow /dev/md0 --level=6 --raid-disk=5
  2. Do not specify the option --backup-file

The reasoning which he details in his blog post is that the backup file is unnecessary when another drive has been added. This is due to the process differing slightly since in this case there is generally a gap between the old and new layouts which can be used to backup the old layout data being operated on during the reshape.

This excerpt from his article explains this in more detail:

How Level Changing Works

If we think of “RAID5” as a little more generic than the standard definition, and allow it to be any layout which stripes data plus 1 parity block across a number of devices, then we can think of RAID4 as just a special case of RAID5. Then we can imagine a conversion from RAID0 to RAID5 as taking two steps. The first converts to RAID5 using the RAID4 layout with the parity disk as the last disk. This clearly doesn’t require any data to be relocated so the change can be instant. It creates a degraded RAID5 in a RAID4 layout so it is not complete, but it is clearly a step in the right direction.
I’m sure you can see what comes next. After converting the RAID0 to a degraded RAID5 with an unusual layout we would use the new change-the-layout functionality to convert to a real RAID5.

It is a very similar process that can now be used to convert a RAID5 to a RAID6. We first change the RAID5 to RAID6 with a non-standard layout that has the parity blocks distributed as normal, but the Q blocks all on the last device (a new device). So this is RAID6 using the RAID6 driver, but with a non-RAID6 layout. So we “simply” change the layout and the job is done.

A RAID6 can be converted to a RAID5 by the reverse process. First we change the layout to a layout that is almost RAID5 but with an extra Q disk. Then we convert to real RAID5 by forgetting about the Q disk.

Complexities of re-striping data

In all of this the messiest part is ensuring that the data survives a crash or other system shutdown. With the first reshape which just allowed increasing the number of devices, this was quite easy. For most of the time there is a gap in the devices between where data in the old layout in being read, and where data in the new layout is being written. This gap allows us to have two copies of that data. If we disable writes to a small section while it is being reshaped, then after a crash we know that the old layout still has good data, and simply re-layout the last few stripes from where-ever we recorded that we were up to.

This doesn’t work for the first few stripes as they require writing the new layout over the old layout. So after a crash the old layout is probably corrupted and the new layout may be incomplete. So mdadm takes care to make a backup of those first few stripes and when it assembles an array that was still in the early phase of a reshape it first restores from the backup.

For a reshape that does not change the number of devices, such as changing chunksize or layout, every write will be over-writing the old layout of that same data so after a crash there will definitely be a range of blocks that we cannot know whether they are in the old layout or the new layout or a bit of both. So we need to always have a backup of the range of blocks that are currently being reshaped.

This is the most complex part of the new functionality in mdadm 3.1 (which is not released yet but can be found in the devel-3.1 branch for git://neil.brown.name/mdadm). mdadm monitors the reshape, setting an upper bound of how far it can progress at any time and making sure the area that it is allow to rearrange has writes disabled and has been backed-up.

This means that all the data is copied twice, once to the backup and once to the new layout on the array. This clearly means that such a reshape will go very slowly. But that is the price we have to pay for safety. It is like insurance. You might hate having to pay it, but you would hate it much more if you didn’t and found that you needed it.

Answered By: TrinitronX
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.