RAID can be implemented either by dedicated hardware (RAID modules integrated into SCSI or SATA controller cards) or by software abstraction (the kernel). Whether hardware or software, a RAID system with enough redundancy can transparently stay operational when a disk fails; the upper layers of the stack (applications) can even keep accessing the data in spite of the failure. Of course, this “degraded mode” can have an impact on performance, and redundancy is reduced, so a further disk failure can lead to data loss. In practice, therefore, one will strive to only stay in this degraded mode for as long as it takes to replace the failed disk. Once the new disk is in place, the RAID system can reconstruct the required data so as to return to a safe mode. The applications won't notice anything, apart from potentially reduced access speed, while the array is in degraded mode or during the reconstruction phase.
When RAID is implemented by hardware, its configuration generally happens within the BIOS setup tool, and the kernel will consider a RAID array as a single disk, which will work as a standard physical disk, although the device name may be different. For instance, the kernel in Squeeze made some hardware RAID arrays available as /dev/cciss/c0d0
; the kernel in Wheezy changed this name to the more natural /dev/sda
, but other RAID controllers may still behave differently.
12.1.1.2. Setting up RAID
Setting up RAID volumes requires the
mdadm
package; it provides the
mdadm
command, which allows creating and manipulating RAID arrays, as well as scripts and tools integrating it to the rest of the system, including the monitoring system.
Our example will be a server with a number of disks, some of which are already used, the rest being available to setup RAID. We initially have the following disks and partitions:
the sdb
disk, 4 GB, is entirely available;
the sdc
disk, 4 GB, is also entirely available;
on the sdd
disk, only partition sdd2
(about 4 GB) is available;
finally, a sde
disk, still 4 GB, entirely available.
We're going to use these physical elements to build two volumes, one RAID-0 and one mirror (RAID-1). Let's start with the RAID-0 volume:
#
mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sdb /dev/sdc
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
#
mdadm --query /dev/md0
/dev/md0: 8.00GiB raid0 2 devices, 0 spares. Use mdadm --detail for more detail.
#
mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu Jan 17 15:56:55 2013
Raid Level : raid0
Array Size : 8387584 (8.00 GiB 8.59 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Jan 17 15:56:55 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Chunk Size : 512K
Name : mirwiz:0 (local to host mirwiz)
UUID : bb085b35:28e821bd:20d697c9:650152bb
Events : 0
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
#
mkfs.ext4 /dev/md0
mke2fs 1.42.5 (29-Jul-2012)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=128 blocks, Stripe width=256 blocks
524288 inodes, 2096896 blocks
104844 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2147483648
64 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
#
mkdir /srv/raid-0
#
mount /dev/md0 /srv/raid-0
#
df -h /srv/raid-0
Filesystem Size Used Avail Use% Mounted on
/dev/md0 7.9G 146M 7.4G 2% /srv/raid-0
The mdadm --create
command requires several parameters: the name of the volume to create (/dev/md*
, with MD standing for Multiple Device), the RAID level, the number of disks (which is compulsory despite being mostly meaningful only with RAID-1 and above), and the physical drives to use. Once the device is created, we can use it like we'd use a normal partition, create a filesystem on it, mount that filesystem, and so on. Note that our creation of a RAID-0 volume on md0
is nothing but coincidence, and the numbering of the array doesn't need to be correlated to the chosen amount of redundancy. It's also possible to create named RAID arrays, by giving mdadm
parameters such as /dev/md/linear
instead of /dev/md0
.
Creation of a RAID-1 follows a similar fashion, the differences only being noticeable after the creation:
#
mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdd2 /dev/sde
mdadm: Note: this array has metadata at the start and
may not be suitable as a boot device. If you plan to
store '/boot' on this device please ensure that
your boot-loader understands md/v1.x metadata, or use
--metadata=0.90
mdadm: largest drive (/dev/sdd2) exceeds size (4192192K) by more than 1%
Continue creating array?
y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
#
mdadm --query /dev/md1
/dev/md1: 4.00GiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail.
#
mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Thu Jan 17 16:13:04 2013
Raid Level : raid1
Array Size : 4192192 (4.00 GiB 4.29 GB)
Used Dev Size : 4192192 (4.00 GiB 4.29 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Jan 17 16:13:04 2013
State : clean, resyncing (PENDING)
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : mirwiz:1 (local to host mirwiz)
UUID : 6ec558ca:0c2c04a0:19bca283:95f67464
Events : 0
Number Major Minor RaidDevice State
0 8 50 0 active sync /dev/sdd2
1 8 64 1 active sync /dev/sde
#
mdadm --detail /dev/md1
/dev/md1:
[...]
State : clean
[...]
A few remarks are in order. First, mdadm
notices that the physical elements have different sizes; since this implies that some space will be lost on the bigger element, a confirmation is required.
More importantly, note the state of the mirror. The normal state of a RAID mirror is that both disks have exactly the same contents. However, nothing guarantees this is the case when the volume is first created. The RAID subsystem will therefore provide that guarantee itself, and there will be a synchronization phase as soon as the RAID device is created. After some time (the exact amount will depend on the actual size of the disks…), the RAID array switches to the “active” state. Note that during this reconstruction phase, the mirror is in a degraded mode, and redundancy isn't assured. A disk failing during that risk window could lead to losing all the data. Large amounts of critical data, however, are rarely stored on a freshly created RAID array before its initial synchronization. Note that even in degraded mode, the /dev/md1
is usable, and a filesystem can be created on it, as well as some data copied on it.
Now let's see what happens when one of the elements of the RAID-1 array fails. mdadm
, in particular its --fail
option, allows simulating such a disk failure:
#
mdadm /dev/md1 --fail /dev/sde
mdadm: set /dev/sde faulty in /dev/md1
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Update Time : Thu Jan 17 16:14:09 2013
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Name : mirwiz:1 (local to host mirwiz)
UUID : 6ec558ca:0c2c04a0:19bca283:95f67464
Events : 19
Number Major Minor RaidDevice State
0 8 50 0 active sync /dev/sdd2
1 0 0 1 removed
1 8 64 - faulty spare /dev/sde
The contents of the volume are still accessible (and, if it is mounted, the applications don't notice a thing), but the data safety isn't assured anymore: should the sdd
disk fail in turn, the data would be lost. We want to avoid that risk, so we'll replace the failed disk with a new one, sdf
:
#
mdadm /dev/md1 --add /dev/sdf
mdadm: added /dev/sdf
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Raid Devices : 2
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Thu Jan 17 16:15:32 2013
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 1
Spare Devices : 1
Rebuild Status : 28% complete
Name : mirwiz:1 (local to host mirwiz)
UUID : 6ec558ca:0c2c04a0:19bca283:95f67464
Events : 26
Number Major Minor RaidDevice State
0 8 50 0 active sync /dev/sdd2
2 8 80 1 spare rebuilding /dev/sdf
1 8 64 - faulty spare /dev/sde
#
[...]
[...]
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Update Time : Thu Jan 17 16:16:36 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Name : mirwiz:1 (local to host mirwiz)
UUID : 6ec558ca:0c2c04a0:19bca283:95f67464
Events : 41
Number Major Minor RaidDevice State
0 8 50 0 active sync /dev/sdd2
2 8 80 1 active sync /dev/sdf
1 8 64 - faulty spare /dev/sde
Here again, the kernel automatically triggers a reconstruction phase during which the volume, although still accessible, is in a degraded mode. Once the reconstruction is over, the RAID array is back to a normal state. One can then tell the system that the sde
disk is about to be removed from the array, so as to end up with a classical RAID mirror on two disks:
#
mdadm /dev/md1 --remove /dev/sde
mdadm: hot removed /dev/sde from /dev/md1
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Number Major Minor RaidDevice State
0 8 50 0 active sync /dev/sdd2
2 8 80 1 active sync /dev/sdf
From then on, the drive can be physically removed when the server is next switched off, or even hot-removed when the hardware configuration allows hot-swap. Such configurations include some SCSI controllers, most SATA disks, and external drives operating on USB or Firewire.
12.1.1.3. Backing up the Configuration
Most of the meta-data concerning RAID volumes are saved directly on the disks that make up these arrays, so that the kernel can detect the arrays and their components and assemble them automatically when the system starts up. However, backing up this configuration is encouraged, because this detection isn't fail-proof, and it is only expected that it will fail precisely in sensitive circumstances. In our example, if the sde
disk failure had been real (instead of simulated) and the system had been restarted without removing this sde
disk, this disk could start working again due to having been probed during the reboot. The kernel would then have three physical elements, each claiming to contain half of the same RAID volume. Another source of confusion can come when RAID volumes from two servers are consolidated onto one server only. If these arrays were running normally before the disks were moved, the kernel would be able to detect and reassemble the pairs properly; but if the moved disks had been aggregated into an md1
on the old server, and the new server already has an md1
, one of the mirrors would be renamed.
Backing up the configuration is therefore important, if only for reference. The standard way to do it is by editing the /etc/mdadm/mdadm.conf
file, an example of which is listed here:
Example 12.1. mdadm
configuration file
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#
# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
DEVICE /dev/sd*
# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
# definitions of existing MD arrays
ARRAY /dev/md0 metadata=1.2 name=mirwiz:0 UUID=bb085b35:28e821bd:20d697c9:650152bb
ARRAY /dev/md1 metadata=1.2 name=mirwiz:1 UUID=6ec558ca:0c2c04a0:19bca283:95f67464
# This configuration was auto-generated on Thu, 17 Jan 2013 16:21:01 +0100
# by mkconf 3.2.5-3
One of the most useful details is the DEVICE
option, which lists the devices where the system will automatically look for components of RAID volumes at start-up time. In our example, we replaced the default value, partitions containers
, with an explicit list of device files, since we chose to use entire disks and not only partitions, for some volumes.
The last two lines in our example are those allowing the kernel to safely pick which volume number to assign to which array. The metadata stored on the disks themselves are enough to re-assemble the volumes, but not to determine the volume number (and the matching /dev/md*
device name).
Fortunately, these lines can be generated automatically:
#
mdadm --misc --detail --brief /dev/md?
ARRAY /dev/md0 metadata=1.2 name=mirwiz:0 UUID=bb085b35:28e821bd:20d697c9:650152bb
ARRAY /dev/md1 metadata=1.2 name=mirwiz:1 UUID=6ec558ca:0c2c04a0:19bca283:95f67464
The contents of these last two lines doesn't depend on the list of disks included in the volume. It is therefore not necessary to regenerate these lines when replacing a failed disk with a new one. On the other hand, care must be taken to update the file when creating or deleting a RAID array.