HOWTO install Linux 2.2 and 2.4 using RAID-Level 1 for Data- and Root/Boot-Partitions

2006
11.10

-2. Preambel
-1. Prerequisites
-0. Introduction
1. Install Debian
2. The Kernel (Kernel 2.2.20, Kernel 2.2.25, Kernel 2.2.19, Kernel 2.4 (Debian Woody))
3. Raidtools2
3a. mdadm
3b. raidutils
4. Partitions
5. /etc/raidtab
6. ROOT fs on RAID
7. BOOT from RAID
8. Swap Space
9. Testing
10. Monitoring your RAID-Devices
11. In Case of Disk-Failure
12. Trailer


-2. Preambel
This Document is based on the install of AQUA (one of our servers) and describes how to install the Linux-system using RAID-1-devices for the /data-partition as well as for the system’s root ( "/" ) itself.

Document Versions
04-10-27:
Added a link to another Debian-install-on-RAID-guide.
Added a short note about Debian Sarge.
04-06-21:
Added a link to raidutils.
04-03-29:
Added a paragraph to "6. ROOT fs on RAID" about a situation where "mount" could produce wrong output.
03-12-26:
Updated information about the Linux 2.4 kernel series (which works perfectly fine now).
03-08-26:
Added information on the section about mdadm.
03-07-11:
Added a second monitoring script which works for both Linux 2.2 and 2.4.
03-06-06:
Added a section about mdadm.
03-04-24:
Added link to raidreconf.
03-03-17:
Added a section for Linux 2.2.25.
03-03-01:
Minor changes to clear some things up.
Changed everything to mixed-case in order to improve readability.
03-01-24:
Restructured some HTML in order to be "HTML 4.01 Strict" valid.
Created a chapter "the kernel" to hold the specific sections.
03-01-23:
Publishing the document under the GNU Free Documentation License — finally. :)
03-01-19:
Changed this page’s home to a new URL (nevertheless it is still accessible using the old one too).
Repaired two links to "tldp".
The page got a new look.

License
Copyright (c) Markus Amersdorfer and Bruno Randolf, subnet.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
A current copy of the license should be available here. In case of failure, Version 1.2 of the GNU FDL is also available locally.

Disclaimer
This document comes without any warranty and does not claim to be complete nor does it necessarily hold correct information. The authors can not be held reliable for any loss of data or corrupted hardware or any other miscomfort due to information of this document. Use this at your own risk!

Authors
Markus Amersdorfer (max _at_ subnet)
Bruno Randolf (br1 _at_ subnet)
-1. Prerequisites
Software (used/needed):
Debian (2.2r3 aka "Debian Potato" or 3.0 aka "Debian Woody")
Linux 2.2.19, 2.2.20 or 2.2.25 (official Linux kernel)
raid-2.2.19-A1 or raid-2.2.20-A0
raidtools2 (from the according Debian-tree)
(LILO-21.5beta which comes with Debian 2.2r3, or later.)
Hardware:
SCSI-Disks:
2x 18,2GB IBM DDYS-T18350 (SCSI-LVD2)
2x 9,1GB IBM DDYS-T09170 (SCSI-LVD2)
SCSI-Controller:
Adaptec 19160 (v.2.57.2)
HOWTOs and other resources:
http://www.linuxdoc.org/HOWTO/Software-RAID-HOWTO.html
http://www.linuxdoc.org/HOWTO/Boot%2BRoot%2BRaid%2BLILO.html
http://www.fokus.gmd.de/linux/dlhp/aktuell/DE-Software-RAID-HOWTO.html (german)
RAID Reconfiguration Tool
Installing Debian – a dense description of how to install Debian Woody on a RAID-1-device using Knoppix and debootstrap.
0. Introduction
The two 18 GB disks shall be mirrored (RAID-1) and contain the data. The two 9 GB disks shall be mirrored (RAID-1) and contain the system-files and the root-fs. The system shall be bootable even if one of the two system-disks failed.

We’ll use a patched linux-2.2.19-kernel (see below for info on kernels 2.2.20 or ) in order to have the newer raid-0.90 (kernel 2.2.x comes with an old raid-system (raid-0.42 I believe)). Among the advantages of raid-0.90 is the possibility of having the kernel "auto-detect" RAID-partitions on the harddrives.

1. Install Debian
Well … do that. :)
Install the system as usual on the first of the mentioned system harddrives.

Here are some notes of what we did:
swap: /dev/sda2
root: /dev/sda1
no ext2 backwards compatibility
modules: net: 3c59x
(… net config …)
hardware clock set to GMT
lilo in MBR
reboot
md5 passwords
scan cdrom 1/2/3
software install: simple (C Dev, Samba, Tcltk)
exim: 1/rbl-filter
samba: daemon
login
dselect: +nano +vim +bzip2 +libncurses5-dev
dselect: install (this installs all standard packages as well)
We strongly recommend you create a boot-disk if at some point of this work your system was not bootable anymore.

Debian Sarge
Update — 04-10-27:
The with the release of Debian Sarge upcoming new installer supports RAID-devices out of the box. Some or even lots of the steps below to get such a system up and running will not have to be performed manually by the user anymore.

2. The Kernel
2.1.a Kernel 2.2.20
Kernel 2.2.20 (as usual check out http://www.kernel.org/ for where to find it) above all fixes two rather serious bugs present in 2.2.19 and earlier: one results in your machine being vulnerable to a DoS attack, knowing how to exploit the other bug a cracker can even gain root priviledges.
Recommendation as usual: Update your kernel!

Just get the kernel and the patch raid-2.2.20-A0 … the installation process is basically the same as with kernel 2.2.19 :) (which will be used in the description below).

2.1.b Kernel 2.2.25
Linux 2.2 up to and including 2.2.24 (as well as Linux 2.4 up to and including 2.4.20) are vulnerable to a local root exploit, so it’s a good idea to install 2.2.25 on a production (and any other) system.

As kernel 2.2.20 is the latest kernel (as of 2003-03-17) for which a seperate raid-patch exists, we’ll have to use the 2.2.20-patch with later version kernels of the 2.2 series. This should basically work fine, as the 2.2 series definitely does not change kernel basics.
My experiences with a 2.2.23 kernel are as perfect as they should be, so I suppose it also works with the 2.2.25 kernel.

Just get the kernel and the patch. Patch the kernel and notice the two messages Hunk #1 succeeded at 88 with fuzz 2 and patching file Makefile — Hunk #1 FAILED at 1. The first Hunk succeeded, so it should be fine, the second Hunk failed. Nevertheless, it’s just the EXTRAVERSION string in /usr/src/linux/Makefile. So edit the file and add raid as the variable’s value.
(I wrote "raid" in lower case, as this complies with the Debian policy… I think… :) …)

Once you finished patching the kernel, go on and configure, compile, install and use it as described below for version 2.2.19.

2.2. Kernel 2.2.19
Get the kernel linux-2.2.19.tar.gz and copy it to /usr/src. (Would be nice to download the source from a mirror near you, check out http://www.kernel.org/mirrors/ to find one.)

  cd /usr/src
  [ if necessary:  mv linux linux-old ]
  tar xzf linux-2.2.19.tar.gz
  mv linux linux-2.2.19
  ln -s linux-2.2.19 linux

Get the raid-0.90-patch "raid-2.2.19-A1″ from http://people.redhat.com/mingo/raid-patches/ and copy it to /usr/src.

  cd /usr/src
  patch -p0 < raid-2.2.19-A1

The "hunk [...] succeeded"-messages are fine :) . If no error is reported, go on and config your kernel as you wish to:

  cd /usr/src/linux
  make xconfig  [ or "make menuconfig" ]

Be sure to say "y" to the following (in section "block devices"):

  y  multiple devices driver support
  y  autodetect RAID partitions
  y  RAID-1 (mirroring) mode

All other RAID-related options are set to "n".

Store config to file, save and exit.
Compile and install the kernel manually as described below, or follow the instructions here to do it the Debian-way...

  make dep
  make bzImage
  [ if necessary:  mv /lib/modules/2.2.19 /lib/modules/2.2.19_old ]
  make modules
  make modules_install

  cp System.map /boot/System.map-2.2.19
  cp arch/i386/boot/bzImage /boot/vmlinuz-2.2.19

  cd /boot
  ln -s System.map-2.2.19 System.map
  ln -s vmlinuz-2.2.19 vmlinuz

Edit /etc/lilo.conf accordingly and execute lilo (and don't forget the option prompt which is quite useful). At this time, LILO should still install just to the MBR of /dev/sda.

Reboot and see your shiny new RAID-1-capable kernel come up :) .

2.3. Kernel 2.4 (Debian Woody)
03-12-26, Update:
I've repeatedly had problems with our development server back then when trying to have it run Linux 2.4.17 and 2.4.18. (With the latter, the machine kept crashing (it just froze during normal operation), the last time resulting in a non-bootable system. No error-logs or similar were available.
It took me some time to undo the damage.) I've never found out exactly what the problem was.

Nevertheless, I do think the problem was due to some basic 2.4-problems with this one machine's hardware and had nothing to do with having the machine use Linux' SW-RAID. Meanwhile, all our servers are properly running Linux 2.4.23, including the software RAID mirrored ones!

To install Linux 2.4 with SW-RAID enabled, just recompile the kernel activating RAID capabilities. You don't have to patch anything, as the 2.4 series already has the 0.90-series RAID as its default.

Mind 1: When using Linux 2.4, do use Debian 3.0 (aka Debian Woody) or later, as Potato is not 2.4-ready.
Mind 2: When using Linux 2.4, be sure to have raidtools2 with at least version 0.90.20010914-9 or better. (This version was uploaded to Woody between beginning of January '02 and midth of February '02, and up to date enough version is thus part of the official Debian 3.0 release.)

3. Raidtools2
Get the package raidtools2 (which is the one for 0.90-RAID, don’t use the package "raidtools" which is for old-style RAID):

 aqua:~# cat /etc/apt/sources.list
  deb http://http.us.debian.org/debian stable main contrib non-free
  deb http://non-us.debian.org/debian-non-US stable/non-US main contrib non-free
  deb http://security.debian.org stable/updates main contrib non-free
  [...]

and do

 apt-get update
 apt-get install raidtools2

Fortunately we had and took the time to review the package list in dselect and installed some essential packages such as bzip2 etc. This would be a good time to do so in order to get a really nice system :) .

3a. mdadm
"mdadm" is a tool with similar functionality to the raidtools. I have never used it thoroughly yet, but I just wanted to mention it.

Once you had a partition of type "fd" be part of an MD-device (e.g. /dev/md0), even changing its type to the normal "83″ will not prevent the kernel (with autodetection-functionality as described below) from recognizing this partition to actually belong to md0. This is because of the special superblock written at the beginning(?) of the partition.
mdadm provides for a convenient way to get rid of this RAID superblock without losing the data stored in the corresponding partition. I tried this once with a different machine then what is used in the rest of this HOWTO (this device /dev/md0 of course had nothing to do with the one described in later sections here). Though it worked for me, be sure to have backups!
There was no other partition currently part of the RAID device then the one I wanted to erase the superblock from, /proc/mdstat only showed /dev/md0 to consist of /dev/hdc1 (which was set to type "83″ already before booting the machine). I stopped the RAID device using raidstop /dev/md0 (you might have to unmount the device first). Next, I ran mdadm –zero-superblock /dev/hdc1 and rebooted the machine.

3b. raidutils
I don’t have any experience with raidtools – Adaptec I2O compliant RAID controller management utilities, but I wanted to note them here in order not to forget they exist … :)
4. Partitions
Partition the second system-drive exactly the same as the first one.
We did:

  sda1:  8916.21 (Linux, aka "/")
  sda2:  254.99  (swap)

Set the partition-types from all partitions you would like to be part of the raid to "fd" using fdisk. This enables the autodetection-function of the kernel. In this case, it’s just sda1 and sdb1 (being sda the first and sdb the second system-drive).
Partition the 3rd and 4th HDD (which will hold the user-data) as you wish (of course the 4th must be partitioned the same as the 3rd) and set the partition-type to "fd".

5. /etc/raidtab
  # raidtab config-file for aqua.subnet.at

  # 17 GB RAID-1 (mirroring) array — DATA
  raiddev /dev/md1
         raid-level      1
         nr-raid-disks   2
         nr-spare-disks  0
         chunk-size      4
         persistent-superblock 1
         device          /dev/sdc1
         raid-disk       0
         device          /dev/sdd1
         raid-disk       1

  # 8 GB RAID-1 (mirroring) array — BOOT/ROOT
  raiddev /dev/md0
         raid-level      1
         nr-raid-disks   2
         nr-spare-disks  0
         chunk-size      4
         persistent-superblock 1
         device          /dev/sdb1
         raid-disk       0
         device          /dev/sda1
         failed-disk     1
  #      ^^^^^^^^^^^^^^^^^
  #      assuming that you installed the system on this hdd!
  # from the HOWTO: Don’t put the failed-disk as the first disk in the raidtab,
  #                 that will give you problems with starting the RAID.

/dev/md1 should be configured just fine now: two previously unused partitions (/dev/sdc1 and /dev/sdd1) which will build up the RAID-partition /dev/md1.
Make sure again that both partitions have the correct partition type set (fdisk -l /dev/sdc should give fd as the Id, same goes for /dev/sdd).
If not, set the type correctly, rename /etc/raidtab again so it can’t be found under that name by the system when booting and reboot the machine. Verify the settings, move raidtab to /etc/raidtab again and then continue with activating one of the two RAID-partitions.

We’ll now activate it:

  mkraid /dev/md1

From the HOWTO:
"Check out the /proc/mdstat file. It should tell you that the /dev/md0 device has been started, that the mirror is being reconstructed, and an ETA of the completion of the reconstruction.

Reconstruction is done using idle I/O bandwidth. So, your system should still be fairly responsive, although your disk LEDs should be glowing nicely.

The reconstruction process is transparent, so you can actually use the device even though the mirror is currently under reconstruction.

Try formatting the device, while the reconstruction is running. It will work. Also you can mount it and use it while reconstruction is running. Of Course, if the wrong disk breaks while the reconstruction is running, you’re out of luck."

So the next step is to format the device:

  mke2fs /dev/md1

The data-disks’ RAID-1 is now ready to use. The RAID-device could already be mounted using a command similar to

  mkdir /mnt/data-raid
  mount -t ext2 /dev/md1 /mnt/data-raid

/dev/md0 is going to hold the system and needs additional attention now …

6. ROOT fs on RAID
What you want to do now is to have the root-filesystem ("/") reside on a RAID-device, in our case /dev/md0.
This means: LILO will still be installed on just one normal harddrive (in our case in the MBR or /dev/sda) but when booting, your system will mount "/" from /dev/md0 instead of from /dev/sda1 or anything else. This way, if for example sdb crashes, you will still be able to boot your system as the RAID-partition /dev/md0 will be used as "/". (Of course, /dev/md0 will be run in degraded mode, but that’s what the RAID is for actually :) …)

So, let’s go:
"Create the RAID, and put a filesystem on it:"

  mkraid /dev/md0
  mke2fs /dev/md0

… and wait until the RAID-system has finished building up (check /proc/mdstat).
This creates the RAID1-device which will be used for our Linux-system "/". Just /dev/sdb1 is used at this moment, as sda1 (which already holds the current Linux-system) is marked "Failed" (see /etc/raidtab for details on the configuration).

"Try rebooting and see if the RAID comes up as it should."
(Check /proc/mdstat for the status of /dev/md0. It should be in degraded mode, but otherwise run smoothly. This means, it should have [U_] or [_U] as it’s status.)

"Copy the system files, and reconfigure the system to use the RAID as root-device:"

  mkdir /mnt/newroot
  mount -t ext2 /dev/md0 /mnt/newroot
  cd /
  find . -xdev | cpio -pm /mnt/newroot

Being in the directory "/", the last command copies everything from the source-partition mounted to "/" to the destination-partition which is mounted to /mnt/newroot. Content from other partitions and filesystems which might be mounted is not copied to /mnt/newroot due to find’s option "-xdev".

Now edit /mnt/newroot/etc/fstab to use the correct device (the /dev/md0 root device) for the root filesystem "/".
If you don’t do that, you’ll end up with LILO telling the kernel that "/" is to be found at /dev/md0 (see next paragraph) but fstab still thinking that "/" is to be taken from /dev/hda1. In this case it seems that /dev/md0 is used, but "mount" will (falsely) tell you that /dev/hda1 was mounted. Make sure that lilo.conf’s root=… and fstab’s "/" point to the same device!

Add a new "image="-section to /mnt/newroot/etc/lilo.conf which says "root=/dev/md0″, ( HOWTO: "The boot device must still be a regular disk (non-RAID device), but the root device should point to your new RAID" ), run

  lilo -t -C /mnt/newroot/etc/lilo.conf

if it doesn’t complain, run

  lilo -C /mnt/newroot/etc/lilo.conf

and if it still doesn’t complain about anything reboot with this new LILO-entry and see your shiny new system come up with LILO residing in sda’s MBR and (that’s what’s new:) the root-fs "/" being on the RAID-1 mirrored /dev/md0 :) .
(Mind: /dev/md0 is still in degraded mode…)

Now check & double check, that your root is really on /dev/md0 and that the new /etc/lilo.conf is correct and /etc/fstab is correct (both pointing to /dev/md0 as "/" or "root=")…
(you could use "df -h" and "mount" for this.)

From the HOWTO again:
"When your system successfully boots from the RAID, you can modify the raidtab file to include the previously failed-disk as a normal raid-disk. Now, < raidhotadd > the disk to your RAID."

This will integrate the partition you installed Debian to initially into the RAID-device /dev/md0, making it complete and run in fully mirrored (non-degraded) mode.
Make sure again that the new partition has the correct partition type set (fdisk -l /dev/sda should give "fd" as the Id).
If not, set the type correctly and reboot the machine. Verify the settings and then continue with integrating the new partition as described below.

  # raidtab config-file for aqua.subnet.at
  [...]

  # 8 GB RAID-1 (mirroring) array — BOOT/ROOT
  raiddev /dev/md0
  [...]
  device          /dev/sdb1
  raid-disk       0
  device          /dev/sda1
  raid-disk       1

And then:

  raidhotadd /dev/md0 /dev/sda1

And wait until the recovery process is finished. Use

  cat /proc/mdstat

To check the current state of the process.

IMPORTANT: Run LILO again!!! (Otherwise you won’t be able to boot your system but instead have a broken LILO and LOTS of work!)
And be careful not to boot any image that uses /dev/sda1 as "/" (this will cause fs inconsistencies)!! From this point on, your Linux-system shall only be bootet using /dev/md0 as "/", which requires a RAID-capable kernel as well as the LILO-entry "root=/dev/md0″.

… you should now have a system that can boot normally with a root fs ( "/" ) on a non-degraded RAID. :)

7. BOOT from RAID
You now have to reconfigure LILO again.
Until now, LILO just installs to the first system-disk (sda). If this one fails, there is no boot-manager left to boot your machine.
Therefore, you need to install LILO also on the second drive (sdb) which also builds our system-RAID-device /dev/md0 in order to be able to boot correctly even if sda had failed.
(Mind: If the first harddrive fails in a way it just can’t boot anymore (e.g. due to sector-errors) but the BIOS still recognizes the device being present and therefore tries to load this drive’s boot-loader, you might need to manually remove the broken harddrive in order to be able to boot from the "second" drive. The description presented here to install LILO on both harddrives holding "/" and "/boot" will thus not mean your computer will automatically reboot in any situation which might occur. Nevertheless, I think it’s a good idea to install LILO to the second drive too, as it can save you a lot of time and work even if the first disk crashes…)

You could do this by having two lilo.conf’s and executing LILO two times when something changes. Anyway, this doesn’t seem to be very Sys-Admin-Error-tolerant. ;)

Another and IMHO much better possibility is to have LILO automatically do that. Fortunately Debian Potato’s lilo-21.5beta is already able to deal with RAID-1-devices. (For Debian Woody’s LILO-22.2-3 see below!)
The new option is "boot=". set it to:

  boot=/dev/md0

This way LILO recognizes /dev/md0 consisting of two RAID-1-devices and installs automatically on both of them :) .

Here is our /etc/lilo.conf with both "boot=" and "root=" set to "/dev/md0″:

  # lilo.conf
  # boot and root on /dev/md0 (raid root partition)

  lba32
  boot=/dev/md0
  root=/dev/md0
  install=/boot/boot.b
  map=/boot/map
  delay=20
  prompt
  timeout=100
  vga=normal

  default=2.2.19

  image=/boot/vmlinuz-2.2.19
          label=2.2.19
          read-only

  # !! DONT BOOT THIS IMAGE !!
  #
  # this is the standard debian-2.2r3 kernel.
  # this will damage the RAID-1 partitions and corrupt the filesystem !!
  #
  #image=/boot/vmlinuz-2.2.19pre17
  #       label=debian
  #       read-only
  #
  # !! DONT BOOT THIS IMAGE !!

Executing LILO should give an output similar to the following:

  aqua:/home/max# lilo
  boot = /dev/sdb, map = /boot/map.0811
  Added 2.2.19 *
  boot = /dev/sda, map = /boot/map.0801
  Added 2.2.19 *

… this means, LILO was installed on both disks. Your system should now come up even if either one of the two system-disks failed.
Reboot your system now to see if everything works fine.

Debian Woody’s LILO 22.2-3
This new version of LILO needs an additional option for RAID-1 devices which has to be added to the general section of lilo.conf:

  raid-extra-boot=/dev/sda,/dev/sdb

From "man lilo.conf":
"Use of an explicit list of devices, forces writing of auxiliary boot records only on those devices enumerated, in addition to the boot record on the RAID1 device. Since the version 22 RAID1 codes will never automatically write a boot record on the MBR of device 0×80, if such a boot record is desired, this is the way to have it written."

8. Swap Space
If one of the two system-disks fails, the system still needs a swap-partition to work properly — what a surprise :) .
Generally, there are many controversial opinions on this topic. e.g., you could again install a RAID-1-device for the swap partitions.

The easier way is to create and activate two swap-partitions and have the system use them both. If one of them isn’t available (e.g. due to a disk-crash), that’s no problem cause there’s still the other one.

Edit /etc/fstab:

  /dev/sda2       none            swap    sw                           0 0
  /dev/sdb2       none            swap    sw                           0 0

Initialize the swap space:

  mkswap /dev/sdb2

Reboot again.

  aqua:/# swapon -s
  Filename                        Type            Size    Used    Priority
  /dev/sda2                       partition       248996  0       -1
  /dev/sdb2                       partition       248996  0       -2

9. Testing
You’re done now. System’s ready :) ).
If all went well you can now test your RAID-functionality to whatever extent you want.
Play around disconnecting different disks and see how the system reacts.

Remember the functionality of raidhotadd, raidhotremove, raidstart, raidstop and the other commands and how to use them. (For a short german description about these commands check out http://www.fokus.gmd.de/linux/dlhp/aktuell/DE-Software-RAID-HOWTO-3.html.)
Especially mind that after re-connecting the disk and re-booting the machine, you’ll have to do something like

  raidhotadd /dev/md0 /dev/sda1

in order to have the new disk added to the running RAID-array again.

For details about the usage of especially raidhotadd, see 11. In Case of Disk-Failure below.

10. Monitoring your RAID-devices
Here’s a small script which is sufficient for our two simple devices. Executed by cron it sends you a mail if there’s an error:

  #!/bin/sh
  /bin/cat /proc/mdstat |
  /bin/grep ^md |
  /bin/grep -v ‘[UU]‘

Here is another one which works for both Linux 2.2 and Linux 2.4:

#!/bin/sh
#
# monitor raid-devices (every night at 4 am)
# cron automatically sends mail if there’s an error
#
# http://homex.subnet.at/~max/
#

# checking for stuff like [_U] or [U_]:
/bin/grep -q ‘[.*_.*]‘ /proc/mdstat

if [ $? == "0" ]; then
 # found something like [_U]:
 
 mdstat=`/bin/cat /proc/mdstat`
 machine=`/bin/hostname`
 
 /bin/echo "WARNING for ${machine}: Some RAID arrays are running in degraded mode!"
 /bin/echo "Below is the content of /proc/mdstat:"
 /bin/echo
 /bin/echo "$mdstat"
fi

# checking for (F):
/bin/grep -q ‘(F)’ /proc/mdstat

if [ $? == "0" ]; then
 # found (F):
 
 mdstat=`/bin/cat /proc/mdstat`
 machine=`/bin/hostname`
 
 /bin/echo "WARNING for ${machine}: Some disks seem to have failed!"
 /bin/echo "Below is the content of /proc/mdstat:"
 /bin/echo
 /bin/echo "$mdstat"
fi

11. In Case of Disk-Failure …
intro
For the exact worklog concerning our first disk-failure, check out this.

The text here is a detailed summary of our two disk failures — in both cases our software RAID enabled us to have the machine up and running despite the hardware problems :)

Here’s what /proc/mdstat said after our first crash (it was drive sdc):

  # cat /proc/mdstat
  Personalities : [raid1]
  read_ahead 1024 sectors
  md0 : active raid1 sdb1[0] sda1[1] 8707072 blocks [2/2] [UU]
  md1 : active raid1 sdd1[1] sdc1[0](F) 17920384 blocks [2/1] [_U]
  unused devices: (none)

Well, as I probabely don’t have to tell you:

The (F) means failure
[2/1] tells us, only 1 out of 2 disks of this RAID-device is working normally
[_U] shows us "graphically" (*g*) the first disk is down (which ain’t funny, actually)
Summary
Edit /etc/raidtab and change the entry for the damaged disk from "raid-disk" to "failed-disk" (see below for details)
Shut down the computer
Remove the disk
Restart and see your RAID-1 coming up "normally" in degraded mode
As soon as you got your new disk: shut down, insert the disk and restart
fdisk it the same as the old one
Reboot the machine (!)
Edit /etc/raidtab and replace "failed-disk" with "raid-disk" again
# raidhotadd /dev/md1 /dev/sdc1
A more detailed Version
First I changed /etc/raidtab to:

  [...]
  # 17 GB RAID-1 (mirroring) array — DATA
  raiddev /dev/md1
          raid-level      1
          nr-raid-disks   2
          nr-spare-disks  0
          chunk-size      4
          persistent-superblock 1
          device          /dev/sdd1
          raid-disk       0
          device          /dev/sdc1
          failed-disk     1
  #       ^^^^^^^^^^^^^^^^^

After doing this, you could immediately shut down you machine, replace the damaged disk with a new one and reboot.
As we first had to get our hands on a new one, we rebooted the machine just after removing sdc.
As you perhaps may have noticed, the still working "sdd" became a temporary but new "sdc". Nevertheless, the RAID system noticed this change and worked fine with just one drive for /dev/md1 (although the original "sdc" was marked failed at this time). The kernel said "md: device name has changed from sdd1 to sdc1 since last import!"
For more details check out the exact worklog from the link above.

The status of the system’s RAID devices looked like this:

  # cat /proc/mdstat
  Personalities : [raid1]
  read_ahead 1024 sectors
  md1 : active raid1 sdd1[1] 17920384 blocks [2/1] [_U]
  md0 : active raid1 sdb1[0] sda1[1] 8707072 blocks [2/2] [UU]
  unused devices: [none]

After finally inserting the new disk, the kernel again while booting recognized the change in device-names ("sdc" became "sdd", new "sdc" available).

The next important step was to partition the disk exactly as the current "sdd" which held the data.
Just to be sure everything is initialized correctly with the disk I rebooted the machine again.
01-12-12: The reboot at this stage of work is essential! Otherwise it could be that your system seems to be in perfect health but all changes made with fdisk on the new disk will be gone after your next reboot (which finally will come some day).
A great thanks to Carsten Grohmann for telling me about this. He probabely saved a lot of someone’s bits and bytes with this :) .

Having everything prepared, I reedited /etc/raidtab to say:

  [...]
  # 17 GB RAID-1 (mirroring) array — DATA
  raiddev /dev/md1
    raid-level      1
    nr-raid-disks   2
    nr-spare-disks  0
    chunk-size      4
    persistent-superblock 1
    device          /dev/sdd1
    raid-disk       0
    device          /dev/sdc1
    raid-disk       1
  # ^^^^^^^^^^^^^^^^^

Without formatting the new partition (i.e. I did not create a new ext2-filesystem on it explicitly):

  # raidhotadd /dev/md1 /dev/sdc1

  # cat /proc/mdstat
  Personalities : [raid1]
  read_ahead 1024 sectors
  md0 : active raid1 sdb1[0] sda1[1] 8707072 blocks [2/2] [UU]
  md1 : active raid1 sdc1[2] sdd1[1] 17920384 blocks [2/1] [_U] recovery=3% finish=10.2min
  unused devices: [none]

After the recovery-process was finished, I tested the RAID with either disk removed and raidhotadd’ing it afterwards … everything has worked fine again and still does. :)

Some Details concerning the second Disk-Failure
The second disk which crashed was "sdd". As you can see in /etc/raidtab above, sdd is the first disk in the array.
http://www.linuxdoc.org/HOWTO/Software-RAID-HOWTO-4.html#ss4.12 says:

"Don’t put the failed-disk as the first disk in the raidtab, that will give you problems with starting the RAID."

Well, I somehow had to mark the disk as failed, resulting in two options:

  [...]
  device          /dev/sdd1
  failed-disk     0
  device          /dev/sdc1
  raid-disk       1

or

  [...]
  device          /dev/sdc1
  raid-disk       0
  device          /dev/sdd1
  failed-disk     1

The first won’t comply with the hint of the official HOWTO, while with the second block one would end up with swapped RAID-disks :( .
For some reason and despite the warning from the HOWTO, I tried the first block… resulting in the following /etc/raidtab:

  # 17 GB RAID-1 (mirroring) array — DATA
  raiddev /dev/md1
    raid-level      1
    nr-raid-disks   2
    nr-spare-disks  0
    chunk-size      4
    persistent-superblock 1
    device          /dev/sdd1
    failed-disk     0
#   ^^^^^^^^^^^^^^^^^
    device          /dev/sdc1
    raid-disk       1

In short: Everything worked fine for me.
I rebooted the machine with the new disk built in, checked everything came up fine, partitioned the new drive (and probably rebooted the computer), re-edited /etc/raidtab to include it as "raid-disk" again (instead of "failed-disk") and – again without formatting – just did a "raidhotadd /dev/md1 /dev/sdd1″.

12. Trailer
We hope this document could help you in some way.
If you have some additions, corrections or other critics (pro or contra), feel free to contact us.

You may have fun with your system now. :)

One Response to “HOWTO install Linux 2.2 and 2.4 using RAID-Level 1 for Data- and Root/Boot-Partitions”

  1. Matthew Antaya 說道:

    This is such a great resource that you are providing and you give it away for free. I enjoy seeing websites that understand the value of providing a prime resource for free. I truly loved reading your post. Thanks!

Your Reply