TUX Linux XFS Recovery Boot Tool

Overview

Under XFS, there currently is no way to repair a damaged filesystem that is mounted, which makes it rather hard to recover the root (/) partition without some form of recovery disk.  This script, plus some specific configuration items will provide for automatic and manual repair services for XFS file systems.

Requirements

  1. /boot needs to be its own partition.  I tend to do this for safety reasons anyway.  I also mount it as Read-Only under normal operation but that is not required.
  2. The current scripts assume devfs support in the recovery kernel with devfs automatically being mounted.
  3. The current scripts assume /proc support in the recovery kernel
  4. The current scripts assume tempfs support in the recovery kernel
  5. The current install script assumes lilo as the boot loader
It turns out that my normal kernel has all of these settings in it already so I can use a tested production kernel as a recovery kernel.

Background and how it works

The single script you can download here builds all of the structures and files that it needs into the /boot directory tree other than the actual kernel. The kernel file name it assumes is "linux.lastchance " and should be built before you run the install process the install process will grab the modules from /lib/modules so they will need to be there.  Two new lilo.conf entries are created (if they are not already there) that will add the auto recovery and manual recovery options.

The recovery process

Auto recovery starts by having lilo (or grub) boot with the recovery kernel and the correct options to start the recovery script.  The script then sets up enough of an environment such that the XFS check and repair scripts can run.  This includes adding swap space since the XFS repair tools can take quite a bit of VM space on large filesystems.

The recovery script then searches the system for XFS filesystem based partitions.  This is done by trying to mount them which has the very important side-effect of playing back any log entires that may be due to an unclean umount.  This builds a list of filesystems to check.

Next, xfs_check is run on each of these filesystems.  For each one where xfs_check returns non-zero, it is added to the list of filesystems needing repair.

Finally, in automatic mode, xfs_repair is run for each of the filesystems that were on the repair list and then the system cleanly reboots.

In manual mode, the current status is displayed and you are dropped into a shell prompt.  At this point you can try to affect repairs yourself by running xfs_replair manually or you can run the repair script which will do the auto-repair processing for you.  The manual mode is mainly there for those cases when things are worse than expected or if some other repair operation is needed.  In manual mode, /tmp is a tmpfs filesystem and you can use it as storage and to make mount points to mount your other filesystems in case some file needs repair or recovery.

Tricks

One of the tricks I use for remote servers (headless) is to use lilo with the -R option to run the auto-repair on the server and then to reboot the server back into normal mode.  I issue the command "lilo -R Autofix" such that the next time the server reboots, it runs the auto-recovery mode.  Then a "reboot" command when you are ready to reboot it is all that is needed.

Compatibility

I have been using this system on all of my Linux boxes for over a year now.  These range from simple IDE-only desktops to complex multi-drive, multi-partition SCSI setups.  I am sure there are configurations that do not work but none of the ones that I have yet to run into.  (For example, I am sure that software RAID would require some additional work but I don't have such a set up here to test.)

Some notes on security:

The physical console of the machine will be a security risk during the recovery process.  There are recovery shells started on separate virtual consoles that would let someone do nasty things if they wanted to.  Manual recovery mode is even more insecure as it drops the user into a shell after having done the system scan.  So, physical security for the console would be needed to keep the system secure.

Everything that is installed into the boot process comes from the core system and the install script.  The install script does not check (MD5-sum) the parts as it does not know what the source files should MD5 as.  Everything put into /boot has root-only access but security of the source files are up to the system administrator.

During the recovery process, networking is not up and running.  This improves security and removes the chance of a network-induced failure/crash causing even more damage.