Overview
Under XFS, there currently is no way to repair a damaged filesystem that is
mounted, which makes it rather hard to recover the root (/ )
partition without some form of recovery disk. This script,
plus some specific configuration items will provide for automatic and manual
repair services for XFS file systems.
Requirements
-
/boot needs to be its own partition. I tend to do this for
safety reasons anyway. I also mount it as Read-Only under normal operation
but that is not required.
-
The current scripts assume
devfs support in the recovery kernel
with devfs automatically being mounted.
-
The current scripts assume
/proc support in the recovery kernel
-
The current scripts assume
tempfs support in the recovery kernel
-
The current install script assumes
lilo
as the boot loader
It turns out that my normal kernel has all of these settings in it already so I
can use a tested production kernel as a recovery kernel.
Background and how it works
The single script you can download here
builds all of the structures and files that it needs into the /boot directory
tree other than the actual kernel. The kernel file name it assumes is "linux.lastchance
" and should be built before you run the install process the install process
will grab the modules from /lib/modules so they will need to be
there. Two new lilo.conf entries are created (if they are not already
there) that will add the auto recovery and manual recovery options.
The recovery process
Auto recovery starts by having lilo (or grub) boot with the recovery kernel and
the correct options to start the recovery script. The script then sets up
enough of an environment such that the XFS check and repair scripts can run.
This includes adding swap space since the XFS repair tools can take quite a bit
of VM space on large filesystems.
The recovery script then searches the system for XFS filesystem based partitions.
This is done by trying to mount them which has the very important side-effect of
playing back any log entires that may be due to an unclean umount. This
builds a list of filesystems to check.
Next, xfs_check is run on each of these filesystems. For each one where
xfs_check returns non-zero, it is added to the list of filesystems needing
repair.
Finally, in automatic mode, xfs_repair is run for each of the filesystems that
were on the repair list and then the system cleanly reboots.
In manual mode, the current status is displayed and you are dropped into a shell
prompt. At this point you can try to affect repairs yourself by running
xfs_replair manually or you can run the repair script which will do
the auto-repair processing for you. The manual mode is mainly there for
those cases when things are worse than expected or if some other repair
operation is needed. In manual mode, /tmp is a tmpfs
filesystem and you can use it as storage and to make mount points to mount your
other filesystems in case some file needs repair or recovery.
Tricks
One of the tricks I use for remote servers (headless) is to use lilo
with the -R option to run the auto-repair on the server and then to
reboot the server back into normal mode. I issue the command "lilo -R
Autofix " such that the next time the server reboots, it runs the auto-recovery
mode. Then a "reboot " command when you are ready to reboot it
is all that is needed.
Compatibility
I have been using this system on all of my Linux boxes for over a year now.
These range from simple IDE-only desktops to complex multi-drive, multi-partition
SCSI setups. I am sure there are configurations that do not work but none
of the ones that I have yet to run into. (For example, I am sure that
software RAID would require some additional work but I don't have such a set up
here to test.)
Some notes on security:
The physical console of the machine will be a security risk during the recovery
process. There are recovery shells started on separate virtual consoles
that would let someone do nasty things if they wanted to. Manual recovery
mode is even more insecure as it drops the user into a shell after having done
the system scan. So, physical security for the console would be needed to
keep the system secure.
Everything that is installed into the boot process comes from the core system
and the install script. The install script does not check (MD5-sum) the
parts as it does not know what the source files should MD5 as. Everything
put into /boot has root-only access but security of the source files are up to
the system administrator.
During the recovery process, networking is not up and running. This
improves security and removes the chance of a network-induced failure/crash
causing even more damage.
|