Bleeding on the Bleeding-Edge :(

The unthinkable happened today -- My hard drive crashed! The machine stopped booting with scary looking messages I don't recall ever seeing. I looked desperately, among the randomness for a message of relief - Kernel panicked, but not even that message was to be found.

It all started soo innocently. I updated my Xgl portage tree - just like any other morning, and went about compiling the update packages. Now my machine is known to be quite buggy and reboot, especially during disk or processor intensive tasks. Overtime I've learned to live with it, blaming it on bad hardware (after all Linux is stable right!). But the random reboots were only a mere annoyance, thanks to reiser-fs the journaled filesystem, I never met with any sort of corruption -- that is until today.

The gruelling path to recovery


I booted off a taprobane Livecd, and went about fscking the root partition. The exact command was:


reiserfsck /dev/hda3


After checking for over an hour, it came up with a long report that basically said, my partition was severely corrupted! It adviced me to re-run the command with the --rebuild-tree option, but the man page had some alarming text about this option:


--rebuild-tree
This option rebuilds the entire filesystem tree using leaf nodes
found on the device. Normally you only need this option if the
reiserfsck --check reports "Running with --rebuild-tree is
required". You are strongly encouraged to make a backup copy of
the whole partition before attempting the --rebuild-tree option.
Once reiserfsck --rebuild-tree is started it must finish its
work (and you should not interrupt it), otherwise the filesystem
will be left in the unmountable state to avoid subsequent data
corruptions.


So instead I decided to use the more safe --fix-fixable option which basically said to fix anything fixable. Well that wasn't enough and I did ended up running --rebuild-tree after all.

Luckily my /home partition and the smaller /boot passed with good heath after fscking them :) All seemed well now, so I rebooted as I eagerly expected things to work. It turned out I was still far from recovery.

At first I could see the kernel boot and load drivers. I immediately noticed now the sound driver was giving an error, but that didn't seem too important to worry bout. Then came the following message:


Your system seems to be missing critical device files
in /dev ! Although you may be running udev or devfs,
the root partition is missing these required files !
...


But I did see a login prompt, but supprisingly it never prompted me for the password! I couldn't login :(. So I rebooted again, this time with the Gentoo Xgl CD. Copied all the /dev files from the live cd onto the /dev directory on the hard drive.

That's when I noticed the lost+found directory. I've rarely seen this directory on a reiserfs partition (its quite common on an ext[23] partition). Surprise surprise.. there were over 3-4 dozen of files that had been recovered with the inode numbers -- pretty useless without the file names. Suddenly it occurred to me -- I might have to reinstall Gentoo, something that could take a week or more.

Gentoo to the rescue


I rebooted to get the same message and the same login problem. I rebooted again and chrooted into the root partiton. I first ran revdep-rebuild -p to check reverse dependencies but that only came up with xterm as being broken. Trying to emerge xterm revealed another problem : the gentoo package database was inconsistent, but it told me the exact command to execute.


/usr/lib/portage/bin/fix-db.py


That came up with a list of packages that came up as corrupted. So I unmerged (removed) some and re-merged others and rebooted the system. Still had no luck, so back to the livecd. Again I chrooted to the root(/) partition and started looking at the logs which revealed more kernel drivers that have failed (acpi for example). Looks like I'll need to recompile a kernel. But that still didn't explain the login problem. Looking through the log files, one revealed the problem.

#tail -100 /var/log/auth.log
agetty[2933]: tty1: can't exec /bin/login: No such file or directory


Sure enough, /bin/login was infact missing. A bit of esearching hinted pam-login was responsible for that file, so emerging it at rebooting fixed it. Phew, a sigh of relief as X-windows booted to enlightenment(e17).

I still have to recompile the kernel and stay on the lookout for other (possibly) broken packages, atleast I've save on a couple of days/weeks starting from scratch!

I guess you can't live on the bleeding edge without bleeding a little (or a lot in this case) sometimes :)

Updated: