IBM, reiserfs and my data

Of how my IBM harddisk made strange sounds, the reiser filesystem could not read the partition any more and I was somehow able to rescue the data with the dd_rescue tool. The whole article is more or less just me whimpering, so you might want to jump directly to the conclusion with the quintessence.

The Story

The described events date back to Spring 2003, so I am not quite sure whether I recollect everything described here correctly.

In February 2003 my 60GB harddisk, from the infamous IBM DTTA series, in my Slackware 8.1 box, started to make strange sounds. After halting the system and rebooting again, I was unable to mount the reiser filesystem. Great. There was about 30gig of data on that partition.

I can't remember the exact mount error message, but basically it could not identify a reiserfs on the partition. So I started to mess a little bit bit with the reiserfsprogs. I went wild and called reiserfsck with --check, --fix-fixable or --rebuild-tree to no success. As far as I understand the error messages in retrospective, reiserfsck tried to rebuild the filesystem, but did so, right there, on the damaged harddisk blocks. The situation might not have been that serious, but unfortunately exactly the harddisk-blocks, where the superblock was located, were broken.

So I went to google, read all kind of arguments for and against reiserfs or ext2/3, subscribed to the reiserfs mailing-list - reading it for two month or so. I got myself the latest version of the reiserfsprogs (3.6.7) to no success.

I have been using reiserfs ever since, as I started out with SuSE (yes, I am from Germany) and they have been pushing reiserfs for a long time. I never had any complains about it. I liked it more than ext2. Not that it is really crucial with my desktop machine.

I was kind of wrapped up in other businesses back than and not totally dependent on the data on the partition so I did not press it. But finally I found a report about a similar situation like the one I was in. I think it was in some SuSE support forum.

The Solution

So the proposed solution was to mirror the broken partition to another disk with the program dd_rescue. This small, but fine piece of software, works more or less like dd but does not stop on broken blocks on a disk. It keeps on trying or finally goes on the the next block after it fails with a block. It basically copies the bits and bytes of a partition (e.g. /dev/hdXX) one by one, not caring about the existing filesystem.

By then, a new harddisk arrived. I tried to go with Western Digital this time and got myself one with 80gig. Time to start dd_rescue. There are all kind of parameters. I decided to give the prog some time and set the retry number on broken blocks to something pretty hi. It ran for several days, right next to my bed - making strange noises. Here is the summary (there were only 30gig on that partition):

Summary for /dev/hdc1 -> /data/daten/backup:
dd_rescue: (info): ipos:  32387195.5k, opos:  32387195.5k, xferd:  27655710.5k
             -     errs:  18009, errxfer:      9004.5k, succxfer:  27646706.0k
             +curr.rate:        0kB/s, avg.rate:       56kB/s, avg.load: -0.0%

Now I had an image of my partition which I could mount as a loopbackdevice. I let reiserfsck loose again. And it was happily rebuilding trees and what not all.

root@hastur:~# reiserfsck --rebuild-tree --logfile rebuild.log /dev/loop0

<-------------reiserfsck, 2003------------->
reiserfsprogs 3.6.7

.....  cut out some comments from output .......

Will rebuild the filesystem (/dev/loop0) tree
Will put log info to 'rebuild.log'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
Replaying journal..
0 transactions replayed
###########
reiserfsck --rebuild-tree started at Tue May 27 08:21:11 2003
###########

Pass 0:
Loading on-disk bitmap .. ok, 6514462 blocks marked used
Skipping 8454 blocks (super block, journal, bitmaps) 6506008 blocks will be read
0%....20%....40%....60%....80%....100%                       left 0, 10376 /sec
Selected hash ("r5") does not match to the hash set in the super block (not set).
        "r5" hash is selected
Flushing..finished
        Read blocks (but not data blocks) 6506008
                Leaves among those 17546
                        - corrected leaves 12
                pointers in indirect items to wrong area 11645 (zeroed)
                Objectids found 38907

Pass 1 (will try to insert 17546 leaves):
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....100%                        left 0, 2924 /sec
Flushing..finished
        17546 leaves read
                17538 inserted
                8 not inserted
        non-unique pointers in indirect items (zeroed) 1289

Pass 2:
0%....20%....40%....60%....80%....100%                           left 0, 0 /sec
Flushing..finished
        Leaves inserted item by item 8
Pass 3 (semantic):
Flushing..finished
        Files found: 0
        Directories found: 2
Pass 3a (looking for lost dir/files):
Looking for lost directories:
Looking for lost files:
Flushing..finishede 0, 0 /sec
        Objects without names 2765
        Empty lost dirs removed 37248
        Dirs linked to /lost+found: 161
                Dirs without stat data found 2
        Files linked to /lost+found 1015
Pass 4 - finisheddone 13118, 1874 /sec
        Deleted unreachable items 13
Flushing..finished
Syncing..finished
###########
reiserfsck finished at Tue May 27 08:31:59 2003
###########

That seemed to be positive feedback. My mood was rising!.

Casualties

Data

As you saw in the output from reiserfsck, there were some losses of different kind. Apparently some files could be recreated, some could not be correctly placed and named.

I never missed the 13 seemingly deleted files. I don't know what they were.

Then I entered the lost+found directory. What a mess! 2765 files got their filenames replaced by some number. Great. Well better than loosing them all together. There is a listing of the file command for the files below. It boiled down to me using the preview functionality of Konqueror and named the files one by one by hand. Most of the stuff is done. I could not identify some mp3s I downloaded from some independent electronic artists. Oh well.

root@hastur:~# losetup -d /dev/loop0

root@hastur:~# mount -t reiserfs -o rw,loop=/dev/loop0 /data/daten/backup /mnt/loop

root@hastur:~# cd /mnt/loop
root@hastur:/mnt/loop# ls
lost+found/

root@hastur:/mnt/loop# cd lost+found/

root@hastur:/mnt/loop/lost+found# ls -l > ~/listing_lost+found 

root@hastur:/mnt/loop/lost+found# file * > ~/file_lost+found

root@hastur:/mnt/loop/lost+found# ls -l | wc -l
   1175

The rest of the files were okay, so I basically did not loose any data.

Hardware

IBM gave up their desktop harddisk line and moved it together with Hitachi. I still had about one and a half year left of the three year guarantee and could send it in. I got another identical 60gig drive from Mitsumi after running some analysis software from some IBM-DOS bootdisk and giving them some error code.

Summing-up

No data lost. Hardware got replaced. I am using ext3 for a change - even though I don't blame reiserfs for the affair. I just want to see, if it makes any difference for my desktop system. The answer till now is: no.
My next harddisk was from Western Digital. Can somebody tell me, why I wrote this article?

Conclusion

Don't trust harddisks for data storage. Duh.

Yes, this is no new finding. No, I would not have lost very important data on that partition - nevertheless it would have hurt. Just to everybody out there who has heard about disk-failures, but never experienced one: Ask yourself, what would happen when one of your disks crashed and then get yourself some sort of backup-scheme.

Quintessence of used solution:

harddisk degraded and partition with reiserfs could not be mounted any more. reiserfsck would not work on the broken disk.
use ddrescue to copy data to an image on another disk
mount image and do reiserfsck on good disk
be happy, be bussy with lost filenames and start think about sensible backup solution.

Recommendation

Anything is better than loosing data - even 5½" 360kb disks or printouts. What I am using right now is mirrordir and another hard-disk in my box. The name mirrordir is speaking for itself: it mirrors complete directory trees. It is a powerful *nix command-line utility with a lot of optional parameters to affect its behavior.

So right now I am securing my data, by mirroring the relevant directories to a second disk. It is a desktop system which is not up running 24/7, so I began by adding some lines to my system halt scripts. But, even though mirrordir is very effective as it only mirrors changed data, that was too much. Sometimes it is nice to have two different versions. Right now, I have a small script which mounts the backup-disk, mirrors some directories and than unmounts the backup-disk. I call the script from time to time. Not so often, when nothing changes, but I don't hesitate to call it either, as it is no hassle at all when it is running in the background. Or when I go to a LAN-party over the weekend, I just make a backup beforehand.

Yes, this is no off-site backup. It might not protect my data from evil system-intruders. I could add encryption to my backup-disk for that? It basically protects me from disk-failures and the unintended deleting of some data — assumed I realize it early enough. By the way: rsync should be able to do the same thing.

So kids remember: in the information age, you've gotta protect your data!

← home