Of how my IBM harddisk made strange sounds, the reiser filesystem could not read the partition any more and I was somehow able to rescue the data with the dd_rescue tool. The whole article is more or less just me whimpering, so you might want to jump directly to the conclusion with the quintessence.
The described events date back to Spring 2003, so I am not quite sure whether I recollect everything described here correctly.
In February 2003 my 60GB harddisk, from the infamous IBM DTTA series, in my Slackware 8.1 box, started to make strange sounds. After halting the system and rebooting again, I was unable to mount the reiser filesystem. Great. There was about 30gig of data on that partition.
I can't remember the exact mount error message, but basically it could not identify a reiserfs on the partition. So I started to mess a little bit bit with the reiserfsprogs. I went wild and called reiserfsck with --check, --fix-fixable or --rebuild-tree to no success. As far as I understand the error messages in retrospective, reiserfsck tried to rebuild the filesystem, but did so, right there, on the damaged harddisk blocks. The situation might not have been that serious, but unfortunately exactly the harddisk-blocks, where the superblock was located, were broken.
So I went to google, read all kind of arguments for and against reiserfs or ext2/3, subscribed to the reiserfs mailing-list - reading it for two month or so. I got myself the latest version of the reiserfsprogs (3.6.7) to no success.
I have been using reiserfs ever since, as I started out with SuSE (yes, I am from Germany) and they have been pushing reiserfs for a long time. I never had any complains about it. I liked it more than ext2. Not that it is really crucial with my desktop machine.
I was kind of wrapped up in other businesses back than and not totally dependent on the data on the partition so I did not press it. But finally I found a report about a similar situation like the one I was in. I think it was in some SuSE support forum.
So the proposed solution was to mirror the broken partition to another disk with the program dd_rescue. This small, but fine piece of software, works more or less like dd but does not stop on broken blocks on a disk. It keeps on trying or finally goes on the the next block after it fails with a block. It basically copies the bits and bytes of a partition (e.g. /dev/hdXX) one by one, not caring about the existing filesystem.
By then, a new harddisk arrived. I tried to go with Western Digital this time and got myself one with 80gig. Time to start dd_rescue. There are all kind of parameters. I decided to give the prog some time and set the retry number on broken blocks to something pretty hi. It ran for several days, right next to my bed - making strange noises. Here is the summary (there were only 30gig on that partition):
Summary for /dev/hdc1 -> /data/daten/backup: dd_rescue: (info): ipos: 32387195.5k, opos: 32387195.5k, xferd: 27655710.5k - errs: 18009, errxfer: 9004.5k, succxfer: 27646706.0k +curr.rate: 0kB/s, avg.rate: 56kB/s, avg.load: -0.0%
Now I had an image of my partition which I could mount as a loopbackdevice. I let reiserfsck loose again. And it was happily rebuilding trees and what not all.
root@hastur:~# reiserfsck --rebuild-tree --logfile rebuild.log /dev/loop0 <-------------reiserfsck, 2003-------------> reiserfsprogs 3.6.7 ..... cut out some comments from output ....... Will rebuild the filesystem (/dev/loop0) tree Will put log info to 'rebuild.log' Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes Replaying journal.. 0 transactions replayed ########### reiserfsck --rebuild-tree started at Tue May 27 08:21:11 2003 ########### Pass 0: Loading on-disk bitmap .. ok, 6514462 blocks marked used Skipping 8454 blocks (super block, journal, bitmaps) 6506008 blocks will be read 0%....20%....40%....60%....80%....100% left 0, 10376 /sec Selected hash ("r5") does not match to the hash set in the super block (not set). "r5" hash is selected Flushing..finished Read blocks (but not data blocks) 6506008 Leaves among those 17546 - corrected leaves 12 pointers in indirect items to wrong area 11645 (zeroed) Objectids found 38907 Pass 1 (will try to insert 17546 leaves): Looking for allocable blocks .. finished 0%....20%....40%....60%....80%....100% left 0, 2924 /sec Flushing..finished 17546 leaves read 17538 inserted 8 not inserted non-unique pointers in indirect items (zeroed) 1289 Pass 2: 0%....20%....40%....60%....80%....100% left 0, 0 /sec Flushing..finished Leaves inserted item by item 8 Pass 3 (semantic): Flushing..finished Files found: 0 Directories found: 2 Pass 3a (looking for lost dir/files): Looking for lost directories: Looking for lost files: Flushing..finishede 0, 0 /sec Objects without names 2765 Empty lost dirs removed 37248 Dirs linked to /lost+found: 161 Dirs without stat data found 2 Files linked to /lost+found 1015 Pass 4 - finisheddone 13118, 1874 /sec Deleted unreachable items 13 Flushing..finished Syncing..finished ########### reiserfsck finished at Tue May 27 08:31:59 2003 ###########
That seemed to be positive feedback. My mood was rising!.
As you saw in the output from reiserfsck, there were some losses of different kind. Apparently some files could be recreated, some could not be correctly placed and named.
I never missed the 13 seemingly deleted files. I don't know what they were.
Then I entered the lost+found directory. What a mess! 2765 files got their filenames replaced by some number. Great. Well better than loosing them all together. There is a listing of the file command for the files below. It boiled down to me using the preview functionality of Konqueror and named the files one by one by hand. Most of the stuff is done. I could not identify some mp3s I downloaded from some independent electronic artists. Oh well.
root@hastur:~# losetup -d /dev/loop0 root@hastur:~# mount -t reiserfs -o rw,loop=/dev/loop0 /data/daten/backup /mnt/loop root@hastur:~# cd /mnt/loop root@hastur:/mnt/loop# ls lost+found/ root@hastur:/mnt/loop# cd lost+found/ root@hastur:/mnt/loop/lost+found# ls -l > ~/listing_lost+found root@hastur:/mnt/loop/lost+found# file * > ~/file_lost+found root@hastur:/mnt/loop/lost+found# ls -l | wc -l 1175
The rest of the files were okay, so I basically did not loose any data.
IBM gave up their desktop harddisk line and moved it together with Hitachi. I still had about one and a half year left of the three year guarantee and could send it in. I got another identical 60gig drive from Mitsumi after running some analysis software from some IBM-DOS bootdisk and giving them some error code.
No data lost. Hardware got replaced. I am using ext3 for a change - even
though I don't blame reiserfs for the affair. I just want to see, if it makes
any difference for my desktop system. The answer till now is: no.
My next harddisk was from Western Digital. Can somebody tell me, why I wrote this article?
Don't trust harddisks for data storage. Duh.
Yes, this is no new finding. No, I would not have lost very important data on that partition - nevertheless it would have hurt. Just to everybody out there who has heard about disk-failures, but never experienced one: Ask yourself, what would happen when one of your disks crashed and then get yourself some sort of backup-scheme.
Anything is better than loosing data - even 5½" 360kb disks or printouts. What I am using right now is mirrordir and another hard-disk in my box. The name mirrordir is speaking for itself: it mirrors complete directory trees. It is a powerful *nix command-line utility with a lot of optional parameters to affect its behavior.
So right now I am securing my data, by mirroring the relevant directories to a second disk. It is a desktop system which is not up running 24/7, so I began by adding some lines to my system halt scripts. But, even though mirrordir is very effective as it only mirrors changed data, that was too much. Sometimes it is nice to have two different versions. Right now, I have a small script which mounts the backup-disk, mirrors some directories and than unmounts the backup-disk. I call the script from time to time. Not so often, when nothing changes, but I don't hesitate to call it either, as it is no hassle at all when it is running in the background. Or when I go to a LAN-party over the weekend, I just make a backup beforehand.
Yes, this is no off-site backup. It might not protect my data from evil system-intruders. I could add encryption to my backup-disk for that? It basically protects me from disk-failures and the unintended deleting of some data — assumed I realize it early enough. By the way: rsync should be able to do the same thing.