Sunday morning I was awoken by a call from a co-worker saying that he couldn’t connect to our main file server. I went in to work a bit later and discovered an orange blinky light on the machine. One of the disks in the RAID5 array was unwell. After two calls to Dell tech support and hours worth of diagnostics, I left at 7:00 PM thinking that the problem was fixed.
Yesterday evening I got a call from my boss saying that, you guessed it, he couldn’t access the main file server. So I went in to work again. This time a different hard disk was reporting errors. So I replaced it and did a rebuild. I went home at midnight thinking (or maybe just hoping this time) that the problem was fixed.
This morning I got a call at 5:30 AM. Yes, 5:30 AM. Once again the server was offline. I groggily pulled on yesterday’s clothes and headed out. As soon as the locked door closed behind me I realized that I had forgotten my keys. I couldn’t get into the office without them. With weary regret, I banged on the door to rouse my housemates so they could let me back in. Thank Peep, Emily came quickly and didn’t seem angry. Keys in pocket, I trudged toward work.
In the server room, it was the same old problem. I was so exhausted that I could think of nothing to do but call Dell again. This time the guidance was limited. The hardware wasn’t reporting any errors, so there wasn’t much they could do. At one point, I freaked out when I thought that a driver update had failed and made the disk array entirely inaccessible. I was almost shouting at the guy on the other end of the line. Poor sod. Turns out I had left a disk in the floppy drive. I really felt like a heel.
My boss arrived at about 8:30. We talked game plans and I just got more stressed. The volume in question is the most important resource on our network apart from our Oracle database. If it is offline for an extended period a LOT of people can’t do their work. And as we know, time is money. Tick, tick, tick…
It seemed like nothing good would ever happen again.
My boss and I both talked with techs-for-hire, who had suggestions that all seemed equally relevant and daunting. We held off on scheduling anyone to come in. Strangely, the drive came up fine upon reboot, so we asked people to copy whatever files they needed to their local drives and work that way. I eventually realized that our backups had failed at the same file two times in a row. …maybe there were just a few corrupt files causing the problem? Jake and I started copying huge directories to another drive to see what happened. Wow. A lot was copying with no problem. But it took so long…
I went downstairs at 11:30 to have a rest on a conference room couch. I was SO tired. It was tempting to sack out entirely. But I couldn’t. Someone had the conference room scheduled for a meeting at noon.
Hours and hours passed. I was surprised at 2:50 by an alarm reminding me that in ten minutes I had a meeting with a guy from an IT tech company to do a “security audit”. Too late to cancel! I thought I could use a break — why not? I might have been a little weird in the meeting, but at least I could keep two thoughts together. When I got back to my desk and saw the results of the file copying, it seemed evident that most of the data on the drive was fine after all, and that it really was just a small number of corrupted files causing the problems. Which files was still not nailed down, but things seemed… dare I say it? Better.
I didn’t leave the office until 7:30. A fourteen hour day. But my reward now is to see (via a remote access internet connection) that my carefully created backup job has been running for three and a half hours now without a crash. Hooray! There’s hope. Tonight I plan on getting some real sleep. And I won’t be going to the dentist appointment I forgot I had scheduled for tomorrow morning at 8:30. Cancellation fee? Who cares!