Home Backup Project - Part 5: Evaluating Incremental Backup Options

Previous Posts:

The most crucial piece of this project is keeping the primary family computer, a 20" iMac, backed up. This project might as well never have happened if this one task were incomplete. I evaluated several options for keeping the machine backed-up:

mozy.com
JungleDisk and other Amazon S3 utilities
rdiff-backup
Time Machine

mozy

Since one of the muses for this project was a post over at Computer Zen, I followed his lead initially and signed up for a free account at mozy.com (using his referral code, of course). Overall, the concept sounded fantastic. They have clients for Windows and Mac, unlimited backups for $5/month (or less, when purchased in bulk) and a fairly solid fanbase of users who are very pleased.

From a security standpoint, mozy also seemed very good. An article over at MacApper noted that files are encrypted with a 448-byte key and all transactions take place via SSL.¹

I installed the mac client and started to play with it and was instantly turned off. There's a huge design flaw right up front... the software does not ask me what I want to do, it just starts doing what it thinks I want it to do - spidering my hard drive for likely content to be backed up. Now, even iTunes and WMP ask before taking up significant CPU resources in such a way. Not mozy, no sir. It knows that you want it to do so!

I may be an uncommon geek, but I'm fairly well organized. A child could probably find a file on my computer (without spotlight) simply by knowing what they're looking for. Is it a document? Yes, it's probably in Documents.

Anyway, all I wanted was the dialog that says "pick your files to back up", which is the second part of the interface. I had to wait almost ten minutes before I could do that. But, for the sake of argument, I waited. After that one annoying "feature" everything else about mozy is actually pretty nice. It tells you how much space you have left, how much your backup will take (the trial is limited to 2Gb) and what files you're backing up. You tell it when to do it's work, and it goes off into the background and does its business.

So, for $0, I had 1.8Gb of data backed up, which really just amounted to my Documents folder and my iTunes library files. I need about ten times that for my images and other data I want to safeguard this way.

What I liked: Speed of backup; Easy way to download backed-up files or order DVDs as needed; Cost

What I disliked: Annoying up-front user behavior assumption; Immature client - it started crashing randomly after about a week

I'm told that this was a fairly new client.. the Windows client is supposed to be much better, so that's good to hear. For my purposes, however, all of the data I care about in my house is on my mac. (My wife's laptop and my linux server are all recoverable with little-to-no effort and don't contain any data that aren't anywhere else. My office workstation gets backed up at work, so I don't need to worry about it here.) I don't much need the Windows client, so to me, the killer app is the Mac client, which was somewhat lousy. Too bad... Scott had such nice things to say about mozy.

One additional note, however. If I were in Scott's position, and were backing up multiple computers with multiple platforms, mozy might make even more sense. If, for instance, I had a true family approach to this, and were trying to keep my and my wife's familys' computers all backed up in one place, then this might make sense. For us, on one mac, mozy didn't completely fit my needs, purely from a client perspective. Were the client overall more usable and stable, I think I might have stayed.

JungleDisk / S3

Initially, mozy was the only online option I was investigating intensively. I had looked at others (Carbonite, iDisk, box.com, etc.) but didn't like various things about their services and stuck with mozy. About a week into the process, I read a comment thread over at LifeHacker about this very idea. Several commenters noted that they had great success with JungleDisk, which uses Amazon's S3 service for storage. The S3 pricing model is pretty cheap, so its barrier to entry was quite low.

JungleDisk installs as an internet drive (similar to iDisk), though it's actually connecting to a service running on your machine. When you add/change a file on this drive, it caches it locally and then queues it for upload to S3. First of all, the upload speed is SLOW. There were times, during my initial 20Gb push, that the speed dropped to less than 1 kbps. To me, that's abysmal. I had to let it run for over a week to complete the upload. It might be faster to download data (I should probably test that, huh?), and that's what's really important now that the backup has completed.

One nice thing about S3, however, was that I could upload data to it, and not leave a copy on my mac. I have about 5Gb of old archived data that, if you remember, I "lost" for a few weeks. I don't need it around all of the time, but I sure do want it somewhere. S3 sounded like a good place.

The client is robust and backs up exactly what you ask it to back up. It runs on whatever schedule you set. I have it running nightly. It costs about $20 after a 30-day trial, but for backup software it's probably worth it.

What I liked: Overall storage cost; Set it and Forget it; Usable for non synchronous data

What I disliked: Client cost; Upload speed

rdiff-backup

rdiff-backup is a unix command-line util to perform backups by, gasp, recursively diff-ing. I know. The name is a real misnomer.

It's written in python, and it's pretty darn snappy. It backed up about 20Gb in 90 minutes (which doesn't seem that fast, but I was copying to mr. slow external "says it's 2.0 but acts like 1.1" removable drive), and then about 12 hours later ran an incremental update in 4 minutes. Not too shabby, I suppose. I also set it up to run in cron every night:

45 04 * * * rdiff-backup -v5 --print-statistics --exclude /Users/shelton/rdiff-backup.log --include /Users/shelton/Documents --include /Users/shelton/Pictures --include /Users/shelton/Music --exclude '**' /Users/shelton /Volumes/backup/BACKUP >> /Users/shelton/rdiff-backup.log

Then, since space is limited on that drive (though not too much), I also set it up to remove increments that are older than three months. I can always trim that down if space becomes a premium:

00 04 01 * * rdiff-backup --remove-older-than 120D /Volumes/backup/BACKUP

..and that's it. Pretty darn simple, I must say. Once of the really nice things about rdiff-backup is that the most recent revision is always sitting there as normal files. I can use any standard file system command (cp, mv, tar, etc) to move/copy files out of that archive in case I lose something. It's not as space-conscious as incremental tarballs gzipp'd, but it's what makes most sense for me.

rdiff-backup also has some nifty restore features that seem promising.

After using it for a few weeks, however, I noticed that the include/exclude flags don’t really work as I’d like them to. For instance, I want to include all of ~/Documents, but not ~/Documents/Parallels (because I have more than one Parallels VM, and that takes up a LOT of space). Shouldn’t this work properly?

–include /Users/shelton/Documents
–exclude /Users/shelton/Documents/Parallels

I guess not. It pattern-matches the directories, giving precedence to the include statements, so since /Users/shelton/Documents/Parallels has /Users/shelton/Documents in it, I’m out of luck. I ended up thinking different and moved Parallels up a directory and symlink’d back to it so now I have /Users/shelton/Documents/Parallels -> /Users/shelton/Parallels. It doesn’t know the difference because, well, UNIX is teh r0x. I added an exclude statement for that directory and I’m in business, with about 20Gb reclaimed.

While figuring out that mess, I noticed that the pruning task wasn’t quite effective. My drive filled up within two weeks, and I needed to set the prune to 14 days, then 7 days, and then 5 days to recover space enough to even start the following backup. I ended up running the prune task 45 minutes before the backup task every day and setting it to remove anything older than two days, which somewhat defeats the purpose of it keeping previous versions in the first place. I may be too busy (or out of town) to notice something’s been gone in under two days. Again, I could just buy a new hard drive, but I’m also aiming for “cheap” in this project. After getting rid of Parallels, I was able to put this back to 7 days, so all in all, this isn’t such a bad option.

rdiff-backup can be used to backup remotely, but you need to provide a remote machine and unless you can mount it as a drive, you need rdiff-backup to work properly on the remote machine as well. I could not get it to work on my ubuntu server in the basement, so that was somewhat annoying. I thought about using it to sync to S3, but the overall speed of S3 was so slow that it wasn’t worth it.

Time Machine

Leopard isn’t out yet, so I couldn’t fully evaluate Time Machine. However, everything I’ve seen and read leads me to believe that when I do upgrade, I’ll also buy an external drive 2-4 times the size of my internal drive and point Time Machine at it and just pretend it isn’t there. It also doesn’t satisfy the “remote” aspect of this part of the project, but it might be yet another way to keep my data safe.

Conclusion

So what did I end up doing? Well, a little of column A, and a little of column B.

My disgruntled-ness with mozy’s poor client was enough of a detracting factor for me that I ditched it entirely. If I had a windows household, I probably would have kept it. In the end, I stuck with rdiff-backup running every morning. I’m almost certain I’ll switch over to Time Machine once I upgrade to 10.5, but if I do, I’ll probably want a bigger backup drive than what I have.

The last part seems a no-brainer, but having my data encrypted makes me feel all warm and fuzzy inside.↩

Scribblings and Geekery