Data Storage Thoughts

Recent events have me thinking on data storage: “Pocket”, need to backup a machine, and an overflowing NAS.

The State of Storage: Thoughts on the Filesystem

Back in the days of paper (like those are bygone, ha!) we had sheets made of wood pulp covered with writing. A single, coherent set of sheets could be assembled to constitute a “document”, which might be kept with related documents in a manilla folder or a binder; this collection of related documents was called a “file”. Collections of files were organized in specialized drawer system called “file cabinets”, which might be in the “file room” if numbers of files necessitated many file cabinets.

When computers came along, and we had decks of punch cards and tapes, which were filed in new ways because of their differing size and shape from traditional paper, but nevertheless the same filing idea was there: a “document” was filed together in some particular way, such that it could be retrieved whenever needed and its contents utilized.

As we progressed into online storage, we invented the “filesystem”: initially a flat-organized system akin to a single folder but multiple-documents; then a single-level of folders or directories were introduced, and eventually filesystems grew (as did storage space and subsequently the need for organization) filesystems bloomed into a hierarchical scheme for storing all manner of documents and data.

Although the old file cabinets were easy to understand in principle, many found them hard to actually operate: in file rooms, documents had to go in the right file cabinet, in the right order; failure to do this correctly was punished with mind-numbing linear poring through files for one misplaced.

Despite the parallel to the real world, many find the concept of filesystems hard to understand; perhaps the computer world’s nomenclature of “files” and “directories” instead of “documents” and “folders” obscured the similarity. To its credit, Apple was good about recognizing this issue and adjusting their terminology up-front to help users understand filesystems are just the computer version of a file room.

As smartphones were introduced, it looked like the filesystem might disappear. Applications had lists of documents, but no way to assemble related ones. No way to keep letters to clients with the bills prepared with a spreadsheets, or keep the project documents with the letters or spreadsheets. Documents were associated with their applications, and couldn’t be moved between applications. As application capability increased, some of these limitations have disappeared while some filing aspects have reappeared. A single list of documents is not a good organization scheme, even if it is alphabetized.

There’s a feature called “Pocket” in Firefox: “Click the Pocket Button in your Firefox toolbar and save any article, video or page you want to come back to later. It’ll be organized, easy to find, and all in one place.” Huh… so a lot like a filesystem. “Pocket lets you read saved articles offline, making it indispensable for anywhere you find yourself without an Internet connection.” So… just like “Save as” to my filesystem?

I admit that Pocket’s ability to sync data across devices is something my filesystem doesn’t do, but I can’t help wondering: perhaps it should?

We’ve been building filesystems for 50 years; we understand what they are and what they do. In most of the abstractions that pretend they aren’t filesystems, they’re actually just hiding the fact that there’s a 1:1 mapping to some document in a folder somewhere in the filesystem. Except now, it’s organized by Pocket, probably given some screwy indistinguishable name, and limited by Pocket’s organizational limits.

We’ve been working on the organizational issues of filesystems for ages. I find myself concluding, it’s not because filesystems are inherently troublesome or overly complicated. The real trouble is that storing a large volume of heterogenous data in a coherent way to allow reliable retrieval in the future takes effort and skill. We will never solve complexities posed by filesystems by doing away with them or hiding them: for users with simple needs a list may be fine, but that doesn’t work as use scales: all the tools we have in the world do not make up for human-imposed organization. I mean, I love grep(1) and locate(1) and Spotlight (an easy-to-use, system-wide search on the Mac). They’re indispensable.

But imagine a search for the estate files for Joe Smith.

And if I find one, do I know it’s the latest? Or is it the draft?

And what could I learn when I’m pulling up “Smith, Joe” and when I see there’s also a real estate file and a Trust.

Okay, I don’t work in a law office (though I did for a while), but I think my point holds: filesystems help me organize related ideas so when I go back to them, or someone else does, they find the whole thing. Not just bits and pieces that show up early in the search, or that happen to be done with this application.

When we’re (re)inventing stupid things like Pocket, we’re approaching it the wrong way. Rather than starting again from scratch, we should be thinking about data synchronization for existing filesystems, where we can continue using the past 50 years worth of tools built to manage all this data. Where we don’t have to reinvent those tools all over again. Where data can be stored in a coherent, understandable manner, for anyone willing to put in a little time. Where data can be moved from one application to another, instead of being locked into its creator.

Sometimes, when things begin changing and technology surges ahead, I wonder if I’m a dinosaur, bouncing between command line and windowed applications. Am I making due with ancient tech because I’m cheap or am I stubborn or just set in my ways?

Sometimes I think maybe I should try some of this new tech. I admit, I adore parts of the data synchronization features provided by cloud solutions.

But then, Twitter expands tweets beyond 140 characters because they’re too limiting. Pocket shows up with a laughable reinvention of the organizational abilities of my filesystem. If I bought a tablet, I could get a keyboard peripheral so I could type on it like a laptop. People with their phones are in a continual upgrade cycle, like us techies were in the ’90s and first few years of the millennium. We finally dug our way out of that crazy cycle, and these nutters decided to jump in and give it a try. It’s funny, having been called a nerd or a geek all those years, and now the shoe is on the other foot, isn’t it?

So, as I sit here writing this on Stefanie (a 6+ year-old-laptop with all the compute power I need), writing in MultiMarkDown (a typewriter-like format that I can run through a command-line tool to make it look pretty), with a command line window open in the background (ready to translate this into a web page later), the screen’s background an image from Wunderground’s World View (retrieved to my filesystem fresh each week by a script I wrote 10ish years ago, using a 40-year-old editor vi, in a 40-year-old language sh), I remain confounded and amused by the insanity of these new devices and their attempts to reinvent the world in new and better ways. Because from my viewpoint, it looks like a lot of it, some day down the road, will eventually catch up to where I am today: doing what I need it to do, and working well.

Backups and Archiving

I have long been pedantic about backups, a paranoia perhaps instilled by almost losing major amounts of data twice in the ’90s. I was amazingly lucky both times—once, the hard drive gave me a reprieve just long enough to backup the system before it quit for good; the other had a sort of electronic stroke but betrayed its failing condition before it had a chance to loose all the data.

But also dating to the ’90s, or perhaps even earlier, was an “Archive” directory: where things went when I was done with them, but thought I might want them again. It still lives on today, a link list implementation I wrote for a WITR’s album database in 1992 safely tucked away in case I ever want to use it again (though I’d rather use C++’s STL containers).

Despite the history, the distinction of backups vs archives has only struck me recently; there certainly is similarity and overlap. Between 2006 and 2014, I duplicated my personal files to my hosting provider nightly, where a script organized them by date. Each night, it replicated the work, aliasing unchanged files (using UNIX hard links) to prepare snapshots of my data.

With limited space before I had to pay extra, there was incentive to watch my data weight, so not all my data went—excluded were audio files, because they’re big and bulky; and also most derived data, since it could be rebuilt with not-too-much effort, and a single input change might effect many derived documents. After a few weeks, my script culled the data down to every fifth day (easier to pattern match than weekly), then only every tenth day.

Periodically, I manually culled the older parts of the collection down to monthly, then quarterly.

In reality, I was creating was an archive, not a backup. But I still didn’t notice the difference.

In 2014, Dreamhost demanded I move my backups to a different server, stripping my ability to use the shell scripts that had been managing backups for 8 or 9 years. I downloaded my archives so I didn’t to lose anything, and switched to Arq, which does something similar to my previous home-brew solution but in a manner quite like Time Machine.

Arq bothered me though: the files aren’t in a filesystem. They’re instead in some mysterious encrypted format, in “buckets” without any relation to the original structure. Arq manages where they come from and where they go back, and Arq is the only way to manage the contents. But it’s easy and automated.

Recently, I wanted a script I wrote years ago that got deleted; I thought I would never need it again. Searching my formerly-Dreamhost archive, quickly I found it among web site files from 2008. It got me thinking, finally leading me to differentiate archives and backups.

Backups: Backups cover disasters, and that’s pretty much it. Backups are what you break out when you trash or mangle a file, have a drive go bad, or your laptop is stolen. Backups include everything, and restoring one means you are back where you were at the time of the backup.

Archives: Archives keep things you keep things you might need someday, and in a pinch, you could do a full restore from them—with effort.

Here’s a table of the simple differences:

Attribute Rsync † TimeMachine Arq
Frequency Configurable Hourly, fixed Configurable
Reliability Yes No †3 Maybe?
Technical longevity †2 Yes If locked in Maybe?
Data volume Up to you Gets crazy Up to you
Simple recovery Medium Easy Easy
Search Medium Impossible Impossible
Manipulation (ease) Medium Slow, medium Slow, medium
Manipulation (capable) Lots Limited Limited
Retention policy My control Preset/manual Preset/manual
So what is it for? Archiving Backups Backups

† “rsync” is my home-brew archiving solution.

†2 Apple is large enough TimeMachine will probably be long-lived, provided I’m willing to be locked into Apple’s software/hardware ecosystem. Arq is made by a smaller company; it’s a decent product so it will probably be around a long time, but there’s no telling.

†3 Time machine over a network periodically corrupts the backup set, requiring starting over and losing past snapshots.

Here’s what those differences mean:

  • Backups are not good for archiving. Every operating system or software update ends up in the backup, forever. Is there any reason I would want to go back to Firefox 28, 29, 30, 31, 31.1, etc.? No. They’re junk; I want the current one. Maybe one edition or two back, just in case I find a major bug. Replicate that same problem to Safari, Mail, Chrome, Skype, Money, BBEdit, Cornerstone, Arq and the operating system itself, and you have a bulky archive of outdated software that isn’t needed.

  • Archives require attention, backups should not. Backups should be easy to set up, and take care of themselves without worry. Archives require differentiating worthwhile data vs junk, with occasional curating and culling. For example, I like having a history of client websites, and archive snapshots are perfect for this to refer back, or to reuse something temporarily abandoned. But if a client leaves or becomes defunct, those older editions become extraneous. Maintaining an archive takes time and judgment.

  • Archives are for long-term storage, backups are not. Backups cannot be researched effectively. That script I wanted from a few years ago—I didn’t remember when. It wasn’t for my machine, so I wasn’t sure where i filed it. Trying to find something when you don’t know where to look, on a slow backup system, is worse than linear searching file cabinets for a missing document.

  • Archives can be researched, backups cannot. If you’re looking for something like phrases in an e-mail in a backup, good luck. But I can, and have, researched and found things like these in my archives. It requires ugly command line work, but it beats digging around blindly.

  • Archives can be readily examined, backups cannot. I can see what’s taking up space in my archive. When it’s time for culling, space use suggests good places to to look for dead data. Meanwhile, Time Machine backups of 185GB are taking up 800GB on my file server, but I have no idea why and no way to find out. It is similarly impossible to find out what’s taking up space in Arq.

  • Backups are proprietary, archives are not. Backup systems are often roach motels: data goes in, but does not (easily) come back out. With Arq, there’s no way to ever move my data anywhere else without lots and lots of manual intervention. Time Machine allows some command line use, which could extract data wholesale—but then it wouldn’t be in Time Machine, it would be plain-old files, just like what’s already in my archive.

  • Backups are bulky and cumbersome, archives are awkward and cumbersome. My archive from 2006–2014 was about 15GB, and my 2014–2016 archive was 10GB. On the other hand, I can move the data from place to place easily. Backups, even after excluding some derived data and easily-replaced downloaded data, start at 60GB and steadily grow: costly and difficult to manage or move because of their size.

  • Backups are for recovery, archives are not. Backup recovery is easy: start it, go away for a few hours, and it’s ready; to restore a file you broke, browse the repository and restore it. A full recovery from an archive, though possible, requires reinstalling the operating system from scratch, reinstalling all applications, then restoring your files from archive. Then you would rebuild any derived data, including re-ripping any compact disks or other media.

I am glad to finally see the difference. I could see excessive data hoarding as potential trouble, opportunity to get mired in maintaining the past instead of being in the present and moving forward; I know a friend who is like this. But I also know the times I’ve needed something or unexpectedly needed to refer back to some document that I deleted, or wanted to compare with an earlier version. Trying to save everything and have it at-the-ready would also be problematic, so having an archive is a good compromise.

Both backups and archives have their purposes. Considering the programs I listed above:

  • Given Time Machine’s unreliability over networks, occasionally imposing starting over from scratch, a second scheme or backup destination is essential should hardware fail during the window of an “initial” backup. Time Machine’s whole-disk strategy is not great for long-term data preservation.

  • Arq is hybridish, but lacks search capabilities, the ability to easily move or manipulate data it has stored, and uses proprietary mechanisms. It could be a fallback to Time Machine, and it’s easy to use. It works well as an away-from-home Time Machine.

  • rsync is command line, and subsequently harder to use. But, it can store data in unencrypted, uncomplicated ways in trusty-old, time-tested filesystems ready to be searched and manipulated. I can move, search, duplicate and even delete it just as I could before archiving. It is far more likely the data could be retrieved by a different type of machine, should it be necessary. Paralleling the archival-quality paper and inks that differentiate a cheap print from archival-quality production, these are the attributes that identify viability of long-term data storage.

I think for many, managing an archive is sufficiently time consuming that it’s easier just to throw money at it and treat backups as archives. The tendency toward streaming purchased media instead of keeping local copies might make it easier, but will likely be offset by growth in personal media weight (pictures and videos, especially)–although much of that’s going to online services, at least for the masses.

Perhaps in the future, someone will invent a single solution that can meet all needs, but I don’t think we’re there yet. It would mean tri-state logic for inclusion/retention (never back up, back up and retain only temporarily, and retain indefinitely). It would necessitate improved search capabilities across date ranges in an archive, and not just for file names but file contents. It would need better ability to size, manipulate, and trim archives. Meanwhile, I’ll continue using my existing, long-lived archive in a filesystem.