Data Storage Thoughts

Recent events have me thinking on data storage: “Pocket”, need to backup a machine, and an overflowing NAS.

The State of Storage: Thoughts on the Filesystem

Back in the days of paper (like those are bygone, ha!) we had sheets made of wood pulp covered with writing. A single, coherent set of sheets could be assembled to constitute a “document”, which might be kept with related documents in a manilla folder or a binder; this collection of related documents was called a “file”. Collections of files were organized in specialized drawer system called “file cabinets”, which might be in the “file room” if numbers of files necessitated many file cabinets.

When computers came along, and we had decks of punch cards and tapes, which were filed in new ways because of their differing size and shape from traditional paper, but nevertheless the same filing idea was there: a “document” was filed together in some particular way, such that it could be retrieved whenever needed and its contents utilized.

As we progressed into online storage, we invented the “filesystem”: initially a flat-organized system akin to a single folder but multiple-documents; then a single-level of folders or directories were introduced, and eventually filesystems grew (as did storage space and subsequently the need for organization) filesystems bloomed into a hierarchical scheme for storing all manner of documents and data.

Although the old file cabinets were easy to understand in principle, many found them hard to actually operate: in file rooms, documents had to go in the right file cabinet, in the right order; failure to do this correctly was punished with mind-numbing linear poring through files for one misplaced.

Despite the parallel to the real world, many find the concept of filesystems hard to understand; perhaps the computer world’s nomenclature of “files” and “directories” instead of “documents” and “folders” obscured the similarity. To its credit, Apple was good about recognizing this issue and adjusting their terminology up-front to help users understand filesystems are just the computer version of a file room.

As smartphones were introduced, it looked like the filesystem might disappear. Applications had lists of documents, but no way to assemble related ones. No way to keep letters to clients with the bills prepared with a spreadsheets, or keep the project documents with the letters or spreadsheets. Documents were associated with their applications, and couldn’t be moved between applications. As application capability increased, some of these limitations have disappeared while some filing aspects have reappeared. A single list of documents is not a good organization scheme, even if it is alphabetized.

There’s a feature called “Pocket” in Firefox: “Click the Pocket Button in your Firefox toolbar and save any article, video or page you want to come back to later. It’ll be organized, easy to find, and all in one place.” Huh… so a lot like a filesystem. “Pocket lets you read saved articles offline, making it indispensable for anywhere you find yourself without an Internet connection.” So… just like “Save as” to my filesystem?

I admit that Pocket’s ability to sync data across devices is something my filesystem doesn’t do, but I can’t help wondering: perhaps it should?

We’ve been building filesystems for 50 years; we understand what they are and what they do. In most of the abstractions that pretend they aren’t filesystems, they’re actually just hiding the fact that there’s a 1:1 mapping to some document in a folder somewhere in the filesystem. Except now, it’s organized by Pocket, probably given some screwy indistinguishable name, and limited by Pocket’s organizational limits.

We’ve been working on the organizational issues of filesystems for ages. I find myself concluding, it’s not because filesystems are inherently troublesome or overly complicated. The real trouble is that storing a large volume of heterogenous data in a coherent way to allow reliable retrieval in the future takes effort and skill. We will never solve complexities posed by filesystems by doing away with them or hiding them: for users with simple needs a list may be fine, but that doesn’t work as use scales: all the tools we have in the world do not make up for human-imposed organization. I mean, I love grep(1) and locate(1) and Spotlight (an easy-to-use, system-wide search on the Mac). They’re indispensable.

But imagine a search for the estate files for Joe Smith.

And if I find one, do I know it’s the latest? Or is it the draft?

And what could I learn when I’m pulling up “Smith, Joe” and when I see there’s also a real estate file and a Trust.

Okay, I don’t work in a law office (though I did for a while), but I think my point holds: filesystems help me organize related ideas so when I go back to them, or someone else does, they find the whole thing. Not just bits and pieces that show up early in the search, or that happen to be done with this application.

When we’re (re)inventing stupid things like Pocket, we’re approaching it the wrong way. Rather than starting again from scratch, we should be thinking about data synchronization for existing filesystems, where we can continue using the past 50 years worth of tools built to manage all this data. Where we don’t have to reinvent those tools all over again. Where data can be stored in a coherent, understandable manner, for anyone willing to put in a little time. Where data can be moved from one application to another, instead of being locked into its creator.

Sometimes, when things begin changing and technology surges ahead, I wonder if I’m a dinosaur, bouncing between command line and windowed applications. Am I making due with ancient tech because I’m cheap or am I stubborn or just set in my ways?

Sometimes I think maybe I should try some of this new tech. I admit, I adore parts of the data synchronization features provided by cloud solutions.

But then, Twitter expands tweets beyond 140 characters because they’re too limiting. Pocket shows up with a laughable reinvention of the organizational abilities of my filesystem. If I bought a tablet, I could get a keyboard peripheral so I could type on it like a laptop. People with their phones are in a continual upgrade cycle, like us techies were in the ’90s and first few years of the millennium. We finally dug our way out of that crazy cycle, and these nutters decided to jump in and give it a try. It’s funny, having been called a nerd or a geek all those years, and now the shoe is on the other foot, isn’t it?

So, as I sit here writing this on Stefanie (a 6+ year-old-laptop with all the compute power I need), writing in MultiMarkDown (a typewriter-like format that I can run through a command-line tool to make it look pretty), with a command line window open in the background (ready to translate this into a web page later), the screen’s background an image from Wunderground’s World View (retrieved to my filesystem fresh each week by a script I wrote 10ish years ago, using a 40-year-old editor vi, in a 40-year-old language sh), I remain confounded and amused by the insanity of these new devices and their attempts to reinvent the world in new and better ways. Because from my viewpoint, it looks like a lot of it, some day down the road, will eventually catch up to where I am today: doing what I need it to do, and working well.

Backups and Archiving

I have long been pedantic about backups, a paranoia perhaps instilled by almost losing major amounts of data twice in the ’90s. I was amazingly lucky both times—once, the hard drive gave me a reprieve just long enough to backup the system before it quit for good; the other had a sort of electronic stroke but betrayed its failing condition before it had a chance to loose all the data.

But also dating to the ’90s, or perhaps even earlier, was an “Archive” directory: where things went when I was done with them, but thought I might want them again. It still lives on today, a link list implementation I wrote for a WITR’s album database in 1992 safely tucked away in case I ever want to use it again (though I’d rather use C++’s STL containers).

Despite the history, the distinction of backups vs archives has only struck me recently; there certainly is similarity and overlap. Between 2006 and 2014, I duplicated my personal files to my hosting provider nightly, where a script organized them by date. Each night, it replicated the work, aliasing unchanged files (using UNIX hard links) to prepare snapshots of my data.

With limited space before I had to pay extra, there was incentive to watch my data weight, so not all my data went—excluded were audio files, because they’re big and bulky; and also most derived data, since it could be rebuilt with not-too-much effort, and a single input change might effect many derived documents. After a few weeks, my script culled the data down to every fifth day (easier to pattern match than weekly), then only every tenth day.

Periodically, I manually culled the older parts of the collection down to monthly, then quarterly.

In reality, I was creating was an archive, not a backup. But I still didn’t notice the difference.

In 2014, Dreamhost demanded I move my backups to a different server, stripping my ability to use the shell scripts that had been managing backups for 8 or 9 years. I downloaded my archives so I didn’t to lose anything, and switched to Arq, which does something similar to my previous home-brew solution but in a manner quite like Time Machine.

Arq bothered me though: the files aren’t in a filesystem. They’re instead in some mysterious encrypted format, in “buckets” without any relation to the original structure. Arq manages where they come from and where they go back, and Arq is the only way to manage the contents. But it’s easy and automated.

Recently, I wanted a script I wrote years ago that got deleted; I thought I would never need it again. Searching my formerly-Dreamhost archive, quickly I found it among web site files from 2008. It got me thinking, finally leading me to differentiate archives and backups.

Backups: Backups cover disasters, and that’s pretty much it. Backups are what you break out when you trash or mangle a file, have a drive go bad, or your laptop is stolen. Backups include everything, and restoring one means you are back where you were at the time of the backup.

Archives: Archives keep things you keep things you might need someday, and in a pinch, you could do a full restore from them—with effort.

Here’s a table of the simple differences:

Attribute Rsync † TimeMachine Arq
Frequency Configurable Hourly, fixed Configurable
Reliability Yes No †3 Maybe?
Technical longevity †2 Yes If locked in Maybe?
Data volume Up to you Gets crazy Up to you
Simple recovery Medium Easy Easy
Search Medium Impossible Impossible
Manipulation (ease) Medium Slow, medium Slow, medium
Manipulation (capable) Lots Limited Limited
Retention policy My control Preset/manual Preset/manual
So what is it for? Archiving Backups Backups

† “rsync” is my home-brew archiving solution.

†2 Apple is large enough TimeMachine will probably be long-lived, provided I’m willing to be locked into Apple’s software/hardware ecosystem. Arq is made by a smaller company; it’s a decent product so it will probably be around a long time, but there’s no telling.

†3 Time machine over a network periodically corrupts the backup set, requiring starting over and losing past snapshots.

Here’s what those differences mean:

I am glad to finally see the difference. I could see excessive data hoarding as potential trouble, opportunity to get mired in maintaining the past instead of being in the present and moving forward; I know a friend who is like this. But I also know the times I’ve needed something or unexpectedly needed to refer back to some document that I deleted, or wanted to compare with an earlier version. Trying to save everything and have it at-the-ready would also be problematic, so having an archive is a good compromise.

Both backups and archives have their purposes. Considering the programs I listed above:

I think for many, managing an archive is sufficiently time consuming that it’s easier just to throw money at it and treat backups as archives. The tendency toward streaming purchased media instead of keeping local copies might make it easier, but will likely be offset by growth in personal media weight (pictures and videos, especially)–although much of that’s going to online services, at least for the masses.

Perhaps in the future, someone will invent a single solution that can meet all needs, but I don’t think we’re there yet. It would mean tri-state logic for inclusion/retention (never back up, back up and retain only temporarily, and retain indefinitely). It would necessitate improved search capabilities across date ranges in an archive, and not just for file names but file contents. It would need better ability to size, manipulate, and trim archives. Meanwhile, I’ll continue using my existing, long-lived archive in a filesystem.