How I would store 100TB so that it is available for the next 100 years

In the summer of 2012 I applied to and interviewed at an intriguing startup in Palo Alto. I'm still not 100% sure what they were working on, except that it involved distributed computing and storage of some kind, they were working with Blocks in C, and they were definitely "thinking big". And the leadership of the company was something of a dream team. They had posted a fascinating challenge to their website for job candidates, which I attacked with gusto. Unfortunately, they were not interested in me. Perhaps if I had spent as much time talking about content addressable storage concepts instead of physical hardware I'd be in like Flynn.

*shrug*
Such is life.

But their loss is your gain: My answer to their challenge is posted here for your entertainment!


tl;dr: Disk backed by tape backed by a 11' microfiche cube. But ultimately this is a problem of institutions, not processes or technology.

"Available" in this context is a tricky word. Availability implies a system that can provide the data immediately, which brings mechanical and electrical systems into play. These systems require ongoing maintenance, and since almost none of the adults alive today will be around for the next 100 years to provide this maintenance, the stewardship of the availability of this system comes down to the institutions and processes that are entrusted with this responsibility.

100TB is not that big of a challenge. It's the 100 years part that really complicates matters. Filesystems, databases, etc.... there are many viable options, but the design of a long-living physical storage system is quite a bit harder so I'll focus my thoughts on that.

I'm going to assume that the 100TB is already deduped and compressed. If not, that's the first step.

The only way I know of to reliably store 100TB of data that is at all manageable for that amount of time is to store it on microfiche. Yes, really! Optical disc formats like CDs and DVDs oxidize at an alarming rate, even "archival" quality media. Flash degrades on the order of decades, too. Even programmable ROM ICs can be difficult to extract data from only 30 years later. But that ancient film-negative junk you find at a library with the back-lit viewer and everything is great for the longevity requirement.

Microfiche is an A6 film card, 105mm × 148mm with a .03mm thickness. A silver halide film with polyester base has an expected usable lifetime of 500 years, plenty for this project. What I'm really talking about here is "Computer Output Microform", a well-established and growing industry. But despite excellent work being done to analyze the possibilities of microform in hybrid digital/analog roles [1] [2] (ie. storage of x number of "pages"), there isn't as much use of the medium in a pure digital application. Today the medium is usually 16mm or 35mm film, but I'll stick with our A6 sheets to make the math easier.

[1]: http://www.dcc.ac.uk/webfm_send/355
[2]: http://www.degruyter.com/dg/viewarticle/j$002fmdr.2012.41.issue-2$002fmir-2012-0008$002fmir-2012-0008.xml

Research conducted at the Institute for Communications Technology at Technische Universitat Braunschweig [3], gives us some guidelines as to the limits of data storage on microform media. Using just a green channel for data encoding with a 9 um dot size, bit-error rates of 10^-4 can be achieved. That's pretty high, but with Reed-Solomon encoding such as used in QR Codes (and lots of other places) we can sacrifice some of our capacity for robust error-correction. I would also advocate for the printing of metadata and additional error-correction in the red and blue channels of the film card, respectively. This metadata would be analog and human-readable to ease the burden of our maintenance engineers (or a 22nd century archivist who discovers our 100T cache).

[3]: http://www.imaging.org/ist/publications/reporter/articles/REP25_3_ARCH2010_VOGES.pdf

How much physical space do we need to store 100TB on microfiche? I'll round down and up against my favor, where appropriate. We use the recommended 10 um dot pitch (~ 2500 dpi), sacrifice a 5mm margin on the edges of the cards, and we'll sacrifice 20% of our capacity to error correction. A single card can hold: 95mm x 138mm == 13K mm^2 x 8Kbit per mm^2 == 13MB (130Mbits).

A stack of 1000 cards, 13GB, would be 105mm x 148mm x 300mm, which works out to 13GB per 4.5M mm^3, or 2.8K per cubic millimeter. So at our information density we need roughly 35 billion mm^3 to store 100TB. This works out to a cube, 3.25 meters (~10' 9") on each side. It won't be the most "online" of data sources, but it will easily last 100 years and is more durable than any electro-mechanical device.

Of course, to enhance our availability and make this on-demand we could add a front-end to our cache that would provide the data on-demand. Let's call this the "availability subsystem". Probably a fast array of disks, in triple parity if we can. A 100TB disk array takes a while to rebuild itself, during which a double read error would be catastrophic for a RAID6. I'd probably use ZFS, but there's a few options here. Behind the disk I'd put a tape backup using a StorageTek T10000K format tape (5T per tape). 25 tapes would hold the entirety of it, plus a few for indexes, etc. I'd include 4x disks, tapes, anything mechanical or electrical. That should last long enough for new technology to replace components of this system.

Finally, mechanical, electrical and system maintenance. There's published standards for this, and the owners manuals for the disk array and tape systems outline most of it. But the microfiche hardcopy archive should survive a failure of the availability subsystem. At some point in the next 20 years those systems should be replaced or availability will suffer. On an ongoing basis a small section of the microfiche cache is scanned and compared to the online array to protect against data errors and to verify that the microfiche remains stable. A stack of 5K cards every week would cover the entire cache every 25 years and should be easily doable by maintenance staff.

Let's not stop there. I would actually advocate that the base physical unit for our Terabyte-scale microfiche storage system is the ISO-standard 20 foot shipping container. A shipping container of this size is 20' x 8' x 8', just enough space to store our 100TB microfiche block with insulation, a rack to hold our disk array, tape drives, scanners to read the microfiche, many many extra disks/tapes/drives/systems/cabling, and plenty of documentation. By arranging the data in the microfiche block so that there is sufficient error correction distributed throughout, we should achieve additional robustness against side or corner damage, heat or water intrusion, pests, and fungus. By choosing the standard shipping container we can leverage the existing infrastructure for movement and storage of these units so that we could geographically distribute multiple copies of our archive.

It's assumed that anything worth storing for 100 years is something that someone might want to destroy as well, and we're trying not to build institutional protection into our design. So we make copies. Twenty-five 100TB shipping containers, distributed in dry, stable climates throughout the globe would help to ensure that our cache would survive political instability, natural disaster, and man-made accidents (it's even conceivable that we could put one in orbit... just in case). Geographical redundancy is probably the only real protection against humanity's worst inclinations.

And this gets back to the real issue. What I've described above is a technical solution to a problem that isn't necessarily technical. 100 years is more than a human lifetime (for now), and any person who works on such a system will not live live long enough to shepherd it during that entire time. Therefore the care and protection of such a cache needs to be established at a societal level. Institutions for the preservation of knowledge are some of our most durable and cherished endeavors, and for good reason. It is those institutions and their action (or inaction!) that would best serve to protect our data for 100 years.

By 2020 it's likely that we'll have 100TB single drives. At which point the process of making the data persist for the remaining 92 years is largely down to making sure that the data is actively error-corrected, replicated, and verified on a regular basis in the same way that it wouldn't be difficult to maintain a backup 1T today, given a continuous process backed by a reliable institution. It's easy to imagine that the process of transferring 100TB from an aging format to a fresher one in 2030 will be nearly instantaneous, routine, even boring. Eventually we'll be able to throw around our 100TB archive with the same ease that we could copy an MP3 today. But the key question will remain who and why they'll be maintaining this 100TB cache on our behalf.

In other words, it's all about our institutions and processes. The technology just makes it possible, and over time easier.