Harvard Writ Large

I've been saying for a long time that the two big families of OS'es, Windows and Unix derivatives, aren't well-architected for updating. That's not to say that there aren't good solutions for both. In fact, current versions of Windows and MacOS both have excellent vendor-supplied updating applications, and many Linux distributions do as well. What I mean is that the mechanism for updating a system is largely brute-force: scan the system libraries and executables for out of date components, update existing or install new libraries, change symlinks or registry entries, maybe reboot, and hope for the best. The success of this system has more to do with good hygiene on the part of the OS vendors than it does with the elegance of the mechanisms.

I have some ideas and hunches on what a better system would entail, but I hesistate to put a stake in the ground regarding them as I think most would require rearchitecting core parts of many OS'es such as linkers, binary file formats, and even the native process runtime. I haven't thought it through enough though, so that crazytalk will have to wait for another day. What I can say is that a core principle of being able to cleanly update a system is the separation of code and data.

Unix got this right early on with the concept of a user's /home directory. Microsoft attempted this with the C:My Documents folder in Windows 95 SR2, but they failed to provide strong enough guidance (nevermind enforcement) for developers and it took several years before best practices for storing per-user settings and data was widespread. It's only been in the last several years that updating an OS wasn't something that filled users with dread. It's easy to forget just how perilous a system upgrade was, and experienced IT professionals still take precautions before upgrades just in case the updater isn't as smart as the vendor thinks it is.

In a sense, the ultimate systems for cut-and-dry OS updates were the old OS-in-ROM personal computers of the late 70's and early 80's. Most of the OS was read-only, etched in silicon, and all of your data was on removable storage. Not all of the systems had upgradable ROMs. But for those that did upgrading didn't mean a complex upgrade application writing and rewriting information on the same media that your precious data was on. There was no confusion about which was which; your data and applications were over here, the code that ran the system was over there. It was physically two different things.

Of course, that didn't last. The industry split between OS and hardware vendors, users' demands for more OS upgrades faster, and the technical problem of ROM not keeping performance or capacity parity with RAM all added up to combining the OS, the applications, and the user data into a single storage medium. And conceptual confusion of the time meant that the lines between the three got very blurred. But I see a couple developments on the horizon that mark a return to this idea of the separation of code and data.

The first is the rumor that Apple will be shipping MacOS X on a separate drive in a future MacBook design. It's just a rumor, but the idea has merit. Apple has done more than Microsoft and the various Linux vendors to maintain the hygiene of its OS and keep applications, user data, and MacOS X itself cleanly separated. They are in a better position than the others to take the next step.

The rationale for this feature is that having the OS on a separate solid-state drive (SSD) would allow the system to boot and generally perform better without requiring the entire system's storage to be a larger and more expensive SSD. But moving to separate storage for the OS also allows Apple to guarantee the integrity of the OS better, make the upgrade process more predictable for them and the users, and allow for a safer factory reset feature. It would be much less engineering effort to build an updating application for a system where the OS is guaranteed to be one of a few "official" versions.

If this rumor turns out to be true and it is a successful strategy, it's possible to imagine future point-release versions of the Macintosh OS coming from Apple on something like a microSD card for end-user swapping. This would be the 21st century version of an upgrade ROM. Technologists who are used to manipulating products from stock configurations for their use may be uneasy about this idea but I believe that the consumers will be comfortable with this, especially if it results in greater convenience and reliability.

The second development is the trend toward applications in the "cloud". It's unclear how much code and data are separated in GMail or flickr's systems, but the conceptual line exists pretty clearly for the users: the "OS" is the local system + browser, the application is the site and the services it provides, and the data is what has been uploaded and shared (or processed, or sorted, etc). A user has every reason to expect that any of these things can be seamlessly upgraded without effecting the operation of the others. And for the most part that's true! I think this separation will continue and get even more defined. And from the system integration perspective I think that advances in the state of the art for global-scale scale applications will continue to be split between the disciplines of data processing and data storage (in terms of Google's technology portfolio, MapReduce and GoogleFS, respectively). The ongoing challenge for architects at this scale is to keep track of the innovations on both sides and to know how to integrate them in the right way, for the right price.

I call this trend "Harvard Writ Large", named for the difference between Harvard and von Neumann computing architectures. Harvard architecture gets its name from the "Harvard Mark I", an early computer system that had entirely different hardware for the program code and the data it was processing. In contrast, machines designed by John von Neumann made no distinction between code and data; that is to say that program code was just another kind of data to process. Modern CPUs generally aren't entirely one architecture or another, but instead blend the two approaches for different parts of the chip. Harvard architecture tends to influence design in the lower cost end of the CPU market while von Neumann is stronger in the higher cost, more general purpose CPUs.

But I propose that as our focus shifts from CPUs toward stand-alone systems and further into global scale computing, Harvard architecture tends to dominate. For computer scientists the abstraction of code-as-data is essential, but at the end of the day what most people really care about is the data. Furthermore, data now dwarfs code for storage requirements. Long gone is the day when your applications took up more hard drive space than your documents. It's now the opposite, and the difference between them is getting greater. If anything, many software vendors have embraced a "smaller is better" aesthetic (thankfully!).

I predict that this trend will continue and that we'll begin to see the distinction between code and data storage made clearer; from system architecture wonks all the way to consumer-level marketing. Code storage will become more permanent, more protected, and faster. Data storage will remain fungible, less protected (but better backed-up!), more portable, and generally slower. How will the market respond to this clarifation between the two? Cloud-based storage services proliferating? More emphasis on privacy concerns? Will innovations in storage be split in the market with speed and capacity advances being applied separately to code and data storage? System architecture change is afoot!