A Guide To Storage for ADD Types

800px-FACTS_ubt_2

If you read my bio on the About page, you can see that I called myself an ‘architect’. I made an explicit promise to apologize for that in a dedicated post at some point. According to a 2003 article by Martin Fowler, I may never be able to work at ThoughtWorks if this IBM thing does not work out, unless Dave Rice has changed his mind in the ensuing 10 years and is not in a grumpy mood any more. Now, if you check my tag line, it says that I care 30% about design, 70% about code, and there are no percentages left for caring how things are stored. As far as I am concerned, they can be dumped in the attic, next to a painting of me getting younger.

Nevertheless, as an architect I should care about storage, and that means going beyond putting obligatory cylinder-shaped objects on my slides. How did I ever get this far without having to worry about it? Through the genius of specialization. You see, when you are participating in a development of a large, multi-component Web applications customers install on premise, through the sheer economy of scale it makes sense that I worry about UIs and somebody else worries about the storage gore. You get to interact with nice objects that know how to persist themselves. You are not lazy, you are efficient. But the world is rapidly changing, going towards the hosted deployment, and from big monolithic web applications to collections of apps running inside a warm bosom of CloudFoundry, Heroku and other cloud platforms. In this architecture, we all need to be polymaths and fret about all aspects of modern app development, soup to nuts.

On the cloud, storage is considered ‘implementation detail’, something you are directed at via attached resources. Adam Wiggins has channeled his inner L. Ron Hubbard by establishing his Church of 12 Factors (kudos for pulling it off without one reference to space ships), and using storage as an attached resource is one of the 12 commandments. What does all this mean? It means that in this particular game of specialization chicken, I lost and have to deeply care about storage for the first time. I don’t need to install and manage databases (cloud platforms have a buffet to chose from), but nobody is going to make that network request for me any more.

And would you look at that? Just when I needed to care about databases, somebody threw a big friggin’ rock into the sleepy storage pond. For many years, you could use any particular DB type, as long as it was relational, where people like to shout in their SQL commands. Then Johan Oskarsson wanted to organize an event to discuss open source distributed databases, and used a #nosql hashtag. The rest is, as they say, history, and we now have to consider the weird and wonderful world of schemaless storage as well. Particularly if we care about clustering, because everybody knows that “Mongo DB is Web Scale” (you will need to Google this yourself, those foul-mouthed bears are NSFW).

His conflicted thoughts on architects notwithstanding, I found Martin Fowler a great resource in my deranged binge of DB knowledge drinking. His introduction to NoSQL databases is great, and so is the 1 hour video from the GOTO Aarhus conference linked from that page. But my general feeling of catching up with the world was akin coming to a party 6 hours late, where most of the food is gone, some guests are not feeling well (to put it mildly) and through their personal experience I know which cocktails not to drink and why Vodka and Red Bull are not such great bedfellows.

Here is a sample of what I have learned, in no particular order. This is a big topic so consider this part #1 of a longer article. If you are like me and you like to spend your time on the ‘outside in’ aspect of your apps, this may help you pick your storage poison and be done with it. If you are an expert in DBs, feel free to point at me and laugh in contempt:

  1. Objects in the relational view mirror are harder than they appear. Objects are round, tables are square, and I don’t need to work hard to turn that into a tired phrase, don’t I? ORM can never be perfect, so there is no use arguing that your approximation is 2% closer to the asymptote than mine. Throwing more resources at it can turn so bad that Ted Neward declared it Vietnam of Computer Science in 2004. To the phrase ‘all is fair in love and war’ we can add ‘and also in ORM’ and be done with it. One popular solution is to do shallow mapping of your key properties to columns to allow for fast queries and stuff the rest of the aggregate data hierarchy into a serialized LOB. If you really, really need to query the data in the LOB later, you can build an inverted index like folks at FriendFeed did and blogged about in 2009.
  2. Don’t fake schemaless storage by using EAV (entity-attribute-value) tables. Bill Karwin was on a mission at some point to convince everybody who cared that they sucked. I can only add that if you truly need that kind of open-ended freedom, RDF triple stores seem to be a better fit. And you get to use SPARQL query language, which always makes me laugh because it reminds me of Mr. Sparkle from a Simpsons episode.
  3. Do store trees of nodes in tables. The curse of EAVs is limited to aggregate data, not to uniform trees of nodes. It is OK if some rows in well designed tables have additional columns for capturing parent-child relationships. Joe Celko seems to be a go to authority on adjacency lists so it is wise to start there, then fan out as needed.
  4. We are moving away from databases as integration platforms. DBAs rise to the level of demigods was facilitated by the fact that a well designed RDB can serve as an integration platform for many applications written to slice and dice the data in new and unexpected ways. Today, services are the portals where we go for our data, and as long as interchange formats are stable, databases can be relegated to the implementation detail. Both RDF and JSON-LD are designed to give you a level of ‘unexpected data utility’ of the web of linked data that no single database can give you.
  5. RDBs are hard to scale. The whole movement towards NoSQL databases was inspired by their affinity to data partitioning (sharding for capacity scaling and/or replication for fault tolerance). On a cloud platform, not only that your app can be clustered as demand grows, but so can an array of NoSQL DB nodes if you start producing a ton of data.
  6. NoSQL databases have no JOINs, and have dropped ACID. According to the CAP theorem, a distributed system cannot guarantee all three of: consistency, availability and partition tolerance. This leads to a system where your app is eventually consistent, but can occasionally have brief intervals of stupidity. There is anecdotal evidence of NoSQL databases loosing data in the real world deployments. It is disconcerting to view your database as an airline that can loose your luggage every once in a while. Now you may need a second database as a backup in case your first database partied too hard and cannot quite remember where all your JSON documents went.
  7. NoSQL databases are attractive to all-JS outfits. The flip side – imagine a system where you write your client side using a collection of JavaScript toolkits, use XHR to send objects as JSON to the server, which is itself written using Node.js. After preparing your data you send it to MongoDB, as JSON again, where it will be stored as a JSON document. You can then use JavaScript to apply map/reduce and fetch interesting projections on the data, or query on JSON properties, or simply fetch the entire JSON document back when you need all the data. Apart from clustering, NoSQL DBs are just a nicer environment for app developers, particularly the recently weaned ones that have no attention span for strongly typed languages.

This can give you a flavor of where my head is right now. I am just trying to be practical about it. I know that there is no free lunch, but ‘kids these days’ know how to make things new and shiny. I know an adult in me should still prefer RDBs for their battle hardened dependability and predictability, but look at Mongoose JS library and tell me you are not swayed by the simplicity (and nobody is yelling at you):

    var mongoose = require('mongoose');
    mongoose.connect('mongodb://localhost/test');

    var Cat = mongoose.model('Cat', { name: String });
    var kitty = new Cat({ name: 'Zildjian' });
    kitty.save(function (err) {
       if (err) // ...
          console.log('meow');
    });

This is as close as I have come to replicating that ‘objects that know how to persist themselves’ feeling of yore, minus the quagmire of ORM. It is hard to resist the siren call of NoSQL. But I think the biggest lesson of my forray into the wonderful world of storage is that today you can change your mind about it. As long as you truly hide your storage choices as implementation detail, you can switch later, so stop agonizing, pick one and move on to the more interesting areas, like user experience.

© Dejan Glozic, 2013

Advertisements