Node.js Apps and Periodic Tasks

397px-Kitchen_alarm_clock

When working on a distributed system of any size, sooner or later you will hit a problem and proclaim ‘well, this is a first’. My second proclamation in such situations is ‘this is a nice topic for the blog’. Truth to form, I do it again, this time with the issue of running periodic tasks, and the twist that clustering and high availability efforts add to the mix.

First, to frame the problem: a primary pattern you will surely encounter in a Web application is Request/Response. It is a road well traveled. Any ‘Hello, World’ web app is waving you a hello in a response to your request.

Now add clustering to the mix. You want to ensure that no matter what is happening to the underlying hardware, or how many people hunger for your ‘hello’, you will be able to deliver. You add more instances of your app, and they ‘divide and conquer’ the incoming requests. No cry for a warm reply is left unanswered.

Then you decide that you want to tell a more complex message to the world because that’s the kind of person you are: complex and multifaceted. You don’t want to be reduced to a boring slogan. You store a long and growing list of replies in a database. Because you are busy and have no time for standing up databases, you use one hosted by somebody else, already set up for high availability. Then each of your clustered nodes talk to the same database. You set the ‘message of the day’ marker, and every node fetches it. Thousands of people receive the same message.

Because we are writing our system in Node.js, there are several ways to do this, and I have already written about it. Of course, a real system is not an exercise in measuring HWPS (Hello World Per Second). We want to perform complex tasks, serve a multitude of pages, provide APIs and be flexible and enable parallel development by multiple teams. We use micro-services to do all this, and life is good.

I have also written about the need to use messaging in a micro-service system to bring down the inter-service chatter. When we added clustering into the mix, we discovered that we need to pay special attention to ensure task dispatching similar to what Node.js clustering or proxy load balancing is providing us. We found our solution in round-robin dispatching provided by worker queues.

Timers are something else

Then we hit timers. As long as information flow in a distributed system is driven by user events, clustering works well because dispatching policies (most often round-robin) are implemented by both the Node.js clustering and proxy load balancer. However, there is a distinct class of tasks in a distributed system that is not user-driven: periodic tasks.

Periodic tasks are tasks that are done on a timer, outside of any external stimulus. There are many reasons why you would want to do it, but most periodic tasks service databases. In a FIFO of a limited size, they delete old entries, collapse duplicates, extract data for analysis, report them to other services etc.

For periodic tasks, there are two key problems to solve:

  1. Something needs to count the time and initiate triggers
  2. Tasks need to be written to execute when initiated by these triggers

The simplest way to trigger the tasks is known by every Unix admin – cron. You set up a somewhat quirky cron table, and tasks are executed according to the schedule.

The actual job to execute needs to be provided as a command line task, which means your app that normally accesses the database needs to provide additional CLI entry point sharing most of the code. This is important in order to keep with the factor XII from the 12-factors, which insists one-off tasks need to share the same code and config as the long running processes.

 

There are two problems with cron in the context of the cloud:

  1. If the machine running cron jobs malfunctions, all the periodic tasks will stop
  2. If you are running your system on a PaaS, you don’t have access to the OS in order to set up cron

The first problem is not a huge issue since these jobs run only periodically and normally provide online status when done – it is relatively easy for an admin to notice when they stop. For high availability and failover, Google has experimented with a tool called rcron for setting up cron over a cluster of machines.

Node cron

The second problem is more serious – in a PaaS, you will need to rely on a solution that involves your apps. This means we will need to set up a small app just to run an alternative to cron that is PaaS friendly. As usual, there are several options, but node-cron library seems fairly popular and has graduated past the version 1.0. If you run it in an app backed by supervisor or PM2, it will keep running and executing tasks.

You can execute tasks in the same app where node-cron is running, providing these tasks have enough async calls themselves to allow the event queue to execute other callbacks in the app. However, if the tasks are CPU intensive, this will block the event queue and should be extracted out.

A good way of solving this problem would be to hook up the app running node-cron to the message broker such as RabbitMQ (which we already use for other MQ needs in our micro-service system anyway). The only thing node-cron app will do is publish task requests to the predefined topics. The workers listening to these topics should do the actual work:

node-cron

The problem with this approach is that a new task request can arrive while a worker has not finished running the previous task. Care should be taken to avoid workers stepping over each other.

Interestingly enough, a hint at this approach can be found in aforementioned 12-factors, in the section on concurrency. You will notice a ‘clock’ app in the picture, indicating an app whose job is to ‘poke’ other apps at periodic intervals.

There can be only one

A ‘headless’ version of this approach can be achieved by running multiple apps in a cluster and letting them individually keep track of periodic tasks by calling ‘setTimeout’. Since these apps share nothing, they will run according to the local server clock that may nor may not be in sync with other servers. All the apps may attempt to execute the same task (since they are clones of each other). In order to prevent duplication, each app should attempt to write a ‘lock’ record in the database before starting. To avoid deadlock, apps should wait random amount of time before retrying.

Obviously, if the lock is already there, apps should fail to create their own. Therefore, only one app will win in securing the lock before executing the task. However, the lock should be set to expire in a small multiple of times required to normally finish the task in order to avoid orphaned locks due to crashed workers. If the worker has not crashed but is just taking longer than usual, it should renew the lock to prevent it from expiring.

The advantage of this approach is that we will only schedule the next task once the current one has finished, avoiding the problem that the worker queue approach has.

Note that in this approach, we are not achieving scalability, just high availability. Of the several running apps, at least one app will succeed in securing the lock and executing the task. The presence of other apps ensures execution but does not increase scalability.

I have conveniently omitted many details about writing and removing the lock, retries etc.

Phew…

I guarantee you that once you start dealing with periodic tasks, you will be surprised with the complexity of executing them in the cloud. A mix of cloud, clustering and high availability makes running periodic tasks a fairly non-trivial problem. Limitations of PaaS environments compound this complexity.

If you visit TJ’s tweet above, you will find dozen of people offering alternatives in the replies (most of them being variations of *ron). The plethora of different solutions will be a dead giveaway that this is a thorny problem. It is not fully solved today (at least not in the context of the cloud and micro-service systems), hence so many alternatives. If you use something that works well for you, do share in the ‘Reply’ section.

© Dejan Glozic, 2014

For Once, Being Reactive is Good

5 Gum - React
5 Gum – React

Apple said Monday that it sold more than 300,000 iPads on the first day of its launch, ushering a new era of people buying things in order to find out what they are.

 

SNL Weekend Update, season 35, episode 18

All my life, I thought ‘reaction’ was a bad word. Ever since the French Revolution, being ‘reactionary’ could get you into a lot of trouble. More recently (and less detrimental to your health and limb count), being in ‘reactionary’ mode is considered merely an anti-pattern. We were all in situations in life where we felt like we were merely reacting to changes foisted upon us, like tall grass helplessly flailing on a windy day. We all want to be the wind, not the grass.

As you could read only everywhere, including my own blog post, Agile movement has been declared dead (although ‘agility’ is still fine and dandy, thank you). Being communal people and in need of an idea to gather around, and not liking the traditional organized religions’ early hours, we looked for a more suitable replacement.

Not that others were not trying, and even before Agile’s passing. For example, Adam Wiggins of Heroku extraction has channeled his inner L. Ron Hubbard by establishing his Church of 12 Factors (kudos for pulling it off without one reference to space ships). It it is chock-full of Cloud-y goodness and is actually quote good and useful. I think Adam is now beating himself up for not waiting a bit and slapping ‘micro-services’ all over the manifest, because that is totally what ’12 factors’ are about.

According to Adam, 12-factor apps:

  • Use declarative formats for setup automation, to minimize time and cost for new developers joining the project;
  • Have a clean contract with the underlying operating system, offering maximum portability between execution environments;
  • Are suitable for deployment on modern cloud platforms, obviating the need for servers and systems administration;
  • Minimize divergence between development and production, enabling continuous deployment for maximum agility;
  • And can scale up without significant changes to tooling, architecture, or development practices.

So what went wrong? It is still a worthwhile document, and I keep revisiting it often, but it lacks the galvanizing force that true movements have. Maybe there are just too many factors (even actual religions knew to stop at 10), or maybe because it sounds too much like inspirational lifestyle articles such as 12 Lifestyle Factors That Make You Feel Depressed.

Then the Agile thing happened, and it was time to get cracking. It was not a long wait – behold The Reactive Manifesto.

Now, I make it sound like it all happened in a neat chronological order (something my buddy Adrian Rossouw would organize in a Wayfinder timeline), but it did not. The first version of the manifesto was published by Jonas Boner and friends and described on the Typesafe blog in July 2013. It was uploaded to GitHub and the community was invited to help tweak the document. The current version (1.1) dates September 2013 and is signed by thousands of believers (I meant ‘supporters’). In Jonas’ own words, the motivation for putting the manifesto forward was:

The primary motivation for this manifesto is to come up with a name for these new type of applications (similar to NOSQL, Big Data, SOA and REST) and to define a common vocabulary for describing them — both in terms of business values and technical concepts. Names matter and hopefully this will help bring the communities together and make it easier for users and vendors to talk about things and understand each other.

 

Up to now the usual way to describe this type of application has been to use a mix of technical and business buzzwords; asynchronous, non-blocking, real-time, highly-available, loosely coupled, scalable, fault-tolerant, concurrent, reactive, event-driven, push instead of pull, distributed, low latency, high throughput, etc. This does not help communication, quite the contrary, it greatly hinders it.

The four traits

In its core, The Reactive Manifesto puts forward four reactive traits a modern distributed system should possess (notice the small number of key tenets – that’s how it’s done). Let’s see how these qualities intersect with the modern micro-service based systems I was writing about in the last few months:

  1. Reactive to events – a modern micro-service based distributed system is asynchronous by nature, with each service sitting dormant until an event wakes it up. Lest it turns out we are talking REST only, a loosely coupled system using some kind of message broker is a better fit because it offers further decoupling. Publishing into a pub/sub topic does not require any knowledge of the possible consumers of the message in a way that A->B REST calls do. And of course, while the authors of the manifest seem to be coming from the Scala background (with Play framework also playing a part), it is easy to notice that Node.js is an even better fit. Its asynchronous nature ‘all the way down’ ensures non-blocking way of reacting to events.
  2. Reactive to load – a corollary of the micro-service based system is the freedom to scale out each service independently of the rest of the system. The ability to instrument the nodes and cluster hot spots while living the less popular services as-is is of great help in the cloud environments. Cloud resources are finite and cost money. Knowing which nodes to cluster (and more importantly, where that would be overkill) is essential to arriving at a system that is reactive to load while still keeping the monthly bill reasonable.
  3. Reactive to failure – when there are many moving parts, failure is inevitable. Successful ‘born on the net’ companies with complex distributed systems not only guard against failure, they openly embrace it (who hasn’t heard about the Netflix’s Chaos Monkey).
  4. Reactive to users – this part is a bit confusing. You would think that the four reactive traits are like ‘four pillars of heaven’ or ‘the four elements’ (minus Mila Jovovic). As it turns out, the previous three reactive properties are preconditions to the system being reactive to users – by providing real time, engaging, performant user experiences. Being reactive to events, load and failure will simply increase your chances to be reactive to users in a way that will keep them from leaving in frustration.

Reactive reactions

When you research a topic, and the second Google search hit after the actual topic is “Reactive Manifesto bulls**t”, you cannot resist clicking on the link. In it, Paul Chiusano argues that Reactive Manifest is not only not right, it’s not even wrong. It looks like the ultimate insult is being banished into the binary system limbo, where you are neither right nor wrong, you are, well, nothing (it’s like Louis CK joke that because he owes $50, he needs to raise $50 just to be broke).

Of course, there are positive reactions and people who don’t really care – the usual spectrum.

Here is what I think: I am actually positive about the intent of Reactive Manifesto. First off, it redefines ‘reactive’ as something positive. Of course it didn’t invent anything new, but that is not the first time in history somebody came and put a name on something we were already doing but didn’t know it. Remember JJG and Ajax? He didn’t invent it – he just put a name on a technique that is the bedrock of any modern client side application. How about ‘micro-services’? Many people fail to see how they are different from SOA or just plain ‘services’ or ‘distributed systems’ – a lot of haters in that camp too.

But here is what my first reaction was: coming from a former communist country, my first association was with a guy whose beard is every hipster’s dream – Karl Marx. When he and his buddy Engels came forward with the Communist Manifesto, they didn’t invent alienation, oppression and capitalism’s seedy underbelly. They just articulated them succinctly and gave something to millions of disgruntled workers around the world to rally around. It didn’t turn out all that well in hindsight, but notice what I am getting at here – manifestos don’t invent new things, they put a names on concepts, helping the adherents galvanize around them and form movements.

So to those who say ‘The Reactive Manifesto’ didn’t invent anything new, you are right: that was not the intention. Go read your history or  Google ‘manifesto’.

The ‘What, Why, How’ trifecta

I would like to circle back to ’12 factors’ and ‘micro-services’ and claim that we now have a fairly complete set of answers to big questions we may ask while we build modern distributed systems:

  1. What are we building? A micro-service based distributed system.
  2. Why are we building it? Because we want it to be reactive to events, load, failure and users.
  3. How are we building it? Using the techniques and recommendations outlined in the 12-factors.

There you go – no need to choose one over the other – a simple ‘1 + 4 + 12’ formula that will bring you happiness and make you rich.

As for me, I am taking a leaf from the SNL book – I am going to the first Toronto Reactive Meetup. In fact, I am raising SNL by not only participating but actually speaking at it – a new era of people presenting on topics to find out what they are.

If you are in Toronto on June 24, join us – in addition to reading my musings, you can see me deliver them in full HD.

© Dejan Glozic, 2014

Social Micro-Services: Activity Streams 2.0

Awkward social encounters, Wikimedia Commons, Chrisrobertsantieau
Awkward social encounters, Wikimedia Commons, Chrisrobertsantieau

If I had a dollar for every time somebody mentioned the phrase ‘social’ to me in the last couple of years, my pants would suffer since I tend to carry all my coins in my right pocket. With my soft credit card wallet. I carry my iPhone in my left pocket, where I carry cash in paper bills (paper bills will not scratch the ‘naked’ iPhone, while metal coins might). No, I don’t have Asperger’s, why do you ask?

Tapping into the opportunity to exploit user activity is now considered a norm in any system of a decent complexity. Users of a system create a social trail that can be put to good use to improve collaboration and insight.

More conveniently, adding social dimension to a system that already provides value on its own is an order of a magnitude easier proposition. Think about it: dedicated social networks such as Facebook and Twitter vitally depend on users generating primary data by posting updates, pictures, videos and otherwise exposing themselves to the advertisers. In a system where users go to accomplish a primary task (track and plan, write code, build and deploy code, monitor running apps etc.), social events are happening as a side effect of user activity – they are not the sole value proposition of the system. All we need to do is capture and distribute these social events – no extra user effort required.

Another characteristic of value systems (in contrast to pure social systems) is that social activity is not limited to users. A part of the system can kick in, perform a task and then notify interested users about it. Before we called this ‘social’, they were ‘events’ and/or ‘notifications’. We now understand that activity of programmatic agents can easily and usefully appear on your social stream, mixed with the updates of actual users.

Social streams and micro-services

If you have followed this blog recently, you already know we now prefer to build our systems using micro-services. Extracting social streams out of such a system seems plausible for the following reasons:

  1. Overall activity of a micro-service-based system is a combination of activities of each individual service
  2. We are already using a message broker to ensure efficient and flexible message passing between micro-services
  3. Micro-services already publish messages when their state changes, mostly as a result of user actions
  4. Having a dedicated activity stream micro-service to add social dimension to the system is in itself consistent with micro-service architecture

OK, sounds like a plan. As I have already written in my post on clustering and messaging, we need a dedicated service that aggregates social activities from all the corners of your micro-service system. Our first instinct may be to tap into existing messages already flowing around, but this may turn out not to be a good idea:

  1. Messages that are published by the micro-services tend to supplement REST API. Not every CRUD event in every service is a social event worthy of appearing in activity feeds.
  2. There may not be enough data in the CRUD messages to build up a complete activity record.

For these reasons, it is a better practice to dedicate a separate messaging channel (or ‘topic’) for social activities and let micro-services choose which subset of the CRUD message traffic is social-worthy, and if so, publish a message that contains all the additional information required by the social stream.

Anatomy of an activity

What would that information be? We don’t have to guess – there is a public specification available to follow. An activity typically consists of an actor, verb and object, and optionally a target. In an activity that can be expressed as “Jane posted a picture to the album ‘Jane’s Vacation'”, we can see all four (Jane is the actor, ‘post’ is the verb, ‘picture’ is the object and ‘album’ is the target). Expressed using Activity Stream draft 2.0 JSON syntax, it could look like this:

{
   "verb": "post",
   "published": "2011-02-10T15:04:55Z",
   "language": "en",
   "actor": {
     "objectType": "person",
     "id": "urn:example:person:jane",
     "displayName": "Jane Doe",
     "url": "http://example.org/jane",
     "image": {
       "url": "http://example.org/jane/image.jpg",
       "mediaType": "image/jpeg",
       "width": 250,
       "height": 250
     }
   },
   "object" : {
     "objectType": "picture",
     "id": "urn:example:picture:abc123/xyz"
     "url": "http://example.org/pictures/2011/02/pic1",
     "displayName": "Jane jumping into water"
   },
   "target" : {
     "objectType": "album",
     "id": "urn:example:albums:abc123",
     "displayName": "Jane's Vacation",
     "url": "http://example.org/pictures/albums/janes_vacation/"
   }
}

Notice that an equivalent CRUD message produced as a result of a new picture resource being added in a Picture micro-service that manages images would follow the REST POST action that was performed to add the picture:


POST /pictures/albums/janes_vacation

In the command above, the new picture that Jane added to the album was in the HTTP request body.

As you may have noticed, CRUD messages are resource-centric, while activities are actor-centric. A Web page rendering the ‘Jane’s Vacation’ album will want to refresh to include a new picture (possibly using Web Sockets), but does not care who initiated the action. This is why it is hard to ‘synthesize’ activities out of CRUD messages – it is much better for the micro-service at the source to fire a clean, well formed activity object according to the public spec from the get go. It is virtually impossible to synthesize an activity example as shown above unless you are the service owning the data.

A vital part of firing a new activity is audience targeting. Let’s say that there is a micro-service that manages projects in a system. The project owner has decided to change the project description. Who should receive this activity on their personal social stream? There are two ways to implement this – user-centric and service-centric:

In a user-centric implementation, each user has a social graph of relationships. When an activity is performed by a node in her social graph, it should end up on her personal social stream. This approach looks very logical but is actually hard to implement if you are not Facebook or Twitter. I don’t think it is actually necessary in a system where social is enhancing to the primary value, rather than the value itself.

In a service-centric implementation, we assume that when an event occurs that is deemed social, the service has all the information it needs to determine activity’s primary and secondary audience. It so happens that activity stream specification has just such a feature. In our example with changing the project description, the service already knows all the members of the project, and all the users who are ‘subscribed’ or ‘watching’ the project somehow. Therefore, it should fire an activity like this:

{
   ....
   "to":  [{ "objectType": "person",
             "id": "johndoe"},
           { "objectType": "project",
             "id": "xxzqIHH_556X" }
   ],
   "cc":  [{ "objectType": "person",
             "id": "fredf"},
           { "objectType": "person",
             "id": "jasonj"}
   ]
}

In the example above, the activity is addressed to John Doe (the owner of the project) and the project’s dedicated activity stream, while “Fred F” and “Jason J” who are ‘watching’ the project will receive the update by the virtue of being on the “cc” list. This illustrates another powerful feature of the audience targeting – the ability to target object types other than ‘person’.

When such an activity arrives at the dedicated activity stream micro-service, it can simply store a copy in the social stream of each of the targets in the target audience. The publishing service has done all the work by identifying the audience – the activity stream service will simply honor the directive.

Social streams can be used to mix events from various sources. For example, system-wide alerts and broadcasts can end up on personal streams as well (things like ‘maintenance restart in 5 minutes’) for awareness and audit purposes.

Similarly, activities performed by various engines can be mixed with activities performed by actual users – activity stream specification is flexible enough to allow actors other than persons. That’s why you can have an activity such as ‘Continuous Integration started a build #45’ as well as ‘Build #45 failed with 45 errors’.

iphone-activities-ee2

Filtering

Finally, activity stream specification fits micro-services like a glove when it comes to filtering. Aggregating all the system chatter produce a fire-hose activity stream that ends up ignored due to its multitude of entries. This is where semantics of activity streams is superior to the RSS/ATOM feeds. Each micro-service can provide a definition of verbs and object types it intends to use in its activities. Since the core set of verbs and object types can easily becomes inadequate for a typical complex system, the definitions of extensions are vital to allow for powerful filtering based on verbs and object types, something like:

  • Hide all ‘build’ updates – filtering based on object type
  • Hide all ‘build succeeded’ updates – filtering based on object type and verb combination
  • Hide all updates like this – filtering of future occurrences of a something you don’t care about

Housekeeping

A micro-service system of a decent size can quickly produce a lot of activity chatter. If some of these activities target multiple users, copies per user add to ever growing database. Obviously we need to draw a line somewhere.

Again, social streams in a system where social data is not the primary value are less critical when it comes to data preservation. The assumption is that the primary data is safely stored, and if data changes need to be preserved for audit purposes, this audit trail is itself safely stored. Social streams are just an echo of these auditable changes, and do not need to be preserved in long term storage.

You can experiment with your activity stream micro-service storage, but keeping a week worth of social streams may be plenty. Alternatively, you can draw a line at a number of activities, or storage size, or a combination of all three.

Whichever method you pick, you need to run a ‘pruner’ task that deletes old activities from the database. In a distributed system based on micro-services, 12factors comes to the rescue with a recommendation for running admin tasks as one-off processes.

And there you have it. In a distributed system based around micro-services, there are already messages flying around. Opening up a social channel and collecting dedicated messages into a social stream is just an extra step that will help your users with the added insight into the activity of the system, and the actions of other users and agents they should know about.

In addition to the activity streams draft 2.0 spec, there is now a GitHub project with both client and server side implementation. It appears in the early days but if you don’t want to write everything from scratch, a Java as well as Node.js implementation is readily available – give it a test drive and let me know what you think.

© Dejan Glozic, 2014

A Guide To Storage for ADD Types

800px-FACTS_ubt_2

If you read my bio on the About page, you can see that I called myself an ‘architect’. I made an explicit promise to apologize for that in a dedicated post at some point. According to a 2003 article by Martin Fowler, I may never be able to work at ThoughtWorks if this IBM thing does not work out, unless Dave Rice has changed his mind in the ensuing 10 years and is not in a grumpy mood any more. Now, if you check my tag line, it says that I care 30% about design, 70% about code, and there are no percentages left for caring how things are stored. As far as I am concerned, they can be dumped in the attic, next to a painting of me getting younger.

Nevertheless, as an architect I should care about storage, and that means going beyond putting obligatory cylinder-shaped objects on my slides. How did I ever get this far without having to worry about it? Through the genius of specialization. You see, when you are participating in a development of a large, multi-component Web applications customers install on premise, through the sheer economy of scale it makes sense that I worry about UIs and somebody else worries about the storage gore. You get to interact with nice objects that know how to persist themselves. You are not lazy, you are efficient. But the world is rapidly changing, going towards the hosted deployment, and from big monolithic web applications to collections of apps running inside a warm bosom of CloudFoundry, Heroku and other cloud platforms. In this architecture, we all need to be polymaths and fret about all aspects of modern app development, soup to nuts.

On the cloud, storage is considered ‘implementation detail’, something you are directed at via attached resources. Adam Wiggins has channeled his inner L. Ron Hubbard by establishing his Church of 12 Factors (kudos for pulling it off without one reference to space ships), and using storage as an attached resource is one of the 12 commandments. What does all this mean? It means that in this particular game of specialization chicken, I lost and have to deeply care about storage for the first time. I don’t need to install and manage databases (cloud platforms have a buffet to chose from), but nobody is going to make that network request for me any more.

And would you look at that? Just when I needed to care about databases, somebody threw a big friggin’ rock into the sleepy storage pond. For many years, you could use any particular DB type, as long as it was relational, where people like to shout in their SQL commands. Then Johan Oskarsson wanted to organize an event to discuss open source distributed databases, and used a #nosql hashtag. The rest is, as they say, history, and we now have to consider the weird and wonderful world of schemaless storage as well. Particularly if we care about clustering, because everybody knows that “Mongo DB is Web Scale” (you will need to Google this yourself, those foul-mouthed bears are NSFW).

His conflicted thoughts on architects notwithstanding, I found Martin Fowler a great resource in my deranged binge of DB knowledge drinking. His introduction to NoSQL databases is great, and so is the 1 hour video from the GOTO Aarhus conference linked from that page. But my general feeling of catching up with the world was akin coming to a party 6 hours late, where most of the food is gone, some guests are not feeling well (to put it mildly) and through their personal experience I know which cocktails not to drink and why Vodka and Red Bull are not such great bedfellows.

Here is a sample of what I have learned, in no particular order. This is a big topic so consider this part #1 of a longer article. If you are like me and you like to spend your time on the ‘outside in’ aspect of your apps, this may help you pick your storage poison and be done with it. If you are an expert in DBs, feel free to point at me and laugh in contempt:

  1. Objects in the relational view mirror are harder than they appear. Objects are round, tables are square, and I don’t need to work hard to turn that into a tired phrase, don’t I? ORM can never be perfect, so there is no use arguing that your approximation is 2% closer to the asymptote than mine. Throwing more resources at it can turn so bad that Ted Neward declared it Vietnam of Computer Science in 2004. To the phrase ‘all is fair in love and war’ we can add ‘and also in ORM’ and be done with it. One popular solution is to do shallow mapping of your key properties to columns to allow for fast queries and stuff the rest of the aggregate data hierarchy into a serialized LOB. If you really, really need to query the data in the LOB later, you can build an inverted index like folks at FriendFeed did and blogged about in 2009.
  2. Don’t fake schemaless storage by using EAV (entity-attribute-value) tables. Bill Karwin was on a mission at some point to convince everybody who cared that they sucked. I can only add that if you truly need that kind of open-ended freedom, RDF triple stores seem to be a better fit. And you get to use SPARQL query language, which always makes me laugh because it reminds me of Mr. Sparkle from a Simpsons episode.
  3. Do store trees of nodes in tables. The curse of EAVs is limited to aggregate data, not to uniform trees of nodes. It is OK if some rows in well designed tables have additional columns for capturing parent-child relationships. Joe Celko seems to be a go to authority on adjacency lists so it is wise to start there, then fan out as needed.
  4. We are moving away from databases as integration platforms. DBAs rise to the level of demigods was facilitated by the fact that a well designed RDB can serve as an integration platform for many applications written to slice and dice the data in new and unexpected ways. Today, services are the portals where we go for our data, and as long as interchange formats are stable, databases can be relegated to the implementation detail. Both RDF and JSON-LD are designed to give you a level of ‘unexpected data utility’ of the web of linked data that no single database can give you.
  5. RDBs are hard to scale. The whole movement towards NoSQL databases was inspired by their affinity to data partitioning (sharding for capacity scaling and/or replication for fault tolerance). On a cloud platform, not only that your app can be clustered as demand grows, but so can an array of NoSQL DB nodes if you start producing a ton of data.
  6. NoSQL databases have no JOINs, and have dropped ACID. According to the CAP theorem, a distributed system cannot guarantee all three of: consistency, availability and partition tolerance. This leads to a system where your app is eventually consistent, but can occasionally have brief intervals of stupidity. There is anecdotal evidence of NoSQL databases loosing data in the real world deployments. It is disconcerting to view your database as an airline that can loose your luggage every once in a while. Now you may need a second database as a backup in case your first database partied too hard and cannot quite remember where all your JSON documents went.
  7. NoSQL databases are attractive to all-JS outfits. The flip side – imagine a system where you write your client side using a collection of JavaScript toolkits, use XHR to send objects as JSON to the server, which is itself written using Node.js. After preparing your data you send it to MongoDB, as JSON again, where it will be stored as a JSON document. You can then use JavaScript to apply map/reduce and fetch interesting projections on the data, or query on JSON properties, or simply fetch the entire JSON document back when you need all the data. Apart from clustering, NoSQL DBs are just a nicer environment for app developers, particularly the recently weaned ones that have no attention span for strongly typed languages.

This can give you a flavor of where my head is right now. I am just trying to be practical about it. I know that there is no free lunch, but ‘kids these days’ know how to make things new and shiny. I know an adult in me should still prefer RDBs for their battle hardened dependability and predictability, but look at Mongoose JS library and tell me you are not swayed by the simplicity (and nobody is yelling at you):

    var mongoose = require('mongoose');
    mongoose.connect('mongodb://localhost/test');

    var Cat = mongoose.model('Cat', { name: String });
    var kitty = new Cat({ name: 'Zildjian' });
    kitty.save(function (err) {
       if (err) // ...
          console.log('meow');
    });

This is as close as I have come to replicating that ‘objects that know how to persist themselves’ feeling of yore, minus the quagmire of ORM. It is hard to resist the siren call of NoSQL. But I think the biggest lesson of my forray into the wonderful world of storage is that today you can change your mind about it. As long as you truly hide your storage choices as implementation detail, you can switch later, so stop agonizing, pick one and move on to the more interesting areas, like user experience.

© Dejan Glozic, 2013