It’s a new year, which means it is time to update the copyright at the bottom. I have been busy lately and I didn’t want to write fluffy blogs just to fill the space. But this time I actually have real content, so here it goes.
If you followed my blog in the past, you know I was very bullish on message brokers in the context of microservice architecture. I claimed that REST APIs alone are not sufficient to sustain a successful microservice system. Event collaboration pattern is necessary to ensure a scalable and robust system where microservices that own resource lifecycle don’t need to be burdened with executing all the code driven by that lifecycle. You can read more about it in my article about REST/MQ mirroring.
Not so fast
There is a fly in this particular ointment, however. While HTTP REST is well understood, messaging protocols are numerous. You may know from my writing that I liked MQTT for a while due to its simplicity and wide support. However, MQTT is actually not a great protocol for gluing your microservice system. It is designed to be lightweight and run on the smallest of IoT devices, and lacks some critical features such as anycast support and manual acknowledgement of messages.
Another popular protocol (AMQP) suffered a schism of sorts. The version supported in a popular RabbitMQ broker (0.91) is a very different protocol from an actual open standard AMQP 1.0 that is not as widely supported. This is a pity because I really like AMQP 1.0. It is like MQTT but with anycast and manual acknowledgment – perfect as a microservice system glue, yet still very simple to work from within clients.
And of course, there is Apache Kafka, a very powerful but odd choice given its origins in high-throughput distributed log aggregation. Kafka is unapologetically Java-centric, has a proprietary protocol and bolt-on REST APIs to connect client languages are still fairly low level. For example, where AMQP 1.0 guarantees quality of service and requires implementations to provide a buffer of messages that are re-delivered to a client that crashed, Kafka simply allows you to maintain a pointer in the queue but it is your job to work up the queue and catch up after restarting. You pay for performance by working at a level fairly low for general purpose application messaging.
Authentication pains
Choosing the messaging solution and protocol is only one part of the problem. If you want your system to be extensible and allow third-party integrations, you need to make it fairly easy for new clients to connect. On the other hand, you need to secure the clients because you don’t want a rogue app to eavesdrop on the events in the system without authorization (otherwise you have this egg on the bot face). Maintaining a large list of clients connected to the same topics can also be an scalability issue for some brokers.
For all these reasons, it has been widely acknowledged that message brokers are not a good match for external integrations. A cursory scan of popular cloud applications with large ecosystems all point at a more client-friendly alternative – WebHooks.
WebHooks to the rescue
You would think that if something is so popular and has a catchy name, there is an actual well written protocol for it. Wrong! WebHooks is the least common denominator you could imagine. In a nutshell, this is all there is:
- You publish a list of valid events for which you will notify clients
- You provide an API (and/or UI) for clients to register URLs for one or more of those events
- When an event happens, you execute HTTP POST on the URLs registered for that event type.
That’s all. Granted, message brokers have topics you can publish and subscribe, and the actual messages they pass around are free-form, so this is not very different. But absent from WebHooks are things such as anycast, Quality of Service, manual acknowledgement etc.
I don’t intend to go into the details of what various implementations of WebHooks in apps such as GitHub, Slack etc. provide because thankfully Giuliano Iacobelli already wrote such an article. My interest here is to apply this knowledge to a microservice system we are building and try to anticipate pros and cons of going with WebHooks.
What it would take
First thing that comes to mind is that in order to support WebHooks, we would need to write a new WebHook service. Its role would be to accept registrations, and store URL and event type mappings for subsequent invocation. Right there, my first thought is about the difference between external and internal clients. External clients would most likely use the UI to register a URL of their integration. This is how you register your script in GitHub so that it runs on every commit, for example.
However, with internal clients we would have a funny problem: every time I restart a microservice instance, I would need to register somewhere in startup. That would make a POST endpoint a nonstarter, because I don’t want to keep creating new registrations. Instead, a PUT with a client ID would work better, where an existing registration for the same ID would just be updated if already there.
Other than that, the service would offer a POST for a new message into the provided event type that would be delivered to all registered URLs for that type. Obviously it would need to guard against 404s, 502s and URLs that take too long to return response, giving up on them after a set timeout.
The best of both worlds
The set timeout brings back the topic of the quality of service, implying that WebHooks are great for external integration but not that great for reliable glue of a microservice system. Why don’t we marry the two then? We could continue to use message broker for reliable delivery of internal messages, and hook it up to a WebHook service that would notify external integrations without the need to support our particular protocol, or get too much access into the sensitive innards. Hooking up a WebHook service to a message broker would have the added benefit of buffering the service itself so that it can be restarted and updated without interruption and missed events.
In the diagram above, our microservice system has the normal architecture with a common routing proxy providing a single domain entry into the microservices. The microservices use normal message broker clients to publish to topics. A subset of these topics deemed suitable for external integrations is also listened to by the WebHook service, and for each of those messages it reaches into the stored list of registrations and calls HTTP POST on the registered URLs. If the WebHook service crashes, a reputable message broker will maintain a buffer of messages to re-deliver them upon restart. For performance reasons, WebHook service can choose to keep a subset of registrations in the in-memory cache depending on how frequently they are used.
Discussion
Obviously registering a URL with an HTTP PUT is much easier to implement, and providing a single POST endpoint to handle the event lowers the barrier of entry for external integrations. In fact, hooking up code to react to a single POST could very well be done using serverless architecture.
Are we losing something in the process? Inserting another service into the flow will add a bit of a delay but external notifications are normally for events that are not happening many times a second, so the tiny delay is more than acceptable tradeoff. In addition, if the client providing WebHook URL is itself load-balanced, this delivery will be hardcoded to anycast (the event notification will only hit one of the instances in the cluster).
Finally, this creates two classes of clients – ‘inner circle’ and external, segregated clients. Inner circle is hooked up directly to the message broker, while the external clients go through the service. In addition to this being an acceptable price to pay for easier integration, it is useful to be able to only expose a subset of events externally – some highly sensitive internal events may only be available to ‘trusted’ clients subscribing to message broker topics and having internal credentials.
Since the WebHook service will normally not keep retrying to deliver an event to an unresponsive URL, it is possible to miss an event. If this is a problem, external system would need to fashion a ‘belt and suspenders’ fortification, where event driven approach is augmented with a periodic REST API call to ‘compare notes’ and ensure the baseline it is working against is up to date.
© Dejan Glozic, 2017
Leave a Reply