Software engineering is an evidence-based practice. We are always seeking facts. We are logically sequencing them together to generate knowledge out of them. And yet, sometimes we are completely oblivious to the state of our system in production. No amount of testing can guarantee a 100% bug free production software. Things will go wrong. The real test of the system begins when it gets deployed in production. Are we prepared for this?

Typical Enterprise Experience

How often changes are pushed to production without completely collecting data around the issue? We wait to change until someone complains. Then the investigation begins. We look at the logs. We "fix" the problem by making code changes, throwing more hardware at it, etc. and life move on. How reactionary is this!

Then someone says, "We need monitoring!". That's largely because of some random compliance. Some tools are installed, they gather info about CPU and RAM usage. They look at the spikes, they complain, they log a ticket, nothing happens, and everyone moves on.

Then there are alerts. Something goes wrong, a million of them fill up people's inboxes even during a planned outage. They quickly become annoying. Moreover, they are compared to a fire alarm but are anything but. These alerts, unlike a fire alarm, go off during the problem instead of at the first sign of smoke. What’s the point of a loud alarm in the middle of a raging fire? People create rules to ignore these and move on. That whole situation is like this:

alt

Sometimes, they also gather service's uptime using heartbeat reports with some nice charts and put them on TV. Is that enough? Can a service, returning 200 OK with a perfect heartbeat (i.e. it is up), perform its critical actions? I also hear about not making systems any slower than today, however, nobody can tell what slow or fast means?

All of this points toward a lack of value provided by a typical monitoring in the enterprise. These systems were put in place because somebody wanted them, or they came in as part of some compliance. It lacks purpose. So, how do we fix this? How do we change our systems to make them more observable?

What is Observability?

Wikipedia says:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Now, if we apply this to some of the scenarios we saw above, we can see how incomplete the external outputs tend to be. They don't scale very well either.

Applying to Microservices

In the distributed systems/Microservices world, we are dealing with eventual consistency, where data may not be in the identical state throughout. We are dealing with services changing constantly at their own pace. Sometimes there could be a situation when certain functionalities are not available. All of this can make triaging problems tricky.

Increasing Observability

Let's see how to achieve this:

1) Come up with service level objectives and indicators: This helps us in putting together clear and practical definition around what a healthy service looks like. Google's Service Reliability Engineering Book has good advice on how to put these numbers together and how to avoid pitfalls. Afterwards, we can gather actual numbers and build intelligence around it.

2) Log aggregation: Most of the software produce a lot of logs. These logs can be aggregated into one place. This one place could be Kibana, Splunk or HoneyComb. If we can make holistic sense of these logs, we should be good.

3) Traceable individual event lifecycles: This can be achieved by generating correlation ids and source of origins on events and they can be logged. We can have messaging headers to pass this information around. NServiceBus supports these types of headers.

4) Business goal tracking: We need to gather user's activity on our application so that we can figure out the importance of features based on its usage. A/B testing can provide great insight into this. Optimizely is one such tool that I have seen used effectively.

5) Host performance tracking: There is value in gathering info about the host running our service. However, to make it effective, we should be able to create enough log information so that any anomalies can be tracked down to a piece of code that caused it.
Otherwise, any issues arising from this, are hard to recreate and resolve.

6) Meaningful notifications: I'd create less but more meaningful notifications. My proposed principle for this would be to generate a notification when something is about to go wrong. This way, the individuals can take necessary actions to avoid further escalations.

7) Custom approach: Sometimes there is no getting away from writing custom few lines of code somewhere to proactively capture data you need. In some of my systems, I had to take this approach to collect what I needed. It worked out better than customers telling us about our problems.

8) Automation: If someone must manually hit a button, that's not good. It is not going to happen. It will be ignored quickly. We need to build a culture of Automation.

9) Autonomy: It can neither bring the system down nor should it go down with the system. In the past, one of our application servers started to slow down because the monitoring tool was consuming too many resources.

Avoid these traps

Cargo cult: Blindly following practices that worked for one successful organization may seem to be working out in the beginning, but that can backfire and introduce unnecessary complications. Next time, when you are using terms such as "Netflix did it that way" or "Amazon does it this way", catch yourself and see if it really applies to you. Some of those ideas are great and noteworthy but they still need to be correctly evaluated for validity in your situation. Every environment is different.

Another form of Cargo Cult is, "We have always done it this way." or "This is our pattern." The fix here is the same as above. We always need to evaluate if past lessons are still applicable or if we can do a better job this time around.

Tool Obsessions: Whenever I had conversations around monitoring, within a few minutes, they turned into a battle of different tools out there. Yes, the capabilities of tools can give us ideas but before spending too much money on these, we need to get some of the basics right and evaluate if those tools can help us or not.

Information I found useful

Sam Newman, in his Principles of Microservices talk briefly touches upon observability being one of the core principles of building microservices. Charity Majors goes into a lot more details in her Monitoring is dead talk. Her book Database reliability engineering also has a couple of good chapters relevant to monitoring. Google's SRE book is another good source as mentioned above. Microsoft's best practices and Practical monitoring book take you deeper into monitoring.

Parting thoughts

With increased observability, when someone says our system is slow, we can add more to it with the information around how slow and the data around the cause. This can be incredibly powerful in fixing problems as they come. It will result in increased resiliency of our system and making a positive impact on the business.