Business case for observability

In the previous post, we took a look at what observability is and how to build it into your application. Just to recap, observability is developing insights into the system based on external signs. For example, the car dashboard has a service engine light, low-pressure indicators, RPM for the engine, etc. All of this can quickly help us determine if the car is in a condition to drive. We don’t need to actually look at every individual part of the car every single day.

As with anything in our industry, software exists for the business needs. We always need to be cognizant of that. Any initiative that doesn't align with the business needs loses steam quickly and rightly so. After all, we have a limited amount of resources to spend at any given time. The domain driven design movement plays right into it. So, we need to understand why having a more observable system can help us.

Confidence in the system

When we hear people say their system works, what does that mean? Does it mean their smoke tests worked? Does it mean there are zero errors? Can it withstand Chaos Monkey? I highly recommend taking a look at Principles of Chaos Engineering to better understand the Chaos Monkey tool. That deserves a separate discussion in and of itself for some other time.

Sometimes a system may not be throwing any errors but may actually be doing something it is not supposed to. How do we catch that? Very often, this unexpected behavior is not captured in the logs or the monitoring tools. So, going by a no news is good news attitude, the team claims that the system is working. It is similar to walking into a room full of mess with the lights turned off: since I can't see anything, I am conveniently going to assume everything is just great.

Having multiple levels of health checks, connectivity checks, and performance checks along with observing demo data points helps us provide a basis for the "it’s working" claim. It is not a claim anymore, since we have data to back it up. It becomes a fact. This can reduce the number of customer complaints related to code issues and help the business in building confidence in the system.

For example: If I have an e-commerce site along with typical telemetries around errors, availability, and response times, we can create a dummy order and check the data around it periodically to make sure everything is fine. If any part of the order process goes beyond the acceptable thresholds we can raise the appropriate alarms.

This is how we can build confidence in our system with knowledge instead of guesses.

Faster Deployments

Higher confidence in the system paves the way for more advanced deployment models such as canary deployment and blue-green deployment. Both of these deployment models require some level of testing before the new features go completely live. If we can subject new nodes to production loads and observe how the system behaves with the new changes, we reduce the friction between existing codebase and the new changes coming in. All of this means we can deploy new code more reliably and rapidly with minimum to no downtime, thus achieving a true continuous deployment for the system.

Understand changes that affect business KPIs (key performance indicators)

KPIs tell us how the business is doing, as they point to key health issues that need to be addressed. Some examples of KPIs could be the number of active customers, cost per customer, customer attrition, etc. Let’s think of customer attrition from the perspective of a social media site, Twitter. What if it took more than 30-40 seconds to make a tweet live? That would significantly impact how many tweets can be generated, affect Twitter’s popularity, and eventually cause customer attrition. In this case, we can see a relation between latency and customer attrition. Understanding this, we can see why Twitter must have made a move to migrate to Scala and JVM from Rails. This is not a trivial undertaking and the company's existence can depend on it. Could they have done it without gathering performance metrics? Could anyone do this without having a before and after picture?

Observability brings the backend problems to the forefront by making them measurable, which is beneficial because those problems can be absolutely detrimental to the business. On the other hand, fixing those issues proactively can drive the business forward.

Tangible targets

We saw a relation between technical metrics and KPIs. This means if I want to improve my KPIs, I can target my technical metrics because I can tie them to a piece of code. So when I hear the system is slow, I can quickly create an understanding of what that means based on the metrics and start the analysis.

Let’s take a look at an example.If my response time for a request averages 100 milliseconds over a week but jumps to 2-3 seconds after I push a new feature, then I know the system's performance has been affected significantly. It might not be a rollback worthy deployment but is certainly worth taking a look to ensure it doesn’t get any worse. I am likely to know the cause as well based on the timelines and other useful observability metrics and logging. What does that do?

I can now clearly see what I need to achieve. If I can throw hardware at something, I can try that and I have ways to test that. If it requires a code change, I can push those changes through the same rigor. I know that I am not done until I bring the response time down to the acceptable range. Without realistically knowing that range, this would have been hard to achieve.

Faster software reflects positively on sales

With a clear understanding of what fast or slow means for the application, let’s take a look how that can affect the bottom line of the business.

On a social media site, how expensive is the complete shutdown of the site? How many users are lost by the site with just regular sluggish performance? It is very easy to get a bad reputation and the downward spiral begins there. Would you buy claims of performance tuning from a software consulting firm if their own site often suffers from some serious lag, random crashes, etc. I may see an item I am interested in on a vendor site but end up on Amazon anyways because they made it incredibly easy and fast to search items and purchase them. They couldn't have arrived to a premier experience without gathering tons of metrics about usages, the performance of the pages, etc. What would happen if Amazon experiences just 10% slow down in their checkout process? How many carts will get abandoned? All of this has a direct impact on the business and its viability. As software engineers, it is important to understand the impact of these things on the business.

As explained by an Amazon Developer here, Amazon found that revenue increased by 1% for every 100 ms of load time improvement. You can see how Amazon Prime day 2017 fared for them in this [post])(

Here’s a result from Pinterest’s frontend performance project in March 2017: 40% drop in perceived wait time, 15% increase in SEO traffic yielded a 15% increase in signups. They go into the details in this Pinterest Engineering blog post

Both of these examples were originally mentioned in the Practical Monitoring book by Mike Julian.

Business priorities determination

It is very common for development teams to be in discussion with product teams about the next set of features. We can always make these conversations more data-driven if we have the usage statistics. In our e-commerce example, if we find that most users are utilizing the search functionality on the site for discovery instead of the navigation system, we can use this insight to make the search faster and more useful. We reduce the priority on everything related to the navigation system.

Justification for refactoring

Everyone cringes when they look at their old code. Engineers want to immediately start refactoring. Business doesn't see any value in it as they think nothing changes from the end users perspective. The maintenance argument works only with folks with some development experience. The friction creeps in. How can we allow time for this activity from the business perspective? We can find a middle ground in the observability metrics. For example, we can prioritize activities that are going to improve performance of the login process over other types of changes.

Generate a complete picture for A/B testing

The observable nature of the system can contribute to A/B test experiments effectively. Technical data metrics can be tied to usage to generate information that can help understand the stress points in the system. Fortunately, there are tons of tools to conduct these experiments. Optimizely is one such tool I have seen being used effectively.

Provide steps towards auto-healing

Observability is knowing the speed of your car by looking at the speedometer and not the spinning wheel. It won’t necessarily fix problems but it will provide good insight into those problems that could prove crucial in resolving. Auto-healing could be hard to generalize because it can change per context, per architecture, per tech stack, etc. To come up with a truly auto-healing system can be a daunting task but the path to that destination goes through observability.

Parting thoughts

We have seen how observability helps you build better software to drive value for users and the business. When we understand a system’s behavior we can operate better. We can deploy faster, build greater confidence in our system, understand KPIs better and drive sales.

In the previous post, we took a look at what observability is and how to build it into your application. Just to recap, observability is developing insights into the system based on external signs. For example, the car dashboard has a service engine light, low-pressure indicators, RPM for the engine,…

Read More

Increasing observability of microservices

Software engineering is an evidence-based practice. We are always seeking facts. We are logically sequencing them together to generate knowledge out of them. And yet, sometimes we are completely oblivious to the state of our system in production. No amount of testing can guarantee a 100% bug free production software. Things will go wrong. The real test of the system begins when it gets deployed in production. Are we prepared for this?

Typical Enterprise Experience

How often changes are pushed to production without completely collecting data around the issue? We wait to change until someone complains. Then the investigation begins. We look at the logs. We "fix" the problem by making code changes, throwing more hardware at it, etc. and life move on. How reactionary is this!

Then someone says, "We need monitoring!". That's largely because of some random compliance. Some tools are installed, they gather info about CPU and RAM usage. They look at the spikes, they complain, they log a ticket, nothing happens, and everyone moves on.

Then there are alerts. Something goes wrong, a million of them fill up people's inboxes even during a planned outage. They quickly become annoying. Moreover, they are compared to a fire alarm but are anything but. These alerts, unlike a fire alarm, go off during the problem instead of at the first sign of smoke. What’s the point of a loud alarm in the middle of a raging fire? People create rules to ignore these and move on. That whole situation is like this:


Sometimes, they also gather service's uptime using heartbeat reports with some nice charts and put them on TV. Is that enough? Can a service, returning 200 OK with a perfect heartbeat (i.e. it is up), perform its critical actions? I also hear about not making systems any slower than today, however, nobody can tell what slow or fast means?

All of this points toward a lack of value provided by a typical monitoring in the enterprise. These systems were put in place because somebody wanted them, or they came in as part of some compliance. It lacks purpose. So, how do we fix this? How do we change our systems to make them more observable?

What is Observability?

Wikipedia says:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Now, if we apply this to some of the scenarios we saw above, we can see how incomplete the external outputs tend to be. They don't scale very well either.

Applying to Microservices

In the distributed systems/Microservices world, we are dealing with eventual consistency, where data may not be in the identical state throughout. We are dealing with services changing constantly at their own pace. Sometimes there could be a situation when certain functionalities are not available. All of this can make triaging problems tricky.

Increasing Observability

Let's see how to achieve this:

1) Come up with service level objectives and indicators: This helps us in putting together clear and practical definition around what a healthy service looks like. Google's Service Reliability Engineering Book has good advice on how to put these numbers together and how to avoid pitfalls. Afterwards, we can gather actual numbers and build intelligence around it.

2) Log aggregation: Most of the software produce a lot of logs. These logs can be aggregated into one place. This one place could be Kibana, Splunk or HoneyComb. If we can make holistic sense of these logs, we should be good.

3) Traceable individual event lifecycles: This can be achieved by generating correlation ids and source of origins on events and they can be logged. We can have messaging headers to pass this information around. NServiceBus supports these types of headers.

4) Business goal tracking: We need to gather user's activity on our application so that we can figure out the importance of features based on its usage. A/B testing can provide great insight into this. Optimizely is one such tool that I have seen used effectively.

5) Host performance tracking: There is value in gathering info about the host running our service. However, to make it effective, we should be able to create enough log information so that any anomalies can be tracked down to a piece of code that caused it.
Otherwise, any issues arising from this, are hard to recreate and resolve.

6) Meaningful notifications: I'd create less but more meaningful notifications. My proposed principle for this would be to generate a notification when something is about to go wrong. This way, the individuals can take necessary actions to avoid further escalations.

7) Custom approach: Sometimes there is no getting away from writing custom few lines of code somewhere to proactively capture data you need. In some of my systems, I had to take this approach to collect what I needed. It worked out better than customers telling us about our problems.

8) Automation: If someone must manually hit a button, that's not good. It is not going to happen. It will be ignored quickly. We need to build a culture of Automation.

9) Autonomy: It can neither bring the system down nor should it go down with the system. In the past, one of our application servers started to slow down because the monitoring tool was consuming too many resources.

Avoid these traps

Cargo cult: Blindly following practices that worked for one successful organization may seem to be working out in the beginning, but that can backfire and introduce unnecessary complications. Next time, when you are using terms such as "Netflix did it that way" or "Amazon does it this way", catch yourself and see if it really applies to you. Some of those ideas are great and noteworthy but they still need to be correctly evaluated for validity in your situation. Every environment is different.

Another form of Cargo Cult is, "We have always done it this way." or "This is our pattern." The fix here is the same as above. We always need to evaluate if past lessons are still applicable or if we can do a better job this time around.

Tool Obsessions: Whenever I had conversations around monitoring, within a few minutes, they turned into a battle of different tools out there. Yes, the capabilities of tools can give us ideas but before spending too much money on these, we need to get some of the basics right and evaluate if those tools can help us or not.

Information I found useful

Sam Newman, in his Principles of Microservices talk briefly touches upon observability being one of the core principles of building microservices. Charity Majors goes into a lot more details in her Monitoring is dead talk. Her book Database reliability engineering also has a couple of good chapters relevant to monitoring. Google's SRE book is another good source as mentioned above. Microsoft's best practices and Practical monitoring book take you deeper into monitoring.

Parting thoughts

With increased observability, when someone says our system is slow, we can add more to it with the information around how slow and the data around the cause. This can be incredibly powerful in fixing problems as they come. It will result in increased resiliency of our system and making a positive impact on the business.

Software engineering is an evidence-based practice. We are always seeking facts. We are logically sequencing them together to generate knowledge out of them. And yet, sometimes we are completely oblivious to the state of our system in production. No amount of testing can guarantee a 100% bug free production software.…

Read More

Message store clean up strategy

In the previous post, we saw how to use message store strategy to make a receiver idempotent. As we are using a table in SQL server, it can fill up quickly if you have multiple types of messages getting stored. How do we keep it from filling up?

Instead of relying on any strategy outside our endpoint, I decided to use NServicebus scheduling. I configured it to send a message every 24 hours. This can be configured easily. My send code looks like below:

await endpointInstance.ScheduleEvery(
    timeSpan: TimeSpan.FromHours(24),
    task: context=>
        var message = new CleanupStoreMessage();
        return context.Send(message);

The CleanupStoreMessageHandler picks it up after. The handler goes back to the configured amount of days and removes them from the store. That code looks like below:

 var messagesToRemove = dbContext.MyMessageStore.Where(
     m =>  m.Timestamp <= DateTime.UtcNow.AddDays(-10));

The NServicebus scheduling comes with several caveats. A relevant one is below:

Since the Scheduler is non-durable, if the process restarts, all scheduled tasks (that are created during the endpoint's startup) are recreated and given new identifiers.

To get around this, my strategy is to keep the TimeSpan for the ScheduleEvery method small enough so that it gets executed at least once in a day. Since I am going back some days based on the day of the execution, if I receive more than one message, the later messages just wont find any data to cleanup and that's ok for me. My query here goes back 10 days and I am sending the CleanupStoreMessage every 24 hours. If I reduce the TimeSpan to 12, I will get 2 messages per day and the second time, I may not find any data to clean. This way my handler has a greater chance of executing and mitigating the limitation.

Another advantage here is that I am cleaning up my own data without relying on any other clever database archiving strategy or some other process to clean up it. I am less likely to forget about it and it requires less maintenance.

In the previous post, we saw how to use message store strategy to make a receiver idempotent. As we are using a table in SQL server, it can fill up quickly if you have multiple types of messages getting stored. How do we keep it from filling up? Instead of…

Read More

Idempotent receivers using message store

In Microservices/Distributed systems, messaging is a preferred form of integration. I covered some of the reasons in one of my previous posts. In these type of systems, whether we have pub-sub/event notification/sending commands, it is a good practice to make the receiving endpoints idempotent. It simply means the endpoint can receive the same message multiple times.

One way of ensuring this is by designing this message body itself. For example, in an accounting system, instead of sending a message to deduct the withdrawn amount, we can send the resulting total. To explain further, if I had $100 in my account and I took $20 out, sending a message of setting the total to $80 would be naturally more idempotent than sending the deduct by $20 message.

However, there are times when this is not possible. In the simple accounting system described above, to send a message of setting $80, the sender needs to know the current state of the destination. This can bring in a whole new set of complexities. What if this is a pub-sub/event notification system? Are we going to keep track of everything happening in subscribers? Even if this was a command, we need to be sure of the current reality of the destination. This means increased coupling between the source and the destination. We are also increasing the importance of the order of messages and the bugs around this are never fun to resolve.
Inadvertently, we could be pushing this distributed system to be more consistent and that is always challenging given several explanations of CAP theorem. I highly recommend reading everything around this topic from designing data-intensive applications.

So, it sounds easy to say, "just send set-the-total-to-$80 message" but the reality is often a little more complex than that especially when we need to have a system more available than consistent or if we are dealing with event notification system with many subscribers. In these situations, it could be more practical to send a deduct-by-$20 message. Now, what?

In my system, I am working with NServiceBus with RabbitMQ as a transport. NServiceBus provides Outbox functionality. It looks good, but it comes with some caveats.

Because the Outbox uses a single (non-distributed) database transaction to store all data, the business data and Outbox storage must exist in the same database.

I can't guarantee this in my system. Also, what if I want to use a NoSQL store for this?

The Outbox feature works only for messages sent from NServiceBus message handlers.

This is very limiting. RabbitMQ is a very capable queuing system. It comes with very easy to use clients and enqueuing a message directly is common.

In comes the simple message store approach. We can store processed messages. In my case, my model looks like below:

public class MessageStore
	public Guid MessageId { get; set; }
	public Guid GenericIdentifier { get; set; }
	public DateTime TimeStamp { get; set; }
	public string MessageType { get; set; }

I can use any type of data store for this. I used SQL Server.

Now, in my NServiceBus message handlers, I can access this store as needed. To check if the message was already processed, I can do a simple check based on the message Id. This can happen for several reasons. For example, We can fetch a message out of RabbitMQ queue, process it but positive ack just never reaches back to the cluster because of a network blip.

Often, especially in data update scenarios, we want to discard a message if we have a more recent update processed. This can happen if the message is going through a retry logic and before it comes up again, another more recent update arrives and gets processed. In this situation, I can do a check like below:

var topProcessedMessage = dbContext.MessageStore.Where(m =>     m.MessageType.Equals(nameof(MyMessage)) && m.GenericIdentifier.Equals(message.MyRecordIdentifier))
.OrderByDescending(m => m.TimeStamp).FirstOrDefault();

I can compare the Timestamp of that message against the incoming message and discard the message if it has the Timestamp earlier than the stored one. Otherwise, process the message and store the necessary values in the MessageStore.

This way I can worry about what's relevant for my subscriber/handler. I can easily discard processed and stale messages. If multiple messages are going to make updates to the same entity, I won’t have to clutter it with many different types of LastUpdated Timestamp columns. This store can be easily expanded to store serialized messages, their checksums, etc. As we are not adding this to a pipeline, we are keeping this behavior optional and not applying by default to all the handlers.

The downside of this approach, is of course, the store can get out of hand quickly from a number of records perspective. We will see how to handle this situation in the next post.

In Microservices/Distributed systems, messaging is a preferred form of integration. I covered some of the reasons in one of my previous posts. In these type of systems, whether we have pub-sub/event notification/sending commands, it is a good practice to make the receiving endpoints idempotent. It simply means…

Read More

Why should you avoid get calls across microservices?

In the previous posts in this series, we saw what microservices are and how to start the journey towards broken out services from the monolithic application.

In the second post, I talked about not having get calls across microservices. On that, I received some questions. Here's one:

"Nice article on microservices.
But i did not get reasoning behind
"You shouldn't make a query/get call to another microservice."

My reply to that is below:

As a service, you should have your own store for the data you need. If you don't, what happens if the service you're using for get calls goes down for an extended period of time? You are blocked. Your function that depends on it fails now. What if you have multiple of these? You just increased your reasons to fail by that factor.

To explain this further, too many blocking get calls may result into a system that can get completely stalled. What if the individual calls that are happening have different SLAs? Do we block the entire request for the slowest call?

One of the projects I worked on in the past had these types of calls spread through out the system. In the worst cases, the http requests took well over a minute to return. Is this acceptable?

To counter this, the team went towards caching the entire relational data stores. Pulling all that information down became very tricky. The data sync was happening once a month. The sync job itself took days to finish, pushing the changes further back. I have seen many project teams going towards this type of solution and eventually running into the scale problem. Then they settle on syncing the data using messaging.

So, what is the trade off? I explained it below:

The data duplication that may come with this is an acceptable trade off for scale. In other words, you are giving up some consistency and moving towards eventual consistency for scale and availability.

One of the teams I worked with did a good job differentiating the types of messages. Some of the changes had to be pushed forward at a higher priority, some required a little more processing, etc. We built different routes for those messages so that we could maintain our SLA for these messages. The data was still cached but we were able to get rid of the sync jobs that were trying to sync all the data at once. Messaging enables different kinds of patterns. I recommend Enterprise Integration Patterns book for those patterns. It goes deep into the different kinds of messaging patterns.

I used this example afterwards:

For a banking app, you can say, there are $100 in your account as of the last sync time instead of $0 because you can't hit that service. People are much more likely to freak out in the later scenario than the one before. This method scales too since we have reduced the consistency level to eventual. It sounds radical but you are trading in strong consistency for scale and user experience. I hope that clears things up for you. I think I should write another post on this :)

I went on further to express the points below:

Another reason is network. You can't take it for granted. In any of these situations, you don't want your customers to get affected by this implementation detail. The way to avoid this is by building your own store beforehand and keep it synced through messaging.

You don't have a distributed system if the system is not resilient enough to consider the network loss. This is something that needs to be considered on day one. Your customers are not going to care that your network failed. They want your app to work.

If your system needs too much of a cross service chatter then it could be a sign of wrong system boundaries. It means that the service is just too fine grained. Your services could be suffering from the nanoservices anti pattern. It is a subtle problem. The SOA patterns book goes into the details of that. Some of the services may require merging.

The Netflix approach

They go the hybrid route. They hit the local cache store first, if it is not available, they fall back on cross boundary calls. Josh Evans explains that very eloquently in this talk.

To summarize, we are giving up some consistency for autonomy. We can solve the consistency problem using data pumps or sync jobs easily but autonomy has to remain as strong as possible. Without autonomy, you won't be able to materialize any benefits of moving towards microservices architecture.

In the previous posts in this series, we saw what microservices are and how to start the journey towards broken out services from the monolithic application. In the second post, I talked about not having get calls across microservices. On that, I received some questions. Here's one: "Nice article…

Read More