Is Kafka or RabbitMQ the right messaging tool for you?

In a previous post on microservices integration patterns, we talked briefly about messaging. Messaging comes with many options and patterns, and one of the most critical decisions you’ll make is choosing between message brokers. RabbitMQ and Kafka are lead options, seen as representing queueing and streaming, respectively. If you search for a comparison between the two, you are unlikely to get an unbiased view: Vendors on both sides have muddied the internet with praise of their preferred tool. The answers are hardly a slam dunk as some posts or talks seem to suggest. In many of our clients’ experience, choosing the wrong option only brings on more problems. So, how do you make the right choice? Instead of providing a prescriptive answer, we’ll look at the evaluation criteria and provide a decision matrix that you can use to arrive at the right solution for your unique situation.

How does RabbitMQ work?

RabbitMQ is an implementation of Advanced Message Queuing Protocol. It brings in concepts for the advanced routing of messages such as Topic, Direct, and Fanout exchanges. These exchanges are bound to subscriber queues.

In the diagram above, we have a publisher, the Users service, with the intention of generating a UserProfileUpdated event. It is bound to the UserProfileUpdated FanOut exchange. There are two subscribers to this exchange: the Transactions and Credit Score services. When they start up and indicate the intention of subscribing to this event, binding is made with the exchange. After that, when the publisher sends an event to the exchange, RabbitMQ delivers the event to all bound queues in the order that it was received. Each bound queue gets its copy. The event doesn’t get dequeued until those subscribers send a positive acknowledgment to their respective queues. We can easily add another subscriber to this and store these events in an Event Store if needed. Repeated failure to handle these messages by the subscriber can be moved to another exchange, named a dead letter exchange. The dead letter exchange could be managed separately. We can achieve high throughput by adding multiple competing consumers to the same queue and managing the routing.

How does Kafka work?


This diagram represents the same scenario implemented in Kafka. Kafka is an event log: When the Publisher (the Users service) sends an event, it simply gets appended to a stream, similar to how a log entry would be made. The consumers pick up messages from their specific position (offset) in the stream and consume everything afterward sequentially. The diagram above shows the Transaction service’s offset is 2, so it gets an event sitting at that position and continues. The Credit Score service’s offset is 1, so it picks up that message and continues. This way, consumers can freely move back and forth as needed. The events are never taken off the stream. The amount of time for which an event should be in the stream is configurable. If a consumer fails to process any event, it can easily consume that event again. Each consumer gets its own partition. Multiple consumers are not allowed for a partition. The degree of parallelism is controlled by the number of partitions. This is how Kafka can support large volumes of data. The delivery of messages to these partitions is handled by Kafka. The consumers are completely unaware of the internal routing and related intricacies.

Now that we see what a typical publish-subscribe with events looks like in both Kafka and RabbitMQ, let’s compare some high-level features.

How do they compare head to head?

  • RabbitMQ cannot be used as a store; Kafka can.
  • In RabbitMQ, ordering is not guaranteed once we have multiple consumers. Kafka guarantees order for a partition in a topic.
  • Messages can’t be replayed by RabbitMQ—they have to be resent those from the sending side. We do this with the Message Outbox pattern. Kafka stores data in the order it comes in and supports message replay with the help of offsets. However, it introduces other tradeoffs around data compaction, how long to keep the data on the streams, what to do if data required predates the stream, etc.
  • RabbitMQ doesn’t support transactions natively, it uses acknowledgments. Kafka supports transactions.
  • RabbitMQ has great .NET support—it completely outshines Kafka in this regard. Kafka treats .NET support as a secondary priority.
  • RabbitMQ has good tooling for management on Windows. Kafka does not.
  • RabbitMQ implements the Advanced Message Queuing Protocol. These guardrails help you stumble into a pit of success. With Kafka, you will have to implement a lot of these patterns and disciplines yourself.
  • RabbitMQ doesn’t need an outside process running. Kafka requires Zookeeper’s running instance for its broker management. Zookeeper is responsible for assigning a broker for the topic.
  • Out of the box, RabbitMQ is behind in multithreading support compared to Kafka—but not by much. Since NServiceBus works with RabbitMQ and has good support for multithreading, it is lesser of a problem for RabbitMQ. In both worlds, ordering is not guaranteed if the consumers are scaled out or have fetching records using multiple threads.
  • RabbitMQ has a lot of plugins to support your needs. Kafka is not as mature and therefore doesn’t have as many plugin options.

There are a lot of features to compare, and baking these into an overall decision can be challenging. The evaluation criteria we’ve developed can help you weigh the options, together, and end up with an empirical answer.

How do you choose one, or both?

As Caitie McCaffrey, one of the most well known Distributed Systems architects puts it in this tweet, there can only be trade-offs within different contexts. Building a scoring sheet can help you evaluate your options. The considerations you choose will vary in different contexts, research, and comfort levels. Below is an example of an actual evaluation that we performed. You assign a “1” to the tool that is stronger in each scenario. If neither outranks the other, you assign “0”s to both. Tabulating the totals will give you an idea of how one suits your needs over the other.

Sample Scoring Sheet


*Kafka requires a dependency on Zookeeper

As this case demonstrates, there may not always be a clear winner: but using both or transitioning can help you cover all of your bases. If you’re leaning towards having both in the environment or introducing Kafka slowly, you can make use of a connector between RabbitMQ and Kafka.

When you choose these tools, you also need to be cognizant of some manual enhancements you may need to do to make them more developer-friendly. For example, if you choose RabbitMQ and still need the Event Store, you will need to build message handlers to populate the store. Similarly, if you choose Kafka and you need process management, you will have to do extra work, perhaps in a homegrown library to support that. Accounting for extra work like this will better ensure you make the correct decision.

Have you answered these prerequisite questions?

Any tool or framework may not necessarily address all underlying architectural problems. If database integration was the norm in the past, the chance of repeating mistakes is high. We need to think differently when we are dealing with tools like Kafka. For example, we don’t necessarily need new topics per message type, as Martin Kleppmann points out in his event types for Kafka topic post.

A change in mindset starts with the questions below—they need to be answered before you implement either tool. This is just a starting point, but if these considerations are not addressed early on, the odds of messaging success will be against you.

  • Do you have observability/monitoring in place? Can you demonstrate a need for scale?
  • Have you answered underlying architectural concerns in your system?
  • Do you have proper business and data boundaries in place?
  • Do you have a regular check of the business and data boundaries process?
  • Do you have operational and standards concerned answered? Is your choice compliant with them?
  • Have you documented your hardware constraints? Is your choice compliant with them?
  • Have you documented your security constraints? Is your choice compliant with them?
  • Have you evaluated other platform as a service (Event Hubs on Azure) options?
  • Have you implemented the Outbox pattern correctly? Data loss is possible with RabbitMQ and Kafka both.

Parting thoughts

Both RabbitMQ and Kafka are powerful tools, but you need to be able to evaluate them objectively per your needs. Instead of making a “gut choice,” be a little more data-driven in your evaluation. The criteria you apply may vary depending on the context, and that is to be expected. Controlled consideration will represent the reality a lot more closely than the vendors of these tools can.


Advanced Message Queuing Protocol:
Dead Letter Exchange:
Message Outbox:
The .NET client for Kafka is behind
High throughput can be achieved with both tools. Some easily available benchmarks for these tools are here:
Producer Side Idempotency:
Kafka can lose data too:
Martin Kleppman’s talk on Kafka:
Kafka/RabbitMQ connector:
My past posts on observability:
Kafka documentation on Zookeeper:
Nice blog post series on Kafka vs RabbitMQ comparison:

In a previous post on microservices integration patterns, we talked briefly about messaging. Messaging comes with many options and patterns, and one of the most critical decisions you’ll make is choosing between message brokers. RabbitMQ and Kafka are lead options, seen as representing queueing and streaming, respectively. If you…

Read More

How integration patterns impact your microservices architecture

When the world wide web first emerged, integrating different types of operating systems was a core challenge. Hypertext transfer protocol (HTTP) created communication channels by sharing hypertext, these systems started speaking a common language over an accepted protocol, and the internet as we know today was born.

When creating a microservices architecture, the integration challenges are not very different: Multiple implementation technologies are physically separated by a network and need to communicate with each other. Microservices integration plays a vital role in creating a seamless experience of the system, from the end user’s perspective. Correctly integrated systems also help realize the benefits of distributed systems: They enable scaling at the service level and improve efficiency, and have a potential of reducing the infrastructure costs while serving business needs [1].

On the other hand, incorrectly integrated systems completely undermine the benefits of a microservices architecture: It can result in painful data loss and integrity issues. The problems are usually very hard to track down, and in the meantime, users are impacted adversely.

Seamless integration depends on a number of considerations, which we looked at in our previous post, Principles for Microservices Integration. These serve as a guide to choosing the type of integration that will offer the most autonomy and scale. We’ll cover the various options and their pros and cons below.

Database Integration

In this pattern, two or more services read and write data out of one central data store. All of the services go to this central data store. We can illustrate this using the banking application example from our previous post, which takes login, user profile, transactions, notifications, credit score, and spending reports as separate services defined by the business functionalities.

One of the significant advantages of database integration is simplicity. The transaction management is more straightforward as compared to other patterns. This is perhaps the most widely-used pattern of integration—but also the most abused.

This pattern couples services together undesirably, making the microservices architecture more difficult to change and expensive to scale. Defining the data ownership and updating the schema can become a messy process—every change requires re-compiling and deploying all of the services. It pushes towards highly orchestrated, big bang style deployments. This type of integration can create significant obstacles in maintaining the autonomy of microservices.

Under the needs of high scale, the only option is to throw more hardware at the database, and even then it becomes difficult to avoid deadlocks in the database and row-level contentions.

Ideally, we don’t recommend this pattern for inter-services communication. It can be used in one of the early phases of a phased microservices rollout. In other words: If you use it, lose it soon.

Synchronous API calls

In this integration pattern, the services communicate synchronously through an API. All the access to each other’s data is coordinated through an API in a request-response fashion, and the service waits for data from the API to perform its action. In the example above, if the transaction service needs to read the user profile data, it calls the user profile API and gets what it needs.

This provides a decent abstraction over a direct database call and offers excellent flexibility in terms of technical choices. This provides the benefit of hiding many implementation details: The abstraction gives us the freedom to change technologies without affecting our clients. For example, the user profile service could use JAVA and MySQL, while the transaction service could use SQL server and .NET, and they can still easily speak to each other through the API.

However, this is not much different than the direct database integration pattern. Adding another network hop on top of a database call can also inhibit scale: by increasing the workload, performance decreases—it is significantly at odds with most of distributed systems fallacies [2]. This integration pattern also makes transaction management difficult and inhibits autonomy, as services depend on one another’s uptimes. In microservices, if you have to read data synchronously outside of your system boundary, that is a service-oriented architecture smell [3].

In some cases, this integration pattern is the best or unavoidable. Security tokens are a prime use case for synchronous API integration because those tokens are short-lived and can’t be generated before they are required. Synchronous API calls should be used sparingly, if possible. If used, they should be versioned and should be used with a circuit breaking mechanism such as Polly [4].

ETL (Extract, Transform, and Load)

ETL entails synchronizing data via background processes on a predefined schedule. This data can be pushed or pulled. Only backend ETL processes need database access. It is asynchronous, meaning services can execute without waiting for a “callback.”

This integration pattern also hides implementation details nicely. It provides reasonable decoupling because the services are not dependent upon one another’s uptimes. Live users don’t get affected by the uptimes or the processing time.

The ETL processes have to change with the source and destination databases. With ETL integrations, data consistency depends on the schedule and duration. Figuring out the change delta could get too complicated. In these situations, the teams fall back on pushing the entire dataset out. That makes processes very long-running, significantly undermining their usefulness.

Reporting services are a natural fit for this type of integration. These processes have their place but usually get very involved with time. They should be used only when the stale data is acceptable in the system.


In this pattern, services exchange meaningful messages with each other through what are called commands or integration events. They are sent over message transports such as RabbitMQs, MSMQs, or Azure Service Bus. In the example above, the transactions service generates an “account balance changed” event and puts it on a message broker. The rewards service, credit score service, and notification service each subscribe to that event and react to it as necessary. This is a publish-subscribe pattern, and there are plenty of other useful patterns for messaging. Enterprise Integration Patterns is an excellent resource for learning more [5].

If done correctly, messaging provides very good decoupling. It offers complete flexibility in terms of technology choices, as long as they can communicate correctly with the transport. Pushing data to subscribers makes the sending part simpler, and the sender remains completely unaffected by the processing details on the subscriber’s side.

Incorrect service or transaction boundaries can complicate messaging implementations. Also, the loose coupling this pattern provides can come at the cost of consistency. It requires a high level of discipline to implement messaging correctly, and inexperienced teams may struggle with it.

Two categories of messaging

Typical messaging solutions are built on top of the properties of the transport. At a high level, these transports could be divided into two categories: message queueing and message streaming.

The queueing solution involves dealing with live data. Once a message is processed successfully, it is off the queue. As long the processing can keep up, the queues won’t build up and won’t require too much space. However, the message ordering can’t be guaranteed in cases of scaled-out subscribers.

When a new subscriber comes up, we have to involve the source to bring it up to speed. They need specific migration path, which could be challenging depending on the scale and the business domain.

In the streaming solution, the messages are stored on a stream as they come in order. That happens on the message transport itself. The subscriber's position on a stream is maintained on the transport. It can reverse forward on the stream as necessary. This is very advantageous for failure and new subscription scenarios. However, this depends on how long the stream is. It requires a lot more configuration from storage perspectives, necessitating the archiving of streams. Azure Event Hubs and Kafka are some of the examples of this. Some databases such as Cassandra support generating events out of transaction log. Kafka is often looked at as a direct replacement of the ETL jobs [6].

The typical use cases here assume that the system is able to work with somewhat stale data. If that’s not the case, we need to analyze the business domain a bit more and understand why. In our example, when the transaction update happens, the credit score update doesn’t necessarily need to happen in a real time. Messaging fits very well here. We can set the user’s expectation accordingly and allow the services to manage their own load.

Parting thoughts

Every microservices architecture will be different, and there are no perfectly prescribed solutions for integration. We need to keep failure scenarios in mind when we use them—that drives to a combination of these integration patterns. For example, Netflix uses messaging to move the data and they fall back synchronous API if messaging is not available or data is still in transit [7]. In each case, the ideal is achieving the most flexible and scalable microservices architecture—but you have to consider the implementation details and your own capabilities first. The chart below shows how some integration patterns are more desirable from a microservices standpoint, but inherit complexities that your development team must be prepared to deal with:
The key takeaway here is to have asynchronous patterns when needed for scale and autonomy. To achieve this, you need solid service boundaries and clear data ownership. Otherwise, you end up with complex and unsustainable integration scenarios. Modeling the business processes is important. That will let you know which use cases and processes are inherently asynchronous and suitable for messaging. How do we model the business processes and define the boundaries? We will look into that in the future posts.


[1] – Microsoft .NET Microservices Architecture E-book
[2] – Fallacies of Distributed Computing
[3] – What are Microservices?
[4] – The Polly Project
[5] – Enterprise Integration Patterns
[6] – ETL is dead
[7] – Mastering Chaos at Netflix

When the world wide web first emerged, integrating different types of operating systems was a core challenge. Hypertext transfer protocol (HTTP) created communication channels by sharing hypertext, these systems started speaking a common language over an accepted protocol, and the internet as we know today was born. When creating a…

Read More

Principles for microservices integration

Out of the many advantages of microservices, the most significant motivations are scale and autonomy for business units. These go hand in hand. However, we still need to create an integrated experience that makes sense for the end user. It’s important to keep both these aims in mind when developing strategies for the interactions between microservices. Those are the strategies that can make or break your effort.

How we map each microservice determines how autonomous it will be. Microservices modeled by bounded contexts [1] or business capabilities are more natural to the autonomy than the ones based on the technical abilities. Let's consider the example of a banking application. Some bounded contexts a typical it can have are Login and Security, Profile Management, Transaction Services (one service for debit and credit because they are closely tied), Spending Reports, and external services such as Credit Report Check or Rewards Check. These contexts may have many technical implementations that are similar: for example, logging. However, if we create logging as its own service, almost all of the other services are going to be dependent on it. It can become a linchpin: You take that down, and the business stops. Instead, we can push the logging implementation into a library, create services based on the contexts, and make use of the logging library if possible.

Mapping services in vertical business slices with their own databases is only the beginning. We still need to integrate them in a way that creates a cohesive experience and share the data between those services. How do we achieve this while maintaining the autonomy? Before we look into how we could integrate, we must first assess the myriad of interactions between individual services that will influence our integration decisions.

Creating loose coupling and high cohesion

To ensure autonomy and scale, individual services should be highly cohesive (grouping similar functionalities) and loosely coupled [2]. “Coupling” in computer science describes the interdependence between modules [3]. The loosely coupled systems share well-defined data in the form of messages, and that’s all. They don’t worry about states, uptimes, performance levels, or technical implementations.

From our banking example, if credit and debit services are separate, they become very dependent on each other because they tend to affect the same data piece: your account balance. If there’s a discrepancy between balances shown, which one is right? These services must be incredibly consistent, which results in a lot of back and forth network chatter. Instead, we can merge these two functionalities into one cohesive service and avoid the complexity.

Iterating business boundaries

Services that depend too much on other services’ data, implementations, and uptimes, could be a symptom of wrong or outdated business boundaries. Businesses always change, which is why we need to revisit boundary assumptions periodically. This ensures we aren’t creating too fine-grained services, i.e., nanoservices [4]. These nanoservices tend to have fragmented logic and poor performance. They add a lot of maintenance overhead. Horizontal services that are based on technical implementation rather than business boundaries fall into this pit. The division of credit and debit functionalities into their own services fits this, too. There is no need to break the cohesion and introduce a network between them. What’s wrong with the network?

Knowing the network limitations

If there is a chance of something breaking down, we need to have a plan to deal with it as a good engineering practice. Communication over the network is a prime example of it [5]. Services across the business boundaries connect with each other over the network. We need to understand the impact of this because very little can be in the maintenance team’s control outside of the business boundary. Hence, we should keep the network communication as minimum as possible. For example, in the banking application, we need to let the Spending Report service know of a debit transaction. An incorrect implementation would have a call to that service asking if such operation is possible or maybe perform validation on the input parameters followed by the actual reporting of the balance change. These two can be easily merged into one cutting the chatter by half. This idea is based on the Tell, Don’t Ask principle [6]. All of this may seem small, but these little things add up quickly. It is essential to understand the implications of putting APIs on HTTP.

Having a contract-oriented mindset

It’s important to think about the consumer of your API all of the time, no matter which kind of integration we decide to go with. The code written with the service’s consumer in mind has better encapsulation and hides the implementation details very well. Test-driven development can be helpful in this regard. With TDD, we can write on consumer contracts first and then code to satisfies those contracts. PACT[7] can help us share these contracts between services.
It becomes tough to draw boundaries in the code that’s not written this way, for example, CRUD or repository pattern based APIs. They are concerned with the database entities. They span across business functionalities producing tighter coupling. At that point, redesigning them first is a better idea.

Understanding CAP theorem and database technologies

The primary goal of distributed systems is to scale better. In an ideal world, the data shared by loosely coupled services could be replicated without any trouble. That would require optimal consistency, availability, and partition tolerance, which means 1) every reader gets the latest write, 2) every request receives a non-error response, and 3) because the network separates microservices, they must be about to handle an arbitrary number of messages getting dropped. However, this is restricted by the CAP theorem [8], which states that only two of these three conditions can be optimally met in any system.

Because availability and partition tolerance are critical in the distributed world; we must deal with weaker consistency, as shown below:

However, consistency itself has many levels. Distributed database technologies such as Azure Cosmos DB supports five of them [9]. Google Cloud Spanner technology, on the other hand, is challenging the CAP theorem by claiming to offer high consistency along with availability and partition tolerance [10]. We need to keep these conditions in mind while deciding on database technologies for our systems.

Understanding transactions and transactional boundaries

Distributed transactions across multiple services are hard to get right because they go through multiple phases before the data is committed [11]. They require orchestration that makes the systems very fragile. All of this hassle to get to a place which can’t scale well and the database choices may differ between services. Now what?

Instead, we can let new database technologies such as Cosmos DB or Cloud Spanner handle the complexity behind the scenes. If that’s not an option, we can support transactional guarantees within the service boundary and generate events with Outbox pattern for everyone else to consume [12]. Using our banking example, when a user changes his or her phone number in the profile, we can commit that info in the User Profile service’s own data store and generate events for the other systems to consume. After the successful consumption of that message, our Notification service can notify the user for account changes, as shown below:


Being careful with synchronous (blocking) APIs

We must consider the limitations of our networks before putting services there. Synchronous calls across services usually take place over HTTP, which can become very tricky to manage What happens if the HTTP service goes down? How do we know it’s down? How do we handle the failures? How can we rollback synchronously applied changes? Where does the cache live? How many types of caches will be managed? One per consumer? One per call? All this complexity can result in a complex architecture, with everyone calling each other.

Synchronous services have higher expectations on response times, making them more challenging to scale and maintain. Less is more here. Synchronous API calls usually lead to more orchestrated solutions. Sometimes we need physical obstacles to prevent incorrect usages from creeping into the systems [13]. What’s wrong with the orchestration? What can we do instead?

Considering choreography over orchestration

Any system that requires a lot of central management or any service that plays a role of that kind can become problematic. They become too important to go down. Everything is funneled through them, increasing the coupling in the system. This highly coordinated approach is known as orchestration. In contrast, a choreographed approach lets services decide what to do when an event happens. These services don’t need handholding from a central manager. Back to the banking application example: upon debiting money from your account, the Transactions service can call the Rewards services, which can call the Credit Score service, and end it with a notification. In this case, the Transaction service is sitting in the middle of everything playing a traffic cop. Instead, it can just create an “account balance changed” event and let other services subscribe to those events and finish their operations independently. The latter is a much more decoupled approach—a notification can still be sent even if the credit score service is down.

A choreographed approach can be the difference between partial outage vs. full outage, making the services themselves more robust [14].

Tying it all together

Considering your services structure and the complex web of interactions they inherit is the first step to building a robust microservices architecture. There are no silver bullets in software engineering, but each of these principles is a building block in the construction of a full understanding of the interactions between services. In the next part of this series, we will take a look at the types of integrations we can implement to create cohesion within your systems, and for your end users.


[1] More on bounded contexts
[2] Service definition from SOA patterns book
[3] This article does an excellent job explaining different levels of coupling
[4] More on the nanoservices antipattern
[5] Read about the fallacies of network computing
[6] The "Tell, don’t ask principle" explained with C# example
[7] Details on PACT
[8] CAP Theorem explained
[9] Consistency levels supported by Azure Cosmos DB
[10] Google Cloud Spanner
[11] Distributed transactions two-phase commit protocol
[12] Learn about Outbox Pattern
[13] Hear Eric Evans’ argument for physical separation of services in his GOTO Conference talk
[14] The routing slip pattern for messaging can also lead to more choreographed solutions

And to learn more about dealing with the fallacies of network computing in distributed systems, check out this video from Headspring's Chief Architect Jimmy Bogard: Building Distributed Systems.
(This is a cross post from my post on Headspring's blog)

Out of the many advantages of microservices, the most significant motivations are scale and autonomy for business units. These go hand in hand. However, we still need to create an integrated experience that makes sense for the end user. It’s important to keep both these aims in mind when…

Read More