Kafka is an increasingly popular tool for building real-time data pipelines and streaming apps that are horizontally scalable, fault-tolerant, and is extremely fast. To understand why Kafka is so fault tolerant and fast you need to understand the protocols that are used to drive the big data solution. To understand more about the decisions that were made behind the open source tool we recommend taking a look at the bottom of the Kafka protocol page to learn more about “Some Common Philosophical Questions”, which provides some interesting backstory on the decisions behind the very popular platform.
Some people have asked why we don’t use HTTP. There are a number of reasons, the best is that client implementors can make use of some of the more advanced TCP features–the ability to multiplex requests, the ability to simultaneously poll many connections, etc. We have also found HTTP libraries in many languages to be surprisingly shabby.
Others have asked if maybe we shouldn’t support many different protocols. Prior experience with this was that it makes it very hard to add and test new features if they have to be ported across many protocol implementations. Our feeling is that most users don’t really see multiple protocols as a feature, they just want a good reliable client in the language of their choice.
Another question is why we don’t adopt XMPP, STOMP, AMQP or an existing protocol. The answer to this varies by protocol, but in general the problem is that the protocol does determine large parts of the implementation and we couldn’t do what we are doing if we didn’t have control over the protocol. Our belief is that it is possible to do better than existing messaging systems have in providing a truly distributed messaging system, and to do this we need to build something that works differently.
A final question is why we don’t use a system like Protocol Buffers or Thrift to define our request messages. These packages excel at helping you to managing lots and lots of serialized messages. However we have only a few messages. Support across languages is somewhat spotty (depending on the package). Finally the mapping between binary log format and wire protocol is something we manage somewhat carefully and this would not be possible with these systems. Finally we prefer the style of versioning APIs explicitly and checking this to inferring new values as nulls as it allows more nuanced control of compatibility.
The Kafka team’s protocol choices provides an interesting story about the team, the technology, as well as the directions that the wider API sector is headed, when it comes to which protocols are being put to work. We recently published our modern API toolbox, which includes Kafka, as part of our efforts to bring the full stack of protocols used to deliver APIs into focus. It isn’t always clear what API providers are using, and the reasons behind choosing different protocols like HTTP, Websockets, and TCP in the case of Kafka aren’t always evident. Our goal is to keep tracking of why API providers, API service providers, and open source tooling providers make the protocol decisions that they do, and share the stories here on the blog for everyone to learn from.