Hermes – A message broker built on top of Kafka

lllr_finger · on May 16, 2019

Can someone help me understand the value proposition for Hermes? The only thing I can see is that it abstracts away producing to and consuming from Kafka. The use cases provided answer why you'd use a message broker system, but not why you'd want to do it over HTTP.

Edit: I understand HTTP is easier than Kafka, but is this something developers really struggle with when adopting Kafka? My experience is that they struggle with the nuances, behavior, and maintenance of Kafka/ZooKeeper more than anything.

I also didn't see how it dealt with concepts like exactly once delivery - any experiences in that area?

thanatos_dem · on May 16, 2019

Exactly once delivery is not a thing, and confluent needs to stop openly lying to people about it. At least once delivery and idempotence is not the same, and has existed forever.

Calling it “exactly once” is marketing BS. It’s the same as Oracle claiming for years to support serializable transactions when they didn’t, except that one is technically possible, they just didn’t support it in actuality.

ryanworl · on May 16, 2019

They do not use the phrase exactly-once delivery as far as I’ve seen, they say “exactly-once semantics”. And this is referring to a specific set of Kafka features that you previously had to build yourself if you used Kafka. I have not seen any reference to them claiming their solution is somehow novel. In fact, it just wraps up patterns people were doing anyway.

Oracle claiming to support serializable transactions is also false (as you say), but calling it a lie is not the whole story. “ANSI serializable” and actually serializable are not the same thing. Oracle is “ANSI SQL-92 serializable”, as in no named anomalies from the spec. There just happen to be more anomalies like write skew which are not in the spec.

thanatos_dem · on May 17, 2019

Then they should call it "idempotent processing". They know that when people, especially middle or upper management, ask about queues, they insist on needing "exactly once" delivery, and will never understand the FLP impossibility and its implications on messaging systems.

They are intentionally using this phrasing to obscure the fact that nothing novel is happening to push sales, and it is absolutely infuriating how well it works.

tiew9Vii · on May 17, 2019

No they do, I completely agree with the original poster on this. Confluent over egg the marketing, I think they may of changed to the phrase “exactly once schematics” as Apache Pulsar labelled it that from the start and people calling out exactly once.

thanatos_dem · on May 17, 2019

Even from their announcement blog post, they're intentionally mixing words - https://www.confluent.io/blog/exactly-once-semantics-are-pos...

They may never describe Kafka itself without the word "semantics", but here are some other snippets:

- "I know what some of you are thinking. Exactly once delivery is impossible" - "While some have outright said that exactly once delivery is probably impossible!"

They mix their phrasing depending on what they're talking about, and whether they are referring to Kafka directly or indirectly.

pgwhalen · on May 17, 2019

You’d be served well by reading confluent/Kafka documentation a bit more closely. They do not claim exactly once delivery, they claim (and achieve) exactly once processing (“semantics”) which is still very useful.

Groxx · on May 17, 2019

re HTTP specifically: one major benefit I've seen to even "just" HTTP wrappers for systems is that the HTTP ecosystem is extremely mature, even on relatively exotic languages / platforms / coding patterns / design constraints / etc.

You want load balancing, context propagation, multiplexing, proxying, authentication, request tracing, [anything from a truly gigantic list, both in-code and around-your-system]? HTTP has it. Probably several. And they probably already work with everything you already have, and happily run unattended for years.

Kafka... might? Kafka for language X.... might? But probably not.

You want to extend Kafka to add X between Y and Z? Does the protocol even allow it? HTTP does, choose your flavor. Odds are even decent that a fair number of your engineers have already heard of or used it.

---

There are benefits to specialized protocols, absolutely. But there are also benefits to letting everything just use the same robust HTTP client as everything else.

thibauts · on May 16, 2019

Exactly what I thought. It can alleviate the need to use flaky Kafka clients in some languages, but kind of disappointing in that it doesn't soften the main pain point of Kafka: operational and cognitive load.

infecto · on May 16, 2019

I am not too familiar with Hermes but there is a lot of power exposing a http endpoint. I think you are thinking too inside of the box here. The benefit here is not adoption from developers or inability to understand how Kafka works with more native libs.

The value here is 1) publishing messages from more unique sources. Perhaps allowing your clients to publish messages. 2) You can enforce additional guarantees. Does the message conform to the a defined schema?

lllr_finger · on May 16, 2019

Developers can do both pretty easily today without too much effort. The TCO of Kafka clusters is considerably higher than the costs in developing an API that marries a POST to a Producer. It feels like you're saving time on one of the easier parts of the solution.

Totally open to hearing more out of the box ideas. I don't think Hermes is bad for existing, I'm just not envisioning why I'd recommend it to someone.

bpicolo · on May 17, 2019

It's a scale thing when you want a ton of different endpoints.

This is one of the options that Google Pub/Sub and SNS can do for you (and if you use e.g. App Engine, this is also queueing model). It can sometimes be an easier model for bootstrapping message passing.

napsterbr · on May 16, 2019

As someone evaluating Kafka for the first time, it would be useful to know what Hermes provides other than Kafka already does. After glancing at the homepage I see the REST api and the fact it is push based. Honestly I don't see how it would fit on my use case but interesting project nonetheless.

> exactly once delivery

Kafka is known to provide exactly once semantics - given your producers and consumers follow some rules, notably being idempotent. When ingesting from Kafka Stream API, it is actually exactly once delivery. There are a couple posts on confluent.io explaining how they achieve this (sorry, currently on mobile, can't copy-paste without having an outburst on how unusable touch devices are for me).

lllr_finger · on May 16, 2019

Sure, I'm aware of how Kafka works - this is an abstraction over all that, so I'm curious if and how Hermes can provide the same behaviors. I didn't see any relevant documentation at a quick glance.

guhcampos · on May 17, 2019

I see a bit of their reasoning. Taking from the article:

"When you have an environment with 20+ services, code sharing, maintenance and following updates become problematic. At Allegro we had the chance to find it out. It’s better to take out dependencies from business services as much as possible."

By adopting something like Hermes you are not really "taking out" a dependency, but abstracting it - as you said. Yet, by talking HTTP to the message broker you are abstracting away Kafka from your developers and your code. One less lib to depend on for each language you have in your architecture. One less version to control among your services, etc.

gcb0 · on May 17, 2019

if you already have http clients on all you components.

and it's not like bindings for kafka is that much uncommon than a decent http client.

joesb · on May 17, 2019

You can use HTTP client to contact any HTTP compatible service. You can only use Kafka client to connect to Kafka.

That sounds like a tautology but it's practical. There is way higher chance that you can make any of your future service response speak HTTP, than all of them speaking Kafka. If you add HTTP client libs today, there's higher chance of reuse than adding Kafka libs.

xeronz · on May 17, 2019

I only skimmed the write up on it, but knowing kafka fairly well, the following could potentially work better than what is included by kafka: - push model (dependent on use case) - filtering - throttling/rate negotiation - exactly once (kafka out of the box does not dedupe on the broker)

rswail · on May 17, 2019

Having just completed a project using Kafka as an event pipeline and a data store, one of the issues we found was that consumer polling takes a large chunk of resources.

Having a push model for consumption would certainly remove some of the complexity we had to deal with for scaling out consumption.

ethnoe · on May 17, 2019

Hi there, I have been the technical team lead of Hermes team for ~four years, before Łukasz (the author of blogpost) took over. Thanks for taking time to read, think and write about our product :)

Our value proposition is built on four main aspects: * ease of integration * easier Kafka management * centralised management and validation * increased stability / reliability

Mind that some of the points don't make much sense unless you have a lot of services managed by a lot of independent teams. Thus Łukasz remark about "20+ microservices" in the original post. We run 700 microservices on prod managed by something close to 70 teams.

Ease of integration has been nicely summed up by others in this thread. HTTP tends to be the simplest way to integrate anything nowadays, at least in our case. While this comes at a cost, being able to get projects started up very quickly, without getting into details of proper handling of Kafka producer/consumer clients provided great value for us. Also while history might not be so important considering using Hermes in 2019 because Kafka matured, gained traction and recognisability among Software Engineers, it wasn't so easy to handle Kafka in 0.7/0.8 days when we started.

Of course switching to HTTP comes at a cost. I think the biggest one is using pure HTTP in push model. This makes it impossible to take advantage of Kafka data model, which guarantees event ordering at partition level. Zalando took a different approach with Nakadi (https://github.com/zalando/nakadi). I would say that at some point Hermes should consider following this path for more advanced users.

Easier Kafka management. Since we abstract away Kafka and hide it behind HTTP/REST API, we can easily introduce many changes to Kafka clusters. One of them was splitting Kafka cluster into two operate ones (one per our DC) without clients noticing. They were still publishing to same old Hermes instances, discovered via Consul. While doing it with clients might seem like a trivial thing to do when you have just a few services that use Kafka, with a few hundreds of clients it generates a lot of unnecessary work for developers.

Now whenever we need to do some maintenance with Kafka clusters (rebalance partitions, change cluster/hosts etc) we just route the traffic at Hermes level and no interaction with clients/developers is necessary.

Centralised management and validation. We started with publishing JSON. Along the way, as more and more people started consuming data offline (from Hadoop), it turned out that moving to some structured/schema based format is necessary, thus Avro. Hermes helped us a lot with this. It enables us to fail fast when someone starts publishing malformed requests for whatever reason, instead of relying on consumer (online and offline) to be hit and have to communicate with producer. Secondly support for Avro in JVM (our main microservice platform) is not that great and we put a lot effort into making it better (including publishing https://github.com/allegro/json-avro-converter). By having Hermes to do on-the-fly conversion for both publishers and subscribers we made it possible to only define schema and deal as little with Avro as possible in simple cases when it might not be beneficial for the team.

We also have Hermes integrated with our Service Catalog, so we can easily track ownership of topics and subscriptions. People publishing have easy access to information about who not only subscribes to online data, but also who accesses data offline (via Hadoop) using our offline clients feature. This way Hermes provides central place to manage our data streams.

Increased stability/reliability. This last one might be controversial, but in practice it did save us a few times. Mind, that I mean increased (more nines), not totally bulletproof. Kafka is a great, resilient piece of software. it is also complex and incidents happen. It might not even be that cluster is down - but increasing response times from few ms to 1second can be just as deadly. Hermes Frontend on the other hand is really simple. By putting it in front of Kafka together with built-in buffering support, we added a layer which increased our reliability. Now even if Kafka cluster has huge problems, we can accept incoming events for 2-3 hours, having time to either resolve the issue or reroute traffic to other cluster. This means that microservices don’t have to deal with data buffering on their own. Of course Hermes is still pretty much stateless by itself, so when traffic to Kafka flows normally, we can restart, spin up and spin down instances at will.

Entering danger zone: if both Kafka goes down and Hermes hosts blow up - the data is lost. This is a trade off and we are happy to say that for years running Hermes + Kafka on production, it never failed and saved us a few times.

I hope that I managed to clarify why we are using Hermes as main message bus powering our microservice architecture. We open sourced it, as we wanted to do our work in the open, sharing it with anyone who finds it useful and beneficial :)

theomega · on May 17, 2019

This sounds very interesting. Did anyone get from the homepage what kinds of guarantees this offers? What if the HTTP endpoint where Hermes should push the data to is down? Does it retry? If yes, for how long?

theomega · on May 17, 2019

Answering my own question (RTFM): https://hermes-pubsub.readthedocs.io/en/latest/user/subscrib...

You can configure how long it retries and with what strategy.

SkyRocknRoll · on May 17, 2019

If any of you looking for message and streaming system under single system then pulsar.apache.org supports both and lot more reliable and scales better than kafka

hestefisk · on May 17, 2019

Very nice with a simple wrapper. That said I’m wondering if 9 out of 10 use cases could do with something simpler, ie zeromq, which scales really well.

twa927 · on May 16, 2019

Can someone provide actual high-level use cases for using Kafka? Prefereably use-cases not handled by RabbitMQ.

I've seen a few talks about Kafka but they focused on the internals. My guess is that Kafka is for large systems for which managing a multi-node RabbitMQ cluster is too much trouble.

josephg · on May 16, 2019

I’ve long had the inverse view - I’m not sure what good use cases there are for Rabbitmq that couldn’t be handled better by a Kafka cluster.

One company I worked with used Kafka as their central source of truth across the organisation. All events generated by users were thrown into a massive Kafka cluster. Each team in the organisation cared about a different view into that data (financials, marketing, fraud, what we display to that user on the website, etc). Each team would ingest the same kafka queue and do different things with it - often consuming certain events into their own Postgres instance, or other things like that.

I used Kafka when I made my reddit r/place clone a few years ago because it gives great read and write amplification. With Postgres as a central source of truth, you can only handle thousands of writes per second. And reads will slow down the instance. With Kafka you can handle about 2M/sec. And reads can really easily be serviced from other machines - you can just have a bunch of downstream Kafka instances consuming from the root, and serving your readers in turn.

It may be that you can also solve all these problems with a well configured rabbitmq cluster. But coming from a database world I find it more comfortable to reason about architecture, performance and correctness with Kafka.

hinkley · on May 17, 2019

Size? If you’re getting less than a few hundred events a minute is it worth setting up Kafka?

josephg · on May 17, 2019

This is the main reason I don’t use much Kafka in my own projects. I hope at some point someone makes a redis equivalent of Kafka for small projects.

Is Rabbit much easier to set up for small projects? I haven’t used it much.

Spiritus · on May 17, 2019

You might be interest in Redis Streams[1], it's basically Kafka in Redis.

[1] https://redis.io/topics/streams-intro

chrisjc · on May 17, 2019

If you're in AWS, you can use Kinesis which is similar to Kafka. It also ties into a lot of their other offering such as:

* s3 - use kinesis firehose to take the contents of your kinesis stream and time partition it into files for either ingestion into redshift, elastic search, etc... or later batch analysis for ML or just to treat as cold searchable storage with something like Athena

* dynamoDB - spit out the data into kinesis from dynamoDB as it changes to create a change stream used elsewhere in your platform. (dynamo-streams)

* real time analysis - perform real time sql analysis (kinesis analytics) on what's in your stream over a given window of time or data, and react as events/situations occur.

Looking at all the services that amazon has built around kinesis might help you understand some of the differences between kafka and something like RMQ.

nullwasamistake · on May 17, 2019

Sounds like your org used Kafka for event sourcing. This is almost always a bad idea, event sourcing and aggregate reconstruction is a nightmare IMO.

Kafka used as a pure FIFO cache for regular CRUD endpoints works fine

pablasso · on May 17, 2019

Event sourcing was one of LinkedIn use cases when they created it, Kafka is fine for all logging needs.

josephg · on May 17, 2019

Yes; they did. It worked pretty well actually.

Why do you think it’s a bad idea? Most of the arguments against event sourcing that I’ve read seem to be “yes but the tooling isn’t very good”. That might be true, but maybe we solve that problem with more investment into event sourcing; not less of it.

nullwasamistake · on May 17, 2019

TLDR the tooling is so bad it's basically impossible to run at scale. I worked for a company that tried. Maybe on a small scale it's fine, but replays and storage of past events takes insane amounts of space at high event rates. To the point that storage costs and replay times became a real problem. (Many terabytes and days)

I also don't think it's a great idea in general. The event stream directly replicates a DB commit log, and the aggregates your tables. It's building your own database.

We had to throw a year's worth of work away at the end so I'm fairly biased against trying it until the ecosystem is better.

solidasparagus · on May 17, 2019

Kafka is a high-throughput, horizontally-scalable blob data store for data streams. The data store part of that is my favorite part.

You can use it as a simple message broker, but since it keeps the message history as a timeseries, you can also do things like run batch analysis jobs on the day's message or replay the last X hours of messages because your DB died and your backup is old.

It is a good way to decouple data producers and data consumer, particularly in an enterprise context - producers push to Kafka and anyone can consume that data, whether they are an operations team that wants a realtime data stream, a BI team that needs periodic data dumps, or a team that wants a long-term audit trail (the duration of the history is going to depend on your scale, but for many users a long history is realistic).

Kafka also has a nice ecosystem including streaming analytics (KSQL), clients that make reading from Kafka easily horizontally scalable (have many machines acting as a single client, automatically rebalancing if one of those machines dies), exactly once processing and probably more since I last worked with it.

I'm not familiar enough with RabbitMQ to say how it compares to Kafka, but I haven't found a use case yet where Kafka isn't a good choice (except for the 'I need to set up a message broker quick and painlessly' use case because it is not a particularly fun technology to manage yourself)

chrisjc · on May 16, 2019

This is worth a read.

https://engineering.linkedin.com/distributed-systems/log-wha...

twa927 · on May 16, 2019

I skimmed over it but it's again mostly about internal design. The high level use case I see is "publish-subscribe" which is handled by RabbitMQ and a dozen of other solutions.

One use case I see is that the events published into Kafka are persisted so e.g. some component can see a history of some events (so this is something not handled by RabbitMQ). Is it right?

lllr_finger · on May 16, 2019

Event streams between decoupled systems is kind of the sweet spot for Kafka. It's extremely easy to scale horizontally, and handles distributed work and network partitions in an easy to reason about way. I've also seen Rabbit be the bottleneck before where I've never really seen Kafka be the bottleneck in an architecture - it's very analogous to a firehose. For organizations shuttling messages and events between teams, it makes a very convenient lingua franca.

chrisjc · on May 16, 2019

I often do the same thing... skim through articles and papers to get the gist. Trust me, this is one to ready all the way through.

lukaszkorecki · on May 16, 2019

That's correct - being able to "rewind" a history of a topic (queue) is a powerful concept. But, Kafka is a bit harder to operate than RabbitMQ in my experience. (Somewhat related, one of our subsystems originally was built around Kafka, but later was migrated to RabbitMQ and Postgres)

mancini0 · on May 16, 2019

A financial exchange - order messages are routed to Kafka and partitioned by the instruments symbol, match engines associated with a given set of symbols consume from their assigned partition. When a match-engine goes down it can reconstruct the order book by replaying from a given offset.

ww520 · on May 17, 2019

It's basically a high speed transaction log, persistent, distributed, easily scalable, that happens to store messages to do messaging brokering very well.

DiseasedBadger · on May 16, 2019

Increasingly, all technology news sounds like:

"X: a blazingly fast X built on top of {something I vaguely thought did X}"

dang · on May 16, 2019

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

colechristensen · on May 16, 2019

Hermes seems to wrap Kafka with HTTP to make integrating it simpler (and I think less robust).

ravenstine · on May 16, 2019

Don't forget "modern".

exabrial · on May 16, 2019

Ironically: Hermes was the name of a JMS user interface/testing tool.

codeduck · on May 16, 2019

It's also the name of a hilariously bad courier in the UK - so bad in fact that most people call it Herpes instead.

vowelless · on May 16, 2019

It’s also the name of a bureaucrat on a certain tv show.

chimen · on May 16, 2019

curiously, nobody mentions the clothing designer :)

user5994461 · on May 17, 2019

thought it was a handbag designer.

fwip · on May 16, 2019

I think it's more a case of the same inspiration (Hermes, the Greek messenger of the gods) than irony.

KaiserPro · on May 17, 2019

also the god of thievery and tricksters.

RickJWagner · on May 17, 2019

My thoughts exactly. (I immediately remembered the old-time JMS UI. I think it was on SourceForge.)