Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Has anyone fully embraced an event-driven architecture?
284 points by sideway on Aug 3, 2021 | hide | past | favorite | 168 comments
After reading quite a few books and blog posts on event-driven architectures and comparing the suggested patterns with what I've seen myself in action, I keep wondering:

Is there any company out there that has fully embraced this type of architecture when it comes to microservice communication, handling breaking schema changes or failures in an elegant way, and keeping engineers and other data consumers happy enough?

Every event-driven architectural pattern I've read about can quite easily fall apart and I have yet to find satisfying answers on what to do when things go south. As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one.

Is there any non-sales community of professionals discussing this topic?

Any help would be much appreciated.




You're not going to like the answer, but I think it captures some of what you're getting at.

Windows 95. Old style gui programming meant sitting in a loop, waiting for the next event, then handling it. You type a letter, there's a case switch, and the next character is rendered on the screen. Being able to copy a file and type at the same time was a big deal. You'd experience the dead letter queue when you moved a window while the OS was handling a device, and the window would sort of smear across the screen when the repaint events were dropped.

Concurrent programming is hard. State isolation from micro services helps a lot. but eventually you'll need to share state, and people try stuff like `add 1 to x`, but that has bugs, so they say, `if x == 7 add 1 to x` but that has bugs so they say, `my vector clock looks like foo. if your vector clock matches, add 1 to x, and give me back your updated vector clock` but now you've imposed a total order and have given up a lot of performance.

I'm blind to the actual problem you're facing. My default recommendation is to have a monorepo, and break out helpers for expensive tasks, no different than spinning up a thread or process on a big host. Have a plan for building pipelines a->b->c->d. also have a plan for fan out a->b & a->c & a->d

It has been widely observed there are no silver bullets. but there are regular bullets. Careful and thoughtful use can be a huge win. If you're in that exponential growth phase, it's duck tape and baling wire all the way, get used to everything being broken all the time. if you're not, take your time and plan out a few steps ahead. Operationalize a few micro services. Get comfortable with the coordination and monitoring. Learn to recover gracefully, and hopefully find a way to dodge that problem next time around.

Sorry this is hand wavy. I don't think you're missing anything. it's just hard. if you're stuck because it won't fit on 1X anymore, you've got to find a way to spread out the load.


I fully agree with this, also that's still quite common in the embedded world.

The user presses a button which sets a hardware event flag. CPU wakes up from sleep, checks all event flags, handles them, clears the interrupt bits and goes back to sleep.

But using events like this requires a very tight integration between event producer and consumer, so I don't think this will translate well to distributed systems or microservices.


There is an example where highly distributed, event-driven systems are used. Everyday we are using them. It the cars.

In a car there are many distributed ECUs(1) that communicate in an event driven system. It was tried to use cyclic communication. But all those attempts failed in the long run. Because cyclic communication would required that all the independent ECUs are synced to each other, which is a very hard problem. That is the reason why everybody moved away from cyclic communication.

That said to help the development in automotive a middleware have been developed and used to help the development of such event driven systems. You develop your functions and define how the signals are routed between those functions. The someone later/independent decides on how the functions are distributed on the different ECUs depending the available resources. The middleware then takes care of the correct routing of the signals. Everything is event driven.

(1) Electronic Control Units


Automotive ECUs use cyclic communication and are event driven. ECUs send their signals over CAN at a defined message rate, regardless of whether the state has changed. Other ECUs set up their hardware to monitor the signals they care about, and trigger a hardware interrupt when the signal is received.


ECUs do both cyclic and event-driven, but typically the software logic is event driven, the communication is cyclic.

There's cyclic communication by sending messages over CAN (if reliability doesn't matter that much) or FlexRay (if reliability matters), so if you examine the network, you'll typically see the same PDUs repeat in a cyclic manner. But an individual ECU will handle the bulk of its logic in an event-driven way. An interrupt will trigger for the incoming CAN frame (and that's an event-driven thing!) but a couple of layers up it will most likely just set a flag for what changed, and it will actually be taken care on a logical level when the relevant task's loop runs next time.


> […] and trigger a hardware interrupt when the signal is received.

That is the definition of event driven.

The same is not only true for the CAN-based, and CAN FD-based communication, but also for the Automotive Ethernet-based communication.

There were tries with Time-triggered protocols like TTP/C (used by Daimler for exactly one model) and FlexRay which had cyclic communication. The communication cycle required that the ECUs are synced to the communication cycle because they needed to have to correct data available at exactly their time slot. If they missed the time slot, the data got marked as stale. The same problem on the receiving side.


Electronics can be more trouble than they're worth in cars. If they made modern cars with no electronics in the engine or controls, maybe I'd buy new cars. It's frustrating when my old Cherokee won't start because the antitheft system is having some kind of conniption for example, and I have to go through this ritual of locking and unlocking the doors, reconnecting the battery with the key in the ignition, flipping this mystery switch the previous owner has no clue about but seems to contribute somehow. There's a process to disable this system by buying a new ECU or something but I haven't gotten around to it.


I exclusively drive modern cars, and have never experienced anything along the lines of what you describe. So maybe the problem isn't with modern cars, but old Cherokees ;)


Or you could just get your car fixed


> But using events like this requires a very tight integration between event producer and consumer, so I don't think this will translate well to distributed systems or microservices.

Though I don't think tight integration between event producer and consumer is necessarily antithetical to microservice design. See the consumer-driven contract pattern, for example: https://www.martinfowler.com/articles/consumerDrivenContract...


It's good to hear from an embedded dev. Embedded is so under-represented that people love to argue about things like low level devs don't use them every day.


>> I fully agree with this, also that's still quite common in the embedded world.

Is this why so many gas pumps and ATMs have such horrible delays in their UI?


partially. also, the hardware is generally old. really old. Imagine hardware from the 00's, and there's a good chance that's what's inside the last ATM you used.


Windows 95. Old style gui programming meant sitting in a loop, waiting for the next event, then handling it.

The GUI worked in a single thread, but your whole program didn't need to do that.

Being able to copy a file and type at the same time was a big deal.

That's not correct. You just needed to create a separate thread for the file operation. For some programmers that was a big deal indeed, the same could be said for some programming tools. But that wasn't a general case at all.

There were some ugly things, like the way file operations were treated down the OS level, but it wasn't impossible to make your application responsive. It wasn't even difficult... if you knew how to do it.


I am not sure about Windows95, but the MSDOS legacy is single threaded, so you did not have the luxury of having the abstraction represented for you by the operating system.

Some people wrote very inventive software to get around it.

In Windows there is a fundamental concept of the message loop.

The message loop is an obligatory section of code in every program that uses a graphical user interface under Microsoft Windows.

Windows programs that have a GUI are event-driven.

Note that WindowsNT is fully multithreaded, but there is a thread sitting there listening to the event loop still. (of course other threads can do other things).

https://en.wikipedia.org/wiki/Message_loop_in_Microsoft_Wind...


I am not sure about Windows95...

I am because I did a few multithreaded programs with it, including what the OP called "a big deal".

Not sure what's your point though.


OP is talking about what you could do using the built in Windows utilities, not what is possible if you write your own file copy-and-note taking utility (which some people did, like Borland Sidekick.)


I guess what you call "built in Windows utilities" is making a call to the GUI shell. It's actually a very complex inter-process communication that happened to be a one-liner from VB and showed that fancy flying folders animation. Not very flexible though.

Writing "my own" code for file copy was not really so low-level, it's what most people understood as programming at the time. Locating the file, opening it, using a descriptor and a loop to copy blocks through a buffer, closing the file, managing errors... that kind of thing.

If you wanted to do the file operation in the background and keep using the GUI for input, you did need to create a separate thread. But that wasn't some black art feat, you just read a book or searched Altavista for a snippet, if in a hurry.


> Have a plan for building pipelines a->b->c->d. also have a plan for fan out a->b & a->c & a->d

How does that look like? Do you mean to learn how this is done with the CI of choice, create helper functions or are there concrete steps that you would recommend? I'm new to this and would appreciate any feedback.


Let's assume a has a time constraint, and needs to start shoveling bytes back to the client in 30ms.

In the pipeline case, a probably only has visibility to b, each service in the chain has ~7ms to provide its response.

In the fan out case, a is aware of b c and d, they can each take 25ms or more and a can still meet its deadline.

The way a sets its timeouts will vary depending on what's happening downstream.

This is a pedantic and fussy point of view. Nobody really does this, some pretend they do. But it can be important for overall performance.


Ohh you talk about call pipelines, I totally missed the point. You are talking about time-constrained systems where an answer in the timeframe is required, right? Otherwise the timeout does not make sense, or would you do such precise timeouts even in non-time-constrained systems?

Anyway, seems to make sense to plan for this timing-wise. Allows addressing and seeing performance bottlenecks. Thank you for spelling it out for me!


Failure too. Recovery in a pipeline is different than fan out.


Aren't most "events" just an abstraction that's actually implemented by polling and/or an event queue at some level anyway, be it a library or the kernel?


I agree mostly to this answer. I also want to point out that we use sharding as a technique.

Micro-services is just sharding across a different axis (TM)


React also works quite similar, with the added benefits of async and single thread.


As someone that worked at the company that taught Pete Hunt, and other early contributors to react this trick, I wouldn't call it a benefit, but the only workable solution to acceptable single page web application performance in an otherwise terrible development environment (browser javascript).

Proper concurrency support and real mutexes would be way better than having a single execution thread for all your computations that is shared with all the visual and human interaction computations. Plus, the data sharing models between the main thread and web workers is pretty crap for anything serious, so it's not even easy to get computations out of the main thread that don't need to be there.


I don’t think I agree. I can’t think of many tasks I’ve ever tried to solve in programming where moving the work to another thread was the right answer. Usually things are too entangled, or too fast to bother. And there’s so much overhead in threading - from thread creation and destruction, message passing with mutexes or whatever and you need to figure out how to share data in a way that doesn’t accidentally give you lower overall performance for your trouble. I’d rather spend the same time just optimising the system so it doesn’t need it’s own thread in the first place.

The mostly single threaded / async model is a hassle sometimes. But I find it much easier to reason about because I know the rest of my program has essentially paused during the current callback’s execution.


I've also found few use cases for workers in the browser, but as the parent pointed out that's partly because of the limited and sluggish communication between threads in javascript. Just to offer an opposite perspective, serverside, almost any modestly heavy processing I do with nodejs happens in worker threads. These are pooled so they don't need to spin up. Blocking node's event loop is never an option. Fine to access a big dataset with an async call, but if you need to crunch that data before passing it to the client you pretty much have to use workers. [edit] obviously assuming you're not calling some command line process, and you're trying to chew though some large rendering of data using JS.


I find Clojurescript to have a superb async support and ecosystem.


Like everyone getting frustrated with the buggy core.async and just using JS promises directly?

In Clojure/JVM core.async has less problems though.


Never ran into problems when I was using it ~5 years ago.


Unless something has changed with web assembly since I last did browser development, doesn't it still compile down to javascript in the same execution environment with those limitations?


I'm not very familiar with concurrency at a lower level, so I can't comment on your complaints. What Clojurescript provides is considerably better abstractions that in turn get you correct implementations with much less incidental complexity or hassle.


Not sure why my sincere question about the state of browser execution environment got downvoted, but the general concerns I raised in my first comment weren't about incidental complexity or hassle. They matter and it's great that clojurescript helps with those two. The concern I voiced was about maintaining realtime performance of a responsive user interface. The key issue is that you have a single event loop that all visual and human interactions need to be handled by because that's the only thread that can interact with the DOM. It's just way too easy to block that thread and cause the app to become unresponsive.


> The key issue is that you have a single event loop that all visual and human interactions need to be handled by because that's the only thread that can interact with the DOM.

Isn't that normal? A single event/drawing loop. Would there be an advantage to two threads drawing? Sounds pretty novel.


Like others have said, it is just one tool in the tool box.

We used Kafka for event-driven micro services quite a bit at Uber. I lead the team that owned schemas on Kafka there for a while. We just did not accept breaking schema changes within the topic. Same as you would expect from any other public-facing API. We also didnt allow multiplexing the topic with multiple schemas. This wasn’t just because it made my life easier. A large portion of the topics we had went on to become analytical tables in Hive. Breaking changes would break those tables. If you absolutely have to break the schema, make a new topic with a new consumer and phase out the old. This puts a lot of onus on the producer, so we tried to make tools to help. We had a central schema registry with the topics the schemas paired to that showed producers who their consumers were, so if breaking changes absolutely had to happen, they knew who to talk to. In practice though, we never got much pushback on the no-breaking changes rule.

DLQ practices were decided by teams based on need, too many things there to consider to make blanket rules. When in your code did it fail? Is this consumer idempotent? Have we consumed a more recent event that would have over-written this event? Are you paying for some API that your auto-retry churning away in your DLQ is going to cost you a ton of money? Sometimes you may not even want a DLQ, you want a poison pill. That lets you assess what is happening immediately and not have to worry about replays at all.

I hope one of the books you are talking about is Designing Data Intensive Applications, because it is really fantastic. I joke that it is frustrating that so much of what I learned over years on the data team could be written so succinctly in a book.


Thanks for your detailed answer, really appreciate it.

Two follow up questions if you don't mind me asking, even though I understand you were not on the publishing side:

1. Do you know if changes in the org structure (e.g. when uber was growing fast and - I guess - new teams/product were created and existing teams/products were split) had significant effect on the schemas that had been published since then? For example, when a service is split into two and the dataset of the original service is now distributed, what pattern have you seen working sufficiently well for not breaking everyone downstream?

2. Did you have strong guidelines on how to structure events? Were they entity-based with each message carrying a snapshot of the state of the entities or action-based describing the business logic that occurred? Maybe both?

And yes, one of the books I'm talking about is indeed Designing Data Intensive Applications and I fully agree with you that it's a fantastic piece of work.


For 1, no example really comes to mind, but i guess there could be cases where a service went from publishing an event with all of its related data, then split into a service where that becomes more expensive to do (like that data is no longer in memory its behind the api of the old service). In some cases you can have very simple services that consume a message, make a few calls to services or databases to hydrate it with more information, then produce that message to another topic that the original consumers could switch to. More commonly though if the data model is making a drastic change where the database is being split and owned by two new services, you will have to get consumers in on the change to make sure everyone knows the semantics of the new changes.

For 2, it completely depends on the source of the trigger. The first event in a chain probably only has enough information to know that it should produce an event, usually as quickly possible, so no additional db or api fetches. So you might get something in the driver status topic that contains {driver_uuid, new_status, old_status}, then based on what downstream consumers may want to do in response to that event, you may need more info, so you may get more entity information in derived topics. Even pure-entity-based messages would have needed a trigger, so in our topics that tail databases, you may have the full row as a message along with the action that occurred like {op: insert, msg: {entity data… }}.


Thank you so much for your input on this topic, very informative answers!


I am not the author of the original message, however, I also recommend "Building microservices 2nd edition" if you are trying to answer such questions


Thanks for your recommendation, I pre-ordered it =)


When you say you used a central schema registry, did you have a single repo containing all topics and schemas?


kafka has a 'schema registry' service. Typically they use avro/json to define the schema of each message. Think they added a couple of new types in recent versions. When you define your consumer/producer you also tell avro what the schema registry service URL is. Basically it helps you keep your messages in spec. It also has the ability for versioning so you can have different versions of messages in flight on the same topic. So if you add a new field it is fairly trivial for both sides to know what is going on. If you remove a field then it becomes more tricky.

Now what goes into that registry is typically in some way version controlled. That is for 'rebuilding' a clean environment if needed. Or part of a CI/CD system.


Thanks for the reply. We're using a schema registry but we generally have our schemas spread across multiple repos depending on who owns the topic. I was wondering if they centralized all their schemas under a single repo. I'd like to do this, but I wanted to get some other opinions on the subject.


I have seen several approaches.

1) the producer repo owns it. This has the benefit of only the producer makes those things and it is side by side. Downside if you have more than one producer repo with the same schema. When the producer code starts up it sets up the schema.

2) central repo. This is nice however tends to make it if you are using a micro service style more of pain to deploy as they all sorta need to move together. But works very nicely with mono repos. You can make it work with micro but it takes a bit more thinking. This turns your deploy into a two step process 'schema first' then code. That can work but again 'more thinking'.

3) consumer owned. The consumer basically says 'i will only grab these anything else I will consider error'. Works ok if you have 1 producer group and one consumer group and is effectively #1. But with several consumers it becomes a 'cut and paste job'. Or something may get out of sync, etc. When the consumer starts up it pushes the schema.

What I found best to get someone to decide is to figure out what is your version upgrade cycle. Is it one thing at a time? If it is 'everything goes every time'? Who 'owns the schemas?' So different styles will match your deployment process more than anything.


Thanks for the feedback!


Evan can correct me if I'm wrong, but I believe it was one centralized service with all the schemas. You could view older versions and make new versions via the UI.

It was pretty user-friendly and managing schemas was straight forward. Haven't done anything similar since so no comparison point, but I thought it was fantastic and improved data quality a ton.


This is correct yeah


I've been working on a contract for 6 months where the architecture is microservices and queues.

IMO: It's over-complicated. They can ship a change to a microservice while insulating the other services from risk, but that's just kicking the can for technical debt.

What happens is that, if a service hasn't shipped in a few release cycles, when an update is made to that service, we often find latent bugs. Typically they are the kind of bugs that could be found with simple regression testing; but the company put too much effort into dividing its code into silos. (Basically, they spent a lot of time dealing with the boundaries between their microservies instead of just writing clean, testable code with decent regression suites.)

---

IMO: Don't get too hung up on microservices and events. Focus on writing simple, straightforward code that's easily testable. Make sure you have high unit test coverage and a useful regression suite. Only introduce "microservice" boundaries when there's natural divisions. (IE, one service makes more sense to write in Node.js, another makes more sense to write in C#; or one service should run in Azure and another should run in AWS.)

This, BTW, helped immensely in a previous job. When I worked for Syncplicity, a major Dropbox competitor, we started with a monolithic C# server for most server-side logic, but we had a microservice in AWS to handle uploads and downloads. This helped immensely, because we ended up allowing customers to host their own version of the upload / download server. It was a critical differentiator for us in the marketplace.


We made the microservices mistake circa 2015-2016. Took us 3 years to recover and we lost some customers. Extremely valuable lessons were learned by all involved.


There's no such architecture, much like there's no "MVC architecture" or "CQRS architecture". These are patterns that should be used specifically in time and space where and when pros outweigh cons.

Anyone calling themselves an architect, or an engineer, or even just a "good developer" would acknowledge that interaction patterns and concepts are contextual, not general idioms at the project or system level.

Speaking of them as "architectures" or embracing them, as in, doing everything the one holy way is only a crutch for people who are confused by what it means to define your system's architecture. And a silver bullet for consultants to sell you books and training.

There is a lot of empty hype and misconceptions around EDA, for example "it helps decouple services" is thrown around, which is nonsense to anyone who can analyze a system and knows what a dependency is (moving from "I tell you" from "you tell me" is not more decoupled, you just moved the coupling; likewise moving from "A tells B" to "everyone tells B and B tells everyone" as in event hubs is much more coupling, it plays the role of a global system variable basically).

Regarding dead letters, a most trivial answer is log and notify the stakeholders for unconsumed messages. That's the most general approach. Think about dead letter messages the same as exceptions that bubbled to the top of the stack. And when you can handle them more specifically, you do.


I suppose it all depends on your definition of the word architecture. Erlang and Elixir are instantiations of the actor model, which is fundamentally an event driven architecture. It is universally the way that state and side effects are handled in those languages.

I touch on it a little bit in another comment: https://news.ycombinator.com/item?id=28047306


Actors are built on async message passing, "events" have a more specific meaning ("X did happen"), therefore it's not correct to say actors are fundamentally EDA.

Aside from the fact many messages in the actor model are not events, but can be commands ("perform X") or queries ("tell me about X") and the difference matters, messages are also not just spit out to no-one-in-particular-and-yet-everyone-who-listens, like it's typical in EDA, but from a specific actor to another specific actor.

To send a message to another actor you need to have their address (or in OOP-ese, their object reference). Sure, you can take one actor and designate it "event hub" and have everyone subscribe to it and then send everything there. But that is NOT a part of the actor model. It's you, writing your own application with your custom logic within it.


Not for a company, but I've embraced it pretty hard for my home automation. It's sort of the hammer I hit everything hard enough with until it looks like a nail by making everything go through the MQTT broker. The website? A static json blob describes interesting MQTT topics, and opens a MQTT over websocket connection to read/write any state. Zigbee, et al.? Translate to MQTT. Reporting? Daemon that listens to all topics and dumps it in a Sqlite database to be queried at my leisure. Events like sprinklers on/off? Python scripts in cron jobs that talk to everything via MQTT.

Basically everything that makes fully event driven architectures difficult is ameliroated because the only consumers are myself and my wife, and we literally built up the whole system. Something appears to be locked up? There's a system of watchdogs to kill stuff, all hardware has been designed to fail off into manual control, and we can pick the pieces up at our leisure like when anything else in the house breaks. The last will and testament messages in MQTT are really nice for at least reporting hard failure conditions.

I'll be the first to admit that I would not look forward to productizing it and supporting someone else's house (to the point that I'll probably never do that). It's so easy for messages to make their way into the bit bucket when setting up a new subsystem, and everything is so loosely coupled because of the event system it's almost like it's all "stringly typed". And being both software engineers, we sort of relish in how awful the UI is, even using 98.css.


+1 to MQTT for home automation.

I spent years fiddling with zigbee and z-wave and propietary other things from well-known brands as well as unknown brands from amazon. Nothing ever worked correctly for long periods of time.

Now I have binned all that other crap and migrated to esp8266-based devices that can be flashed to tasmota, and have them all talking via a Raspberry Pi running a MQTT server and OpenHab via docker containers. It is now rock solid in terms of reliability - the only failures come when OpenHab's cloud integration dies, and then I have a backup local http server with a javascript based client that just sends commands directly over MQTT. MQTT really was the missing link for me.

OoenHab has been a pain to use, but that is a different story. I'd not recommend it and I'd ditch it in a heartbeat, but sonfar I am in a "aint broke don't fix it" position.


Would you mind describing the parts that became automated? I can think of a number of appliances that could potentially be api-driven but haven’t researched them yet....


Sprinkler system, water pressure and leak reporting, lights, mains reporting that the meter happens to broadcast, host of sensors for the house plants. Right now working on a control plane for AVB (audio video bridge) devices to give myself a sonos++ that includes video too without using sonos (I'm still salty about their "recycle mode" BS).

After that either: A) hooking up some cameras to the AVB matrix to give myself a Ring replacement too, or B) hooking all the phone lines in my old house to the AVB bridge and creating a kitsch intercom system with an old phone chosen to match each room.

After that, maybe voice assistant, but that seems like a lot. Or maybe hooking a bunch of bluetooth MCUs together on a low latency network to make the house look like a giant bluetooth device so I don't lose my audio when I go outside to do yard work.

The end goal is all the smart home stuff but I don't have to choose between Daddy Bezos and Daddy Jinping.


Are there communities on Discord/Reddit for what you're doing? I'm interested in dabbling with home automation during my free time.


MQTT at home is a real boon. I have mine chugging away in a raspi in a docker container and that drives everything like the plant watering. I also use it to collect metrics and do alerting with influx+grafana.


Are there communities on Discord/Reddit for what you're doing? I'm interested in dabbling with home automation during my free time.


Where I currently work we are all in on event-driven architecture. For our DLQs, we have alerts on when the queue is growing in size or if messages are in the queue too long. When those alerts come in, we manually move the messages back to the normal queue for reprocessing and if they get DLQed again after that we will look into the reason it is failing.

One of the benefits of this architecture for us is the ability to easily share information between services. We utilize SNS and SQS for a pub/sub architecture so if we need to expose more information we can just publish another type of message to the topic or if we need to consume some information then we can just listen to the relevant topic.

There are two big issues that I've run into while at this company. One is tracking down where events are coming from can be a big pain, especially as we are replacing services but keeping message formats the same. The other big issue is setting up lower environments (dev,qa,etc) can be difficult because you pretty much need the entire ecosystem in order for the environment to be usable, which requires buy-in from all teams in the organization


Do you have any control over the individual services?

One way to ease some of that pain is a standard library to obtain your keys and topic and publish metrics when things are published and consumed, or at least logs on startup.

It's a pain to get buy in, and 10x harder to keep it updated. But if you can solve the problems around getting names and secrets and stuff, folks are usually open to the conversation at least.


I guess it's still harder to track down event emitters, but have you tried using bitbucket or GitHub code search to search all of your repos at once?


Yea, I have use GitHub search in a pinch and sometimes it is helpful enough to show me exactly where to look. Unfortunately, though, there are several events we emit that are many layers of string concatenation, so GitHub search may narrow it down to 4 or so places and I have to manually go from there.


Solution for that is using equivalent of user-agent - write it to message headers. Not sure if SQS supports it like Kafka.

I'm really surprised that not a lot of companies use this.


Thanks for your answer, it really helps.

Does moving the DLQ messages back to the normal queue mean that all consumers can deal with out-of-order scenarios?


if you're using a queue like SQS and expect it to be ordered and exactly-once, you're in for a lot of surprise. If you need ordering, use a stream/log like Kinesis or Kafka.


Exactly-once is not a requirement but ordering is. I had Kafka and Kinesis in mind when writing this question but just in case you haven't seen it, there is a way to get ordering guarantees using SNS and SQS:

https://aws.amazon.com/about-aws/whats-new/2020/10/amazon-sn...


Does that really guarantee in-order processing, or just that messages can be picked up in order. If you have multiple consumers on your queue and consumer A picks up message x from the head of the queue then consumer B picks up message y next, it is possible for y to get processed before x. Maybe consumer A is slow (gc pause?) for some reason, and now we are processing out of order, even though the queuing infrastructure does not see it. If you truly need to guarantee strict processing order (x must complete before you start processing y) I think you may need to build that into your app. Or, I misread, which is very possible.


"Everything looks like a red thumb when you're holding a golden hammer."

Events are a part of a greater whole. It's a tool that you can use to solve certain data flows, but not all data flows. When you start taking more liberty with the word "eventually," you are almost certainly in a realm where event-driven makes the most sense. CQRS is a pretty good example of using many architectures (including event-driven) under a single greater architectural umbrella, and the thought patterns it introduces you to are incredibly useful. But no architecture is gospel, not even close.

Any "pure" architecture is the tail wagging the dog. The problem comes first, the solution comes second, the architecture comes third.


I have worked in places where event-driven architectures are a necessity (we're talking thousands of real-time systems being integrated together).

If you want to use event driven + microservices, first make sure microservices make sense. Event driven is just a cool way to tie monstrous collections of services together if you have to go down that road.

If you can build a monolith and satisfy the business objectives, you should almost certainly do it. With a monolith, entire classes of things you would want to discuss on hackernews (such as event-driven architectures) evaporate into nothingness.


This. I have built many monoliths a few event-driven systems out of necessity.

In my opinion there's also a time factor involved, for example you think there's a huge potential but zero actual clients today. In that case if you think a monolith will do the job the first 5 years of its life, build a monolith.


I did at an old company.

It was great for certain use cases, bad for others. The architecture made it so it took days to do simple features like adding a sortable column.

Having to deal with that made it the worst job I've ever had. It would take 700 lines of code involving two separate systems and 70 hours to to do tasks that would normally take two hours. I felt a lot of pressure because previously simple tasks would take so long.


The closest thing I know of is the Erlang/Elixir approach to program development. The BEAM VM that they're built on, is basically an instantiation of the actor programming model - a series of logical processes (services) that only communicate with each other through messages. Any state is held in an actor and you work with that state in an event driven way based on the messages you receive. I'll give a little peek at this below, but really you'd need to work with it to see how well it works at application scale.

In well architected Erlang/Elixir, most of your business logic will be written as pure functional code (which is gloriously easy to test), but then it is glued together at the boundary by GenServers (usually). GenServers are an abstraction over the BEAM primitives that makes the 'receive a message, update my state' thing very easy. The simplest handler might look like this:

    def handle_call(:increment, _from, state) do
      {:reply, state, state + 1}
    end
Here state is a simple integer. When we receive the :increment message, we send back the current value and increment our local state. The way all this is wrapped up, the caller has an API that just looks like a function call which returns the value but the underlying architecture that you're working with is all event driven.


Events are a part of any good service-oriented architecture. They can replace patterns that involve batch-ETLing large amounts of data from system to system--events are usually a smoother way of doing the same. They're usually more resource-efficient and responsive than a poll & cache approach. They can also create a more consistent way to broadcast data, avoiding CAP problems from trying to do multiple writes to different systems, and preventing systems from devolving into anti-patterns where a system from a business domain gets misused as a message bus for another system.

Using events into a processing queue is also a good way to make systems more responsive for end-users when compared to making every operation blocking.

Events are not a good replacement for transactional request/response models of (i.e.,, making an API call.) Some people advocate for a "event sourcing" system to create its own internal domain model using events. I don't think this is a good default, but it really comes down to is what tools you're using an how you're used to using them. Namely, you can't have a web service that writes to a RDMBS and then immediately writes to RabbitMQ and call it a consistent system, because the write to RabbitMQ could fail and the systems downstream would be permanently wrong. So event sourcing is used to resolve this into a single-write into a queue system which then forks out into the systems' own RDMBS and also other systems. However the more "normal" way would be to just write this atomically into the RDBMS and have a second process poll it back out into the queue for downstream systems.


By and large, all of industry and academia working on modern robotics systems have converged on using event-driven publish/subscribe message busses for basically everything. For example, a camera driver will produce a stream of "image" events that the trigger other code across the system, all the way until a stream of "motor command" events come out the other side. This model is really valuable because it works so well with logging and replay workflows, and because it makes mocking and replacing different parts of the system really easy (up to and including mocking reality with a simulator). ROS is the major open source framework used in academia, and industry is split between using ROS and building proprietary internal protocols with similar functionalities.

It's not an exact match for the scalable-microservices world you're thinking of (for example, typically robots don't need to deal with runtime schema version skew), but could be interesting to learn about anyway.


Monolith is easier to handle. With microservices, any network connection could break, you need a lot more code to handle all that complexity and orchestration.


This depends upon the size & complexity of your monolith & how your development teams work. Essentially it's a trade off about which becomes most work and causes most issues.


Monoliths vs. services is like biologists arguing about cells vs. organs.


Not sure if there are any communities. My general advice is to invest as much as possible in a good logging solution, traceability, and just general things to make debugging easier. Come up with a way to replay events easily. You'll thank yourself everyday a bug or issue pops up.


I absolutely agree with you, we are currently implementing a fully serverless and event-driven infrastructure and thinking about logging, traceability and debugging across 200+ lambda functions has been quite a pain. This is even more important as we manage financial flows.

We have written an article about how we try to fix that, if you are considering such an infrastructure on AWS feel free to read it:

https://medium.com/ekonoo-tech-finance/centralizing-log-mana...

I believe that this is one of the big tradeoffs that you make when choosing to go for a microservices and event driven infrastructure.

Regarding the original post question, we are indeed going all in on an event-driven infra, and so far it has been going not too bad, happy to answer any extra questions.


This 100x. Tracing, replay, etc are all invaluable even in non event driven systems


I'm not sure you've understood EDA if you're suggesting that good logging is an alternative.


It isn't an alternative, it is a necessary part of understanding the system. You get an event and you have no idea where it came from. That is great from a composability stand point, anything can become an event emitter. On the other hand if that event comes in with bad data, you have no way to correct the issue because you have no idea where the event came from. The more event sources become disassociated from the event processors, the more important logging and traceability become. Perhaps your component ships an event and the downstream event processor doesn't do what you expect. Without good logging you loose the ability to treat the down stream as a black box, you will have to dive into the code to figure out why it is rejecting your events. Where as a simple "Dropping event x for reason y" in the logs is incredibly useful.


Thank you for saying the things I didn't think I needed to elaborate on :-)


Your phrasing indicates you positioned it as a replacement/alternative.


I work for a big AI consultancy. Most of the time we build ETLs for the data-engineering side, in a client driven capacity-building effort. We do this because our focus is on Data Science, not data engineering, and we often work in situations where the client doesn't have an existing data science platform. It's simpler to build, to handover and later to maintain.

In projects where the client already have a mature engineering and data science department, we bring the big guns! The scope is usually much larger, with several workstreams and involve production-ready deployments. In this situation we might build upon what the client already have (ETLs), or initiate a full event-driven transformation with a "backbone" team responsible for creating a platform, and several use cases building upon it. In the usual scenario, a team would want to start large computations or simulations upon recieving a trigger event from a monitoring system (model drift) or a human operator ("what would be the impact in € of a small decrease in parameter X over the next 7 days of forecasted sales"?).

Even-driven systems are much more robust than traditional ETLs with a central data warehouse, but they are also much more complex to understand and operate. In the end, we rarely deploy them because they cost us way too much engineering time compared to the benefits. That's mostly because we spend >70% of our time dealing with "security teams" and "access issues". Seriously.


Yes. It's my preferred architecture for any non trivial system. The single biggest downside to it is it's really hard to find people with experience building event driven systems.

There's a bit of a training curve but it's honestly not that hard if people are willing and wanting to learn. You could level up a moderately experienced team in a matter of weeks to be able to work within a well defined event driven microservice architecture. The part that gets tricky and requires experience is carving out the boundaries and messages.

To answer the question about DLQs I think this is a valid critique. I've seen many places just set and forget DLQs and they might as well not have them. For me, I like to start each DLQ with an alert on every message published. Then manually inspect the message, trace the logs and figure out what to do from there. Once I have enough data on failure modes and paths to rectify them, you can start automating DLQ processing. In general though DLQs should not see much traffic outside of a system going down or poison messages hitting your services (broken schema changes from another service)


I had experience with a product using an event based architecture at large scale, and to be honest, it was a pain to work with. For example, traceability, or troubleshooting in general, was very hard since events would spawn more events etc. making things much harder to track than expected.

Unless the scale is an issue, nowadays I always prefer a more state-full approach when possible.


(Kind of related)

I was the lead developer for Syncplicity's desktop client. It was a file synchronization product very similar to Dropbox.

When I joined, the desktop client was 100% event-driven. The problem is that some kinds of operations need to be performed synchronously, so "event-driven" tended to obfuscate what needed to happen when. Translation: For your primary use cases, it's much easier to follow code that calls functions in a well-defined order, instead of figuring out who's subscribed to what. Events are great for secondary use cases.

To translate to microservices, for primary use cases, I'd have the service that generates a result call directly into the next service. For secondary use cases, I'd rely on events. Of course, there's tradeoffs everywhere, but you'll find that newcomers are able to more easily navigate your codebase when it's unambiguous what happens next.


The strength and weakness of event-driven apps is the low coupling. A little bit goes a long way, as you said. Retro-fitting event-driven pubsub in is a fantastic architecture evolution. Running everything that way? Not much fun to read/maintain.


"As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one."

For us, it's either a function that will retry the messages after some time, or manual intervention.

Our department recently said we need to move to even driven architecture for one of our processes that currently runs in a batch. They want us to load data into EMR from an S3 bucket populated by Kinesis. Their suggested implementation is simply to run the batch job more frequently instead of once a day... sorry guys, but that's not even driven...

I suggested maybe just setting a trigger on the S3 bucket and hook it up to Glue, since that would actually be event driven. They said 'no' because they don't want the load to EMR or Glue to run too frequently. I guess that makes sense (not that familiar with the ETL tech), but it sure doesn't make sense to call it event driven.


> For us, it's either a function that will retry the messages after some time, or manual intervention.

What does manual intervention look like? Are all downstream consumers blocked until the DLQ is emptied or do all consumers know how to deal with late-arriving events?


> Are all downstream consumers blocked until the DLQ is emptied No other service should know about the state of another services DLQ

> do all consumers know how to deal with late-arriving events In my approach in architecting event driven systems, this question is meaningless. For me some of the core tenants (for me) of architecting these systems are

1. Message ordering should not matter 2. Message delays should not matter (no temporal coupling) 3. Message replays should have no side effects

This means that Service B should have no knowledge of what happens in Service A. If something in Service A fails and gets sent to Service A's DLQ, it should have no impact on Service B. This often involves rethinking business requirements and processes to account for workflows that get interrupted - I've yet to find an insurmountable issue.

If however, we are talking about a message broker trying to deliver a message from Service A to Service B, failing and forwarding to Service B's DLQ, then there is a very simple solution. Have your message broker deliver messages to queues. Service B then processes and handles messages from the queue rather than having to immediately handle all incoming messages from the broker. This protects against a lot of failure modes. The only thing(s) that arrive at Service B's DLQ are exceptions thrown internally in Service B


>What does manual intervention look like?

A ticket or an ops alert depending on severity. The cause gets fixed and messages are retried or dropped depending on your business case.

>Are all downstream consumers blocked until the DLQ is emptied

No, its a diffent queue. The point is that you eventually pull them into the DLQ so you aren't wasting all your resources on failing messages.

>do all consumers know how to deal with late-arriving events?

Hopefully, but an outage is an outage. You're way past happy path at that point.


Manual intervention could be moving them back to the main queue to try again, or looking up the data that failed like if the customer that is being transacted on exists. Most systems don't have an issue processing late messages. The ones that do would have to have some sort of retry/ingress process.


A lot of great comments here.

The closest community to what you are asking for here is probably the DDD/CQRS/ES[1] Slack[2]. Google groups are pretty much dead at this point.

[1]: https://github.com/ddd-cqrs-es [2]: https://ddd-cqrs-es.slack.com/


No. In my programs I try to hold John Carmack's advice, that there is no better understandable structure than a large program that you can read from start to end. He is talking about a main game loop, but I found that this advice holds. Nothing beats being able to step the function step by step and see all the variables.


I can say we're one of the companies that have successfully embraced event-driven design. We're Vaticle and we're not a microservice shop - rather, we're building a database software called TypeDB. The internals are quite event-driven mainly realised with the actor model and event-loop concurrency.

It has allowed us to scale mainly in two ways: maximising parallelism with respect to CPU, and doing other works while waiting for an RPC call to return.

Event-driven architecture by nature is more parallel and efficient, but comes with a weaker consistency guarantee when it comes to the ordering of events coming from multiple parallel sources.

In my experience, people tend to fall prey to these pitfalls, and ended up resorting to inappropriate workaround such as global locks and ad-hoc retry mechanism. These are most commonly done when trying to aggregate works coming from concurrent producers or when needing to handle communication failures.

In fact, communication failures and downtimes are the most prominent problem in microservice particularly when you need your data to be inserted into multiple data sources in an atomic way.

This is an inherent issue in distributed systems and you have to think what's the atomic unit of data that you wish to insert, and design your system based on this hard constraint. Making the operations atomic, idempotent or revertable are some of the solutions you may want to investigate, but the moral of the story, is that you need to make sure these additional complexities are justified.

For us as a company, we decided on the event-driven architecture after knowing not just the benefit, but also the cost that I've outlined above.

For simpler applications that don't need to be a) real-time and b) handle crazy amount of loads, think small internal applications, small business ecommerce website, I would resort back to good old non-event-driven system since it's the more pragmatic option.

I've seen several companies building an event-driven architecture even when they know there's no way they would need to scale beyond serving several thousands of request per hour in the next two years. I think they would've been better off with a simpler, synchronous model.


I work in semiconductor manufacturing, it's a very common model. I have been in about eight fabs around the world that use it quite successfully.


I have at a few companies now.

It's great for processing data that goes beyond a single database call, data formatting and presenting something on a page.

If you're chaining multiple microservices together you've made a very sloppy/poor mans version of this. (People tend not to account for downed services, maintence updates, client durability, etc) When you bring in technologies like Kafka to orchestrate this, you'll end up with a more reliable system that you can fix if something goes wrong. This more changes the way you think about data and how you present it. Also, it'll increase your uptime because your service's SLA is isolated from what you're processing. (Your service and the persistent storage that is storing state is what people see.. the data being out of date is something you should account for)

Schema changes: Generally you don't have that big of a deal because multiple applications get started at once. You should have system level tests to catch that before you go out. Also, application smoke tests help as well. As long as you picked a durable message queue with a framework that'll crash on error, you can fix that, bring up the fix and continue processing through.

Dead letter queues: It's more about how you architect more than anything. This is something you should plan for.


> When you bring in technologies like Kafka to orchestrate this, you'll end up with a more reliable system that you can fix if something goes wrong.

It depends...

If your services are servicing typical user requests, and expect responses in O(<1s), then eventing architectures are exactly the opposite of what you want. Using message brokers (Kafka) as an ersatz network layer is the path to sadness -- you want backpressure, RPC semantics, etc.

https://programmingisterrible.com/post/162346490883/how-do-y...

If you're doing ETC/event semantics, the game is different. But very few orgs actually work that way.


> If your services are servicing typical user requests, and expect responses in O(<1s), then eventing architectures are exactly the opposite of what you want. Using message brokers (Kafka) as an ersatz network layer is the path to sadness -- you want backpressure, RPC semantics, etc.

Correct. Most cases you need to see what the current stored state is. Processing can happen in the background and update that state without being requested.

I'm assuming that you're architecture your system as having an ingest service that brings everything into the topic, multiple apps that'll refine/enhance, the data, and then an app that sends it out to a more permanent storage. Your service endpoint should look at the premant storage to make a response. (Or read and post out updates via a websocket if you want frequent updates to an existing page)

Kafka is not a network technology.


I saw a number of investment houses (mostly sell-side) do this. This was in the age of ESB (Enterprise Service Bus.)

The architecture made sense since events (new trades or quotes) dictate a host of downstream activites, which often need to be near-real-time reduce divergence.


It's kinda like using Lisp. It may be great, but it's harder to find friends :-)


Event driven architecture honestly seems like a different flavour of all the things we hate with spaghetti code gotos.


Well, every game company :)

Makes sense robotics would do it too. AAA games and robotics are basically the same field after all


We definitely haven’t fully embraced an event driven architecture, but we have gone all in on it where it makes sense for us. We’re processing in the order of a billion messages a day from hardware in customer’s homes, which is probably the ideal use case for event based comms. Handling devices being offline for a while becomes much simpler when the response to that can just be queuing up events and transmitting when available.

One of the key lessons we learned was that your event ingest needs to be rock solid. Put up a service, and then make it do one thing only, receive an event and throw it onto the message bus - we do this for messages from devices, but also for 3rd party services which send us webhook notifications. If ingest fails so does everything else, don’t let that happen.

The other thing I’d say is to make sure there’s a central source of truth for what services are consuming which messages. We made the mistake early on of saying services should be responsible for setting up their own subscriptions and it’s made it much more difficult to answer questions around who’s going to be impacted by changes or outages. At a minimum have a wiki page on it. Ideally manage subscriptions in Terraform or similar.

Finally, DLQs. Typically we don’t do much with them, we have logging of messages that get pushed to the DLQ and usually it’s a non-recoverable error, often around validation or accounts being disabled. They are handy in the case of an outage though as you can just push all the messages back into the queue when things recover.


Yes, I made my own open-source event driven platform: http://github.com/tinspin (rupy is the foundation and fuse is an example implementation tested with 350.000 users and 5 years uptime)

The learnings where 2-fold:

1) You need async-to-async capable db clients so that you use 4 threads (potentially on separate cores) for each browser <-> server <-> database roundtrip.

Since most databases don't have async capable clients I wrote my own database too: http://root.rupy.se

2) You should use a VM + GC language so that you can use atomic shared memory between cores, that way any core can handle any request efficiently (and access other users memory).

This part is very hard to prove in theory, but in practice I'm baffled by how well Java performs, you can find three quotes that I managed find here: https://github.com/tinspin/rupy/wiki

Finally getting threads to cooporate on things is hard and you cannot debug it with any tools, instead you have to use "trial and error" until is sort of works all the time.


> 2) You should use a VM + GC language so that you can use atomic shared memory between cores, that way any core can handle any request efficiently (and access other users memory).

Where did you get the idea that you need a VM and GC to do this? You literally only need a pointer to memory. These boil down almost directly to instructions on a CPU. You literally just atomically load and store to a pointer.

https://en.cppreference.com/w/cpp/atomic/atomic

Java isn't fast, your CPU is fast and the best thing any VM or GC can do is get out of the way.


I would have written rupy in C with atomic pointers and .so/.dll hot-deploy if I had done it now, but:

All concurrent complex datatypes leak memory with C/C++, I don't know why; but if you read the quotes you'll see others try to explain it: https://github.com/tinspin/rupy/wiki

The VM is good for more things than concurrency, like no seg. fault crashes, which is very good to have on a server.

Have you built anything like a 5 year uptime 1000+ concurrent users server with hot-patching every day?

Have you even tried Java?


None of this reply even has anything to do with what I asked, which was why you think you need a VM and garbage collection to run specific CPU instructions.

> All concurrent complex datatypes leak memory with C/C++

That's completely ridiculous and shows that you have a very fundamental misunderstanding of how these things work. They are wrappers around functions that do the atomic operations.

https://en.cppreference.com/w/cpp/atomic/atomic_store

Feel free to show me the memory leak you were talking about.

https://godbolt.org/


All concurrent maps f.ex.

It's like you're argumenting in a void, have you ever made anything?

Please show some code you have written or stop replying to my comments.


That makes absolutely no sense in any way.

You were implying that atomic operations, which don't allocate any memory and are just a pass through to CPU instructions operating on a pointer, somehow leak memory.

Now you are saying "all concurrent maps" in C++ leak memory which not only has nothing to with what you said earlier, but it an insane claim to make without linking to specific maps you are talking about, not to mention that there aren't any in the C++ standard library.

> Please show some code you have written or stop replying to my comments.

What would make you realize that saying operations fundamental to a CPU "leak memory" is like saying 'loops make programs crash'. It is only something someone would say if they not only don't understand the fundamentals of what they are talking about, but don't even realize they don't understand it and in fact are absolutely sure that they do.


Operational support is more interesting with this kind of an architecture. Dealing with message queues and all that can be challenging for a traditional organization.


Some teams parallel to mine have an event based contract with their upstream, vs my team has a service contract.

We've been doing some refractors to combine common systems, so now the same team is upstream for both.

Talking to the sister teams, theyrepretty unhappy about their relationship with the upstream and are trying to avoid and replace them internally, vs we quite enjoy them.

I think the big difference is in mental model. When you're passed an event stream, the producer doesn't care about the events going out, and it's on you to handle all of them, and for failures, you have to reach out to whoever made failures in the upstream system, rather than the upstream team doing it. Otoh, for a service call, you only need to throw an exception for a bad event, and the upstream team is responsible for communicating the failure.

The more event based interfaces you have between your team and the folks making the change, the harder it gets to tell them that they're doing something wrong, and the less you even know about what they're doing or how to find them.

Mind you, immediately after the service call, we put messages on a queue. Distributing the message between systems we own works just fine


I'm guessing the OP meant event driven with persistent events. If the events exist only at runtime then a lot of burden related to schema evolution disappears.

We tried this approach a couple of times with mixed results. In one case, for a new component with strict DDD modeling - it was a boom to productivity. In others , preexisting once we never got to realize the gains, invested quite a bit , not sure if we ever get the upside.


You are right, that's what I was referring to.

I have seen DDD working sufficiently well for some use-cases but issues started appearing when the business redefined what the domains are and how the org is structured. I guess there is not a single straight-forward solution to this.


I'm not sure what properties of the system you assign to the DDD label, but in my experience DDD just means you put the domain first. It's all the Event Sourcing and CQRS baggage which gets lumped together with DDD and adds heaps of non-essential complexity.

If the understanding of the domain changes, you change the code - there's no way around that.


I maintain a legacy application which was written in a fully event-driven way. As another commenter mentioned, native Windows programs work this way, but it is not just because this is a Win32 program. The original author(s) of this application also embraced a multi-threaded paradigm and use their own homegrown asynchronous serial message and event system (built on top of Windows messages).

It's terrible. The reason it's terrible is that you can't use function call graphs, single stepping, or call stacks to debug this application. Everything happens indirectly by one part of the application throwing a message in a bottle into the ocean, and another part of the application (running on another thread) finding the bottle at a later time. And every message is written in a different binary format (different memcpy'd structs) which is mutually intelligible to each sender-receiver pair and no one else.

Troubleshooting and understanding this system is more akin to endocrinology or ecology than math or engineering.


Much of our codebase is python microservices communicating via Kafka. Once you get past the hurdle of getting kafka connected it's pretty reliable. We have a shared library for producing and consuming so we don't need to reinvent the wheel for new services. We also dump the resulting messages as rows into a database. It works very well.


What do you use to encode the messages? Avro, Proto or maybe just JSON? Most of the books and articles encourage the use of Avro, but I feel like the support in python isn't nearly as good as in other languages. Almost every solution operates on dictionaries instead of classes with annotations which doesn't seem like something you would want to use for a Py project in 2021...


We use Json in our Python projects and Avro in our Java projects. We have considered Avro, CSV, Protobuff or CapnProto for the Python projects but it's never been enough of a win for us to prioritize it. We switched to Pydantic from dictionaries.


Thanks for response! That's basically what i ended up with! Avro whenever possible and for python stick with pydantic. Except that at the same time i want to keep avro schema for those pydantic models and just convert them with fastavro, because then i dont have to rely on the quality of code/schema generators. A little sceptic about that though, as keeping the same schemas in 2 different technologies compatible and double (de)serialization might be troublesome.


>"fully embraced"

"fully embraces" / commits - not a very wise thing to do. Event driven arch is one of many tools at your disposal. It is awesome for some things and not so much for others. You can't use single tool / approach for everything and expect best results.


I disagree. There are incompatible design disciplines and architectures.

100% functional code is a lot better than 99% functional. 99% functional is a mess. 100% functional gives strong guarantees.

You can split up code horizontally (MVC), vertically ("apps"), but of you mix the two, you get a mess.

I am working on a system which is 100% event-driven, and it works well in the context of this system, which is primarily about realtime data analytics. There are systems I'd build without any event-driven code either. There are hybrid systems too.

There are architectures which are compatible (e.g. structured and OO), but many just aren't. Event-driven is one of those things which adds a lot of complexity to any system. Fully-embraced, that can pay dividends. A little here a little there and you have most of the cost and little upside.


There are many examples of a "hello world" for event-driven architecture, but there doesn't seem to be the equivalent of the "sinatra/express" of event-driven architecture, a minimalistic foundation to build a customized platform using the approach. Things like schema migration layer would nicely onto something like this.

Most event-driven systems are big company projects with a lot of legacy requirements and integration complexity, or they are narrowly tailored and hard to generalize.

I think that a simple template event-driven system that includes a small number of libraries and does something simple but interesting would be a big help.

I've been wanting to create an open source event-driven e-commerce system but haven't had the time.


"everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one"

It varies, but one common example is that the queue is worked by humans. This is pretty common, for example, in travel, for things like accidental overbookings.


The overhead with this architecture can be cumbersome. Which is why most successful deployments of it tend to be with teams that have embraced full stack serverless. Recommend exploring that community, plenty of event driven systems there.


100% this. Embracing Serverless for me meant embracing distributed, event driven systems. Serverless and Event Driven architecture go hand in hand. I actually find my productivity is higher than in days of MVC monoliths


Interesting, does serverless help with data ownership?


Yes - I work in an investment bank, we try to do millisecond-level latencies for our order management system that sends client orders to the various exchanges(for sub-milli we use FPGA but it's very expensive and only for some clients).

It works alright (like you I hated it all before, coming from more amateurish implementations). It's slow to change (adding a new event type can take years before it works everywhere), failures can happen - for instance if the message parsing library has a crushing bug and an extendable attribute zone has a poison pill that never appeared before - well nothing you can do but manually editing the event source.

What it brings us, I suppose, is that every micro service is single threaded, all events are well-recorded , we use multicast to transfer them from one sequencer to all consumers so we just need good routers and TCP-level message building - it's very barebone to keep it fast, extensibility for us is mainly on adding more services around a core stream that we don't really need to change all that much (we do a lot of regulatory validations, data analysis, the odd scale out for a round-robin compatible process - not all of them are, some need to see all events, for instance for cumulative exposure calculation - client stock exposure on several markets for risk-based decisions).

It also avoid latencies like in state-request based systems, since each service will build its own state machine. We make a lot of money on this system, so we can hire hundreds of people around the world to maintain it.

At this point I dont see how to do it better (5ms round trip to the client if low validation, 100ms if crossing seawater to an exchange with short selling validations) without events, but I know very well that doing it for simpler flows that are not latency-constrained will probably result in heavy cost and low gain. I would never recommend it for people who aren't already struggling with a fully-fledged business implementation that makes money they want to accelerate.

The problems we face are:

- it's slow to evolve the very core of the system

- we need perfect ordering, there will always be time wasted at the sequencer to transform unordered "commands" for the various services into perfectly ordered events

- testing and debugging is an art that takes time to acquire: I can now, but the task seemed daunting when I started - how to spin up the minimal surface of services to make a valuable replay, how to make static configuration reproduce production's behavior exactly so that all services behave the way you want to reproduce if you're investigating a sequence-based issue (rather than a function-based issue)

- it takes up to a year for a new Java dev to get productive on such an exotic mindset, but it's also because we do no intraday malloc since we cannot afford a GC in the middle of a client order

- management cannot understand why they can't cut cost using the cloud, virtual machines, vendor databases etc. Even in a company that makes billions over 20 years on this system, we still can't explain it in a way that sounds valuable vs its cost. Because its cost probably is extremely high, and can't be outsourced by hit and run consulting managers before being brought in-house again. So we're not like the most popular dev, we're the slow and expensive ones :(


Hey thanks for you answer, this is pretty informative.

I am basically where you began i.e don't really know what an actual event driven system actually looks like(as in something implemented properly).

1. Could you please tell me what according to you was a major difference between the more amateurish implementation vs the proper implementation(I know this is quite a general question to ask, sorry about that).

2. Since you've seen what proper looks like when it comes to event driven architecture, can you perhaps suggest some resource(could be a book, could be code base) which according to you is the closest someone can get to understanding how a proper event driven system would be implemented.


My most recent (embedded) programs are almost-entirely interrupt driven, with a main loop containing only a wait-for-interrupt. So, all actions are driven by a GPIO event, RTC or timer wakeup, USB request etc.


I've worked on a project that was fully event-driven realized as Microservices. (I think a few of the externally connected Microservices were event-driven as well but not all.) That was all roll-your-own without framework. So things could break. But the philosophy was more like: everything is written in a very lightweight and simple way. So if it breaks, it can be fixed swiftly. I've also seen a similar approach at another place. FWIW both places had unusually high availability requirements. ("Fail fast"...)


Used it for fully transient services. If search died, it could be rebuilt from scratch. If chat died, data about active chats would be lost, but no one really cared.

The main issue with using it with other services was answering questions like "how do we restore a database backup?" or "what do we do if Kafka blows up, and we lose messages?". It's possible to create a design that won't get into an inconsistent state when bad things happen, but people tend to greatly underestimate how hard it is.


nginx is fully event driven https://github.com/nginx/nginx here is a list of what nginx is using for event handling on the different operating systems: https://nginx.org/en/docs/events.html

i don't know of any big project using the newer io_uring, does anyone know some big examples of io_uring usage?


https://github.com/CarterLi/nginx-io_uring here is a project that tries to use io_uring with nginx; you learn something every day...


Yes, most MVC platforms are event driven.

Since you mentioned schema breakage what I imagine you’re doing is inner platform effect as your API / database already supports everything you are trying to reinvent, just use Postgrest and views.

When you break a view you know you’re making incompatible changes. Stop right there and either version the view, or figure out how to add your feature without breaking the view.

It’s pretty easy to avoid making breaking API changes.


I recently got recommended this video, haven't watched it yet but given that the speaker is Kleppmann I suspect it will be very helpful

Thinking in Events: From Databases to Distributed Collaboration Software

https://www.youtube.com/watch?v=ePHpAPacOdI&list=WL&index=1&...


In finance, specifically in trading systems, event-driven is a natural fit and in my experience the default to which systems converge to.


I've worked at a company that has launched at least one product where back-end was entirely event-sourced.

Apart from using event sourcing and CQRS, they take DDD very seriously.

They use a self-made open source framework, which has very good JavaDocs. https://github.com/SpineEventEngine/


Event sourcing and event driven are two different things.


What is their relationship? I've been thinking of event-sourcing as a special case of event-driven: all event-sourced systems are event-driven, but the reverse is not true. Is this the generally accepted view or do folks view these as more orthogonal?


In my opinion, event sourcing is a data persistence strategy. I personally never heard that it's considered part of event-driven architecture, but given speed of change in that space, I wouldn't be surprised.


event sourcing is indeed a special case of event-driven.


Yes, have a look at https://github.com/purpleidea/mgmt/ Not at a 1.0 release yet, but there's enough for you to have fun with. LMK


I think the best model that describes event-driven approach is Petri nets. The theory is quite simple, yet powerful.

Though is difficult to implement in a straightforward way, it is able to take your mind in the right direction.


Are there any rules of thumb where such an architecture should be considered? >X TPS? >Y milliseconds per txn? >Z milliseconds between write and subsequent read? Eventual consistency OK?


It depends! FWIW the spots I've used event patterns most often are high write loads that benefit either from fan-in micro-batching before hitting the data stores or polyglot systems where a client/system event need to fan-out to multiple systems and we want to insulate the producer from downstream latency/outages.


Idea: have event streams between microservices always be bilateral (two party) contracts. When you want to make a change, you can pick up the phone to the other end and get it done quickly.

Multilateral contracts easily lead to email chains or meetings over each change, and compromise data structures, similar to the “add a column on the end” culture often used in shared reldbs.

What is your motivation for wanting microservices? As an alternative, what about events between the processes of single codebase? In that case, when you want a schema change - change it, run your tests to see that all modules comply, redeploy everything.


If you think about it, modern front end web development (with React + Redux) is event-driven! In a way, there are tons of people embracing it


Window systems are classically event-driven. Especially earlier single-thread ones from Microsoft.


them kids and their microservices...


Got a list of books for reference?



* Designing Data-Intensive Applications

* Monolith to Microservices

* Streaming Systems

* Designing Distributed Systems

* Building Event-Driven Microservices

* Kafka-Streams in Action

A comment in this post also suggests "Building Microservices" but haven't read it.


A fully event driven service based application I wrote that matches file system interaction to peer to peer networking:

https://github.com/prettydiff/share-file-systems


You mean Erlang?


erlang is great (and has events) but it is not an "event-driven" system. It is very easy to write an "event-driven" system in erlang, though. (I have done CQRS, from 'scratch' in elixir + psql, which I believe is a flavor of event-driven system). Was very straightforward.


Would be very interested to know!


Yes. I am interested in knowing what was done before event driven architecture to make robust asynchronous systems?


It depends on your definition of fully embraced. If you mean that there is no synchronous communication between services, then no, and neither does it make sense in the real-world scenarios I am aware of.

However, I am an advocate of the pattern and have seen it used successfully repeatedly. The largest scale as the data lead for a product maintained by 100-200 developers and several thousand transactions per second.

To answer your specific questions

>handling breaking schema changes or failures in an elegant way, and keeping engineers and other data consumers happy enough?

We did not allow for breaking schema changes. If there is a breaking change, it's a new event/topic. We used Kafka and every topic needed to have a compatibility scheme defined (see https://docs.confluent.io/platform/current/schema-registry/a...) to clarify what constitutes a breaking change. Even though some claim that producers and consumers can be fully decoupled, you will need to have a good idea who your consumers are and the time horizon of the data they consume. Application engineers are usually easier to keep happy than machine learning practitioners and other data consumers that want to consume events emitted over a long time period, potentially years.

> As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one.

Dead letter queues are a tool you can use when the context demands it, applying it wholesale is likely creating too much overhead. But to provide you with a specific example. Some emitted events will be revenue impacting and depending on your setup, you actually want to use the events for financial reporting (careful! some more info later). In this specific use-case, if you can't process a record, the last thing you want to do is throw the message away. Somebody will need to have a look at these records, fix the cause and then either re-emit the records based on what you know about them from the header or fix the records in the DLQ. So think about the guarantees you need to provide and decide whether a DLQ makes sense for your use-case.

Some other thoughts and considerations.

- Topics more or less directly become analytics tables. Almost creating a unified view on your application's data otherwise difficult to create.

- How are the messages emitted. Are the messages emitted from the application logic? If so, what guarantees do you need? What happens if the app crashes (e.g. after a DB transaction happens and before the event was emitted). Depending on what you need, have a look at the transaction outbox pattern.


Here is an alternate point of view: Everything is event-based. Fundamentally, our universe is event based; things interact via events mediated by the "force-carrying particles". It's down there at the bottom. Even up here at massively higher levels, everything is fundamentally event-based.

If that's the case, then why don't we write entirely in terms of events as our base architecture? It isn't because we are ignorant of event-based processing, it is because we want the other types of systems in our lives. Transactions don't really exist; they are an abstraction we add to certain elements of the world that are behaviors they perform in response to certain types of incoming events. API calls don't really exist, they are a stereotyped pattern of an event making a call and event sent back, tied to the first via some ID (which may be a TCP socket), with a response, and no further activity on that ID for that event, which is an abstraction we added on top of events for a certain very common type of call.

Working directly with event-based systems is the software architectural equivalent of writing in assembler. Sometimes you have to do it because nothing else will do. However, you are dropping to a lower level, with all that implies, particularly the fact that you are now responsible for any of those nice properties that you want to enforce. Very similar to how if you want to use UDP, but you want some of the guarantees of TCP, you are now responsible for those guarantees. There's nothing wrong with that. It's just something you need to be aware of in making your decisions.

Being the foundational architecture everything else is based on, event-based systems can do anything that is possible to do, again, quite similar to how assembler is what can do anything the CPU can do. The other abstractions function by providing limits on the event flows in the system for their power, again just as a higher-level language like C, or perhaps even more clearly Rust, simply can not be used to generate all possible assembler instruction sequences, because the way they fundamentally work is to exclude sequences from the set of all possible sequences to only contain sequences maintaining certain properties. Event-based systems can implement API-call-type sequences internally. Event-based systems can implement transactions by manually implementing all the requisite limitations of events and event ordering and what things do in response to events. Etc.

But, if you have a system that needs any of those guarantees, it's kind of silly to start writing at an event-based level, only to have to painfully reconstruct the guarantees already available to you. As the Ancient Wisdom goes, "Any sufficiently complicated event-based system contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of TCP." Most of these abstractions are around for a reason.

Where things go wrong is when the abstractions become detached from their underlying implementation in people's minds, and in their minds, become the base abstraction. Probably the single biggest instance in this space of that problem is treating "the API call" as "the fundamental abstraction". API calls are an incredibly useful abstraction, but, at the same time, a really terrible primitive to be the bottom of your system. To build the API abstraction out of event flows involves throwing away a lot of the capabilities of events. If an API call is what you need, and it is a very common need, that's a virtuous simplification, but when your needs exceed what an API call can do, you can really wreck up a design trying to implement an event system back on top of API calls. I've seen it at least twice in my career; one of the products I'm responsible for can almost literally be seen as a rewrite of a previous version that made that mistake in a manner that basically fatally killed it architecturally and replacing it with an event based system at the core... which I then immediately implement an API call layer on top of which mostly ran the system... but... right at the the critical place... didn't, and I could reach back down the stack and use the raw event-based system in the core for a few critical bits of functionality.

This is my "alternate point of view". Everything is already event-based, even when you can't see it. However, that doesn't mean it's a good idea to work at that level of abstraction all the time any more than the fact CPUs run assembler means we should always be working in raw assembly code. It is not necessary to use raw events everywhere. It is not a betrayal of good design to have some API calls in your system, or a centralized transactional database, or even TCP (which most notably adds "ordering" to events). It is, however, necessary for software architects to understand that the event-based system underneath is fundamental, and that they view the other additional abstractions as islands of functionality based on the event-based core of the world underneath, and not confuse those islands with the bedrock. Many systems may not even expose event-based functionality at a raw level anywhere, but if you keep this principle in mind, if a raw-event use case ever pops up, your system will quite likely be ready to handle it.


You seem to have conflated two things here - do event-driven architectures work at all, and has anyone "fully embraced" it. It doesn't have to be "fully embraced" to be successful.

I started going down this rabbit hole a year ago (see the many good replies to this https://twitter.com/swyx/status/1241482183472295939?s=20) and most people feel it is "hard to reason about", which often seems code for unfamiliar.

What I and other people were lacking is a good framework to think about it. Unfortunately this has compromised my credibility to you as I left Amazon to go work on this very problem at https://temporal.io this year. I'll try to give some thoughts for how we tackle this but wanted to give that disclaimer upfront - not trying to sell you anything other than "i think this architecture could work use whatever you want"

1. DLQs - the AWS answer would be to wire up Lambda and SQS to build your own DLQ retry system (https://aws.amazon.com/blogs/compute/using-amazon-sqs-dead-l...). This is a bunch of extra provisioning and coding. So you may want to use the retries built into Step Functions (https://aws.amazon.com/blogs/developer/handling-errors-retri...). But instead of learning a bespoke States Language and debugging-by-redeploying-cloudformation (sooo slow lol), you may wish to work in a proper programming language SDK you can run and test locally instead (this is Temporal.io's approach)

2. Failures - any decent workflow engine will log and retry your failures for you, i wouldn't write my own logic for that these days

3. Microservice communication - what problems do you foresee? need more here. We simply call them Signals (send data in) and Queries (get data out) and it works well.

4. Breaking schema changes (versioning/migration) - yes this is really fragile unless you have a proper framework to bring this all together. We just build in versioning into our SDKs and give you a replay tools to verify you've handled still-running workflows (https://www.youtube.com/watch?v=kkP899WxgzY)

5. Keeping engineers happy - this one REALLY depends what youre talking about but being able to write tests for your asynchronous/distributed system is important for increasing confidence, as is being able to work in your preferred language (polyglot microservices), making every part of the system horizontally scalable so you don't have random bottlenecks, having everything logged and persisted so you are resistant to network/machine failures and can figure out exactly what went wrong when it goes wrong... I could go on.

Of course i'd love for more neutral users of workflow engines to chime in if I got anything wrong here. just trying to offer what I've learned so far working in this area.


> do event-driven architectures work at all, and has anyone "fully embraced" it. It doesn't have to be "fully embraced" to be successful.

For context, I've seen successful event-driven architecture implementations when it comes to data ingestion for analytics/ML purposes.

What I have yet to see is a successful implementation of microservices using local state stores built by joining streams owned by multiple domains. That's where the (arbitrary) "fully embraced" term came from.

Does everyone implement the outbox pattern themselves? Or treat their streams as the only source of truth, materialising the state that is already embedded in them?


"Event-driven architecture" only considers a

small subset of the computational events fundamental to

understanding computation.

Actor Theory is based on automatizing the "precedes" partial

order for all computational events.

Proving properties of computation al systgems can be

accomplished using Actors Event Induction for computational events.

For more information see the following video:

https://www.youtube.com/watch?v=AJP1VL7shiI


> Every event-driven architectural pattern I've read about can quite easily fall apart and I have yet to find satisfying answers on what to do when things go south.

Just a kind request from someone living in the southern hemisphere not to use "south" as a synonym for "bad" or "fail".




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: