I was responsible for Stripe's API abstractions, including webhooks and /events, for a number of years. Some interesting tidbits:
Many large customers eventually had some issue with webhooks that required intervention. Stripe retries webhooks that fail for up to 3 days: I remember $large_customer coming back from a 3 day weekend and discovering that they had pushed bad code and failed to process some webhooks. We'd often get requests to retry all failed webhooks in a time period. The best customers would have infrastructure to do this themselves off of /v1/events, though this was unfortunately rare.
The biggest challenges with webhooks:
- Delivery: some customer timing out connections for 30s causing the queues to get backed up (Stripe was much smaller back then).
- Versioning: synchronous API requests can use a version specified in the request, but webhooks, by virtue of rendering the object and showing its changed values (there was a `previous_attributes` hash), need to be rendered to a specific version. This made upgrading API versions hard for customers.
There was constant discussion about building some non-webhook pathway for events, but they all have challenges and webhooks + /v1/events were both simple enough for smaller customers and workable for larger customers.
Shameless plug but I've built https://hookdeck.com precisely to tackle some of these problems. It generally falls onto the consumer to build the necessary tools to process webhooks reliably. I'm trying to give everyone the opportunity to be the "best customers" as you are describing them. Stripe is big inspiration for the work.
Do you provide the ability to consume, translate then forward? I am after a ubiquitous endpoint i can point webhooks at and then translate to the schema of another service and send on. You could then share these 'recipes' and allow customers to reuse well known transforms.
You can do this with BenkoBot, we just launched custom webhooks (although it's not in the interface yet). So you can receive a webhook and run some arbitrary javascript to transform it then send it on somewhere else:
Our main focus is on handling Trello notifications, and the Trellinator library I wrote is built in, our objective is to create more API wrappers over time to make it as simple as possible to deal with as many APIs as possible. You can see some example code here:
You currently require a Trello account API key/token to sign up, but you can use it as you described to be a generic endpoint, transform however you want with JS then post the data onto another endpoint.
Transformations is something we haven't built yet but we have our eyes on it as your are not the first one to bring that up. You can use Hookdeck in front of lamda and to the transformation there, you'd still get the benefit of async processing, retries, etc
Do you have an idea of when hookdeck will have transformations. It's not something we need immediately but would be the win over something like: https://webhookrelay.com/ if it's something you have on your roadmap for sometime soon.
> We'd often get requests to retry all failed webhooks in a time period.
(I worked on the same team as bkraus, non-concurrently).
For teams that are building webhooks into your API, I'd recommend including UI to view webhook attempts and resend them individually or in bulk by date range. Your customers are guaranteed to have a bad deploy at some point.
At Lawn Love, we naively coupled our listening code directly to the Stripe webhook... but it worked flawlessly for years. I wasn't a big fan of the product changes necessitating us switching from the Transfer API for sending money to the complicated--and very confusing for the lawn pros--Connect product, but its webhooks also ran without issue from the moment we first implemented them. So thanks for making my life somewhat easier, Mr. Krausz.
Like many others, I now pattern my own APIs after Stripe's.
Don’t fully thank me, I was also the architect of the Transfers API to Connect transition :). There’s a lot I would have done differently there were I doing it again, though much of the complexity (e.g. the async verification webhooks) were to satisfy compliance needs. Hard to say how much easier the v1 could’ve been given the constraints at the time, though I’m very impressed with the work Stripe has done since to make paying people easier (particularly Express).
Thanks for the input. We're currently working on a similar solution, so I was really curious to learn more.
One thing I really admire is how Stripe makes it transparent which events were fired both in general through the Developer area, and on specific objects like customers, subscriptions, etc..
Pretty easy for a customer to setup an SQS queue and a lambda for receiving them rather than rely on their infrastructure to do all the actual receiving. Way more reliable than coupling your code directly to the callback.
This is precisely what we do where I work. We have a service which has just one responsibility - receive webhooks, do very basic validation that their legitimate, then ship the payload off to an SQS queue for processing. Doing it this way means that whatever’s going on in the service that wants the data, the webhooks get delivered, and we don’t have to worry about how 3rd party X have configured retries.
These reasons are exactly why we started Svix[1] (we do webhooks as a service). I wish we existed to serve you guys back when you started working on it. :)
I always laugh when people end up with designs like this. They could have just used SMTP! It's designed to reliably deliver messages to distributed queues using a loosely-coupled interface while still being extensible. It scales to massive amounts of traffic. It's highly failure-resistant and will retry operations in various scenarios. And it's bi-directional. But it's not "cool" technology or "web-based" so developers won't consider it.
Watch me get downvoted like crazy by all Nodejs developers. Even though they could accomplish exactly what they want with much less code and far less complex systems to maintain.
I pitched an idea like this years ago to essentially backfill one ticketing system to shiny new system that could read an email inbox. The idea was that if we dropped an email in that inbox with its desired format for each old ticket's updates, the new system would do all the necessary inserts and voila. They told me no -- not because of any technical reason, but because their email infrastructure was required to be audited by the SEC, they would have opened themselves up to significantly more auditing. Instead, I ended up having to do it through painful, painful SQL.
Lesson being, that sometimes there are unexpected reasons why a specific piece of technology shouldn't be used.
You're not allowed to use SMTP without calling it email ?
It sounds like one not allowed to use http for restful APIs without calling it a website. (And that org require website to be audit to fulfill accessibility measurement for physical disability)
For the record, I disagreed with them also and pushed back pretty aggressively and found workarounds to the audit problem. But CTO and CIO basically took it as a challenge against their authority and denied me at all points.
It's not clear from the message whether the software is setting up their own email system, if so it will need to be audited and certified, which is a major hassle.
Either way, the auditors and the infrastructure might not want to handle an order of magnitude more traffic (API usage is really in a different league than occasional human email). Expect all emails to have to be stored for around a decade.
I agree that people don't react well to negativity, but sometimes you have to say it. Node has a lot of very stupid (i.e. ignoring reality) decisions, and by extension, being exposed to this for a long enough period of time, tends to affect the developer as well.
I say this from experience, as someone who's used a few stupid technologies over time.
This was not a time when you had to say it. None of the existing conversation was about languages or ecosystems, and taking cheap shots was wholly unnecessary to the suggestion of using SMTP.
SMTP itself is interesting, although it comes with fun new footguns like STARTTLS.
"STARTTLS is an email protocol command that tells an email server that an email client, including an email client running in a web browser, wants to turn an existing insecure connection into a secure one."
I'm sure absolutely nothing bad will come from that last bit. Oh look:
"And yes, STARTTLS is definitely less secure. Not only can it failback to plaintext without notification, but because it's subject to man-in-the middle attacks. Since the connection starts out in the clear, a MitM can strip out the STARTTLS command, and prevent the encryption from ever occurring."
DANE is meant to fix that. If someone asserts, via DNS records (signed by DNSSEC), that their SMTP server is able to use TLS, then you should only accept connections using TLS to that SMTP server.
And DANE is never going to happen; DANE advocates have been saying this for over a decade, and the only change has been that the IETF and all the major email providers moved forward on a new protocol, MTA-STS, specifically to avoid needing DNSSEC (which nobody uses) to solve this problem.
Almost every time anyone mentions DNSSEC here on HN, you pop up like a jack-in-the-box to claim that nobody is using it and that it is dead. And it’s always you, nobody else. Whereas, from where I sit, I work at a registry and DNS server host (among other things) where about 40% of all our domains have DNSSEC (and that number is constantly climbing). Every conference I go to, and in every webinar, people seemingly always talk about DNSSEC and how usage is increasing.
You might have some valid criticism about the cryptography; I would not be able to judge that (except when you are basing it on wildly outdated information). I’m not an expert on the details; you could most assuredly argue circles around me when it comes to the cryptography, and possibly about the DNSSEC protocol details as well. But, from my perspective, your continuous claim that “nobody uses” DNSSEC is simply false. DNSSEC works, usage of DNSSEC is steadily increasing, and new protocols (like DANE) are starting to make use of DNSSEC for its features. Conversely, I only relatively rarely hear anything about MTA-STS.
Take any list of the top domains on the Internet --- any of them at all --- and run them through a trivial script, like:
#!/bin/sh
while read domain
do
ds=$(dig ds $domain +short)
echo "$domain $ds"
done
... and note that virtually none of the domains, in any sane list of top domains, are signed. That was true several years ago and remains true today, despite the supposed "increase in usage" of DNSSEC.
What's actually changed is that registrars, especially in Europe, now apparently auto-sign domain names. That creates a constant stream of new, more-or-less ephemeral signed zones that gives the appearance of increasing DNSSEC adoption. Of course, this is also security theater (the owners of the zones don't own their keys!). The real figure of merit for DNSSEC adoption is adoption by sites of significance, and that has been static, and practically nonexistent, for a decade.
It is no surprise to me that people working on the DNS talk quite a bit about DNSSEC. People who worked on SNMP talked quite a bit about SNMPv3, and IPSEC people probably really believed there would be Internet-wide IKE. None of those things happened, because what matters in the real world is what the market decides. Most especially at the companies with serious security teams, DNSSEC is a dead letter standard.
Registrars can’t “auto-sign” domains. Only DNS server operators can do that, if they have the cooperation of the registrar. And the DNS server operators is the only workable definition of “owners of the zones”, so they do own their keys. It can’t work any other way.
In fact, the new CDS and CDNSKEY DNS records allow it to work the other way around; DNS server operators can auto-sign domains, and the registrars need not be involved at all.
> The real figure of merit for DNSSEC adoption is adoption by sites of significance
People said the same about IPv6. Or maybe you do, too?
> People who worked on SNMP talked quite a bit about SNMPv3
I seem to recall you mentioning quite often how WHOIS was dead and would be replaced by RDAP. That didn’t happen either.
> IPSEC people probably really believed there would be Internet-wide IKE
Interestingly, that problem could in theory be solved by DNSSEC. We’ll see what happens.
I don't think you ever saw me mention that WHOIS is dead, not least because that's not a thing I believe. What a random thing to say; you can just use the search bar to immediately see the (very few) things I've had to say about RDAP here.
> If the client is configured to require TLS, the two approaches are more-or-less equally safe. But there are some subtleties about how STARTTLS must be used to make it safe, and it's a bit harder for the STARTTLS implementation to get those details right.
I previously thought that was the default, good to know it isn't / might not be
2. Not wanting to use two different TCP port numbers to send and transfer mail.
To solve these problems they created STARTTLS. But obviously, STARTTLS isn't actually secure (even though that was the point of supporting TLS). So to make it secure, it's suggested to use DANE - a standard built on a different procotol, requiring a feature that is controversial, potentially dangerous, and not widely implemented. So you can use a kludge (STARTTLS) with a kludge (DANE) to send and transfer mail securely. But should you?
Since 2018, RFC8314 says that e-mail submission should use implicit TLS, not STARTTLS (https://datatracker.ietf.org/doc/html/rfc8314#section-3). Therefore the use of STARTTLS, and the use of DANE to make it secure, are deprecated. So while you shouldn't use DANE for anything seriously, you really shouldn't use it for SMTP.
Even if implicit TLS is used instead of STARTTLS, DANE is still necessary to avoid forcing backwards-compatible agents to fall back to unencrypted traditional communication.
DANE is necessary as long as there are still some agents using backwards-compatible behavior; i.e. falling back to unencrypted communication if TLS is in some way blocked.
Those agents should not be falling back to unencrypted anyway! The whole ecosystem just needs to get onboard with implicit TLS and deprecate the old agents. It's not acceptable to make the whole ecosystem dependent on two completely different security mechanisms. Every client/server in the world would have to support both indefinitely, which would be a totally unnecessary cost and complexity burden.
I mean, if we accept completely deprecating non-TLS connections, then there still would be no problem with STARTTLS! Servers would just need to only allow the STARTTLS command, and refuse any commands until after the TLS handshake. I believe that many server programs allows this configuration today.
It is only when we allow backwards compatibility that something is needed to differentiate to the clients whether the server is new enough to allow TLS or not.
That's only a footgun if your system is set up to allow an insecure connection to continue. Just because the protocol allows it does not mean you can't add additional requirements.
> Node has a lot of very stupid (i.e. ignoring reality) decisions, and by extension, being exposed to this for a long enough period of time, tends to affect the developer as well.
As a developer who used Salesforce for nearly a year once upon a time, I can confirm that exposure to stupid decisions in a platform can affect the developer.
Node, though? Could you expand on the stupid decisions in Node? And does Deno address those?
I use and love nodejs daily, and I think I can speak to some of the stupid. A lot (but not all) has to do with ecosystem.
Some of the stupid in node just comes from the fact that there's still a lot of reinventing the wheel, and doing a less good job of it. Like, we've got all these backend frameworks, but still nothing at all that compares to eg Spring. Can you even find a nodejs lib that does HATEOAS properly and completely? How often do you find yourself doing string parsing, or handling a JSON object, when you know it would be more efficient to be handling a stream, or that really the kind of work you're doing ought to be handled by your framework but isn't?
As for nodejs itself, it's much better in 2021 than it was in the past. But it's still a massive runtime. And I have mixed feelings about eg Worker threads. As for node_modules, I get the sense that we're just replaying the history of Microsoft's dll story, needing to relearn all the lessons that should have been learned already.
As for Deno, I think it comes with great ideas. In many respects, I like it better than Nodejs. Most of its good ideas, Nodejs is flexible enough to accomodate. One of Deno's main advantages is that it doesn't have any legacy to support, so it can embrace things like ECMAScript modules more easily. Its library system is closer to Go, although I think the end result is that a lot of folks end up doing one-off systems that end up looking a lot like the nodejs module resolution system in the end. Deno's main disadvantage is that it is not compatible with nodejs libraries. That's also an advantage insomuch as you have a clear module import spec from the get-go.
In short, the stuff that Deno can do, Nodejs can do, and I'm not sure that it's cleaner system can overcome the fact that the same is accomplishable in Nodejs. I'd be more than willing to use Deno in a greenfield project because I like all the technology choices it makes, but fundamentally, the technology choice you're making is whether or not to use V8, and adopting Deno is almost just a way of pressing the reset button the ecosystem, which may or may not be a good thing depending on your needs.
I actually did use SMTP as queuing middleware for a registrar platform years ago.
It worked very well.
EDIT: To add some context, my team had come off building a webmail platform, and so we'd done lots of interesting stuff to qmail and knew it inside out. We then launched the .name tld and built a model registrar platform that on registration would bring up web and mail forwarding for users that wanted it. We used SMTP to handle the provisioning of those while keeping the registration part decoupled from the servers handling the forwarding. We also used it to live-update a custom DNS server I wrote.
I remember interviewing someone who worked on a DNS platform where IIRC the DNS zone files were propagated by SMTP to DNS servers. The details on this were that there was a 5-minute SLA (I believe) on the loading of zone records, essentially that the DNS servers were polling the mailbox and parsing new records since some last loaded time stamp.
For a second there I wondered if you'd ever interviewed me (but having looked at your profile: no; I don't think we've met, though I'm in London too).
We had similar-ish constraints. SLA was internal, not imposed (the .name registry had externally imposed SLA's, but the registrar platform did not), but the zones were very simple - either NS records pointing elsewhere, or identical CNAME/MX records, so we needed only a short string per address.
I don't remember if we used CDB files or if we stored individual records directly in ReiserFS filesystems (our mail platform had relied heavily on the ability of ReiserFS to handle vast quantities of tiny files, so were comfortable with that), but it was definitively something simple.
Similar for the web forwarding, which just required a url to redirect to.
If a node should ever need to be replaced, all we'd need to do would be to start a queue on a new box but not process it, then rsync over the dataset from another server, and start processing the queue, and add it into rotation when up to date. If we'd needed stricter consistency guarantees it'd have been a different consideration.
For many types of workloads I'd pick another queuing system today, but the amount of readily available tooling for e-mail, especially once you need federation, reflection/amplification etc. does make it an interesting choice for some things.
It also made debugging the message flow trivial: just add a real mailbox to the cc:....
so someone told you he was prejudiced against, you respond with more prejudice and then you pretend it's his attitude that colored your perception. Are you for real?
Honestly this is so stupid brilliant I love it (stupid as in I can’t believe I hadn’t considered this). Honestly it really is about storing, sending, and checking messages so SMTP makes so much sense!
I’ve been building for the web for 15 years and it shows how far I can hyper focus on certain communications implementations that I’m not looking at pre-existing options that really meet a large number of use cases. I suppose it also means making sure your data consumers are comfortable working with the protocol but it’s a really top notch idea.
SMTP used to be a lot more reliable than it is now. Now, with all the changes to help with blocking spam, you have to be very careful or have a lot of control over the receiving server to ensure you actually get delivery. Some anti-spam systems will just discard if the matching rules indicate the spam likelihood score is above a certain threshold, and mistakes in rules at system levels can and do happen.
But here's another way you could (ab)use the mail system for delivery, provide a mailbox for the client and just allow IMAP or POP access and throw the messages into that. The client can log in to access and process them (which they would likely be automating on their own mailboxes anyway). It does mean it's housed at the provider, but it's also pretty easy to scale. There's lots of info on how to set up load balanced dovecot clusters out there, and even specialized middleware modes (dovecot director) to make it work better so you can scale it to very very large systems.
I don't think mr throwaway was advocating to use email to send the events, only to use SMTP. Email is an entire ecosystem, SMTP is only a protocol.
If the distinction is too hard to make: think of it as using the 'Simple Event Transfer Protocol' that just happens to use exactly the same protocol as SMTP.
> Email is an entire ecosystem, SMTP is only a protocol.
Yeah, but it's a protocol for transferring email. As I noted with "you have to be very careful or have a lot of control over the receiving server to ensure you actually get delivery", you can amstract most of the mail system out as long as you ensure you are running the server they deliver to, but you would also need to rely on them making sure their outgoing server is good for this, which probably means dedicating it to this and not running any real mail through it (so you avoid outgoing company email filters, etc). At that point, both sides are running specific bespoke mail servers, which cuts down on the usefulness of the solution because of how much setup and administration it requires.
It used to be nobody ran incoming and outgoing filtering on email, so it was a robust channel for communication with retries, and notifications for failed delivery, etc. These days it's not exactly that because of all the spam mitigations and company compliance and risk mitigations that might be in place, etc. In fact, just setting up a new mail server and attempting to send to microsoft (live/hotmail), yahoo or gmail is extremely hard, because they have a high bar for acceptance, and large swaths of the easily obtained IP space have already been blacklisted from prior spam use so you start at a bad reputation and have to work to get it to a level you've even be allowed to talk to other by working with all the third party (and first party) blacklist maintaining entities.
Yeah it does make more sense to have the IMAP/POP setup rather than actively sending out emails through consumer level email services like gmail etc where deliverability might become a concern.
The difference is that using a mail subsystem to handle this handles a lot more of the implementation than "use an atom/rss feed".
Notably, in choosing to use an atom/rss feed, you need to determine what the webserver serving it is, how to implement authentication on top of it (is it a token/oauth, HTTP auth, param auth, etc), what is the underlying data store (SQL/NoSQL, some message system), how to scale that system if you expect it to be large and span multiple servers and/or datastores (mail systems right now deal with hundreds of thousands of users and gigabyte plus mailboxes of millions of messages).
Choosing IMAP to deliver this info means there there are well worn solutions for all the decisions you need to make (including howtos to implement oauth at the server level), as well as client level libraries in almost every language. Basically, you could decide to use it and not have to worry about forging a new path on that system basically ever, because there's plenty of people that have already implemented it at a larger level and with the same features (even if you would be using them to slightly different effect), and they've contributed the info on how to do it and what the performance ramifications are to the public domain.
I'm not seriously advocating for it, but that's more because clients will look at you funny than for any technical reason. Technically, it actually has a lot going for it. Unfortunately as an industry we fetishize the new and bespoke because obviously our own unicorn projects are so new and special and will serve so many people that some off the shelf solution could never be as good....
SMTP would raise too many questions, from how both datacenters tolerate it (spam), to who will manage the receiving server itself and certificates on your side, and overall security of this setup. For a nodejs developer it’s really easier to spin up a separate handmade queue process rather than managing SMTP-related things. Webhook (for runtime) and long-polled /events?since= (for startup) have all upsides with little downsides.
When designing something like this as a service, the biggest question is what other developers will find easy to use. Every cheap host supports inbound HTTP requests, and most web developers know how to receive them.
Stripe needs to be usable by both the developers building intense, scalable, reliable systems and the people teaching themselves to code in a limited context on a limited platform.
>And it's bi-directional. But it's not "cool" technology or "web-based" so developers won't consider it.
I might be missing a point or two here, but I don't see how SMTP can work for this case at all. You would require every API consumer to setup a SMTP server (which is another piece of infrastructure to maintain), and then somehow have a layer of authentication so the recipient can control who post messages on that server (overhead for the publisher per new customer). Then we still haven't resolved the issues on the customer side (a bad code that could pop all messages and now we might require the publisher to replay them again).
I haven't even started to think about security and network hardening challenges yet. Again, I might be missing the point but this is not a case of cool tech overuse to me.
SMTP servers supports SSL. Using client certificates and/or HMAC signed messages takes care of the security. You have the same security consideration for HTTP.
As for "setting up an SMTP server", the point is that compared to the current requirement of a webhook, you're going to need a queuing mechanism or a pull mechanism or both anyway. So you can build a custom solution, or you can pick an existing queuing mechanism that people have spent literally decades providing a vast array of software options for.
And yes, you're right, you can always ending up needing a way to trigger a replay because no matter what you do the customer might do something stupid. Nothing you do will get you away from that. So either you require them to always pull, or you provide an option to push and an API to trigger redelivery for when they've done something stupid. If you opt for push, SMTP is an option worth considering, because no other queuing mechanism has as many available ready-made and battle hardened queuing options.
There are many cases where it'd not be suitable, but in the situations where SMTP is a bad choice, webhooks are likely to be an absolutely awful choice.
I speak from actually having run messaging on SMTP both as an e-mail provider with a couple of million users and having used it as messaging middleware in production.
I'm confused, "use SMTP" doesn't even type-check for me. Isn't SMTP just a transfer protocol? Meaning it defines a bunch of commands and gives them meanings (like EHLO and DATA and such), just like how HTTP defines commands like GET and POST and all that? Isn't the problem here about e.g. the storage & retry logic rather than about the data transfer itself? Can't you retry transmission as frequently as you like using whatever protocol you like? How does transferring the data over SMTP gain you anything compared to HTTP?
"use SMTP" here is a short way of saying "send mail to a mail server that will store the requests indefinitely" instead of webhooks that are constantly retrying on a protocol that was hand-written instead of being baked into the whole internet already.
> "use SMTP" here is a short way of saying "send mail to a mail server that will store the requests indefinitely"
So the suggestion is to use email? That's not how others are interpreting it. [1] And it doesn't make sense to me either. Emails as they are "baked into the whole internet already" are unencrypted with tons of middlemen, and even their transport isn't guaranteed to be encrypted. Email is also munged and messed with in weird ways, with fun stuff like each middleman tacking on their own headers and filtering it out based on unknown rules. It also introduces a ton of latency and severely prioritizes "eventually reaching the destination" over timeliness. And more downsides I can't think of off the top of my head. That seems like a really poor choice for an event delivery mechanism.
There are a billion lego pieces out there in the e-mail ecosystem. We can combine them any way we want, if the alternative is a totally custom solution anyway (webhooks, custom API endpoints, cron jobs, queues, etc). There are so many options; where to begin!
First, you don't have to use the rest of the internet's e-mail system. Stripe can run their own mail servers that deliver straight to clients on non-standard ports using implicit TLS, ensuring security and no middle-men. This also ensures delivery is as timely as possible (sub-second typically, as mail software has to be fast to handle its volume).
Let's say you want to poll (ex. "/events"). The client uses IMAP to poll the Stripe server with a particular username/password. Check a folder, read a message, delete it on connection close. There are of course ready-made solutions for this, but you can also write simple IMAP clients really easily using libraries.
Let's say you want pushes (ex. webhooks). The client sets up the alternative to the webhook-server they'd have to set up anyway: an SMTP server. Use a custom domain, one that has nothing to do with the customer's main business, so nobody ever gets confused. Configure it to only accept mail from a "secret mail sender" (aka webhook secret). Part of the "SMTP webhook URI" would be what mailbox to deliver the webhooks to. The client then configures an MDA on their mail server to immediately deliver new messages to some business logic code. If the MDA or business-logic code has a bug, the messages will stay in the client's mailbox until they are "delivered" successfully. If the client's SMTP server is down, Stripe keeps retrying for at least 3 days, more if Stripe wants.
Stripe could actually implement both by keeping messages in an IMAP folder on Stripe's servers, and deleting the messages once the SMTP server confirms delivery to the client. Of course all messages already have unique IDs so removing dupes is easy.
You could implement all of this in a week, write almost no code, and still handle all the weird edge cases. Virtually all of that time is just reading manual pages and editing config files. The end result is a battle-hardened fault-tolerant event-driven standards-based distributed message processing system. The maintenance cost will be "apt-get update && apt-get upgrade -y", and anyone who can configure Postfix and Maildrop can fix it.
Hey throwaway! I think this could work, but it might not be the highest priority. Think about this perspective:
The API is entirely HTTP, and tries to meet users where they are by providing tools that they are most comfortable with. Frequently, these users are familiar websites or mobile apps. As such, webhooks are implemented over HTTP.
If there was an alternate way to integrate with events, it'd be something that's either:
1. accessible to novice users, or
2. delivers on high throughput/latency needs of the largest users, or
3. resolves a storage/latency/compute cost incurred behind the scenes
Thinking about these:
For #1, websockets would score better than SMTP
For #2, kafka (or a managed queue like SQS) would score better (many support dead letter queues and avoid the latency at the mail layer)
For #3, it isn't clear that SMTP reduces the latency, compute, or storage costs
SMTP might be familiar -- and it's possible for you to build your own webhook → SMTP bridge if you wanted it -- but doesn't score well enough on any of these metrics to be built in-house.
[Disclaimer: I work at Stripe, and these opinions are about how I'd approach this decision. They're not the opinions of my employer.]
Yeah, I totally agree it's really out of left field compared to what users are comfortable with. Like another commenter said, clients would probably laugh you out of the room for proposing it. (though that's half my point! why are we only accepting these half-baked custom solutions on janky platforms? fear of criticism? is it really saving anyone any time or money compared to the "weird solution"?)
But I'm not convinced on the latency/compute/storage comparison with Kafka or other solutions. I think a POC would need to be built and perf tested, and then tweaked for higher performance and lower cost, like most software. Considering the volume of traffic that mail software is designed for, I can't see how even a large provider like Stripe would have difficulty scaling a mail system to match Kafka. It's not like mail software is written in Java or something ;-)
What about events that need faster than 1 minute response times? Any push notification like system is going to be just as error prone. And what about multiple message handlers? And what happens when the send fails? Did someone write the code to check the inbox for them and handle them? When a send fails multiple times, is that logged and is there a system for clients to check that log? Message transfer isnt the hard problem in this domain.
There's nothing about SMTP that dictates response times or in any way makes it much slower than HTTP. A non-pipelining clients will require a few more network roundtrips if it connects and disconnects for every message, that's all.
> Any push notification like system is going to be just as error prone.
E-mail servers are built with retry logic and queuing logic already. The point is if you need queuing anyway, it offers a tried and tested mechanism with a multi-decade history and a vast number of interoperable software options. While there is now a relatively decent number of queuing middleware options, none of them have as many server and client options as SMTP.
SMTP isn't the best choice for everything, but it works (I've used it that way), it's reliable, and it scales with relative ease.
> And what happens when the send fails?
It gets retried. Retries are built in to mail servers. That's part of the point.
> And what about multiple message handlers?
What about it? Most SMTP servers provides mechanism for plugging in message delivery agents rather than delivering to a mailbox, or you let it deliver to a mailbox and pick it up from there. Or you plug in whatever routing mechanism you want to distribute the messages further. The sheer amount of ready-built options here is massive.
> Did someone write the code to check the inbox for them and handle them? When a send fails multiple times, is that logged and is there a system for clients to check that log?
Pretty much every e-mail server ever written provides a mechanism for handling persistent failures, and many of them offers heavily configurable ways of doing it. But yes, you'd need to decide on what to do about persistent failures. But you need to do that whatever queuing system you use.
> Message transfer isnt the hard problem in this domain.
The point of the article is exactly that reliable message transfer is the hard problem in this domain.
I think it depends on the developer. There's developers hammering out boring business logic as fast as possible and there's developers with a deep understanding of machine internals, protocols, and infrastructure. For the former, SMTP is black magic they'd probably never think of and involves engaging the one infra person that's always busy
It also means standing up and managing "infrastructure"
I sort of agree, but somebody already has to manage the "infrastructure" of their web apps, dns. They never mind adding more of their own home-grown services. If they used Kinesis instead that's another piece of infra to maintain. But you would never hear them say "what about Postfix instead". Regardless of infra, if it's new, they want to use it, even if something older and more boring would work better.
If I ever heard a dev at work say "No I won't use that new tech, it's too untested/I'll have to spend more time figuring out how to make it work well", I would shit my pants. Whereas if it's old tech, "it's not modern/I'll have to spend more time figuring out how to make it work well". It's practically software ageism...
You’re likely blinded by a “nodejs monkey developer” stereotype which prevents you to see that node is what everyone wanted back then. It’s very, very easy to create an http-based analog of any “traditional” service in node and to free yourself from learning all the shady details (which there is a lot) of configuring it and keeping it alive at all levels, were it based on traditional software. Node is extensible configurable networking itself, and http(s) is a quintessence of all text protocols. All that we wanted back then is available now in node at much finer granularity and much less configuration or headache. “They” spin up home-grown services because it is a natural one-page-boilerplate straightforward thing to do in node, not because of ageism or something similar.
I tell you that as someone who fiddled with sendmail.cf’s and other .conf’s way too much long before nodejs became a thing. Now it’s a relief.
> There's developers hammering out boring business logic as fast as possible and there's developers with a deep understanding of machine internals, protocols, and infrastructure.
Purely anecdotal of course, but I follow a number of the latter, and they're either sparsely employed or often employed in a capacity where it doesn't matter There was a comment in this thread where one person had such an idea, and it was rejected for what were essentially business reasons.
Developers won't be able to use the existing email systems of the company, too critical and managed by another team. They will never be able to reconfigure it and get API access to read emails. Note that it may or may not be reliable at all (depends on the company and the IT who manages it).
Developers won't be able to setup new email servers for that use case. Security will never open the firewall for email ports. If they do, the servers will be hammered by vulnerability scanners and spam as soon as it's running. Note that large companies like banks run port scanners and they will detect your rogue email servers and shut it down (speaking from experience).
Nothing preventing offering delivery on alternative ports for people with incompetent security teams that thing port numbers is sufficient to determine if it's a threat.
As for "being hammered", rejection of invalid recipients before even getting to the DATA verb is cheap.
Having actually run both an e-mail service and SMTP used as messaging middleware, I have dealt with these issues.
The security team is not incompetent. Large companies do not permit developers to spin up their own email systems without audit and regulatory retention. The port number is sufficient to determine that the request should be rejected.
You could work around it but should you? You're exposing the company to fines and risking your job.
Better think of another way to integrate with the vendor, or find another vendor.
P.S. SMTP is easy to identify on ANY port, it's replying a distinctive line of text when TCP connection is opened.
> do not permit developers to spin up their own email systems without audit and regulatory retention
If they freak out over an SMTP server but don't freak out over a web server, then the are indeed absolutely utterly incompetent fools that should never work in this space.
In both cases code written by the company developers will eventually process untrusted textual input, and you need to deal with that with the same level of caution, and the protocol does nothing to change that.
> You could work around it but should you? You're exposing the company to fines and commiting a fireable offense. Better find another product that's easier to deploy.
I would not work around it - I would make the case that there's no difference in exposing a carefully chosen SMTP server than exposing a web server, and if the security team fail to understand that, I'd resign, because it'd be a massive red flag, and I've been successful enough to be in a position to not need to work for companies like that.
For that matter, in 25 years in this business I've yet to run into your hypothetical scenario, including at large companies, so I'm not at all convinced it'd be a genuine problem. Yes, I've been at companies where I'd need to provide a justification for getting a port opened. But never once had an issue getting it approved - including SMTP.
> P.S. SMTP is trivially identifiable on ANY port, it's giving a line of text when the TCP connection is opened.
I was responding to "Security will never open the firewall for email ports.". Point being that if they care about the specific port numbers, it doesn't matter.
[And I'll again point out I've actually run infrastructure like this].
I'm speaking from real experience too. It takes a while to open firewall in some environments, if you ever can.
One bank was the worst. There was a super stringent process to expose things externally. Opening the firewall port was just the beginning and that'd take 2-4 weeks if all goes well.
You'd struggle like hell to expose a SMTP server though because it would be immediately be rejected and flagged based on the port. Banks have to store, monitor and ensure the origin of all emails, they don't allow shadow email servers. And it's plain text so more reasons to ban (also a problem with HTTP, you should do HTTPS if anything).
Defense was simpler, mainly because there was no external connectivity in many cases. You don't need to worry about how to open a firewall when there's none :D
At this point in my career (10 years in the game), let me simply defend node as the tool that got me here. Using it then to bootstrap my career was just as practical as using SMTP as you describe now.
I absolutely love your perspective. I feel the same way. s/Ruby+Rails/node for my situation. I believe there needs to be more respect paid to "bad" technologies. The measuring stick should include things outside pure benchmarks. Low barrier to entry technologies provide broad access and real life changes to folks that are able to pick them up and get hacking.
The article is almost entirely the answer to your question.
> there are risks when you go down.
Solved by SMTP on protocol level. With HTTP, must be solved on both client and servers application level.
> webhooks are ephemeral. They are too easy to mishandle or lose.
SMTP has this baked into its heart. Loosing messages is possible, certainly, but rather hard to do. With HTTP, it's really simple.
> In the lost art of long-polling, the client makes a standard HTTP request. If there is nothing new for the server to deliver to the client, the server holds the request open until there is new information to deliver.
SMTP is push, not polling. So all those issues are solved for you.
There are better options than SMTP. Basically any message-oriented middleware / message queuing service can provide this. It's great for both sides, maintenance/outages can happen independently, as long as the queue stays online and has space everything is fine.
E-Mail isn't trustworthy. You may get a confirmation that an initial SMTP server accepted a mail, but that's it. There's also no good way to detect that an endpoint (receiver address) is gone for good to stop sending messages.
You will probably point me to SMTP success messages, but a removed mailbox might only be known by a backend server.
Also mail infrastructure will potentially include heavy spam filters etc. making it quite inconvenient. Not even mentioning security aspects with limited availability of transport layer encryption with proper signatures.
What you're saying is true of public e-mail infrastructure, but that's besides the point. As a queuing solution internally in a system you can make it as resilient as you like with ease because there's a huge ecosystem of resilient software you can use for it.
Same goes for security - your objection is true for public e-mail delivery without additional requirements on the servers or clients, but that is not relevant for a private infrastructure.
Running over the public internet does not mean you rely on unknown third party mail servers. If I address a message to foo@apiendpoint.mycustomer.com, only the servers configured to handle mail for apiendpoint.mycustomer.com and my sending server is involved in the exchange. And that is if you trust MX records for this exchange rather than have the customer input the address of the receiving SMTP server directly.
I think that would be a great solution for these types of scenarios.
In an enterprise setting it becomes more complex if a 365 subscription is required, or active directory authentication is needed to receive emails. Does someone need to monitor the inbox to confirm it's working etc.
But after you mentioned it, I do wish that this was an alternative to webhooks that more service providers offered.
We used to do this for domain name registrations and it worked fairly well for years. However once you've been added to a spam blacklist it quickly breaks down, especially for time critical operations such as domain name renewals when you're scrabbling around trying to appease the Spamhaus gods.
SMTP doesn't reliably deliver messages, implementations of it do. A webshit could easily create an SMTP server (with the help of a library written by someone with actual programming skills) that silently drops messages when any error occurs instead of implementing all that robustness.
That's not an applicable criticism for SMTP running on a private network and/or dedicated set of "mail" submitting servers, as in the specific model outlined in the grandparent comment.
This is a hill I find myself frequently fighting (and losing on): webhooks are terrible to maintain, because they start from the premise "this never breaks" and thats about where development in an organization stops.
The only event API I ever want is notifications there's new data, and then an interface by which I can query all new data which has arrived by some sort of index marker - because this is fundamentally reliable. It means whatever happens to my system, I can reliably recover missed events, skipped events, or rebuild from previous events.
And this is in fact exactly how something like Kafka actually works! Complete with first-class support for compacting queues to produce valid "summarized" starting points.
Any streaming system essentially should never start as a streaming system - it should start as a slow-path pull-based system, and have a fast-path push system added on top of it if needed - because then you've built your recovery path already, rather then what happens way too often which is just "oh yeah, we'll develop that when it breaks".
I agree. A simple ping with the latest ID (which is option for you to use to get events from last ID to newest ID). The go get the events, which is a likely reusing code. Polling is crap.
Extra points for being able to set something like 1s between pings (now you see why I like the option ID for a range).
> Any streaming system essentially should never start as a streaming system - it should start as a slow-path pull-based system, and have a fast-path push system added on top of it if needed - because then you've built your recovery path already, rather then what happens way too often which is just "oh yeah, we'll develop that when it breaks".
I think this is a quite interesting and important point. When we talk about "doing the simple thing first" too often we end building something that is technically simple but fickle. The trick to making the simple thing reliable is to figure out which part is the slow-path (or failure mode), and then only building that. Unfortunately, it often means out result ends up technically "boring" since all the interesting optimizations are what we cut out, but I think that's worth it if the end result is a more useful product.
It's something I've been working with and thinking about for a while. I think it applies to a way broader scope than this discussion.
(I worked on the same team as bkrausz, elsewhere on this thread, albeit not concurrently).
Yes, this is pretty much the right thing to do. It can be a bit more work for the API consumer, partly because they need to track state of their last-read ID, and there's more moving parts.
If you're building a webhhook+events system like Stripe's, you might consider adding an option for a mostly-empty webhook body, which can speed things up in this use-case, but still allows "the easy way" of just processing the event from within the webhook body.
(For readers thinking of implementing this, note that "query for new data" means hitting a dedicated /events api, not individual tables, which might have unpleasant load/performance consequences).
My company has recently switched to Microsoft Teams, where unsupported integrations happen via webhooks. For example, if we wanted to be able to trigger builds in Jenkins or Gitlab, or acknowledge alerts via AlertManager, we'd have to set them up as webhooks to the appropriate service.
The problem is that all of those services are internal to our network, and aren't accessible from the outside world. We cannot set up a webhook to Jenkins because Jenkins does not have a publicly accessible URL. We cannot set up a webhook to Gitlab, or to Prometheus, or to Sentry, or anything else, because those are all internal services.
The only option there would be to create a new, public-facing server, set it up with a domain name and SSL certificate, expose it to the world, and then give it access to those services - which defeats the point of having those services internal and secure if we just create a non-internal system and give it access to them.
Alternately, we have that new, public-facing server buffer those requests and have other services poll them, somehow, so that it cannot connect in, but now we're getting into the same situation as described in the article.
If there were an API, I could easily create a small daemon that would watch for events and dispatch them accordingly, and then respond to them as needed; instead, my only option is to build some kind of Frankenstein - or to give up entirely, which is the more reasonable solution.
Then again, this is Microsoft Teams, where creating an application requires an Azure account and jumping through a ton of hoops, so they're no stranger to stupid ideas that no one wants to deal with.
My company's internal apps use a mix of VPNs and IP fenced load balancers. We are migrating to app proxy.
No inbound connections + access based on Azure AD identity with conditional access (restrict apps to Intune enabled corporate devices) and MFA is an absolute killer.
My only complain is that connectors are not very DevOps friendly. Cloudflare Tunnel is much better in this area.
You might look into Cloudflare Tunnel (formerly Argo). It is free and allows you to poke a hole in your firewall to a specific service. If that meets your security requirements.
I don't believe Cloudflare Tunnel is free, the free tier pricing page [1] lists Argo Smart Routing at "Starting at $5 per month" ("Argo includes: Smart Routing, Tunnel, and Tiered Caching")
> The only option there would be to create a new, public-facing server
This is a problem with receiving any inbound data from a third party. At least with HTTP, it's pretty trivial to set up a robust reverse proxy with nginx.
I'm finishing a browser based application platform where the applications installed expose a RPC api, so in the end all applications can call others in the same local(or remote) node/s.
The beauty of this is that you also can compose with other nodes and for a distributed service by calling the local service as a proxy and routing the requests to the other nodes of the same api.
It took more time than i've predicted because its also expected to deliver UI and most of the 'HTML5' api to native applications (instead of Javascript), which is a massive platform by now (and the #1 reason why newcomers to browser technology cant compete, giving the feature creep tax imposed to them).
The idea is also to distribute over a DHT so you can just serve your application over torrent without needing to register anything..
The only way to get there is by empowering users and developers and taking some of the control from the cloud platform giants.
In my point of view the only way to break the browser monopoly now is to create a new path forward, a branch.. its not the time to follow the rules, its time to break them or else the future doesn't look so bright in my opinion..
> The only option there would be to create a new, public-facing server, set it up with a domain name and SSL certificate, expose it to the world, and then give it access to those services - which defeats the point of having those services internal and secure if we just create a non-internal system and give it access to them.
That's not the only solution -- you could also develop a bot that will do those specific things.
In the days of yore I know of at least three companies that were using IRC bots to similar effect long before webhooks ever existed.
Because of that prior experience, this is how I currently manage a similar set of problems, albeit not on Teams in my current role.
Really good point that corporate firewalls can trip you up. With slack it was so much easier to call into their events API than receive an outgoing webhook for precisely this reason.
The downside was that the event api required a huge amount of scope, so if you weren’t careful and were compromised, someone could use that token to scrape all messages in the system.
We use this same long-polling based /events API interface for all official clients (web, mobile, terminal), our interactive bots ecosystem (https://zulip.com/api/running-bots), and many integrations (E.g. bridges with IRC/Matrix/etc.).
We also offer webhooks, because some platforms like Heroku or AWS Lambda make it much easier to accept incoming HTTP requests than do longpolling, but the events system has always felt like a nicer programming model.
(Zulip's events system was inspired by separate ~2012 conversations I had with the Meteor and Quora founders about the best way to do live updates in a web application).
There are lots of reasons to want to immediately respond to an external event besides building an eventually consistent data syncing system. Polling an API endpoint works fine for the latter case, but not much else.
A good platform should offer both of these and more (for example Slack does webhooks, REST endpoint, websocket-based streaming and bulk exports), and let the client pick what they want based on their use case.
Long-polling is the way to immediately retrieve events. It's more efficient and lower latency than waiting for a sender to initiate a TCP and TLS handshake.
A persistent connection has a cost. Your statement may be true in some circumstances but definitely not all. Namely, for infrequent events it is much more efficient to be notified than to be asking nonstop. Sure, the latency is lowest if the connection is already established, but for efficiency the answer is not cut and dry but is rather a tradeoff decision based on the expected patterns.
Long-polling is usually configured to reset at the both sides after a timeout preferred by the client-side (/events?t=30), long before any network effects kick in, e.g. 10-30 seconds. A client then simply spams requests in a loop, backing off only at http errors. If you have some crazy firewall in between, just set “t” appropriately.
What’s the issue with that? This will be discovered as soon as the endpoint tries to send an event, right? At which point the client will see that the connection has been closed, reconnect, and receive the event.
No, the server will try to send an event, and the server will notice the connection has dropped. The client will still have no idea until some sort of timeout is reached, as the client will usually not be sending any data over the connection, as the connection's sole purpose is for the server to send events to the client.
A way to fix this is to use an application-level keepalive (TCP keepalives are generally useless), but then that increases the load on the server and adds a scaling burden.
Meanwhile, unless the event stream is stateful (more overhead!), the client has lost all events since the connection has dropped, and the client can't even be sure when the connection actually dropped.
With webhooks, assuming the callback sending service has a generous retry policy, and the customer's receiving service does not return 200 unless the webhook has been completely processed, or persisted to storage, you won't lose events.
I've been at Twilio for the past 10 years. We recently started offering an event stream service (that customers had been requesting for some time), but it's complicated to get right (on both the server and client side) and difficult to scale, and, frankly, webhooks have worked fine for most customers for a very long time.
> No, the server will try to send an event, and the server will notice the connection has dropped. The client will still have no idea until some sort of timeout is reached, as the client will usually not be sending any data over the connection, as the connection's sole purpose is for the server to send events to the client.
Exactly why mqtt has the ping packet for the client.
yeah. a(n improperly configured) firewall is going to start dropping packets if it thinks a connection is idle for too long, so the system never sees an RST and think the connection’s been terminated.
Because what usually happens is the connection is just forgotten from the NAT table. Both sides still see it as connected but the middle box will no longer forward any packets.
It doesn’t just “fall off” the NAT table. Some process in the firewall chose that entry in the NAT table to drop at that moment. It could use the entries from that NAT table to construct RST packets to both sides of the connection. This should be easy and obvious.
are you sure? specifically, are you sure a persistent connection has _more_ of a cost than repeatedly re-establishing a connection & TLS, etc.?
in terms of energy costs alone, DNS resolution, establishing routes, generating cryptographic session keys, etc. it's definitely not as cheap.
in terms of today's computation power, the "memory" costs of maintaining a connection are minuscule, and the performance "penalties" are negligible.
example: lets say you have 50k event subscribers. if nothing happens, then, aside from a few TCP keepalives (which are not strictly speaking required, and can happen very infrequently), no traffic moves. if instead you have polling once every second, then that's nearly ~13-14 connections a second, each one with at least 4 round trips of traffic. that's a measurable amount of load.
One nice benefit of long polling is the built in catch-up-after-a-break functionality: When the client initiates the poll, it tells the server the state it knows about (timestamp, sequence number, hash, whatever), and the server either replies right away if it's different, or waits and replies once it's different.
With webhooks, as in the article, you only get state changes; you need some separate mechanism to achieve (or recover) the initial state.
That's true, although it's also true of any `/events` endpoint that doesn't go back to the beginning of time. Stripe's endpoint only goes back 30 days, so you still need to solve for the initial state unless you have launch all of your desired functionality at the very beginning of your Stripe account!
Hopefully if it's a system like payments where you not only need to know state, you also need to know the time and nature of all transitions, there's a way to query all of that information.
I'm thinking of simpler situations like my source host's CI spinner that seems to get stuck all the time due to missing the ping back from Jenkins about build statuses. In that case it really would be fine to always just say "I think the state is X, please answer me now or in the future whenever the state is other than X." I don't care about anything other than an up to date sync.
Someone has to maintain an always-running listener for `/events`. If a server does that, and triggers client calls, we call that webhooks. If a client does that, and triggers internal functions, it's what the op describes. I think that for APIs, `/events` should indeed be the fundamental feature, and "webhooks" should be a nice-to-have service on top of `/events`, for those who don't want to maintain a local subscriber.
If the webhook events are coming at some sort of a brisk pace, the sender well may be able to reuse an already-open connection. And if they're rather infrequent, is the efficiency or latency likely to be a significant concern?
My understanding is that long polling is the thing that will reliably work at scale. Perhaps this changed in the past few years, but I’ve asked various companies like PubNub why they only use long polling and the answer was that there are too many incompatibilities out there in the wild for anything but that.
Server-Sent Events are very reliable. What you might be thinking of is the fact that you probably shouldn't rely just on server push. But that doesn't mean you should use long polling.
You should use normal short polling and Server-Sent Events.
Also it makes no sense to say long polling is more reliable than SSE, because SSE is essentially a non-hacky implementation of long polling.
Websockets can cause issues especially if you're not closing sockets properly, or have too much activity on a small server etc... Livewire for instance accounts for this by just polling every 2 seconds for changes, this is much more performant than keeping 10000 sockets open if people leave open the page/app but don't actually do anything...
Straight long-polling should be avoided, but intermittent polling is a good solution for performance when you don't want to use all your socket bandwidth.
> To mitigate both of these issues, many developers end up buffering webhooks onto a message bus system like Kafka, which feels like a cumbersome compromise.
Kafka solves exactly the issue that the author is complaining about. This is a safeguard to ensure that data isn't dropped in the event of an issue, and provides mechanisms to replay events.
The tradeoff between pushing and polling have been argued since forever.
In other news, mechanics who work with bolts often do so with ratchets. This is a cumbersome compromise, just give me Torx fasteners!
It would if the source was pushing into the Kafka stream directly. It doesn't solve the problem of going out of sync if my code to push to the Kafka stream is entirely down and I miss POSTs.
(And, of course, I don't want Kafka. I want Google PubSub. No, wait, I mean SQS. No, wait, I mean I want zeroMQ. No, I mean....)
The question is: who maintains the queue of events, and pays for it?
Certainly the event producer is in a better position to maintain a queue without missing events, but it also means they need to buffer more data in their queue system to accommodate for your receiver's downtime
Not disagreeing with your point, and I'm sure you already know this, I just wanted to point out (for the benefit of people that don't have other options) that it is possible to build "webhooks" in such a way that you're confident nothing is dropped and nothing goes (permanently) out of sync. (At least, AFAIK -- correct me if this sounds wrong!)
Conceptually, the important thing is each stage waits to "ACK" the message until it's durably persisted. And when the message is sent to the next stage, the previous stage _waits for an ACK_ before assuming the handoff was successful.
In the case that your application code is down, the other party should detect that ("Oh, my webhook request returned a 502") and handle it appropriately -- e.g. by pausing their webhook queue and retrying the message until it succeeds, or putting it on a dead-letter queue, etc. Your app will be "out of sync" until it comes back online and the retries succeed, but it will eventually end up "in sync."
Of course, the issue with this approach is most webhook providers... don't do that (IME). It seems like webhooks are often viewed as a "best-effort" thing, where they send the HTTP request and if it doesn't work, then whatever. I'd be inclined to agree that kind of "throw it over the fence" webhook is not great and risks permanent desync. But there are situations where an async messaging flow is the right decision and believe it or not, it can work! :)
This misses the problem explained in the article, which is that there are scenarios where events are "acked" but things still go wrong because of bugs.
For example, you rolled out code on the receiver side that did the wrong thing with each message. Now there's no way to replay the old webhooks events in order to reinstate the right behaviour; there's no way to ask the producer to send them again.
The only way around this is to store a record of every received message on the receiver side, too, which the article author thinks is an unnecessary burden compared to polling.
Personally, I think push is an antipattern in situations where data needs to be kept in sync. The state about where the consumer is in the stream should be kept at the consumer side precisely so it can go back and forth.
If you want to be 100% sure that you get all the webhooks, the sender could implement an incrementing "webhook ID". If the receiver knows the last webhook ID was 53 and the sender sends one for 55, you can tell one has been dropped. There are some other concerns around that like if 54 has been sent but they arrived out of order, or if they arrive almost simultaneously. Nothing that isn't solvable afaict though.
Of course, then you need a way for the receiver to retrigger or view the webhook if one gets missed, which starts to look like you have to have a polling endpoint anyways, though.
We have a system that pushes loads of messages (as in thousands a minute) and some consumer insists on using there http backend to push the messages to.
There system is down every once in a while for quite some time.
We're using an async queueing solution, but you can't keep those messages forever.
We sometimes have milions of messages for them in there queue's, which take up space...
If all of our consumers had those problems we would have to buy loads of storage..
We're simply dropping messages older than x, and have an endpoint that they can call to retreive the 'latest state of things'.
This way when they come back from a failure, they simply get the latest state, and then continue with updates from our end..
It's far from perfect, but it works really well.
I know the goal for most systems is just to be 'up to date'
Not to get the entire history.
So in most cases you don't need to stash all the messages, you just need to be able to retreive the latest state of stuff...
> "Of course, the issue with this approach is most webhook providers... don't do that "
Embedded systems don't do that for webhooks because they can't (very little RAM or non-volatile storage) but customers clamor for webhooks anyway because it's what their web developers know how to use. So inevitably they're going to lose data but they're only getting what they asked for.
As long as you guarantee delivery to your message queue before acknowledging receipt, you should be golden.
Also, swapping out one messaging system for another is trivial. Pick the one best suited to the environment you're working in, and if that environment changes, changing messaging queues is going to be one the easiest transitions you'll make.
Yeah I was scratching my head reading this article; they're bending so far backwards to avoid the obvious solution that I thought they were gearing up to pitch some competing tech.
> If the sender's queue starts to experience back-pressure, webhook events will be delayed, and it may be very difficult for you to know that this slippage is occurring
I've never before seen anyone try to argue that properly dealing with backpressure is a bad thing. The author's proposed model makes this situation even worse. With kafka, consumers can continue processing the event stream and you can continue to serve reads from your primary datastore. With the author's model the event stream lives in your primary datastore, so if that starts to lock up the blast radius is much larger.
Are you going to expose your Kafka brokers directly to your integration partners? Are they going to use the Kafka client library and wire protocol to send you data? That’s the thing about webhooks, HTTP is universal and if you’re comfortable exposing anything externally, it’s going to be a web service.
That's a pretty straight-forward design that's widely used, robust, and easy to put together. I've probably done that same workflow 100s of times without issue.
As long as you guarantee the message was pushed to the queue before acknowledging, that will be fabulously reliable. You need to make contingencies for duplicate messages, but that's not usually difficult.
It's a common writing style as of late, set down a premise and solve that premise decisively.
Now, if that premise isn't based in reality, or if it's already been solved some other way, discredit it without giving it too much air time.
A one liner about kafka being cumbersome and then building your own solution, warts and all, doesn't need to exist in the same thought if you've made the reader mentally disregard it as a possible solution.
Totally, things can get very reliable if you start processing webhooks asynchronously. Personally I've found it pretty cumbersome and complicated to build the necessary infrastructure in the past. I've been building https://hookdeck.com as a simpler alternative specifically to ingesting incoming webhooks.
Are events and webhooks mutually exclusive? How about a combination of both: events for consuming at leisure, webhooks for notification of new events. This allows instant notification of new events but allows for the benefits outlined in the article.
What about supporting fast lookup of the event endpoint, so it can be queried more frequently?
I think that a combo of webhooks / events is nice, but "what scope do we cut?" is an important question. Unfortunately, it feels like the events part is cut, when I'd argue that events is significantly more important.
Webhooks are flashier from a PM perspective because they are perceived as more real-time, but polling is just as good in practice.
Polling is also completely in your control, you will get an event within X seconds of it going live. That isn't true for webhooks, where a vendor may have delays on their outbound pipeline.
Yea, you're right. I am reading the advocacy as "if you need real-time, then support long-polling."
I see the value in this, but I actually disagree with the article in terms of that being the best solution. Long-polling is significantly different than polling with a cursor offset and returning data, so you wouldn't shoe-horn that into an existing endpoint.
Couldn't keeping a request open indefinitely open the system up to the potential of DoS attacks though? Correct me if I'm wrong, but isn't it kind of expensive to keep HTTP requests open for an indeterminate amount of time, especially if the system in question is servicing many of these requests concurrently?
I think that's what the author was getting at, after reading through the whole article. The idea isn't to get rid of webhooks, but provide an endpoint that can be used when webhooks won't necessarily work.
Very similar to how I built my previous application.
1) /events for the source of truth (I.e. cursor-based logs)
2) websockets for "nice to have" real-time updates as a way to hint the clients to refetch what's new
Yeah... I'd go so far as to argue that this is the only architecture that should even ever be considered, as only having one half of the solution is clearly wrong.
This is the way to go and I'd love to see more API's with robust events endpoint for polling & reconciliation. Deletes are especially hard to reconcile with many APIs since they aren't queryable and you need to instance check if every ID still exist. Shopify I'm looking at you.
Yes to the combination of both. I worked on architecture and was responsible for large-scale systems at Google. Reliable giant-scale systems do both event subscription and polling, often at the same time, with idempotency guarantees.
Sorry if I'm daft, could you/someone explain why one would want to use both at the same time for the same system?
One thing that makes sense: if you go down use polling so you can work at your own pace. But this isn't really at the same time. When/why does it make sense to do both simultaneously?
There is an inherent speed / reliability tradeoff that is extremely difficult to solve inside one message bus. When you get to truly large systems with a lot of nines of reliability, it starts to make sense to use two systems:
1. Fast system that delivers messages very quickly but is not always partition-tolerant or available
2. Slower, partition tolerant system with high availability but also higher latency (i.e. a database)
The author goes through this in the very first section. Webhook events will eventually start getting lost often enough for the developer to think about a backup mechanism.
Long-polling works if you have a lot of memory on your database frontend. Most shared databases want none of your long-running requests to occupy their memory which is better used for caches.
Even if your message bus has the ability to store and re-deliver events, you might want to limit this ability (by assigning a low TTL). Consider that the consumer microservice enters and recovers from an outage. In the meantime, the producer's events will accummulate in the message service. At the same time, the consumer often doesn't need to consume each individual event but rather some "end state" of some entity or a document. If all lost events were to get re-delivered, the consumers wouldn't be able to handle them, and would enter an outage again. This is where deliberately decreasing the reliability of the message bus and rely on polling would automatically recover the service.
There are other reasons, of course. The author is absolutely correct in their statement, though: whenever a system is implemented using hooks / messages, its developers always end up supplementing it with polling.
> Webhooks allows for zero resource usage until a message needs to be delivered.
Doesn't that only work in the case where the server treats each webhook delivery as ephemeral? If you're keeping a queue to allow reliable / repeatable delivery, that's definitely not "zero resource usage", right?
I don't think the original comment meant long polling (i.e. keeping the connection alive), they meant periodically call the endpoint to check for events.
There's a much better approach than /events or webhooks: add synchronization directly into HTTP itself.
The underlying problem is that HTTP is a state transfer protocol, not a state synchronization protocol. HTTP knows how to transfer state between client and server once, but doesn't know how to update the client when the state changes.
When you add a /events resource, or a webhooks system, you're trying to bolt state synchronization onto a state transfer protocol, and you get a network-layer mismatch. You end up with the equivalent of HTTP request/response objects inside of existing HTTP request/responses, like you see in /events! You end up sending "DELETE" messages within a GET to an /events resource. This breaks REST.
A much better approach is to just fix HTTP, and teach it how to synchronize! We're doing that in the Braid project (https://braid.org) and I encourage anyone in this space to consider this approach. It ends up being much simpler to implement, more general, and more powerful.
You may send POST /events instead. It also breaks “REST”, which is just a sort of obsession rather than a requirement here, but more importantly it wouldn’t break idempotence and proxy caching that GET implies.
Edit: from the network point of view, it’s either call-back or a persistent call-wait/socket, or polling. The exact protocol is irrelevant, because it’s networking limits and efficiency that prevent everyone from having a persistent connection to everyone. A persistent connection can’t be much better than any other persistent connection in that regard, and what happens inside is unrelated story. Or am I missing something?
...and you can get these features for free using off-the-shelf polyfill libraries. If you're in Javascript, try braidify: https://www.npmjs.com/package/braidify
For developers/engineers who have never seen it, SSE is a nice way to get going with streaming (slightly different from long polling, it's server-push) easily -- there's no need to jump to tools like websocket/gRPC streaming if you don't need to:
> One idea for Stripe and other API platforms: support long-polling!
It’s great that we’ve went full circle. But make no mistake that this only means one thing: that servers are cheaper than ever. We can now afford to entertain previously extravagant ideas.
Long polling is a lot easier to support now than it was a few years ago, thanks to the wide availability of async server frameworks - Node.js, Python ASGI etc - which make supporting thousands of simultaneous long-polling connections with a single server much less expensive.
But Long-polling doesn't scale in certain situations where you can have infinite webhooks.
I'm sure many people reading this have many idle GitHub repos set up with webhooks into some kind of build server. The repo might see no more than one commit a week.
It makes absolutely no sense for GitHub to long-poll (or websocket) all of these build servers.
(Now what would make sense is for /events to support a way to flip over to a webhook when it's idle. IE, long-poll for a minute, then the next request sends a URL for a 1-time webhook called on the next event.)
I've have had a couple issues implementing Long Polling in the past. At times I've had the firewall or reverse proxy drop client connections if it detects no data transfer. Which meant I had to timeout all the requests every 30 secons or so. Make a new request before the other one ends, and it all becomes messy.
At this point, its honestly just easier to just have a websocket + events endpoint with a cusor both.
I wouldn't recommend long polling for this reason. There are also some security products that have trouble with what appears to be a long file download.
Websockets or server sent events at least signal to intermediaries that a longer term connection will be open.
> It’s great that we’ve went full circle. But make no mistake that this only means one thing: that servers are cheaper than ever. We can now afford to entertain previously extravagant ideas.
We've not come full circle. This is just one blog saying "how about long-polling". While also ignoring that since then we've gained web sockets, HTTP/2 and HTTP/3, each of which make long-polling pointless in three different ways.
also the OP seems to ignore all the issues with webhooks that longpolls suffers or is worse at. eg: you'll lose all those tcp connections if your service is down -- one of the main complaints about webhooks...
Which, if the parts discussed before long-polling was mentioned were implemented, wouldn't be an issue since you'd be starting from the first event following your last-recieved cursor anyway.
I agree that webhooks have their problems, but, as somebody maintaining software that uses long-polling heavily, I will say that long-polling is quite difficult to handle reliably "over the internet".
ISPs have very different views on how long a connection is allowed to be kept open, and will absolutely kill long-polling connections without remorse. This extends beyond ISPs, too. Practically all network infrastructure between API consumer and API will have to accommodate long TCP connections, which is unfortunately not as trivial as it sounds.
The next cool feature of long-polling is that the server might not know that the connection is broken until it actually tries writing to it, so if the back-end relies on having or not having a connection, this will make for some interesting edge-cases.
Load-testing products using long-polling also has interesting implications, such as hitting the "65k" problem on test clients and network infrastructure.
Add another layer on top (or below, like Kubernetes, OpenShift, whatever) and you should be getting a prescription for anti-depressants from the very beginning, because you WILL need them.
> Add another layer on top (or below, like Kubernetes, OpenShift, whatever) and you should be getting a prescription for anti-depressants from the very beginning, because you WILL need them.
I guess this is the current zeitgeist's version of "stock up on $PAINKILLER" or even "keep a bottle of $LIQUOR hidden in the bottom drawer of your desk".
This article is premised on the incorrect strawman that webhooks are complicated because the consumer has an extra persisted message bus.
But if the producer retries and the consumer does not respond with 200 until it has processed the message, no consumer side message queue is needed, the consumer can rely on the producer to reach at-least-once delivery.
In both cases (webhooks and /events) storage is needed producer side, so nothing consequential has changed, only with /events you need long lived TCP connections which ties up (e.g. you can't do this on FaaS endpoints)
Functions as a service are absolutely ideal for low frequency webhook receivers. SO SO SO cheap.
> But if the producer retries and the consumer does not respond with 200 until it has processed the message, no consumer side message queue is needed, the consumer can rely on the producer to reach at-least-once delivery.
Part of the point of the article was that you may deploy bad code which returns 200, but doesn't actually take the correct action with the events, and then you have lost all that data and have no way to get it back, which is why you have a consumer-side message bus to hold the webhook history, so that you can replay the webhooks if you made a mistake. Your comment does not address this at all.
If the service exposes a /events page, and especially one that supports long polling (or SSE), then you no longer need a consumer-side message bus, and you might not even need webhooks at all.
I definitely think webhooks should be offered, but I agree with the article that webhooks shouldn't be the only thing.
The article specifically exemplifies the case of the web hook having faulty code (introducing nulls) while still not failing with errors. In this case if you don’t have storage on the consumer side, you have to ask for the producers to be merciful gods.
Under the assumption of no storage on the consumption side given by parent commenter, there's no way to replay. Anyway, I was not making the case for long-polling.
I partially agree but the issue with relying on the producer delivery is that you effectively give up control on what the retry logic is. If it doesn't fit your use case, too bad for you. While ideally every platform would provide those configuration I think it's unreasonable to think all platforms will offer excellent webhook tools & configuration. You better just take things under your own hands.
We have (almost) the opposite problem, webhooks are too synchronous for what we need to ship to people. We're experimenting with giving people a NATs endpoint to listen for logs and other events: https://community.fly.io/t/fly-logs-over-nats/1540
Having an ephemeral messaging system and a ledger to reconcile against is a nice, simple way to provide immediacy and eventual consistency (where eventual could be days). It's a pattern we're using all throughout our infra.
FWIW, this is what CouchDB has done from Day 1 and it _always_ seemed like one of the most magical and surprising things tool or platform providers would suddenly realize and then LOVE about using the DB.
There was nothing fancy about it, you could just listen to an endpoint and it was a stream of the append only log of events occuring in the DB to the point that you could literally use it to feed a replicated master or slave (or backup).
I imagine your use-case is a bit more nuanced, but I sure do love that model.
We have this problem too. When we send a request to create an entity in an external system, before we get the response back with the entity Id, we already have received a webhook saying said entity was created. Makes a basically trivial workflow quite confusing.
> If you need strict consistency guarantees, sure. Otherwise, don't piss off your API consumers, webhooks work just fine.
Ahh, but all problems are queues. What happens if your webhook destination is down and your source system sending queue backs up? What happens if your destination endpoint is up, providing 200s to requests, but throwing away the data quietly due to a mistake during a rollout? Or your webhook source quickly ramps to a volume that is effectively a DoS attack? (These are all problems I encountered in a role at a low code/no code product).
I strongly endorse others in this thread who indicate long polling events is a suitable pattern. You want the best worlds of data durability and consistency through polling functionality, but also as close to real time event firing as possible.
Of course, some APIs you have no control over, and are stuck building robust, chatty polling infra to support because your (paying) users demand access to those APIs. Such is the schlep.
Why do I need two different ways of retrieving events? Wouldn't the data also be available via the usual rest api? If events fail on the consumer end it's their responsibility to resync. Presumably code was written for an initial sync anyway no?
> Building polling infrastructure is substantially more complex
That really depends on the setup you're using. There are plenty of server platforms out there where setting up a cron is no more complicated than making a GET route handler.
I'm looking at it from the perspective of a web dev. If you're already building a web application, it is almost always more work to do the polling setup.
In Python, it's either a cron script or Celery beat. In PHP, usually a cron script.
That all means more processes running outside of your web framework, more stuff to deal with, more complexity. Now you have to manage e.g. running a celery process, managing crontabs wherever you deploy...
Even in languages like Elixir where long-running scheduled processes are cheap and native, you still have to write a GenServer in comparison to just using your web framework.
I'm fine with the approach in the article. I actually like that they're giving the option for multiple ways of getting those events. Just please, please don't make /events the only option...
I actually have the opposite opinion. I'm an experienced web developer. Setting up polling is easier for me than setting up an HTTP endpoint.
I typically use a language with an event loop (async/await), so there is always something like `setInterval(poll, 500)`. All that code needs is a connection to the internet. If the server is down for 24 hours I can start it up and it'll read the missed events. I can set up 10 dev environments with just an API key difference in the config. I can batch apply events in a single database transaction, ensuring consistency.
But with a webhook, I need to ensure that my server can accept incoming connections, the external API knows that location. Each dev env needs to replicate this public HTTP server set up. I need to monitor the uptime of the server closely as missing events or erroring on a subset of events could leave my database in an inconsistent state.
I like having a one-off command-line app triggered by a webhook. If you miss an event, invoke the app manually. If you pass it the same event twice it won't matter (idempotency!)
A pattern that I like to use in toy projects is to have an events endpoint and just send the newest event cursor through a notification channel (e.g. webhooks) when a new event occurs. You get new event quickly, you don't have to keep connections open for long-polling, and you can keep polling with a low period in case the event channel stops working for some reason.
Webhooks are great for producers of events, and I'd argue that it's too cumbersome for them to provide an '/events' endpoint primary because of scaling. With webhooks, they can offload events at their own pace.
For consumers, I agree with most here that Kafka is certainly overkill. We've gotten away with a very simple architecture to have reliable event consumption. We point all webhooks to an (AWS) API Gateway backed by Lambdas. The Lambdas push the events to an SQS queue (FIFO-queue, if it needs some sort of sequence), and we take our time consuming the events through a very generic poll.
> Webhooks are great for producers of events, and I'd argue that it's too cumbersome for them to provide an '/events' endpoint primary because of scaling. With webhooks, they can offload events at their own pace.
TBF they could do something similar with `/events`, instead of pushing events to a webhooks-sending queue just push them to the events buffer, which could even be a circular buffer just to point out that the essay is completely wrong. TFA is not asking for /events, they're asking for a very specific kind of /events with a large non-drained buffer. Something which would only ever work for low number of events: $dayjob's github integration takes in several events per second.
A proper event stream would be nice though, github's webhooks delivery system is not exactly reliable.
> With webhooks, they can offload events at their own pace.
They can't offload webhooks at their own pace if the two parties want reliable delivery. The server providing the webhook might be experiencing a prolonged outage, in which case the sender needs the ability to buffer the events anyway.
YES PLEASE. Give us a GET on /events with support for hanging GETs.
A hanging GET is where you get chunked encoding (which is the only option in HTTP/2 anyways) and possibly never-ending stream. I've implemented a (proprietary, for now) "tail -f" over HTTP that does this when Range: bytes=0- (i.e., end offset not specified), completing the transfer (i.e., final, empty chunk sent) only whenever the file is removed or renamed away.
You'll want to add a heartbeat to any hanging GET /events.
My little tail-f-over-http server is for plain files, but this scheme works for anything. Of course, if you can output an indefinite stream (e.g., PG NOTIFYs, build logs, etc.) you can redirect that to a file and then serve up that file.
Why wouldn't service providers want to offer long polling? It seems a lot easier to build (no webhook registration backend) and no retry logic/SLAs outside of your control. SSE seems so much simpler.
Quite the opposite. HTTP servers and clients are essentially a solved problem. Massive scale-out, load balancing, retries, authentication, authorization, rolling deploys etc. can all be done out of the box by a hundred different providers. Anything to do with maintaining a large number of open TCP connections is still a massive pain on the server side.
Yep. Consider: you can build a damn reliable & resilient Webhook handler out of a few lines of PHP or Lua, ready to accept a fairly heavy load, zero dependencies, and only default-available packages for most any distro or BSD (anything where nginx or Apache2 with standard modules is available by default) and without tweaking the config at all. You can be live in hours, or even inside a single hour if you want to cowboy it up pretty hard, and despite not taking a lot of care, the webhook-handling part of your service probably won't get you woken up at night with a everything's-on-fire support call (what it does with the data might, of course). Logging? Trivial and standard. Service management? It's the OS' default service definition for a web server daemon, and that's it. Config and deployment? So tiny it'd be nearly no work to document it in a run-once shell script, if you don't have anything fancier at hand. Operationally, it doesn't get much simpler. "Is it working?" checks? You can test it with curl, from any address that's able & allowed to talk to it.
With long-polling, now you're managing a custom daemon, basically. That's a big step down in reliability-by-default, and a bunch more work to do it right.
In either case, you'll be looking at more work if you want to check any kind of log on the other end for missed messages, but that looks pretty similar for either, and not all systems need that level of accuracy (and if they do, they probably need even more and this whole thing is Doing It Wrong)
Keeping those connections open probably isn’t super cheap and complicates deploying rolling updates - you’d need to kill all connections when you update and that would require some sort of RPC (so that you only kill those existing connections after they’re done sending in-flight data).
With polling you have to make sure you finish sending data for the current in-flight event. If you don’t then the client doesn’t receive that event, unless you also send them events from the past 30 seconds on first poll.
One of the nice things about a pull-based model (polling) is that you are in control of throughput. Need to process events more slowly? Increase the polling interval. It’s impossible to achieve the same thing with push-based (webhooks), you are at the mercy of the producer’s rate of webhook delivery. I had this issue a few years ago with a queue-as-a-service that sent jobs via webhook - the queue would intermittently drain extremely quickly, sending thousands of requests per second, which totally overwhelmed our poor single Heroku dyno.
One big issue with the pull-based model though is with concurrency. If you have multiple workers polling an API endpoint for new data, you need to synchronise the ‘last seen’ ID or timestamp across all workers. Otherwise, worker A and worker B might pull the same data and you could end up with duplicates. There’s no silver bullet here, either model requires work to harden against edge cases.
I'm building integrations with various marketplaces at a company I work at (fulfilled by seller kind of deal), and I can confirm, it's much easier for us to schedule an HTTP request once in 15 minutes than to create a custom HTTP service responding to specific requests from 3rd parties.
We're in the business of selling things, we're not in the business of building HTTP services. Our ERP-like thing is down for maintenance from 10pm to 11pm, so we can't use it as a platform for responding to webhooks.
I'll hack something together in a pinch when it's the only way to get orders from a marketplace service, but then eventually I'll have to explain to my colleagues how to linux, how to HTTPS, how to python, how to WSGI, and all other stuff our company typically doesn't do, but has to do now, because this particular marketplace wants to POST orders to us.
You might want to check out https://hookdeck.com (I work on it). We built it precisely for this use case, you shouldn’t have to spend of bunch of time building webhook ingestion infrastructure.
By polling an endpoint provided by the marketplace with parameters "since" and "to" to filter events by the time they happened. We typically set "since" to two days ago and "to" to tomorrow. We're not in the hurry, we have an hour or two of leeway between receiving a message and having to act on it. I certainly prefer the polling solution for that usecase. Easier to set up, easier to debug, easier to notice that something is wrong.
Maybe there's an opportunity here for some kind of buffering service that would receive webhooks and present them as a stream of events. Or maybe something like this already exists?
That's more or less what I implemented with a bit of python and sqlite. It works, but it's another piece of infrastructure to care about in a shop full of people who never had to care about that kind of infrastructure. For example we (well, I, really) forgot to configure certbot to restart nginx after renewing a cert, and only noticed that after a marketplace notified us that they're temporarily pulling off our SKUs due to our HTTP service being misconfigured.
Can this be a 3rd party service? It certainly can be, but it's hard to make a generic one for any kind of webhook. Some marketplaces expect a dynamic response, like replying the order number we assigned internally (I typically just echo back the number they gave us with some prefix, but it's still more smarts than just replying with empty 200 OK).
And I've seen services which aggregate popular local marketplaces into API which is easier to work with, but they require to concede some other parts of the business we'd rather keep in-house, like assortment and inventory management.
I recently had to work with an api to integrate a card reader terminal. The system was intended to work with an on premise pos so it was oriented around events sent back and forth over a websocket. Only one was allowed per restaurant.
In the end, I had to build a distributed cluster that managed a pool of processes that each spun up a websocket connection and relayed it to our pubsub system. Luckily I built out system in elixir so it was pretty easy to spin up hoarde but this would have been extremely difficult in any other language.
By contrast, a webhook is easy to scale from ANY language. Just make a http endpoint. Those are way easier to scale than a stateful process.
Not being able to replay a webhook event is an issue with the publisher of the webhook. not a consumption issue.
The Stripe `/events` endpoint is kind of like a Stripe-internal webhook with a cache of 30 days.
One issue is that webhooks and HTTP clients can be pinned to a version, but the events listed at `/events` are whatever the Stripe account default version was at the time of the event creation.
So all of your code clients polling `/events` needs to ensure it can handle many different versions.
Another issue is that child lists are limited to 10 items, which means that you need to do a direct download to get list items > 10. This means the event list is lossy as items > 10 are never contained in the event stream.
Stripe feature request:
- An option to include all list items.
- `/events` that can be version-pinned / contains events with the same API version.
Makes me think of how the OpenStack project watches for events to Gerrit. They open an SSH connection to the Gerrit server with the stream-event command. It then stays connected to the Gerrit server and all the events that occur show up in the stream and the program can take actions based on the events it sees.
Gerrit stream-events doesn't solve the issue if the connection is dropped and events occur while disconnected.
I personally (for my hobbyist use case) would prefer the article's /event system over webhooks as webhooks require you have a system that is available on the Internet to receive the webhook. Where having this /event system would not require that.
I think with a right a strategy for retry then webhooks are great. On my email forwarding app https://hanami.run we offer both method to give user access to their email:
1. webhook with retry up to 7 days.
2. a REST api to fetch all data
Sending webhook out properly require lots of effort, especially idempotent key concept to avoid duplicate data. And control concurency to avoid swarming the webhook endpoint.
So at the end of day, both are require same amount of resources, either on the sender side or the receiver side.
One of the complexity of the polling approach on a consumer side is having a long running poller. This is trivial to do in Java apps - start a polling thread - but not so straight forward in case of PHP apps, for example. In that case you'd have to setup a cron job or a separate polling script under some sort of process supervisor like systemd to poll periodically/continuously.
I wonder if the two approaches could be combined to simplify things for consumer apps at the cost of slightly more complexity on the producer side? Instead of POSTing the actual event data to webhook, the producer just uses consumer's webhook to "poke" it - to tell the consumer app "hey, you have new events waiting for you". On receiving the poke the consumer endpoint handler/PHP script can just turn around and do a GET to "/event" with anything > last downloaded event id query. That way you don't have to support long polling on the producer's servers and it's not a big problem if consumer misses couple of webhook "pokes". The next time it does receive a webhook "poke" successfully, it will download all the events and be all caught up. If real time notifications are not strictly required then producer side can even run the webhook dispatching code on a scheduled basis to coalesce multiple events in a single "poke" to a consumer to be more efficient, if desired.
This is just my opinion of course, but I've consumed a great number of APIs and I love when there is. You don't even need /events as long as there's a good index endpoint for the object in question.
For a great number of applications it's not that big of a deal to miss a webhook, and the extreme simplicity that it gives the developer is worth a great deal. With how enormously complex a lot of systems have gotten, I really favor simplicity whenever possible.
long polling is not good for serverless consumers. webhooks are great for compute on demand so we should work towards that (it's cheaper coz it's more effecient).
You do not need storage in the producer AND in the consumer, you just need a queue in the producer. Yes, even that is annoying, but the suggested architecture will still lose data if the long poller is down unless there is storage in the producer... so nothing significant has really been solved
The article is interesting but is misleading. What the article really says, is that if in average you receive a webhook every second or faster, it is better to poll every second an /events endpoint.
Well the best advantage of Webhooks over polling is that you receive events straightaway no matter the volume of events. If you already know the volume and the volume is high, of course polling is going to be better for everyone.
It seems to me that you could build a protocol where normal syncs happen through webhooks where each webhook event refers to the event id of an immediately previous event. If the system receiving notifications doesn’t have that event, it makes an API request for all events between the latest event it has and the one it just received.
That is pretty much what TCP does, except that it doesn't "request" missing packets: the client just acknowledges the ID of the last packet it was happy with.
It requires a sequence though, so that you know that the client can know that packet it just received isn't the one it expected, but you could build that with a chain of IDs like you're proposing.
It the system is moving fast this is still somewhat complicated to implement robustly because by the time the "catchup" request to t0 returns, more time has passed and more events have happened, so you still can't resume consuming the webhooks.
To be correct with such a system you have to be prepared to queue the incoming webhook events, do the catchup query, then replay the queued events.
Yeah, if the events can be handled idempotently and this handling properly accounts for time, you don’t have to stop processing webhooks while filling in the gaps.
OK but, it usually matters. If the events represent changes to some object, those changes almost always have to be handled in order if you are to arrive at the same end state as the source.
edit: Also, idempotent is not the right term here. Idempotent just would mean the event could be handled by the receiver multiple times w/o changing the meaning. If you need the events to be applicable out of order, then you need them to be commutative. This is a much more difficult property to ensure and in practice, I am guessing, almost non-existent in deployed webhook APIs.
I think that’s right re. idempotency. However, I think if each webhook is a statement of the state of an object at time t (using some monotonic definition of time, like a Lamport clock), commutativity is trivial: compare the time in the webhook event to the time in your local database, and only update your database if the webhook event is newer.
IMO the real solution is give me a better transport that RESTful HTTP! As many others have pointed out things like Kafka are built for these kinds of usecases. So often I see people trying to design around the flaws in REST while ignoring that we've had some pretty good progress in the ensuing 20yrs.
> If the follower goes down, when it comes back it can page through the history at its leisure. There is no queue, nor workers on each end trying to pass events along as a bucket brigade.
Sure, there is no queue. There is an append only log. Kafka is not a queue but an append only log.
I do not like the proposed solution. I do not like it because it assumes that I have to maintain the infrastructure to do all the distributed logging on my end. As in, most likely I have to maintain a Kafka cluster.
There’s also one thing glossed over in this article. What if your consumer went past certain messages but it mishandled them? You can go to the past either.
If your consumer needs ordered delivery, web hooks might not be the best solution, indeed. But it might cost you more because I need additional infra to provide you with that.
I think this is also getting at the difference between pub/sub and state synchronization. While one might think they want the former, what they really want is the latter. Get some state and receive updates continuously rather than deal with unreliable stream of updates
On HTTP, pub/sub is eventually guaranteed to drop some messages because TCP/IP itself is not a guaranteed networking protocol (it just makes some promises about failures being uncommon and probably detectable).
If what you want is guaranteed state synchronization, pub/sub alone can't give it to you.
I definitely don't want to see long polling. In that case I'd prefer a combination of /events and websockets, where websockets can push (or pass via a GET param) the last read event from /events to notify the server which is the last known event.
TFA addresses this by suggesting long-polling as an option rather than the only way to request `/events`
> In our integration with Stripe, it would be neat if we could request /events with a parameter indicating we wanted to long-poll. Given the cursor we send, if there were new events Stripe would return those immediately. But if there wasn't, Stripe could hold the request open until new events were created. When the request completes, we simply re-open it and repeat the cycle. This would not only mean we could get events as fast as possible, but would also reduce overall network traffic.
I think the author is fighting the wrong problem. Webhooks are a notification mechanism first of all, not a data transfer protocol. You can view it as a control plane which can be mixed or not with a data plane.
What they offer is a data plane and it makes sense. Although, it doesn't contradict the idea of webhooks, but rather complements it. A consumer can get notified via a webhook when new data is available. Whether the data itself comes with the webhook, or is available via an additional API request, is a matter of design. Personally, I like the idea of separating the control plane from the data plane. However, in some cases it can be an overkill.
I find myself thinking about the often for any kind of distributed system, and here’s the rule of thumb: pulling events / polling works 100% of the time but is inefficient, and webhooks work 99% of the time but are more efficient.
Make the even ingestion job idempotent, so it can handle multiple receives, then hook up a pull / poll every 5 minutes or hour or whatever to start at the last pulled date (very important, don’t start at 5 minutes ago or you’ll miss downtimes). Then optionally set up webhooks or a push model.
One pattern I like to use is to have a regular /events endpoint, which works for polling. The endpoint also checks the accept header though, and if it sees `text/event-stream` it'll act as an EventSource endpoint.
EventSource is also nice because you can add cursors, so if the connection drops the source naturally catches up when the connection is re-established. While the interface is browser oriented, there's no reason not to use it in other contexts.
I am in favor of doing both: provide Webhooks (Callbacks) and Feeds. Webhooks are great as triggers. The payload can be also minimal, especially if data is sensitive and authentication / authorization is an issue. And Feeds provide data (can be static / cached, served by CDNs, optimized for batch processing in different variants) at the consumers' pace. The combination of both are ideal.
History does repeat itself. All of those issues, and the solution, are the reason CouchDB is modeled the way it is: there's a single endpoint that gives you _all_ events happening in the database, in chronological order, with both document ids and "feed" ids, reachable with long-polling. All of this more than a decade ago already.
It says "This mode protects a table against concurrent data changes" but it does not elaborate how. Is it similar in consequences to what MariaDB describes?
Basically Postgres SHARE lock will ensure that there is no active transaction in this table before executing query over table. This is a table wide lock and as I said it's not a best solution, as it will slow down a bit other events producers/consumers.
In our case - active data export process, will slow down a bit main application while reading new events.
Best solution would be using external tool to handle streams, like Event Store, Kafka or new RabbitMq Streams. But we prefer to stick with Postgres.
Something that is underestimated in messaging systems are incremental numbers, eg. add + 1 for each message, so the receiving end get 1,2,3,4,5,6 etc, and if it then get 8 it knows it missed the 7th message. And it will know if the messages are out of order. And it can pick up if it goes down by requesting the missing messages.
no... time is relative! (no pun intended) - sending messages to the space station or a satellite in orbit would be off, or a moving,train, or a tall building... Did you know there are leap-seconds, just like there are leap-year? and atomic clocks are not yet standard on server/PC's, what is standard though is a time-client that will change the computers clock at some random intervals depending how much drift it has.
A sequence number is also easier to parse.
I have the same thing in my backend. The backend has main business server that among the other thing has an API endpoint one can query for a list of events in date-from - date-to fashion. I dismissed the idea of webhooks right at design stage as to me it looked like a minefield choke full of potential problems.
What I don't quite get: This is all about server-to-server communication anyway, so no browsers or web platform limitations are involved. So why bother with HTTP at all and then add complexity to implement real-time event streaming on top of it?
Why not simply offer e.g. a public STOMP endpoint?
It's an incentives issue. Webhooks or for when the service provider is aiming for the bare-minimum functionality with the least overhead. If they have the incentive to maximize your consumption of the data, then they should offer both, and may even overnight a physical copy.
The Telegram bot API supports webhooks and long running /event calls.
Because Telegram bots often have no web UI events has also the plus point of allowing to block every incoming connection with a firewall. Also so DNS name or static IP needed!
Ah, I hear that a lot from customers I work with. A true event-driven system requires both events and webhooks. You will always have apps that only interact via REST so you can't really use streaming architecture here but you can make them more real-time via webhooks.
The article talks about issues with webhooks such as not being reliable if the service goes down and messages are lost. It also talks about developers daisy-chaining multiple services together to put forward a solution which is not robust.
That's why you need a broker that does event distribution, supports multi-protocols (REST, AMQP, MQTT, WebSockets...) natively without any proxies and supports Webhooks. You can push messages to your REST clients and if they disconnect, the messages will pile up in a queue, ready to be consumed when the client reconnects.
Solace PubSub+ Broker does all of this. Disclaimer: I work at Solace.
i think eventually system of records should be able to directly push to well authenticated queues.
we are a data and cpu intensive API - and moving from an client-request flow - to pushing to an API to get a pending response, then hitting their queue/webhook once the data is ready. Eventually i think only-once semantics found in kafka, other stream processing will become the norm and better alternative to webhooks, with the right security constructs.
giving write access to partners to a queue or even database connection to directly push data, may sound suspect today, but be common practice in a few years.
Holding requests open for long polling is probably difficult to scale. How do you persist connections when application servers drop in/out? How do you load-balance? What about socket descriptor limits? etc...
I tried skimming through the article to attempt and derive the missing elevator pitch, but I saw them reimplement precisely the system they maligned at the start (push notifications with polling).
Did anyone here understand what their value proposition even is?
Did you check the website? The value prop is incredibly clear: they give you a Postgres database you can query with SQL rather than having to either (a) learn and code to a custom API or (b) setup your own sync system to get that Postgres db.
I'm talking about the elevator pitch of their blog post, not of their company. Which are actually two distinct things (they're not describing their service in this blog post).
The premise of this post is wacky. They're trying to argue for how web applications should provide consistency to operations on remote systems. You know what that's called? Distributed Computing. I don't know if you know this, but Distributed Computing Is Hard. You can't solve it with a new interface or polling really fast.
Webhooks are perfectly fine for what they're intended, which is inconsistent push-based notifications to loosely coupled web apps. If you require "consistency", you supplement with polling and queues and other junk. If you require real consistency, you must use a distributed consensus algorithm.
distributed consensus/distributed computing appies for mutual/full-duplex/two-way systems, syncing changes one way from a 3rd party is not distributed computing, and not something you would want to throw a distributed consensus algorithm at
Many large customers eventually had some issue with webhooks that required intervention. Stripe retries webhooks that fail for up to 3 days: I remember $large_customer coming back from a 3 day weekend and discovering that they had pushed bad code and failed to process some webhooks. We'd often get requests to retry all failed webhooks in a time period. The best customers would have infrastructure to do this themselves off of /v1/events, though this was unfortunately rare.
The biggest challenges with webhooks:
- Delivery: some customer timing out connections for 30s causing the queues to get backed up (Stripe was much smaller back then).
- Versioning: synchronous API requests can use a version specified in the request, but webhooks, by virtue of rendering the object and showing its changed values (there was a `previous_attributes` hash), need to be rendered to a specific version. This made upgrading API versions hard for customers.
There was constant discussion about building some non-webhook pathway for events, but they all have challenges and webhooks + /v1/events were both simple enough for smaller customers and workable for larger customers.