Give me /events, not webhooks

bkrausz · on July 13, 2021

I was responsible for Stripe's API abstractions, including webhooks and /events, for a number of years. Some interesting tidbits:

Many large customers eventually had some issue with webhooks that required intervention. Stripe retries webhooks that fail for up to 3 days: I remember $large_customer coming back from a 3 day weekend and discovering that they had pushed bad code and failed to process some webhooks. We'd often get requests to retry all failed webhooks in a time period. The best customers would have infrastructure to do this themselves off of /v1/events, though this was unfortunately rare.

The biggest challenges with webhooks:

- Delivery: some customer timing out connections for 30s causing the queues to get backed up (Stripe was much smaller back then).

- Versioning: synchronous API requests can use a version specified in the request, but webhooks, by virtue of rendering the object and showing its changed values (there was a `previous_attributes` hash), need to be rendered to a specific version. This made upgrading API versions hard for customers.

There was constant discussion about building some non-webhook pathway for events, but they all have challenges and webhooks + /v1/events were both simple enough for smaller customers and workable for larger customers.

alexbouchard · on July 13, 2021

Shameless plug but I've built https://hookdeck.com precisely to tackle some of these problems. It generally falls onto the consumer to build the necessary tools to process webhooks reliably. I'm trying to give everyone the opportunity to be the "best customers" as you are describing them. Stripe is big inspiration for the work.

Redsquare · on July 13, 2021

Do you provide the ability to consume, translate then forward? I am after a ubiquitous endpoint i can point webhooks at and then translate to the schema of another service and send on. You could then share these 'recipes' and allow customers to reuse well known transforms.

dools · on July 14, 2021

You can do this with BenkoBot, we just launched custom webhooks (although it's not in the interface yet). So you can receive a webhook and run some arbitrary javascript to transform it then send it on somewhere else:

http://www.benkobot.com/

Our main focus is on handling Trello notifications, and the Trellinator library I wrote is built in, our objective is to create more API wrappers over time to make it as simple as possible to deal with as many APIs as possible. You can see some example code here:

https://trello.com/b/IoHmhz5c/benkobot-community-board

You currently require a Trello account API key/token to sign up, but you can use it as you described to be a generic endpoint, transform however you want with JS then post the data onto another endpoint.

JimDabell · on July 14, 2021

This is a fairly common use of no-code glue services like Zapier, IFTTT, Cyclr, etc.

alexbouchard · on July 14, 2021

Transformations is something we haven't built yet but we have our eyes on it as your are not the first one to bring that up. You can use Hookdeck in front of lamda and to the transformation there, you'd still get the benefit of async processing, retries, etc

rkazokas · on July 14, 2021

Do you have an idea of when hookdeck will have transformations. It's not something we need immediately but would be the win over something like: https://webhookrelay.com/ if it's something you have on your roadmap for sometime soon.

alexbouchard · on July 14, 2021

Can you reach out to me, I'd love to talk about your use case and prioritize accordingly. Email is alex at hookdeck dot com

stympy · on July 16, 2021

My app, hookrelay.dev, has transformations today. :)

rattray · on July 14, 2021

> We'd often get requests to retry all failed webhooks in a time period.

(I worked on the same team as bkraus, non-concurrently).

For teams that are building webhooks into your API, I'd recommend including UI to view webhook attempts and resend them individually or in bulk by date range. Your customers are guaranteed to have a bad deploy at some point.

Eiriksmal · on July 14, 2021

At Lawn Love, we naively coupled our listening code directly to the Stripe webhook... but it worked flawlessly for years. I wasn't a big fan of the product changes necessitating us switching from the Transfer API for sending money to the complicated--and very confusing for the lawn pros--Connect product, but its webhooks also ran without issue from the moment we first implemented them. So thanks for making my life somewhat easier, Mr. Krausz.

Like many others, I now pattern my own APIs after Stripe's.

bkrausz · on July 14, 2021

Don’t fully thank me, I was also the architect of the Transfers API to Connect transition :). There’s a lot I would have done differently there were I doing it again, though much of the complexity (e.g. the async verification webhooks) were to satisfy compliance needs. Hard to say how much easier the v1 could’ve been given the constraints at the time, though I’m very impressed with the work Stripe has done since to make paying people easier (particularly Express).

alexgartrell · on July 13, 2021

I think the Stripe API stuff you did was fine, but you really did your best work as a concepts of mathematics TA.

ctas · on July 13, 2021

Can you share a bit about how these events are stored on Stripes backend e.g. Kafka, Postgres?

bastawhiz · on July 13, 2021

It's all just kafka and mongo. The event can be stored in any simple k/v storage. There's no magic.

Edit: not sure why I'm being downvoted. I work at stripe and this is literally how it works.

bkrausz · on July 14, 2021

Hi Basta! Can confirm both that he works at Stripe and is right.

Years ago there wasn't even a Kafka portion, that's newer.

ctas · on July 14, 2021

Thanks for the input. We're currently working on a similar solution, so I was really curious to learn more.

One thing I really admire is how Stripe makes it transparent which events were fired both in general through the Developer area, and on specific objects like customers, subscriptions, etc..

spullara · on July 13, 2021

Pretty easy for a customer to setup an SQS queue and a lambda for receiving them rather than rely on their infrastructure to do all the actual receiving. Way more reliable than coupling your code directly to the callback.

jon-wood · on July 13, 2021

This is precisely what we do where I work. We have a service which has just one responsibility - receive webhooks, do very basic validation that their legitimate, then ship the payload off to an SQS queue for processing. Doing it this way means that whatever’s going on in the service that wants the data, the webhooks get delivered, and we don’t have to worry about how 3rd party X have configured retries.

tasn · on July 14, 2021

These reasons are exactly why we started Svix[1] (we do webhooks as a service). I wish we existed to serve you guys back when you started working on it. :)

[1] https://www.svix.com

throwaway290232 · on July 13, 2021

I always laugh when people end up with designs like this. They could have just used SMTP! It's designed to reliably deliver messages to distributed queues using a loosely-coupled interface while still being extensible. It scales to massive amounts of traffic. It's highly failure-resistant and will retry operations in various scenarios. And it's bi-directional. But it's not "cool" technology or "web-based" so developers won't consider it.

Watch me get downvoted like crazy by all Nodejs developers. Even though they could accomplish exactly what they want with much less code and far less complex systems to maintain.

chaps · on July 14, 2021

I pitched an idea like this years ago to essentially backfill one ticketing system to shiny new system that could read an email inbox. The idea was that if we dropped an email in that inbox with its desired format for each old ticket's updates, the new system would do all the necessary inserts and voila. They told me no -- not because of any technical reason, but because their email infrastructure was required to be audited by the SEC, they would have opened themselves up to significantly more auditing. Instead, I ended up having to do it through painful, painful SQL.

Lesson being, that sometimes there are unexpected reasons why a specific piece of technology shouldn't be used.

aabbcc1241 · on July 15, 2021

You're not allowed to use SMTP without calling it email ?

It sounds like one not allowed to use http for restful APIs without calling it a website. (And that org require website to be audit to fulfill accessibility measurement for physical disability)

chaps · on July 16, 2021

For the record, I disagreed with them also and pushed back pretty aggressively and found workarounds to the audit problem. But CTO and CIO basically took it as a challenge against their authority and denied me at all points.

darkerside · on July 14, 2021

Weren't these all emails already? Weren't you required to retain them for the SEC even before falling into this specific (hypothetical) inbox?

user5994461 · on July 14, 2021

It's not clear from the message whether the software is setting up their own email system, if so it will need to be audited and certified, which is a major hassle.

Either way, the auditors and the infrastructure might not want to handle an order of magnitude more traffic (API usage is really in a different league than occasional human email). Expect all emails to have to be stored for around a decade.

chaps · on July 15, 2021

For the original system, no, they were not emails.

rendall · on July 13, 2021

The suggestion to use SMTP is interesting.

I didn't downvote you but I bet they come from this part. People don't like this kind of negativity.

> But it's not "cool" technology or "web-based" so developers won't consider it. Watch me get downvoted like crazy by all Nodejs developers.

forgotmypw17 · on July 13, 2021

I agree that people don't react well to negativity, but sometimes you have to say it. Node has a lot of very stupid (i.e. ignoring reality) decisions, and by extension, being exposed to this for a long enough period of time, tends to affect the developer as well.

I say this from experience, as someone who's used a few stupid technologies over time.

strken · on July 13, 2021

This was not a time when you had to say it. None of the existing conversation was about languages or ecosystems, and taking cheap shots was wholly unnecessary to the suggestion of using SMTP.

SMTP itself is interesting, although it comes with fun new footguns like STARTTLS.

cutemonster · on July 14, 2021

What makes STARTTLS a footgun? (I'm curious since I use it sometimes)

virtue3 · on July 14, 2021

"STARTTLS is an email protocol command that tells an email server that an email client, including an email client running in a web browser, wants to turn an existing insecure connection into a secure one."

I'm sure absolutely nothing bad will come from that last bit. Oh look:

"And yes, STARTTLS is definitely less secure. Not only can it failback to plaintext without notification, but because it's subject to man-in-the middle attacks. Since the connection starts out in the clear, a MitM can strip out the STARTTLS command, and prevent the encryption from ever occurring."

teddyh · on July 14, 2021

DANE is meant to fix that. If someone asserts, via DNS records (signed by DNSSEC), that their SMTP server is able to use TLS, then you should only accept connections using TLS to that SMTP server.

tptacek · on July 14, 2021

And DANE is never going to happen; DANE advocates have been saying this for over a decade, and the only change has been that the IETF and all the major email providers moved forward on a new protocol, MTA-STS, specifically to avoid needing DNSSEC (which nobody uses) to solve this problem.

teddyh · on July 14, 2021

Almost every time anyone mentions DNSSEC here on HN, you pop up like a jack-in-the-box to claim that nobody is using it and that it is dead. And it’s always you, nobody else. Whereas, from where I sit, I work at a registry and DNS server host (among other things) where about 40% of all our domains have DNSSEC (and that number is constantly climbing). Every conference I go to, and in every webinar, people seemingly always talk about DNSSEC and how usage is increasing.

You might have some valid criticism about the cryptography; I would not be able to judge that (except when you are basing it on wildly outdated information). I’m not an expert on the details; you could most assuredly argue circles around me when it comes to the cryptography, and possibly about the DNSSEC protocol details as well. But, from my perspective, your continuous claim that “nobody uses” DNSSEC is simply false. DNSSEC works, usage of DNSSEC is steadily increasing, and new protocols (like DANE) are starting to make use of DNSSEC for its features. Conversely, I only relatively rarely hear anything about MTA-STS.

tptacek · on July 14, 2021

Take any list of the top domains on the Internet --- any of them at all --- and run them through a trivial script, like:

    #!/bin/sh
    while read domain
    do
    ds=$(dig ds $domain +short)
    echo "$domain $ds"
    done

... and note that virtually none of the domains, in any sane list of top domains, are signed. That was true several years ago and remains true today, despite the supposed "increase in usage" of DNSSEC.

What's actually changed is that registrars, especially in Europe, now apparently auto-sign domain names. That creates a constant stream of new, more-or-less ephemeral signed zones that gives the appearance of increasing DNSSEC adoption. Of course, this is also security theater (the owners of the zones don't own their keys!). The real figure of merit for DNSSEC adoption is adoption by sites of significance, and that has been static, and practically nonexistent, for a decade.

It is no surprise to me that people working on the DNS talk quite a bit about DNSSEC. People who worked on SNMP talked quite a bit about SNMPv3, and IPSEC people probably really believed there would be Internet-wide IKE. None of those things happened, because what matters in the real world is what the market decides. Most especially at the companies with serious security teams, DNSSEC is a dead letter standard.

teddyh · on July 15, 2021

Registrars can’t “auto-sign” domains. Only DNS server operators can do that, if they have the cooperation of the registrar. And the DNS server operators is the only workable definition of “owners of the zones”, so they do own their keys. It can’t work any other way.

In fact, the new CDS and CDNSKEY DNS records allow it to work the other way around; DNS server operators can auto-sign domains, and the registrars need not be involved at all.

> The real figure of merit for DNSSEC adoption is adoption by sites of significance

People said the same about IPv6. Or maybe you do, too?

> People who worked on SNMP talked quite a bit about SNMPv3

I seem to recall you mentioning quite often how WHOIS was dead and would be replaced by RDAP. That didn’t happen either.

> IPSEC people probably really believed there would be Internet-wide IKE

Interestingly, that problem could in theory be solved by DNSSEC. We’ll see what happens.

tptacek · on July 15, 2021

I don't think you ever saw me mention that WHOIS is dead, not least because that's not a thing I believe. What a random thing to say; you can just use the search bar to immediately see the (very few) things I've had to say about RDAP here.

cutemonster · on July 14, 2021

And, reading more at StackOverflow, from where Virtue3's quotes are?, This: https://serverfault.com/questions/523804/is-starttls-less-sa...

I find:

> If the client is configured to require TLS, the two approaches are more-or-less equally safe. But there are some subtleties about how STARTTLS must be used to make it safe, and it's a bit harder for the STARTTLS implementation to get those details right.

I previously thought that was the default, good to know it isn't / might not be

Thanks everyone :-)

throwaway290232 · on July 15, 2021

DANE is a kludge that should be put to bed, not promoted as a solution to a problem which shouldn't exist.

STARTTLS exists for two reasons (https://www.fastmail.com/help/technical/ssltlsstarttls.html):

1. Wanting to accept mail insecurely.

2. Not wanting to use two different TCP port numbers to send and transfer mail.

To solve these problems they created STARTTLS. But obviously, STARTTLS isn't actually secure (even though that was the point of supporting TLS). So to make it secure, it's suggested to use DANE - a standard built on a different procotol, requiring a feature that is controversial, potentially dangerous, and not widely implemented. So you can use a kludge (STARTTLS) with a kludge (DANE) to send and transfer mail securely. But should you?

Since 2018, RFC8314 says that e-mail submission should use implicit TLS, not STARTTLS (https://datatracker.ietf.org/doc/html/rfc8314#section-3). Therefore the use of STARTTLS, and the use of DANE to make it secure, are deprecated. So while you shouldn't use DANE for anything seriously, you really shouldn't use it for SMTP.

teddyh · on July 15, 2021

Even if implicit TLS is used instead of STARTTLS, DANE is still necessary to avoid forcing backwards-compatible agents to fall back to unencrypted traditional communication.

DANE is necessary as long as there are still some agents using backwards-compatible behavior; i.e. falling back to unencrypted communication if TLS is in some way blocked.

throwaway290232 · on July 19, 2021

Those agents should not be falling back to unencrypted anyway! The whole ecosystem just needs to get onboard with implicit TLS and deprecate the old agents. It's not acceptable to make the whole ecosystem dependent on two completely different security mechanisms. Every client/server in the world would have to support both indefinitely, which would be a totally unnecessary cost and complexity burden.

teddyh · on July 30, 2021

I mean, if we accept completely deprecating non-TLS connections, then there still would be no problem with STARTTLS! Servers would just need to only allow the STARTTLS command, and refuse any commands until after the TLS handshake. I believe that many server programs allows this configuration today.

It is only when we allow backwards compatibility that something is needed to differentiate to the clients whether the server is new enough to allow TLS or not.

vidarh · on July 14, 2021

That's only a footgun if your system is set up to allow an insecure connection to continue. Just because the protocol allows it does not mean you can't add additional requirements.

rendall · on July 14, 2021

> Node has a lot of very stupid (i.e. ignoring reality) decisions, and by extension, being exposed to this for a long enough period of time, tends to affect the developer as well.

As a developer who used Salesforce for nearly a year once upon a time, I can confirm that exposure to stupid decisions in a platform can affect the developer.

Node, though? Could you expand on the stupid decisions in Node? And does Deno address those?

IggleSniggle · on July 14, 2021

I use and love nodejs daily, and I think I can speak to some of the stupid. A lot (but not all) has to do with ecosystem.

Some of the stupid in node just comes from the fact that there's still a lot of reinventing the wheel, and doing a less good job of it. Like, we've got all these backend frameworks, but still nothing at all that compares to eg Spring. Can you even find a nodejs lib that does HATEOAS properly and completely? How often do you find yourself doing string parsing, or handling a JSON object, when you know it would be more efficient to be handling a stream, or that really the kind of work you're doing ought to be handled by your framework but isn't?

As for nodejs itself, it's much better in 2021 than it was in the past. But it's still a massive runtime. And I have mixed feelings about eg Worker threads. As for node_modules, I get the sense that we're just replaying the history of Microsoft's dll story, needing to relearn all the lessons that should have been learned already.

As for Deno, I think it comes with great ideas. In many respects, I like it better than Nodejs. Most of its good ideas, Nodejs is flexible enough to accomodate. One of Deno's main advantages is that it doesn't have any legacy to support, so it can embrace things like ECMAScript modules more easily. Its library system is closer to Go, although I think the end result is that a lot of folks end up doing one-off systems that end up looking a lot like the nodejs module resolution system in the end. Deno's main disadvantage is that it is not compatible with nodejs libraries. That's also an advantage insomuch as you have a clear module import spec from the get-go.

In short, the stuff that Deno can do, Nodejs can do, and I'm not sure that it's cleaner system can overcome the fact that the same is accomplishable in Nodejs. I'd be more than willing to use Deno in a greenfield project because I like all the technology choices it makes, but fundamentally, the technology choice you're making is whether or not to use V8, and adopting Deno is almost just a way of pressing the reset button the ecosystem, which may or may not be a good thing depending on your needs.

kreetx · on July 14, 2021

Since it doesn't relate at all, why even add the negativity?

"Here's a crazy idea ... This are the properties for why it works ..." Would have sounded much better.

Aeolun · on July 13, 2021

Does Node have that? Or do node libraries have that?

I find node to be surprisingly well rounded.

vidarh · on July 13, 2021

I actually did use SMTP as queuing middleware for a registrar platform years ago.

It worked very well.

EDIT: To add some context, my team had come off building a webmail platform, and so we'd done lots of interesting stuff to qmail and knew it inside out. We then launched the .name tld and built a model registrar platform that on registration would bring up web and mail forwarding for users that wanted it. We used SMTP to handle the provisioning of those while keeping the registration part decoupled from the servers handling the forwarding. We also used it to live-update a custom DNS server I wrote.

buro9 · on July 14, 2021

I remember interviewing someone who worked on a DNS platform where IIRC the DNS zone files were propagated by SMTP to DNS servers. The details on this were that there was a 5-minute SLA (I believe) on the loading of zone records, essentially that the DNS servers were polling the mailbox and parsing new records since some last loaded time stamp.

vidarh · on July 14, 2021

For a second there I wondered if you'd ever interviewed me (but having looked at your profile: no; I don't think we've met, though I'm in London too).

We had similar-ish constraints. SLA was internal, not imposed (the .name registry had externally imposed SLA's, but the registrar platform did not), but the zones were very simple - either NS records pointing elsewhere, or identical CNAME/MX records, so we needed only a short string per address.

I don't remember if we used CDB files or if we stored individual records directly in ReiserFS filesystems (our mail platform had relied heavily on the ability of ReiserFS to handle vast quantities of tiny files, so were comfortable with that), but it was definitively something simple.

Similar for the web forwarding, which just required a url to redirect to.

If a node should ever need to be replaced, all we'd need to do would be to start a queue on a new box but not process it, then rsync over the dataset from another server, and start processing the queue, and add it into rotation when up to date. If we'd needed stricter consistency guarantees it'd have been a different consideration.

For many types of workloads I'd pick another queuing system today, but the amount of readily available tooling for e-mail, especially once you need federation, reflection/amplification etc. does make it an interesting choice for some things.

It also made debugging the message flow trivial: just add a real mailbox to the cc:....

erikpukinskis · on July 14, 2021

Your suggestion about SMTP is a good one. Disappointing that I had to downvote your comment for the ad hominem on us old Node developers.

Why you need to insult a whole body of people, rather than just make a claim about the technology, I don’t know.

tremon · on July 14, 2021

Node developers are old now?

erikpukinskis · on July 23, 2021

I'm 39. Not sure what counts as old for you.

eyelovewe · on July 14, 2021

[flagged]

staz · on July 14, 2021

so someone told you he was prejudiced against, you respond with more prejudice and then you pretend it's his attitude that colored your perception. Are you for real?

deleted-account · on July 14, 2021

[flagged]

Jonnax · on July 14, 2021

Prejudice is also an English word that has a meaning unrelated to law.

Calling JS developers most dumb is prejudice.

The word's meaning is literally spelled out "pre" + "judge".

mrzimmerman · on July 14, 2021

Honestly this is so stupid brilliant I love it (stupid as in I can’t believe I hadn’t considered this). Honestly it really is about storing, sending, and checking messages so SMTP makes so much sense!

I’ve been building for the web for 15 years and it shows how far I can hyper focus on certain communications implementations that I’m not looking at pre-existing options that really meet a large number of use cases. I suppose it also means making sure your data consumers are comfortable working with the protocol but it’s a really top notch idea.

kbenson · on July 14, 2021

SMTP used to be a lot more reliable than it is now. Now, with all the changes to help with blocking spam, you have to be very careful or have a lot of control over the receiving server to ensure you actually get delivery. Some anti-spam systems will just discard if the matching rules indicate the spam likelihood score is above a certain threshold, and mistakes in rules at system levels can and do happen.

But here's another way you could (ab)use the mail system for delivery, provide a mailbox for the client and just allow IMAP or POP access and throw the messages into that. The client can log in to access and process them (which they would likely be automating on their own mailboxes anyway). It does mean it's housed at the provider, but it's also pretty easy to scale. There's lots of info on how to set up load balanced dovecot clusters out there, and even specialized middleware modes (dovecot director) to make it work better so you can scale it to very very large systems.

sobani · on July 14, 2021

I don't think mr throwaway was advocating to use email to send the events, only to use SMTP. Email is an entire ecosystem, SMTP is only a protocol.

If the distinction is too hard to make: think of it as using the 'Simple Event Transfer Protocol' that just happens to use exactly the same protocol as SMTP.

kbenson · on July 14, 2021

> Email is an entire ecosystem, SMTP is only a protocol.

Yeah, but it's a protocol for transferring email. As I noted with "you have to be very careful or have a lot of control over the receiving server to ensure you actually get delivery", you can amstract most of the mail system out as long as you ensure you are running the server they deliver to, but you would also need to rely on them making sure their outgoing server is good for this, which probably means dedicating it to this and not running any real mail through it (so you avoid outgoing company email filters, etc). At that point, both sides are running specific bespoke mail servers, which cuts down on the usefulness of the solution because of how much setup and administration it requires.

It used to be nobody ran incoming and outgoing filtering on email, so it was a robust channel for communication with retries, and notifications for failed delivery, etc. These days it's not exactly that because of all the spam mitigations and company compliance and risk mitigations that might be in place, etc. In fact, just setting up a new mail server and attempting to send to microsoft (live/hotmail), yahoo or gmail is extremely hard, because they have a high bar for acceptance, and large swaths of the easily obtained IP space have already been blacklisted from prior spam use so you start at a bad reputation and have to work to get it to a level you've even be allowed to talk to other by working with all the third party (and first party) blacklist maintaining entities.

R0b0t1 · on July 14, 2021

It's not uncommon to set up daemons that only talk to each other for infrastructure monitoring and reporting.

fouc · on July 14, 2021

Yeah it does make more sense to have the IMAP/POP setup rather than actively sending out emails through consumer level email services like gmail etc where deliverability might become a concern.

mlk · on July 14, 2021

at that point you'd better off using an atom/rss feed

kbenson · on July 14, 2021

The difference is that using a mail subsystem to handle this handles a lot more of the implementation than "use an atom/rss feed".

Notably, in choosing to use an atom/rss feed, you need to determine what the webserver serving it is, how to implement authentication on top of it (is it a token/oauth, HTTP auth, param auth, etc), what is the underlying data store (SQL/NoSQL, some message system), how to scale that system if you expect it to be large and span multiple servers and/or datastores (mail systems right now deal with hundreds of thousands of users and gigabyte plus mailboxes of millions of messages).

Choosing IMAP to deliver this info means there there are well worn solutions for all the decisions you need to make (including howtos to implement oauth at the server level), as well as client level libraries in almost every language. Basically, you could decide to use it and not have to worry about forging a new path on that system basically ever, because there's plenty of people that have already implemented it at a larger level and with the same features (even if you would be using them to slightly different effect), and they've contributed the info on how to do it and what the performance ramifications are to the public domain.

I'm not seriously advocating for it, but that's more because clients will look at you funny than for any technical reason. Technically, it actually has a lot going for it. Unfortunately as an industry we fetishize the new and bespoke because obviously our own unicorn projects are so new and special and will serve so many people that some off the shelf solution could never be as good....

wruza · on July 13, 2021

SMTP would raise too many questions, from how both datacenters tolerate it (spam), to who will manage the receiving server itself and certificates on your side, and overall security of this setup. For a nodejs developer it’s really easier to spin up a separate handmade queue process rather than managing SMTP-related things. Webhook (for runtime) and long-polled /events?since= (for startup) have all upsides with little downsides.

bkrausz · on July 13, 2021

When designing something like this as a service, the biggest question is what other developers will find easy to use. Every cheap host supports inbound HTTP requests, and most web developers know how to receive them.

Stripe needs to be usable by both the developers building intense, scalable, reliable systems and the people teaching themselves to code in a limited context on a limited platform.

pmelendez · on July 14, 2021

>And it's bi-directional. But it's not "cool" technology or "web-based" so developers won't consider it.

I might be missing a point or two here, but I don't see how SMTP can work for this case at all. You would require every API consumer to setup a SMTP server (which is another piece of infrastructure to maintain), and then somehow have a layer of authentication so the recipient can control who post messages on that server (overhead for the publisher per new customer). Then we still haven't resolved the issues on the customer side (a bad code that could pop all messages and now we might require the publisher to replay them again).

I haven't even started to think about security and network hardening challenges yet. Again, I might be missing the point but this is not a case of cool tech overuse to me.

vidarh · on July 14, 2021

SMTP servers supports SSL. Using client certificates and/or HMAC signed messages takes care of the security. You have the same security consideration for HTTP.

As for "setting up an SMTP server", the point is that compared to the current requirement of a webhook, you're going to need a queuing mechanism or a pull mechanism or both anyway. So you can build a custom solution, or you can pick an existing queuing mechanism that people have spent literally decades providing a vast array of software options for.

And yes, you're right, you can always ending up needing a way to trigger a replay because no matter what you do the customer might do something stupid. Nothing you do will get you away from that. So either you require them to always pull, or you provide an option to push and an API to trigger redelivery for when they've done something stupid. If you opt for push, SMTP is an option worth considering, because no other queuing mechanism has as many available ready-made and battle hardened queuing options.

There are many cases where it'd not be suitable, but in the situations where SMTP is a bad choice, webhooks are likely to be an absolutely awful choice.

I speak from actually having run messaging on SMTP both as an e-mail provider with a couple of million users and having used it as messaging middleware in production.

dataflow · on July 14, 2021

I'm confused, "use SMTP" doesn't even type-check for me. Isn't SMTP just a transfer protocol? Meaning it defines a bunch of commands and gives them meanings (like EHLO and DATA and such), just like how HTTP defines commands like GET and POST and all that? Isn't the problem here about e.g. the storage & retry logic rather than about the data transfer itself? Can't you retry transmission as frequently as you like using whatever protocol you like? How does transferring the data over SMTP gain you anything compared to HTTP?

wccrawford · on July 14, 2021

"use SMTP" here is a short way of saying "send mail to a mail server that will store the requests indefinitely" instead of webhooks that are constantly retrying on a protocol that was hand-written instead of being baked into the whole internet already.

dataflow · on July 14, 2021

> "use SMTP" here is a short way of saying "send mail to a mail server that will store the requests indefinitely"

So the suggestion is to use email? That's not how others are interpreting it. [1] And it doesn't make sense to me either. Emails as they are "baked into the whole internet already" are unencrypted with tons of middlemen, and even their transport isn't guaranteed to be encrypted. Email is also munged and messed with in weird ways, with fun stuff like each middleman tacking on their own headers and filtering it out based on unknown rules. It also introduces a ton of latency and severely prioritizes "eventually reaching the destination" over timeliness. And more downsides I can't think of off the top of my head. That seems like a really poor choice for an event delivery mechanism.

[1] https://news.ycombinator.com/item?id=27830705

throwaway290232 · on July 15, 2021

There are a billion lego pieces out there in the e-mail ecosystem. We can combine them any way we want, if the alternative is a totally custom solution anyway (webhooks, custom API endpoints, cron jobs, queues, etc). There are so many options; where to begin!

First, you don't have to use the rest of the internet's e-mail system. Stripe can run their own mail servers that deliver straight to clients on non-standard ports using implicit TLS, ensuring security and no middle-men. This also ensures delivery is as timely as possible (sub-second typically, as mail software has to be fast to handle its volume).

Let's say you want to poll (ex. "/events"). The client uses IMAP to poll the Stripe server with a particular username/password. Check a folder, read a message, delete it on connection close. There are of course ready-made solutions for this, but you can also write simple IMAP clients really easily using libraries.

Let's say you want pushes (ex. webhooks). The client sets up the alternative to the webhook-server they'd have to set up anyway: an SMTP server. Use a custom domain, one that has nothing to do with the customer's main business, so nobody ever gets confused. Configure it to only accept mail from a "secret mail sender" (aka webhook secret). Part of the "SMTP webhook URI" would be what mailbox to deliver the webhooks to. The client then configures an MDA on their mail server to immediately deliver new messages to some business logic code. If the MDA or business-logic code has a bug, the messages will stay in the client's mailbox until they are "delivered" successfully. If the client's SMTP server is down, Stripe keeps retrying for at least 3 days, more if Stripe wants.

Stripe could actually implement both by keeping messages in an IMAP folder on Stripe's servers, and deleting the messages once the SMTP server confirms delivery to the client. Of course all messages already have unique IDs so removing dupes is easy.

You could implement all of this in a week, write almost no code, and still handle all the weird edge cases. Virtually all of that time is just reading manual pages and editing config files. The end result is a battle-hardened fault-tolerant event-driven standards-based distributed message processing system. The maintenance cost will be "apt-get update && apt-get upgrade -y", and anyone who can configure Postfix and Maildrop can fix it.

pokoleo · on July 15, 2021

Hey throwaway! I think this could work, but it might not be the highest priority. Think about this perspective:

The API is entirely HTTP, and tries to meet users where they are by providing tools that they are most comfortable with. Frequently, these users are familiar websites or mobile apps. As such, webhooks are implemented over HTTP.

If there was an alternate way to integrate with events, it'd be something that's either:

1. accessible to novice users, or

2. delivers on high throughput/latency needs of the largest users, or

3. resolves a storage/latency/compute cost incurred behind the scenes

Thinking about these:

For #1, websockets would score better than SMTP

For #2, kafka (or a managed queue like SQS) would score better (many support dead letter queues and avoid the latency at the mail layer)

For #3, it isn't clear that SMTP reduces the latency, compute, or storage costs

SMTP might be familiar -- and it's possible for you to build your own webhook → SMTP bridge if you wanted it -- but doesn't score well enough on any of these metrics to be built in-house.

[Disclaimer: I work at Stripe, and these opinions are about how I'd approach this decision. They're not the opinions of my employer.]

throwaway290232 · on July 15, 2021

Yeah, I totally agree it's really out of left field compared to what users are comfortable with. Like another commenter said, clients would probably laugh you out of the room for proposing it. (though that's half my point! why are we only accepting these half-baked custom solutions on janky platforms? fear of criticism? is it really saving anyone any time or money compared to the "weird solution"?)

But I'm not convinced on the latency/compute/storage comparison with Kafka or other solutions. I think a POC would need to be built and perf tested, and then tweaked for higher performance and lower cost, like most software. Considering the volume of traffic that mail software is designed for, I can't see how even a large provider like Stripe would have difficulty scaling a mail system to match Kafka. It's not like mail software is written in Java or something ;-)

kgwxd · on July 14, 2021

What about events that need faster than 1 minute response times? Any push notification like system is going to be just as error prone. And what about multiple message handlers? And what happens when the send fails? Did someone write the code to check the inbox for them and handle them? When a send fails multiple times, is that logged and is there a system for clients to check that log? Message transfer isnt the hard problem in this domain.

vidarh · on July 14, 2021

There's nothing about SMTP that dictates response times or in any way makes it much slower than HTTP. A non-pipelining clients will require a few more network roundtrips if it connects and disconnects for every message, that's all.

> Any push notification like system is going to be just as error prone.

E-mail servers are built with retry logic and queuing logic already. The point is if you need queuing anyway, it offers a tried and tested mechanism with a multi-decade history and a vast number of interoperable software options. While there is now a relatively decent number of queuing middleware options, none of them have as many server and client options as SMTP.

SMTP isn't the best choice for everything, but it works (I've used it that way), it's reliable, and it scales with relative ease.

> And what happens when the send fails?

It gets retried. Retries are built in to mail servers. That's part of the point.

> And what about multiple message handlers?

What about it? Most SMTP servers provides mechanism for plugging in message delivery agents rather than delivering to a mailbox, or you let it deliver to a mailbox and pick it up from there. Or you plug in whatever routing mechanism you want to distribute the messages further. The sheer amount of ready-built options here is massive.

> Did someone write the code to check the inbox for them and handle them? When a send fails multiple times, is that logged and is there a system for clients to check that log?

Pretty much every e-mail server ever written provides a mechanism for handling persistent failures, and many of them offers heavily configurable ways of doing it. But yes, you'd need to decide on what to do about persistent failures. But you need to do that whatever queuing system you use.

> Message transfer isnt the hard problem in this domain.

The point of the article is exactly that reliable message transfer is the hard problem in this domain.

nijave · on July 13, 2021

>so developers won't consider it

I think it depends on the developer. There's developers hammering out boring business logic as fast as possible and there's developers with a deep understanding of machine internals, protocols, and infrastructure. For the former, SMTP is black magic they'd probably never think of and involves engaging the one infra person that's always busy

It also means standing up and managing "infrastructure"

throwaway290232 · on July 13, 2021

I sort of agree, but somebody already has to manage the "infrastructure" of their web apps, dns. They never mind adding more of their own home-grown services. If they used Kinesis instead that's another piece of infra to maintain. But you would never hear them say "what about Postfix instead". Regardless of infra, if it's new, they want to use it, even if something older and more boring would work better.

If I ever heard a dev at work say "No I won't use that new tech, it's too untested/I'll have to spend more time figuring out how to make it work well", I would shit my pants. Whereas if it's old tech, "it's not modern/I'll have to spend more time figuring out how to make it work well". It's practically software ageism...

wruza · on July 14, 2021

You’re likely blinded by a “nodejs monkey developer” stereotype which prevents you to see that node is what everyone wanted back then. It’s very, very easy to create an http-based analog of any “traditional” service in node and to free yourself from learning all the shady details (which there is a lot) of configuring it and keeping it alive at all levels, were it based on traditional software. Node is extensible configurable networking itself, and http(s) is a quintessence of all text protocols. All that we wanted back then is available now in node at much finer granularity and much less configuration or headache. “They” spin up home-grown services because it is a natural one-page-boilerplate straightforward thing to do in node, not because of ageism or something similar.

I tell you that as someone who fiddled with sendmail.cf’s and other .conf’s way too much long before nodejs became a thing. Now it’s a relief.

the_only_law · on July 15, 2021

> There's developers hammering out boring business logic as fast as possible and there's developers with a deep understanding of machine internals, protocols, and infrastructure.

Purely anecdotal of course, but I follow a number of the latter, and they're either sparsely employed or often employed in a capacity where it doesn't matter There was a comment in this thread where one person had such an idea, and it was rejected for what were essentially business reasons.

user5994461 · on July 14, 2021

SMTP won't work for the customers.

Developers won't be able to use the existing email systems of the company, too critical and managed by another team. They will never be able to reconfigure it and get API access to read emails. Note that it may or may not be reliable at all (depends on the company and the IT who manages it).

Developers won't be able to setup new email servers for that use case. Security will never open the firewall for email ports. If they do, the servers will be hammered by vulnerability scanners and spam as soon as it's running. Note that large companies like banks run port scanners and they will detect your rogue email servers and shut it down (speaking from experience).

vidarh · on July 14, 2021

Nothing preventing offering delivery on alternative ports for people with incompetent security teams that thing port numbers is sufficient to determine if it's a threat.

As for "being hammered", rejection of invalid recipients before even getting to the DATA verb is cheap.

Having actually run both an e-mail service and SMTP used as messaging middleware, I have dealt with these issues.

user5994461 · on July 14, 2021

The security team is not incompetent. Large companies do not permit developers to spin up their own email systems without audit and regulatory retention. The port number is sufficient to determine that the request should be rejected.

You could work around it but should you? You're exposing the company to fines and risking your job.

Better think of another way to integrate with the vendor, or find another vendor.

P.S. SMTP is easy to identify on ANY port, it's replying a distinctive line of text when TCP connection is opened.

vidarh · on July 14, 2021

> do not permit developers to spin up their own email systems without audit and regulatory retention

If they freak out over an SMTP server but don't freak out over a web server, then the are indeed absolutely utterly incompetent fools that should never work in this space.

In both cases code written by the company developers will eventually process untrusted textual input, and you need to deal with that with the same level of caution, and the protocol does nothing to change that.

> You could work around it but should you? You're exposing the company to fines and commiting a fireable offense. Better find another product that's easier to deploy.

I would not work around it - I would make the case that there's no difference in exposing a carefully chosen SMTP server than exposing a web server, and if the security team fail to understand that, I'd resign, because it'd be a massive red flag, and I've been successful enough to be in a position to not need to work for companies like that.

For that matter, in 25 years in this business I've yet to run into your hypothetical scenario, including at large companies, so I'm not at all convinced it'd be a genuine problem. Yes, I've been at companies where I'd need to provide a justification for getting a port opened. But never once had an issue getting it approved - including SMTP.

> P.S. SMTP is trivially identifiable on ANY port, it's giving a line of text when the TCP connection is opened.

I was responding to "Security will never open the firewall for email ports.". Point being that if they care about the specific port numbers, it doesn't matter.

[And I'll again point out I've actually run infrastructure like this].

user5994461 · on July 14, 2021

Never worked in a bank? Never worked in defense?

I'm speaking from real experience too. It takes a while to open firewall in some environments, if you ever can.

One bank was the worst. There was a super stringent process to expose things externally. Opening the firewall port was just the beginning and that'd take 2-4 weeks if all goes well.

You'd struggle like hell to expose a SMTP server though because it would be immediately be rejected and flagged based on the port. Banks have to store, monitor and ensure the origin of all emails, they don't allow shadow email servers. And it's plain text so more reasons to ban (also a problem with HTTP, you should do HTTPS if anything).

Defense was simpler, mainly because there was no external connectivity in many cases. You don't need to worry about how to open a firewall when there's none :D

thakoppno · on July 14, 2021

You have this Nodejs developer’s upvote.

At this point in my career (10 years in the game), let me simply defend node as the tool that got me here. Using it then to bootstrap my career was just as practical as using SMTP as you describe now.

tylerscott · on July 14, 2021

I absolutely love your perspective. I feel the same way. s/Ruby+Rails/node for my situation. I believe there needs to be more respect paid to "bad" technologies. The measuring stick should include things outside pure benchmarks. Low barrier to entry technologies provide broad access and real life changes to folks that are able to pick them up and get hacking.

paulddraper · on July 14, 2021

> SMTP

But... Why?

The HTTP protocol is so much easier to manage, load balance, use, etc.

berkes · on July 14, 2021

The article is almost entirely the answer to your question.

> there are risks when you go down.

Solved by SMTP on protocol level. With HTTP, must be solved on both client and servers application level.

> webhooks are ephemeral. They are too easy to mishandle or lose.

SMTP has this baked into its heart. Loosing messages is possible, certainly, but rather hard to do. With HTTP, it's really simple.

> In the lost art of long-polling, the client makes a standard HTTP request. If there is nothing new for the server to deliver to the client, the server holds the request open until there is new information to deliver.

SMTP is push, not polling. So all those issues are solved for you.

teddyh · on July 14, 2021

> SMTP is push, not polling.

Yes. And if you want to poll, POP is polling, and IMAP has both polling and immediate notifications.

salamander014 · on July 14, 2021

Yes, but SMTP is a protocol wherein a system opens a connection to another system and says, hey here's a message.

And if that doesn't succeed for some reason, it reliably queues and retries.

That's a push.

teddyh · on July 14, 2021

I was not disagreeing, merely providing additional information. I have edited to clarify.

fny · on July 14, 2021

So how are people supposed to consume this? With an SMTP client?

I think the bigger issue is that consumption isn't particularly friendly. Also, you still haven't solved the versioning issues.

jval43 · on July 14, 2021

There are better options than SMTP. Basically any message-oriented middleware / message queuing service can provide this. It's great for both sides, maintenance/outages can happen independently, as long as the queue stays online and has space everything is fine.

johannes1234321 · on July 14, 2021

E-Mail isn't trustworthy. You may get a confirmation that an initial SMTP server accepted a mail, but that's it. There's also no good way to detect that an endpoint (receiver address) is gone for good to stop sending messages.

You will probably point me to SMTP success messages, but a removed mailbox might only be known by a backend server.

Also mail infrastructure will potentially include heavy spam filters etc. making it quite inconvenient. Not even mentioning security aspects with limited availability of transport layer encryption with proper signatures.

vidarh · on July 14, 2021

What you're saying is true of public e-mail infrastructure, but that's besides the point. As a queuing solution internally in a system you can make it as resilient as you like with ease because there's a huge ecosystem of resilient software you can use for it.

Same goes for security - your objection is true for public e-mail delivery without additional requirements on the servers or clients, but that is not relevant for a private infrastructure.

johannes1234321 · on July 14, 2021

In a private environment you have tons of options. The post however refers to notification between independent entities on the public network.

vidarh · on July 14, 2021

Running over the public internet does not mean you rely on unknown third party mail servers. If I address a message to foo@apiendpoint.mycustomer.com, only the servers configured to handle mail for apiendpoint.mycustomer.com and my sending server is involved in the exchange. And that is if you trust MX records for this exchange rather than have the customer input the address of the receiving SMTP server directly.

go_prodev · on July 14, 2021

I think that would be a great solution for these types of scenarios.

In an enterprise setting it becomes more complex if a 365 subscription is required, or active directory authentication is needed to receive emails. Does someone need to monitor the inbox to confirm it's working etc.

But after you mentioned it, I do wish that this was an alternative to webhooks that more service providers offered.

teh_klev · on July 14, 2021

We used to do this for domain name registrations and it worked fairly well for years. However once you've been added to a spam blacklist it quickly breaks down, especially for time critical operations such as domain name renewals when you're scrabbling around trying to appease the Spamhaus gods.

oalae5niMiel7qu · on July 15, 2021

SMTP doesn't reliably deliver messages, implementations of it do. A webshit could easily create an SMTP server (with the help of a library written by someone with actual programming skills) that silently drops messages when any error occurs instead of implementing all that robustness.

daniellarusso · on July 13, 2021

The very first startup I worked at used this for a sweepstakes leadgen form to send to MySQL via a Perl script running from cron.

alephu5 · on July 14, 2021

Another option would be to publish an AMQP endpoint, I'm not sure what the security implications of this are though.

NicoJuicy · on July 14, 2021

And far too slow for a lot of use-cases

vidarh · on July 14, 2021

There's nothing about SMTP that makes it slow. There's lots about public e-mail infrastructure that sometimes makes it slow.

andyxor · on July 14, 2021

going down this non-traditional path you might also consider using XMPP and ejabberd for machine-to-machine messaging

aidenn0 · on July 13, 2021

SMTP no longer reliably delivers messages. Try setting up an MTA on a Hetzner VPS and see how many messages get through

vidarh · on July 13, 2021

That is only relevant if you require delivery to arbitrary endpoints rather than to endpoints explicitly set up to process your messages.

mrtesthah · on July 13, 2021

That's not an applicable criticism for SMTP running on a private network and/or dedicated set of "mail" submitting servers, as in the specific model outlined in the grandparent comment.

XorNot · on July 14, 2021

This is a hill I find myself frequently fighting (and losing on): webhooks are terrible to maintain, because they start from the premise "this never breaks" and thats about where development in an organization stops.

The only event API I ever want is notifications there's new data, and then an interface by which I can query all new data which has arrived by some sort of index marker - because this is fundamentally reliable. It means whatever happens to my system, I can reliably recover missed events, skipped events, or rebuild from previous events.

And this is in fact exactly how something like Kafka actually works! Complete with first-class support for compacting queues to produce valid "summarized" starting points.

Any streaming system essentially should never start as a streaming system - it should start as a slow-path pull-based system, and have a fast-path push system added on top of it if needed - because then you've built your recovery path already, rather then what happens way too often which is just "oh yeah, we'll develop that when it breaks".

sroussey · on July 14, 2021

I agree. A simple ping with the latest ID (which is option for you to use to get events from last ID to newest ID). The go get the events, which is a likely reusing code. Polling is crap.

Extra points for being able to set something like 1s between pings (now you see why I like the option ID for a range).

delusional · on July 14, 2021

> Any streaming system essentially should never start as a streaming system - it should start as a slow-path pull-based system, and have a fast-path push system added on top of it if needed - because then you've built your recovery path already, rather then what happens way too often which is just "oh yeah, we'll develop that when it breaks".

I think this is a quite interesting and important point. When we talk about "doing the simple thing first" too often we end building something that is technically simple but fickle. The trick to making the simple thing reliable is to figure out which part is the slow-path (or failure mode), and then only building that. Unfortunately, it often means out result ends up technically "boring" since all the interesting optimizations are what we cut out, but I think that's worth it if the end result is a more useful product.

It's something I've been working with and thinking about for a while. I think it applies to a way broader scope than this discussion.

rattray · on July 14, 2021

(I worked on the same team as bkrausz, elsewhere on this thread, albeit not concurrently).

Yes, this is pretty much the right thing to do. It can be a bit more work for the API consumer, partly because they need to track state of their last-read ID, and there's more moving parts.

If you're building a webhhook+events system like Stripe's, you might consider adding an option for a mostly-empty webhook body, which can speed things up in this use-case, but still allows "the easy way" of just processing the event from within the webhook body.

(For readers thinking of implementing this, note that "query for new data" means hitting a dedicated /events api, not individual tables, which might have unpleasant load/performance consequences).

danudey · on July 13, 2021

My company has recently switched to Microsoft Teams, where unsupported integrations happen via webhooks. For example, if we wanted to be able to trigger builds in Jenkins or Gitlab, or acknowledge alerts via AlertManager, we'd have to set them up as webhooks to the appropriate service.

The problem is that all of those services are internal to our network, and aren't accessible from the outside world. We cannot set up a webhook to Jenkins because Jenkins does not have a publicly accessible URL. We cannot set up a webhook to Gitlab, or to Prometheus, or to Sentry, or anything else, because those are all internal services.

The only option there would be to create a new, public-facing server, set it up with a domain name and SSL certificate, expose it to the world, and then give it access to those services - which defeats the point of having those services internal and secure if we just create a non-internal system and give it access to them.

Alternately, we have that new, public-facing server buffer those requests and have other services poll them, somehow, so that it cannot connect in, but now we're getting into the same situation as described in the article.

If there were an API, I could easily create a small daemon that would watch for events and dispatch them accordingly, and then respond to them as needed; instead, my only option is to build some kind of Frankenstein - or to give up entirely, which is the more reasonable solution.

Then again, this is Microsoft Teams, where creating an application requires an Azure account and jumping through a ton of hoops, so they're no stranger to stupid ideas that no one wants to deal with.

pjgalbraith · on July 13, 2021

If you are using Teams and use Azure AD then something like Azure AD Application Proxy might be a good option https://docs.microsoft.com/en-us/azure/active-directory/app-...

carlosf · on July 14, 2021

+1

My company's internal apps use a mix of VPNs and IP fenced load balancers. We are migrating to app proxy.

No inbound connections + access based on Azure AD identity with conditional access (restrict apps to Intune enabled corporate devices) and MFA is an absolute killer.

My only complain is that connectors are not very DevOps friendly. Cloudflare Tunnel is much better in this area.

graton · on July 13, 2021

You have the same sort of issue that I do.

You might look into Cloudflare Tunnel (formerly Argo). It is free and allows you to poke a hole in your firewall to a specific service. If that meets your security requirements.

https://www.cloudflare.com/products/tunnel/

jffry · on July 13, 2021

I don't believe Cloudflare Tunnel is free, the free tier pricing page [1] lists Argo Smart Routing at "Starting at $5 per month" ("Argo includes: Smart Routing, Tunnel, and Tiered Caching")

[1] https://www.cloudflare.com/plans/

aidenn0 · on July 14, 2021

Argo Smart Routing is not free, but the tunnel is. The tunnel used to be only available under the Argo umbrella, but they changed that at some point.

graton · on July 13, 2021

Well they said free in their blog:

https://blog.cloudflare.com/tunnel-for-everyone/

Also in my personal testing I didn't need to pay.

bastawhiz · on July 13, 2021

> The only option there would be to create a new, public-facing server

This is a problem with receiving any inbound data from a third party. At least with HTTP, it's pretty trivial to set up a robust reverse proxy with nginx.

oscargrouch · on July 13, 2021

I'm finishing a browser based application platform where the applications installed expose a RPC api, so in the end all applications can call others in the same local(or remote) node/s.

The beauty of this is that you also can compose with other nodes and for a distributed service by calling the local service as a proxy and routing the requests to the other nodes of the same api.

It took more time than i've predicted because its also expected to deliver UI and most of the 'HTML5' api to native applications (instead of Javascript), which is a massive platform by now (and the #1 reason why newcomers to browser technology cant compete, giving the feature creep tax imposed to them).

The idea is also to distribute over a DHT so you can just serve your application over torrent without needing to register anything..

The only way to get there is by empowering users and developers and taking some of the control from the cloud platform giants.

In my point of view the only way to break the browser monopoly now is to create a new path forward, a branch.. its not the time to follow the rules, its time to break them or else the future doesn't look so bright in my opinion..

BeefWellington · on July 14, 2021

> The only option there would be to create a new, public-facing server, set it up with a domain name and SSL certificate, expose it to the world, and then give it access to those services - which defeats the point of having those services internal and secure if we just create a non-internal system and give it access to them.

That's not the only solution -- you could also develop a bot that will do those specific things.

In the days of yore I know of at least three companies that were using IRC bots to similar effect long before webhooks ever existed.

Because of that prior experience, this is how I currently manage a similar set of problems, albeit not on Teams in my current role.

ec109685 · on July 14, 2021

Really good point that corporate firewalls can trip you up. With slack it was so much easier to call into their events API than receive an outgoing webhook for precisely this reason.

The downside was that the event api required a huge amount of scope, so if you weren’t careful and were compromised, someone could use that token to scrape all messages in the system.

Slack recently added socket mode for precisely this reason: https://api.slack.com/apis/connections/socket

iamtheworstdev · on July 13, 2021

check out https://smee.io/

orf · on July 13, 2021

A small Lambda (or your cloud equivalent) is perfect for this

tabbott · on July 13, 2021

Zulip's API is built on roughly this design pattern:

* https://zulip.com/api/real-time-events

* https://zulip.com/api/register-queue

* https://zulip.readthedocs.io/en/latest/subsystems/events-sys...

We use this same long-polling based /events API interface for all official clients (web, mobile, terminal), our interactive bots ecosystem (https://zulip.com/api/running-bots), and many integrations (E.g. bridges with IRC/Matrix/etc.).

We also offer webhooks, because some platforms like Heroku or AWS Lambda make it much easier to accept incoming HTTP requests than do longpolling, but the events system has always felt like a nicer programming model.

(Zulip's events system was inspired by separate ~2012 conversations I had with the Meteor and Quora founders about the best way to do live updates in a web application).

Arathorn · on July 14, 2021

Matrix has the same, except we call it /sync these days rather than /events, and it long-polls :)

paxys · on July 13, 2021

There are lots of reasons to want to immediately respond to an external event besides building an eventually consistent data syncing system. Polling an API endpoint works fine for the latter case, but not much else.

A good platform should offer both of these and more (for example Slack does webhooks, REST endpoint, websocket-based streaming and bulk exports), and let the client pick what they want based on their use case.

benlivengood · on July 13, 2021

Long-polling is the way to immediately retrieve events. It's more efficient and lower latency than waiting for a sender to initiate a TCP and TLS handshake.

andrewstuart2 · on July 13, 2021

A persistent connection has a cost. Your statement may be true in some circumstances but definitely not all. Namely, for infrequent events it is much more efficient to be notified than to be asking nonstop. Sure, the latency is lowest if the connection is already established, but for efficiency the answer is not cut and dry but is rather a tradeoff decision based on the expected patterns.

eptcyka · on July 13, 2021

Also, there's the case of ISP's just dropping idle TCP connections. It can also take a while to determine that a TCP connection is broken.

wruza · on July 13, 2021

Long-polling is usually configured to reset at the both sides after a timeout preferred by the client-side (/events?t=30), long before any network effects kick in, e.g. 10-30 seconds. A client then simply spams requests in a loop, backing off only at http errors. If you have some crazy firewall in between, just set “t” appropriately.

runeks · on July 13, 2021

What’s the issue with that? This will be discovered as soon as the endpoint tries to send an event, right? At which point the client will see that the connection has been closed, reconnect, and receive the event.

kelnos · on July 13, 2021

No, the server will try to send an event, and the server will notice the connection has dropped. The client will still have no idea until some sort of timeout is reached, as the client will usually not be sending any data over the connection, as the connection's sole purpose is for the server to send events to the client.

A way to fix this is to use an application-level keepalive (TCP keepalives are generally useless), but then that increases the load on the server and adds a scaling burden.

Meanwhile, unless the event stream is stateful (more overhead!), the client has lost all events since the connection has dropped, and the client can't even be sure when the connection actually dropped.

With webhooks, assuming the callback sending service has a generous retry policy, and the customer's receiving service does not return 200 unless the webhook has been completely processed, or persisted to storage, you won't lose events.

I've been at Twilio for the past 10 years. We recently started offering an event stream service (that customers had been requesting for some time), but it's complicated to get right (on both the server and client side) and difficult to scale, and, frankly, webhooks have worked fine for most customers for a very long time.

rad_gruchalski · on July 13, 2021

> No, the server will try to send an event, and the server will notice the connection has dropped. The client will still have no idea until some sort of timeout is reached, as the client will usually not be sending any data over the connection, as the connection's sole purpose is for the server to send events to the client.

Exactly why mqtt has the ping packet for the client.

fragmede · on July 13, 2021

yeah. a(n improperly configured) firewall is going to start dropping packets if it thinks a connection is idle for too long, so the system never sees an RST and think the connection’s been terminated.

jandrese · on July 13, 2021

Why for the love of God does the firewall not send the RST when it drops the connection?

lmz · on July 14, 2021

Because what usually happens is the connection is just forgotten from the NAT table. Both sides still see it as connected but the middle box will no longer forward any packets.

jandrese · on July 14, 2021

It doesn’t just “fall off” the NAT table. Some process in the firewall chose that entry in the NAT table to drop at that moment. It could use the entries from that NAT table to construct RST packets to both sides of the connection. This should be easy and obvious.

dnautics · on July 13, 2021

Exactly. Perhaps an event happens once or twice-ish a day per customer, and never on the weekends.

nmcfarl · on July 14, 2021

I've got an API where an event happens once a month (+/- 2 weeks) for a large percentage of our customers.

foxhill · on July 14, 2021

> A persistent connection has a cost.

are you sure? specifically, are you sure a persistent connection has _more_ of a cost than repeatedly re-establishing a connection & TLS, etc.?

in terms of energy costs alone, DNS resolution, establishing routes, generating cryptographic session keys, etc. it's definitely not as cheap.

in terms of today's computation power, the "memory" costs of maintaining a connection are minuscule, and the performance "penalties" are negligible.

example: lets say you have 50k event subscribers. if nothing happens, then, aside from a few TCP keepalives (which are not strictly speaking required, and can happen very infrequently), no traffic moves. if instead you have polling once every second, then that's nearly ~13-14 connections a second, each one with at least 4 round trips of traffic. that's a measurable amount of load.

mikepurvis · on July 13, 2021

One nice benefit of long polling is the built in catch-up-after-a-break functionality: When the client initiates the poll, it tells the server the state it knows about (timestamp, sequence number, hash, whatever), and the server either replies right away if it's different, or waits and replies once it's different.

With webhooks, as in the article, you only get state changes; you need some separate mechanism to achieve (or recover) the initial state.

tshaddox · on July 13, 2021

That's true, although it's also true of any `/events` endpoint that doesn't go back to the beginning of time. Stripe's endpoint only goes back 30 days, so you still need to solve for the initial state unless you have launch all of your desired functionality at the very beginning of your Stripe account!

mikepurvis · on July 13, 2021

Hopefully if it's a system like payments where you not only need to know state, you also need to know the time and nature of all transitions, there's a way to query all of that information.

I'm thinking of simpler situations like my source host's CI spinner that seems to get stuck all the time due to missing the ping back from Jenkins about build statuses. In that case it really would be fine to always just say "I think the state is X, please answer me now or in the future whenever the state is other than X." I don't care about anything other than an up to date sync.

hakunin · on July 13, 2021

Someone has to maintain an always-running listener for `/events`. If a server does that, and triggers client calls, we call that webhooks. If a client does that, and triggers internal functions, it's what the op describes. I think that for APIs, `/events` should indeed be the fundamental feature, and "webhooks" should be a nice-to-have service on top of `/events`, for those who don't want to maintain a local subscriber.

sk5t · on July 13, 2021

If the webhook events are coming at some sort of a brisk pace, the sender well may be able to reuse an already-open connection. And if they're rather infrequent, is the efficiency or latency likely to be a significant concern?

lmm · on July 14, 2021

Yes, it is - latency-sensitive but infrequent events are an extremely common use case.

sk5t · on July 14, 2021

In the general sense, yes, but your assertion rings false in my opinion when the situation presents only a choice between webhook or long polling.

lmm · on July 14, 2021

I don't understand your statement? Rare but latency-sensitive events are a very common use case for webhooks or long polling.

IshKebab · on July 13, 2021

If you're using HTTP use websockets or server-sent events, not long polling. Long polling is obsolete.

azinman2 · on July 13, 2021

My understanding is that long polling is the thing that will reliably work at scale. Perhaps this changed in the past few years, but I’ve asked various companies like PubNub why they only use long polling and the answer was that there are too many incompatibilities out there in the wild for anything but that.

IshKebab · on July 14, 2021

Server-Sent Events are very reliable. What you might be thinking of is the fact that you probably shouldn't rely just on server push. But that doesn't mean you should use long polling.

You should use normal short polling and Server-Sent Events.

Also it makes no sense to say long polling is more reliable than SSE, because SSE is essentially a non-hacky implementation of long polling.

gremlinsinc · on July 13, 2021

Websockets can cause issues especially if you're not closing sockets properly, or have too much activity on a small server etc... Livewire for instance accounts for this by just polling every 2 seconds for changes, this is much more performant than keeping 10000 sockets open if people leave open the page/app but don't actually do anything...

Straight long-polling should be avoided, but intermittent polling is a good solution for performance when you don't want to use all your socket bandwidth.

mywittyname · on July 13, 2021

> To mitigate both of these issues, many developers end up buffering webhooks onto a message bus system like Kafka, which feels like a cumbersome compromise.

Kafka solves exactly the issue that the author is complaining about. This is a safeguard to ensure that data isn't dropped in the event of an issue, and provides mechanisms to replay events.

The tradeoff between pushing and polling have been argued since forever.

In other news, mechanics who work with bolts often do so with ratchets. This is a cumbersome compromise, just give me Torx fasteners!

jerf · on July 13, 2021

It would if the source was pushing into the Kafka stream directly. It doesn't solve the problem of going out of sync if my code to push to the Kafka stream is entirely down and I miss POSTs.

(And, of course, I don't want Kafka. I want Google PubSub. No, wait, I mean SQS. No, wait, I mean I want zeroMQ. No, I mean....)

nine_k · on July 13, 2021

The question is: who maintains the queue of events, and pays for it?

Certainly the event producer is in a better position to maintain a queue without missing events, but it also means they need to buffer more data in their queue system to accommodate for your receiver's downtime

BasieP · on July 13, 2021

this!

l_t · on July 13, 2021

Not disagreeing with your point, and I'm sure you already know this, I just wanted to point out (for the benefit of people that don't have other options) that it is possible to build "webhooks" in such a way that you're confident nothing is dropped and nothing goes (permanently) out of sync. (At least, AFAIK -- correct me if this sounds wrong!)

Conceptually, the important thing is each stage waits to "ACK" the message until it's durably persisted. And when the message is sent to the next stage, the previous stage _waits for an ACK_ before assuming the handoff was successful.

In the case that your application code is down, the other party should detect that ("Oh, my webhook request returned a 502") and handle it appropriately -- e.g. by pausing their webhook queue and retrying the message until it succeeds, or putting it on a dead-letter queue, etc. Your app will be "out of sync" until it comes back online and the retries succeed, but it will eventually end up "in sync."

Of course, the issue with this approach is most webhook providers... don't do that (IME). It seems like webhooks are often viewed as a "best-effort" thing, where they send the HTTP request and if it doesn't work, then whatever. I'd be inclined to agree that kind of "throw it over the fence" webhook is not great and risks permanent desync. But there are situations where an async messaging flow is the right decision and believe it or not, it can work! :)

atombender · on July 13, 2021

This misses the problem explained in the article, which is that there are scenarios where events are "acked" but things still go wrong because of bugs.

For example, you rolled out code on the receiver side that did the wrong thing with each message. Now there's no way to replay the old webhooks events in order to reinstate the right behaviour; there's no way to ask the producer to send them again.

The only way around this is to store a record of every received message on the receiver side, too, which the article author thinks is an unnecessary burden compared to polling.

Personally, I think push is an antipattern in situations where data needs to be kept in sync. The state about where the consumer is in the stream should be kept at the consumer side precisely so it can go back and forth.

curryst · on July 14, 2021

If you want to be 100% sure that you get all the webhooks, the sender could implement an incrementing "webhook ID". If the receiver knows the last webhook ID was 53 and the sender sends one for 55, you can tell one has been dropped. There are some other concerns around that like if 54 has been sent but they arrived out of order, or if they arrive almost simultaneously. Nothing that isn't solvable afaict though.

Of course, then you need a way for the receiver to retrigger or view the webhook if one gets missed, which starts to look like you have to have a polling endpoint anyways, though.

BasieP · on July 13, 2021

We have a system that pushes loads of messages (as in thousands a minute) and some consumer insists on using there http backend to push the messages to. There system is down every once in a while for quite some time. We're using an async queueing solution, but you can't keep those messages forever. We sometimes have milions of messages for them in there queue's, which take up space... If all of our consumers had those problems we would have to buy loads of storage.. We're simply dropping messages older than x, and have an endpoint that they can call to retreive the 'latest state of things'. This way when they come back from a failure, they simply get the latest state, and then continue with updates from our end.. It's far from perfect, but it works really well.

I know the goal for most systems is just to be 'up to date' Not to get the entire history. So in most cases you don't need to stash all the messages, you just need to be able to retreive the latest state of stuff...

ThrowawayR2 · on July 13, 2021

> "Of course, the issue with this approach is most webhook providers... don't do that "

Embedded systems don't do that for webhooks because they can't (very little RAM or non-volatile storage) but customers clamor for webhooks anyway because it's what their web developers know how to use. So inevitably they're going to lose data but they're only getting what they asked for.

mywittyname · on July 13, 2021

As long as you guarantee delivery to your message queue before acknowledging receipt, you should be golden.

Also, swapping out one messaging system for another is trivial. Pick the one best suited to the environment you're working in, and if that environment changes, changing messaging queues is going to be one the easiest transitions you'll make.

toomuchtodo · on July 13, 2021

You meant Apache Pulsar! :)

jerf · on July 13, 2021

I'm feeling cosmopolitan today. I mean all the things!

danudey · on July 13, 2021

Having helped manage a Kafka cluster, I do not want to run a Kafka cluster just so that Microsoft Teams can webhook me events now and then.

Floegipoky · on July 13, 2021

Yeah I was scratching my head reading this article; they're bending so far backwards to avoid the obvious solution that I thought they were gearing up to pitch some competing tech.

> If the sender's queue starts to experience back-pressure, webhook events will be delayed, and it may be very difficult for you to know that this slippage is occurring

I've never before seen anyone try to argue that properly dealing with backpressure is a bad thing. The author's proposed model makes this situation even worse. With kafka, consumers can continue processing the event stream and you can continue to serve reads from your primary datastore. With the author's model the event stream lives in your primary datastore, so if that starts to lock up the blast radius is much larger.

closeparen · on July 13, 2021

Are you going to expose your Kafka brokers directly to your integration partners? Are they going to use the Kafka client library and wire protocol to send you data? That’s the thing about webhooks, HTTP is universal and if you’re comfortable exposing anything externally, it’s going to be a web service.

mywittyname · on July 13, 2021

I would not expose kafka directly. I would implement this as:

HTTP Endpoint -> Push to message queue (kafka, SQS, etc) -> Acknowledge receipt

That's a pretty straight-forward design that's widely used, robust, and easy to put together. I've probably done that same workflow 100s of times without issue.

As long as you guarantee the message was pushed to the queue before acknowledging, that will be fabulously reliable. You need to make contingencies for duplicate messages, but that's not usually difficult.

mrkurt · on July 13, 2021

We expose the NATs protocol to our users. Exposing non-http protocols is fun, sometimes.

aunty_helen · on July 13, 2021

It's a common writing style as of late, set down a premise and solve that premise decisively.

Now, if that premise isn't based in reality, or if it's already been solved some other way, discredit it without giving it too much air time.

A one liner about kafka being cumbersome and then building your own solution, warts and all, doesn't need to exist in the same thought if you've made the reader mentally disregard it as a possible solution.

alexbouchard · on July 13, 2021

Totally, things can get very reliable if you start processing webhooks asynchronously. Personally I've found it pretty cumbersome and complicated to build the necessary infrastructure in the past. I've been building https://hookdeck.com as a simpler alternative specifically to ingesting incoming webhooks.

mbrevda1 · on July 13, 2021

Are events and webhooks mutually exclusive? How about a combination of both: events for consuming at leisure, webhooks for notification of new events. This allows instant notification of new events but allows for the benefits outlined in the article.

sb8244 · on July 13, 2021

What about supporting fast lookup of the event endpoint, so it can be queried more frequently?

I think that a combo of webhooks / events is nice, but "what scope do we cut?" is an important question. Unfortunately, it feels like the events part is cut, when I'd argue that events is significantly more important.

Webhooks are flashier from a PM perspective because they are perceived as more real-time, but polling is just as good in practice.

Polling is also completely in your control, you will get an event within X seconds of it going live. That isn't true for webhooks, where a vendor may have delays on their outbound pipeline.

jacobr1 · on July 13, 2021

The article advocate for long-polling

sb8244 · on July 13, 2021

Yea, you're right. I am reading the advocacy as "if you need real-time, then support long-polling."

I see the value in this, but I actually disagree with the article in terms of that being the best solution. Long-polling is significantly different than polling with a cursor offset and returning data, so you wouldn't shoe-horn that into an existing endpoint.

pak9rabid · on July 14, 2021

Couldn't keeping a request open indefinitely open the system up to the potential of DoS attacks though? Correct me if I'm wrong, but isn't it kind of expensive to keep HTTP requests open for an indeterminate amount of time, especially if the system in question is servicing many of these requests concurrently?

coldacid · on July 13, 2021

I think that's what the author was getting at, after reading through the whole article. The idea isn't to get rid of webhooks, but provide an endpoint that can be used when webhooks won't necessarily work.

snarkypixel · on July 13, 2021

Very similar to how I built my previous application.

1) /events for the source of truth (I.e. cursor-based logs) 2) websockets for "nice to have" real-time updates as a way to hint the clients to refetch what's new

saurik · on July 13, 2021

Yeah... I'd go so far as to argue that this is the only architecture that should even ever be considered, as only having one half of the solution is clearly wrong.

alexbouchard · on July 13, 2021

This is the way to go and I'd love to see more API's with robust events endpoint for polling & reconciliation. Deletes are especially hard to reconcile with many APIs since they aren't queryable and you need to instance check if every ID still exist. Shopify I'm looking at you.

shvedsky · on July 13, 2021

Yes to the combination of both. I worked on architecture and was responsible for large-scale systems at Google. Reliable giant-scale systems do both event subscription and polling, often at the same time, with idempotency guarantees.

j_san · on July 13, 2021

Sorry if I'm daft, could you/someone explain why one would want to use both at the same time for the same system?

One thing that makes sense: if you go down use polling so you can work at your own pace. But this isn't really at the same time. When/why does it make sense to do both simultaneously?

shvedsky · on July 13, 2021

There is an inherent speed / reliability tradeoff that is extremely difficult to solve inside one message bus. When you get to truly large systems with a lot of nines of reliability, it starts to make sense to use two systems:

1. Fast system that delivers messages very quickly but is not always partition-tolerant or available 2. Slower, partition tolerant system with high availability but also higher latency (i.e. a database)

The author goes through this in the very first section. Webhook events will eventually start getting lost often enough for the developer to think about a backup mechanism.

Long-polling works if you have a lot of memory on your database frontend. Most shared databases want none of your long-running requests to occupy their memory which is better used for caches.

Even if your message bus has the ability to store and re-deliver events, you might want to limit this ability (by assigning a low TTL). Consider that the consumer microservice enters and recovers from an outage. In the meantime, the producer's events will accummulate in the message service. At the same time, the consumer often doesn't need to consume each individual event but rather some "end state" of some entity or a document. If all lost events were to get re-delivered, the consumers wouldn't be able to handle them, and would enter an outage again. This is where deliberately decreasing the reliability of the message bus and rely on polling would automatically recover the service.

There are other reasons, of course. The author is absolutely correct in their statement, though: whenever a system is implemented using hooks / messages, its developers always end up supplementing it with polling.

kissgyorgy · on July 13, 2021

What's the point of implementing webhooks once you implemented long polling for the /events endpoint?

mbrevda1 · on July 13, 2021

I'd argue against long/persistent polling. Webhooks allows for zero resource usage until a message needs to be delivered.

zimpenfish · on July 14, 2021

> Webhooks allows for zero resource usage until a message needs to be delivered.

Doesn't that only work in the case where the server treats each webhook delivery as ephemeral? If you're keeping a queue to allow reliable / repeatable delivery, that's definitely not "zero resource usage", right?

kissgyorgy · on July 14, 2021

On the sender side, sure. On the receiver side? You have to have a service listening 0-24.

luuio · on July 13, 2021

I don't think the original comment meant long polling (i.e. keeping the connection alive), they meant periodically call the endpoint to check for events.

cwyers · on July 13, 2021

The article advocates for long polling of endpoints.

toomim · on July 13, 2021

There's a much better approach than /events or webhooks: add synchronization directly into HTTP itself.

The underlying problem is that HTTP is a state transfer protocol, not a state synchronization protocol. HTTP knows how to transfer state between client and server once, but doesn't know how to update the client when the state changes.

When you add a /events resource, or a webhooks system, you're trying to bolt state synchronization onto a state transfer protocol, and you get a network-layer mismatch. You end up with the equivalent of HTTP request/response objects inside of existing HTTP request/responses, like you see in /events! You end up sending "DELETE" messages within a GET to an /events resource. This breaks REST.

A much better approach is to just fix HTTP, and teach it how to synchronize! We're doing that in the Braid project (https://braid.org) and I encourage anyone in this space to consider this approach. It ends up being much simpler to implement, more general, and more powerful.

Here's a talk that explains the relationship between synchronization and HTTP in more detail: https://youtu.be/L3eYmVKTmWM?t=235

wruza · on July 13, 2021

You may send POST /events instead. It also breaks “REST”, which is just a sort of obsession rather than a requirement here, but more importantly it wouldn’t break idempotence and proxy caching that GET implies.

Edit: from the network point of view, it’s either call-back or a persistent call-wait/socket, or polling. The exact protocol is irrelevant, because it’s networking limits and efficiency that prevent everyone from having a persistent connection to everyone. A persistent connection can’t be much better than any other persistent connection in that regard, and what happens inside is unrelated story. Or am I missing something?

jonny_eh · on July 13, 2021

> just fix HTTP

Oh yes, changing HTTP is so easy.

toomim · on July 14, 2021

HTTP is actually quite malleable, and adding synchronization is easy.

You can add it to your own website with a few simple headers, and a response that stays open (like SSE) to send multiple updates when state changes: https://datatracker.ietf.org/doc/html/draft-toomim-httpbis-b...

...and you can get these features for free using off-the-shelf polyfill libraries. If you're in Javascript, try braidify: https://www.npmjs.com/package/braidify

derptron · on July 13, 2021

The website seems to be crammed into the left side of my screen unnecessarily.

top_kekeroni_m8 · on July 14, 2021

Centering a div is hard!