Launch HN: Svix (YC W21) – Webhooks as a Service

edoceo · on June 16, 2021

We had a similar issue. We built out a small tool in Go we call WHOMP (WebHOok Management Proxy). Our app code pumps to that and it handles all the rest of the delivery and security parts (eg: malicious hooks). This single go-binary can easily handle 1000+ RPS.

IMHO you're competing with some fairly simple code built on fairly simple queues that can handle loads of traffic at $5/mo after the build (it took us like 24 hours) and it's tightly integrated to our own app - so those retries/errors are visible.

I'm not clear on where the value is here. I drop my home brew then rework systems to your APIs to get back to baseline. Then what? I hit monthly cost parity at 15k messages a month. But at 20k messages it's $10/mo and goes up - it looks like I'd be over $50/mo on the platform quickly. Maybe I just don't see the Killer Feature here.

dubcanada · on June 16, 2021

You already answered your question.

1. You spent hours building a small Go tool, assuming your hour is worth at least a $100 (certainly more) and say you spent 5 hours, that's $500. 2. You deployed it on a VM, that's $5 a month 3. You need to keep it updated and manage it, so let's say you spend 1 hour a month. That's a $100 a month.

So right now you spent $500 upfront, and $105 a month to keep it going, and that's assuming 1 hour a month which is crazy low.

You can replace that entire cost with a static Y x 0.001 + what ever up front cost to convert the system to use Svix, which can be factored into the system cost. I have 50,000 users, each user has 1 webhook I need 50,000 x average number of messages a month. Just add that on top of the system monthly cost.

It's a lot easier to manage from a business/accounting point of view a SAAS then your own homebrew.

onlyrealcuzzo · on June 17, 2021

So instead you should spend 5 hours reading about some service that might not exist in a year and doing integration work with it?

It's not like you're going to integrate some service into your app in 0 hours.

It's also not like you won't have to spend some time maintaining your integration either.

IMO, a "killer feature" for a business is not /maybe/ saving $100 a month under low usage, and /maybe/ costing you $1k+ under high usage...

tasn · on June 17, 2021

Five hours is a gross underestimate for the time it takes to build a robust and scalable webhooks delivery system, and even that is just for the delivery system itself. You also need to build the UI for your users, libraries for verifying signatures and etc.

All of this assumes you know what you are doing and can just write the code. It doesn't even take into account the research you need to do to make sure you cover all of the security (and other) consideration that GP seemed to have already known about and took care of.

edoceo · on June 17, 2021

Right! Vendor driven churn is a costly burden. I'm going to post more about the tool later.

pm90 · on June 17, 2021

> It's a lot easier to manage from a business/accounting point of view a SAAS then your own homebrew.

Hmm I would have to disagree here. This is something that most platform teams could build, deploy and forget about. Setting up an account, getting procurement to pay for it, not being able to see why it’s down etc just seems like a lot more hassle.

wfleming · on June 16, 2021

For a lot of services like this I think the value prop is not monetary cost but more "this is no longer your responsibility". Since you already have a custom-built solution that is stable & operating efficiently, switching doesn't sound worth it to you. For a team that's looking to add webhook functionality for the first time, not needing to go through the implementation work themselves or need to deal with any maintenance longer-term could sound very appealing. I think it's akin to the rise of interest in user-management-as-a-service products, etc.

I'm not, to be clear, expressing an opinion that the business will be successful with that strategy or that I think it's a good trend. On the contrary, this overall trend of feature-x-as-a-service products depresses me a bit. We've gone from the old days when you had to write pretty much everything yourself (which sucked), to the days when there were a ton of feature-x-as-library choices (which was way better but led to complaints of web programming becoming a chore of just bolting libraries together), to the current trend of feature-x-as-service products.

It's a logical evolution in some ways - similar to libraries over roll-your-own, it reduces your own set of responsibilities. But it also reduces your ability to learn (e.g. from reading source) or grow beyond the provided service (much harder to swap out providers or roll-your-own when the current solution is a third party black box). It also feels depressing in that we've gone from a thriving ecosystem of software based on open source to a much more capitalism-first ecosystem of "this could be a library but then how would I make money off of it". (I realize open source funding & maintainer compensation/sanity is its own set of problems. I just wish we could work on those issues without turning everything into a product.)

akudha · on June 16, 2021

Yup. My teammates joke that the best problem to have is “not my problem”.

This might be a simple service, but this is one less thing to worry about

wdb · on June 16, 2021

Yes, but there is no worth about any compliance regarding data protection that includes the webhooks. Also company is based in the US so that causes Privacy Shield issues.

Not easy to use this service when you need follow all kinds of regulation and the GDPR

tasn · on June 16, 2021

All of our servers are in Europe, and we are soon going to tackle getting compliance certifications (as users have been asking for this).

ezekg · on June 16, 2021

Do you find that unnecessarily adds latency for webhook event ingestion from the US? Seems like that would add a point of failure that makes me even more nervous here -- instead of a quick request to us-east-2, I have to send data across the pond.

tasn · on June 17, 2021

Yeah, this is less than optimal, and that's why we are working on adding zones. It just made sense to start with Europe while we are in just one (due to compliance). We also plan on having our API endpoints on many different zones so that for our customers API calls are immediate.

ezekg · on June 17, 2021

That's what I was thinking. Having ingress/egress processes in specific zones will definitely help here, while keeping your datastores in EU without any issues of data loss at the edge.

For example, ingress for receiving events in a US zone, those are asynchronously pushed to the EU datastore, and then egress for delivering the events are again in the US zone, transparently pulling data from the EU.

Not sure on your architecture, but just spit balling how you could keep data stored in the EU while temporarily "processing" that data in the US to keep latency low where it matters.

tasn · on June 17, 2021

Yeah, thanks a lot for the feedback!

We are going to prioritize this task based on your feedback. It's definitely a concern. We don't even need to store the data in the EU if our users don't want/need it. We can even let them choose zones themselves for ingress - as in many (most?) cases, they would only be using one zone themselves.

wdb · on June 16, 2021

Doesn't matter the terms suggest the company is based in Delaware:

This agreement will be governed by the laws of the Delaware, USA. The courts of Delaware have exclusive jurisdiction to settle any dispute arising out of or in connection with this agreement

tasn · on June 16, 2021

IANAL, though this is for disputes. Not for compliance with international laws which we can do anyway. The question is: can a US company with servers solely in Europe can comply with the European legislations. Based on my understanding the answer is yes, though I'll double check with the experts. :)

unamashana · on June 16, 2021

I'd say that the main complexity comes from exposing the logs/analytics to your end-users. I am hoping Svix will help companies implement Webhooks as well as companies like Stripe do.

debarshri · on June 17, 2021

The thing is if a developer whose core task is not to be build these services can build it. Not just build it, build it robustly, I am guessing that threshold for people to start replicating your services is not that high. In that case, I would question whether is a product or a feature. Because the threshold to build something like this is low, your focus would be more on customer acquisition. You might even acquire alot of customer and tell me you have traction, but I would question what true value it really adds and then I would question long term viability of this company.

On a side note, I have lately started seeing various companies graduating from YC that make me question whether they are product or feature. Not sure what the strategy behind them is. Either they end up going on product hunt, get a decent upvotes and then kind of disappear or are in space that slowly starts seeing other people replicate that.

In devtool space, I have seen YC invest in companies that pretty much do the same thing. In my opinion, it is not fair for the operators of a startup if you have a backer that backs your competitor too as it dilutes your orgs value.

tasn · on June 17, 2021

Developers can build things, this is what we all do. The question is not can we do it, but rather should we do it.

Developer time and focus is a scarce resource, and you want to invest it in moving the needle for you business, not reinventing the wheel. Especially when the wheel is actually harder than it first seems.

I can't attest to the YC investment strategy or the whole "product vs feature" trend you are witnessing. Though I think what you are seeing is more of a trend in the industry towards not reinventing the wheel and using external services for many things (e.g. Auth0) rather than a trend led by YC/VCs.

debarshri · on June 17, 2021

For a startup and smaller orgs whose real goal is to survive, yes you are right resources are scarce and they shouldn't invest their time and effort on that front.

But I think beyond that it is, if you go to larger orgs, if you have engineers who can't write services/client implementations like this. Then you have made bad hires. Don't get me wrong, it is my opinion that you need more defensibility in the product.

I do not think it is re-inventing the wheel, you are abstracting the wheel for me, because now I have to implement your client. If you take a step back and objectively look at it, your product is service-of-a-service, which I think is a problematic approach.

On your comparison with Auth0, it is very naive to generalise that argument. You have to look at them in isolation. At the time, when Auth0 came into existance, state of security was bad. In general, multi-factor authentication, machine-to-machine authentication, implementing oauth/oauth2 correctly was a really big pain in the industry. I can actually bet you that only a handful of engineer even today can do it right and cost of doing it wrong is much higher.

dmytton · on June 16, 2021

The story behind Svix and how the idea developed following Tom's work on the end-to-end encrypted backend product EteSync and Etebase is interesting, and something I covered with him in an interview last month:

https://console.dev/interviews/svix-tom-hacohen/

cmlndz · on June 17, 2021

Amazing idea. Building a truly robust and reliable webhooks ‘service’ is harder than it seems. We just finished rebuilding ours and I looked for a service like this before starting but couldn’t find any. Sure, maybe some tech companies will prefer to do an in-house thing, but I can definitely see the value.

Our POV from when we built ours, if it helps: delivery must be reliable, we need visibility on every delivery attempt (status code, response body and headers), some control in retry logic, must be able to add headers (in our case we add a signature), must be replayable if needed, every attempt must have a unique ID, and must be able to send callbacks as a webhook back to our API. Good luck and congrats for being accepted at YC.

tasn · on June 17, 2021

Thank you very much for the kind words. We do everything you mentioned except for adding headers (because we handle the signing for you too), and callbacks which we maybe do, I'm just not sure I fully understand what you mean in particular.

ezekg · on June 16, 2021

Once again, I'm a big fan of this idea. I wish this service existed 4 years ago when I built my own. Lots of time has been spent there. Probably bad timing, but I actually *just* finished a blog post on how I built my webhook system and posted it to HN [0]. I gave Svix a shout out. :)

Anyways -- I think really nailing API uptime is going to be critical for a service like this, to reduce the chance of having to do queuing on the customer's end to replay failed requests to Svix. That's one of my big concerns at first glance. Some of the webhooks I send are table-stakes for many customers.

[0]: https://news.ycombinator.com/item?id=27528212

tasn · on June 16, 2021

Hehe, perfect timing with your post, and thanks for the shutout! :)

Yeah, API uptime is crucial (and we spend a lot of time on just that). Though I think it's not just for us, but rather for almost all of the external APIs and services out there!

asdev · on June 17, 2021

I think the issue here is skilled developers can build this themselves and the product doesn't appeal to non technical folks. so it's kind of in no man's land.

fergie · on June 17, 2021

Right, but what about the legions of mediocre developers? Or just developers with other stuff to do?

danielmarkbruce · on June 17, 2021

Skilled developers can build all the services on AWS, yet AWS is worth a trillion (or more...) dollars.

dylanbfox · on June 16, 2021

Love this idea. Had this on my list of side projects to build for a while - definitely see the use for this. It's one less thing for a dev team to maintain in-house.

What does latency look like on delivering webhooks? From the time your service is hit, to the time when the webhook is sent?

tasn · on June 16, 2021

Median is 55ms, though it can probably be improved as we haven't spent any time optimizing this...

tasn · on June 16, 2021

I thought about it a bit more, and I have a few ideas on how to reduce it by a lot.

jph · on June 16, 2021

Great idea. Multiple teams of mine have experienced this exact pain point with webhook retries, monitoring, caching, idempotent commands, etc. If you consider adding Elixir and/or Rust to your library roadmap, please let me know.

tasn · on June 16, 2021

Thanks! I personally love Rust, though the only reason we haven't done the libraries yet is that I feel not that many people use Rust in production web services just yet. I may be wrong though...

iomcr · on June 16, 2021

This is awesome, I'll probably try it out soon.

My primary use case is that I don't want my end user(s) to know the IP address of my core application.

But, the option to pay someone other than AWS/GoogleCloud and then have to write my own lambdas is also a plus.

Something similar, but for "fetch metadata on this URL" (tags, description, title, etc), version 2 of that being support for when the end target is an SPA, or also a "take a screenshot of this URL" would also be nice.

saimiam · on June 17, 2021

This is serendipity but I just published a medium article on how we are handling meta data for a SPA. Take a read @ https://link.medium.com/FecZ64OJ9gb

Happy to discuss more if you want.

tbarbugli · on June 17, 2021

we have an internet gateway in front of our push infrastructure, this way we always end up with the same IPs address even if the pool of workers is dynamic

tlhunter · on June 17, 2021

This is a very good idea. The UI for end customers to customize the payload is innovative. Things like rate limiting, error handling, exponential backoff, short circuiting, batching, and replaying requests are all harder than one would think (not sure if Svix supports all of those features yet but surely they will).

tasn · on June 17, 2021

Thank you very much! By short circuiting you mean stopping sending to obviously failing endpoints?

foxbee · on June 17, 2021

This sounds really interesting. I'd be interested in having a chat around a partnership with Budibase: https://github.com/Budibase/budibase

tasn · on June 17, 2021

Oh, very interesting, I'd love to chat! My email is tom @ the domain, what's the best way to reach you?

pgt · on June 16, 2021

Svix the name.

knoebber · on June 16, 2021

Congrats on the launch! The new site looks good. Looking forward to seeing where this goes.

tasn · on June 16, 2021

Thank you!

danielmarkbruce · on June 17, 2021

Nice going, this looks great.

yashap · on June 17, 2021

What’s the argument for using Svix over Google Pub/Sub?

tasn · on June 17, 2021

Pub/Sub is oriented towards sending messages to endpoints you control rather than a full solution to send webhooks to arbitrary users.

yashap · on June 17, 2021

Ah fair enough, that’s a significant differentiator.

candiddevmike · on June 16, 2021

You've basically built a message queue as a service using HTTP. I personally don't see the innovation here or the kind of moat you expect to build, but I wish you the best of luck in your endeavor.

codegeek · on June 16, 2021

This comment reminds of the classic "Dropbox is basically a hosted FTP service" comment :).

fasteo · on June 16, 2021

>>> basically

The devil is in the details

candiddevmike · on June 16, 2021

Not really, I don't see how this would be more difficult than any other kind of queue most programs already have--email notification retries especially. Unlike email though, this is HTTP and has better status codes. Also, you'd technically still have to implement a queue to send to svix, no? Otherwise if they go down you lose critical messages.

mrkurt · on June 16, 2021

Webhooks are easy-ish to send and retry. Building the UX to help users successfully use webhooks is not simple. You need debugging tools, retry handling, notifications when they break (but not the first time they break, when they break repeatedly), etc.

You're conflating low level plumbing with a ready-to-go, multi tenant feature.

wferrell · on June 16, 2021

Agree

atombender · on June 16, 2021

It's potentially a bit different from normal queues in that while you scale up your own queue processing, you can't scale up the webhook receiver. And unike something like newsletter emailing, you probably care very much about latency.

This means that in a naive implementation, unless you run as many parallel workers as there are messages in the queue, someone will block someone else from delivering. Depending on your latency requirements, this might not be acceptable.

Making delivery truly parallel — that is, each distinct receiver should not block anyone else, no matter how slow or failure-prone they are — and low latency is a bit more tricky, essentially requiring one logical queue per webhook.

You can solve it in various ways, depending on what solution (Kafka, NATS/JetStream, Pulsar, Google Pub/Sub, etc.) you choose, but as far as I know, nobody provides this out of the box. In particular, one-queue-per-webhook requires worker coordination in a way that classical pub/sub doesn't — after all, you don't want to run one worker per webhook if they're not all full of pending messages — and some systems don't scale to many queues very well (e.g. Google Pub/Sub has a hard limit of 10,000 topics per project).

Retrying can also pose some challenges. What if the webhook has been down for days? Do you still keep messages in the queue, or do you throw them away? If the webhook comes up, do you prioritize new deliveries or do you mix in the old ones? How do keep track of this so that you can alert the webhook owner about the flakiness?

As the other poster says, the devil is in the details. It's all solveable, but nine times out of then, I personally prefer having something off-the-shelf that's been built once, rather than building it from scratch every time.

SahAssar · on June 16, 2021

I might be missing something but it seems like all of your details are either things you would need to configure anyways in Svix (not all services should have the same retry/expiry) or things that are not solved by this service. This service takes HTTP as input and output, so you wouldn't need a worker per topic anyway, right? The workload is http-in, http-out, with a failure condition for retry.

If I already have a queue of http messages (which I need to have to protect from Svix downtime) configured with their policies for retry/expiry (which I need to configure since it's not the same for all) then what does this service do that is not basically a curl loop with an error check?

atombender · on June 17, 2021

But a queue to protect against Svix downtime is fundamentally different from delivering webhooks.

I already outlined some challenges with implementing webhooks. I think you're missing my point about parallel delivery. If the workload is "HTTP-in, HTTP-out", you need to make sure that a single slow "out" does not cause head-of-line blocking that would prevent other, fast workloads from being executed. One way to accomplish that is to scale up to have N_workers >= N_pending, which is typically a terrible solution. So a mature webhook solution needs to be more clever about this.

Queues are great for situations where either the latency doesn't matter, or where you can scale up your resources to decrease latency; but in the case of webhooks, the latency of the webhook receiver is outside your control — you can't scale them up.

Here's another detail where devils are hiding: Delivering webhooks to arbitary URLs is a security concern. The mitigate this, the delivery agent run in an isolated environment so that it cannot possibly interfere with private hostnames/IPs in your cluster.

tasn · on June 17, 2021

You don't need a queue to protect from Svix downtime. It can be as simple as logging failures to svix (when they happen), and replaying these events. Though as I said elsewhere, this scenario is something you'd need to deal with Twilio and SendGrid too.

As for what this service does that is not basically a curl loop with an error check: see the rest of the comments. People chimed him from their experience better than I could have said it myself. Or even look at https://svix.com and see what we offer, you'll see that there's much more nuance. :)

We know that people underestimate webhooks, it's a challenge we need to overcome, but there really is more to it than just a POST request.

SahAssar · on June 17, 2021

> It can be as simple as logging failures to svix (when they happen), and replaying these events

That's a manually implemented queue, right?

I looked at the site and this thread and I still don't get it, I don't think I underestimate webhooks, but rather that I don't see why adding another webhook inbetween will help.

tasn · on June 17, 2021

It's more of an append-only failure log than a queue, which is a whole different beast...

Though as I said elsewhere in the thread, the actual delivery is just part of what we do.

fasteo · on June 16, 2021

This [1] is a good read about design decisions and potential problems you face with a service like this at scale.

[1] https://segment.com/blog/introducing-centrifuge/

tasn · on June 16, 2021

We commented about the "queue to send to svix" in the post above.

> Deliverability to user endpoints (servers) is very different to deliverability to Svix. User endpoints fail all the time and for various reasons, and each of them can fail independently. This means developers need a robust and scalable delivery system that can deal with failures on an ongoing basis. While with Svix, outages are rare, and are dealt with as incidents. The same way you would with SendGrid, Twilio and other API providers.

notyourday · on June 16, 2021

It is not very difficult to have nearly no outages when the service has nearly no traffic.

tasn · on June 16, 2021

What makes you think we have nearly no traffic?

Anyhow, this was not a comment about our uptime against any particular service, but rather how our uptime against the collective (so how often any of those fail) - because that's what matters here.

Though there's definitely a big difference in uptime between a service that has SLAs and random user-endpoints that don't necessarily promise the same.