There's a reason pretty much everything that does not require low latency replie...

jerf · 2024-04-15T18:56:16.000000Z

Behind each request made to OpenAI is a staggering amount of GPU computation. If the price of the queue request is even a hundred thousandth of the overall price of a single request I'd be stunned. There is no message queue scaling issue here. Message queue scaling issues arise when you are blasting around a lot of messages, but each of them take minimal resources on an individual basis to service, so it's feasible for the queue itself to be the bottleneck. I wouldn't be surprised a single Raspberry Pi could handle the entire queuing load here, and if it couldn't it's not off by a very large factor, because the GPU resources behind what it would take to service a full RPi's queuing capacity would be staggeringly enormous, I think well beyond what OpenAI actually has.

tsimionescu · 2024-04-15T19:06:46.000000Z

Isn't the same true then of an HTTP server? Handling the polling requests is a minute amount of work compared to running the real jobs. And you've addressed the scalability problem, but not the connectivity issues that generally plague long-lived connections on the Internet at large.

jerf · 2024-04-17T14:52:59.000000Z

Not always. There are HTTP servers where you are making an HTTP request for an in-memory value where the work is less than the parsing cost for an HTTP request. There are many HTTP services where the time to fulfill the request is much longer than the parse cost of the request, but that time is not 100% CPU of either the server or any given service, because there's a lot of back and forth delays and latency. There are many HTTP services where they are 100% CPU and orders of magnitude greater than the cost of a HTTP request parse, but are still on the order of <1ms and if such a service was actually a message queue you might still be able to clog a message queue at least somewhat.

This is a very pessimal case, though. You make a tiny HTTP request which is parsed in microseconds at the most, and it invokes somewhere between one to five million microseconds of 100% utilization of an expensive resource. A thousand queuings per second would be fairly easy for a RPi, it could handle that no problem (at least assuming you use a decent language to manage it; a super fancy Python framework that also does a lot of metaprogramming might choke under the load, sure, though some half-carefully written Python still ought to be able to handle this), while those 1000 requests/second would require around 2000 GPUs to dispatch them in real time so we can maintain that 1000 rps. I'm pretty sure you can add an order of magnitude before I'd really start worrying about the RPi as a queuing mechanism, and you're getting to the RPi being able to queue for ~100,000 GPUs without too much strain. I don't know how many GPUs OpenAI has but that's got to be getting pretty close to their order of magnitude. They may have a million, I doubt ten million.

(Of course, I wouldn't actually do this on an RPi; I'm just using that as a performance benchmark.)

mike_hearn · 2024-04-15T19:55:45.000000Z

Inevitably some users will decide to poll every 60 seconds or whatever, because they have no idea when the work will be completed and because what they really want is "results ASAP but willing to tolerate latency to pay less". And then your servers are doing a ton of TLS negotiation, user authentication, request serving and database lookups, just to answer "not yet".

I think people are getting distracted by the idea of connections being somehow expensive. They aren't really compared to polling (unless the poll is genuinely very rare). A stateless request is expensive because you have to go back to your source of truth on every request (probably an expensive and hard to scale RDBMS), and you don't control how often the user makes such requests. CPU load is potentially unbounded and users don't pay unless you introduce pay-per-poll micropayments.

Compare that to an MQ design: the overhead is a single TCP connection and a bit of memory to map that connection to an internal queue. Whilst the work sits in the queue or is being processed, nothing is happening and there's no DB load. Overhead is a matter of bytes and in the event that you run out of RAM you can always kick users off at random and let them exponentially back off and retry (automatically - because the libraries handle this and make it transparent). Or just use swap, after all, latency is not that important.

tsimionescu · 2024-04-16T05:28:33.000000Z

Nothing prevents, in principle, a long lived HTTP connection where the server only replies once the response is available (long polling). However, on the real internet, such long lived connections just don't work, for a large minority of users. There are numerous devices, typically close to the client, which kill "idle" connections. NAT gateways and stateful firewalls are some of the most common culprits.

So, you just can't rely on your customers being able to keep around a long connection.

Not to mention the numerous corporate environments in which it is hard to even open an outgoing connection which is not HTTPS or a handful of other known protocols.

mike_hearn · 2024-04-16T07:33:23.000000Z

Well, as I've said several times on this thread, good MQ libraries know how to reopen connections automatically if they break, backoff, retry, connect to several endpoints and load balance between them and so on. All this is an abstraction layer higher than what HTTP provides, so problems HTTP long polling can have in consumer/mobile use cases isn't necessarily relevant. It's like files vs SQLite.

As for the general issue of connections, that's true for consumer use cases. B2B workloads have far fewer problems with that especially when running in the cloud. If your cloud gives you mobile-quality internet then you have a problem, but again, it's a problem a good MQ implementation will fix for you. Consider the "lessons from 500 million tokens" blog post the other day, in which the author mentioned repeatedly that they had to write their own try/catch/retry loops around every OpenAI call because their HTTP API was so flaky.

And again, if you are behind a nasty firewall then you might find your connection dying at any moment because OpenAI got classified as a hate speech site or something. The fix is to file the right tickets to get your environment set up correctly.

btown · 2024-04-15T18:59:54.000000Z

Dropbox's approach here in the early 2010s (not sure if still used) was actually quite clever. Both official clients and third parties could open up a long-timeout HTTP connection that wouldn't be responded to until the web server was delivered a message in their internal message queue. Avoided the overhead of polling and allowed for extremely low latency, while letting clients still use their favorite HTTP library.

https://dropbox.tech/developers/low-latency-notification-of-...

smarx007 · 2024-04-15T19:14:56.000000Z

> long-timeout HTTP connection that wouldn't be responded to

This open connection IS the overhead. Some approaches have more overhead, some less. Long-polling these days is not needed and can be replaced by SSE [1] (if you are OK with unidirectional communication), websockets (bidirectional), or callback URIs in the request. The latter is a per-request webhook, essentially, and would have the lowest overhead at the cost of ceremony (the client now needs to have a running web server).

By the way, long polling was popularized by Comet [2] around 2006.

[1]: https://en.wikipedia.org/wiki/Server-sent_events

[2]: https://en.wikipedia.org/wiki/Comet_(programming)

TeMPOraL · 2024-04-16T13:47:35.000000Z

Wait, how is having the client run its own http(s) server and accept regular HTTP(S) requests the lowest overhead option? Sounds like the highest overhead one, with all the protocol statelessness.

tsimionescu · 2024-04-15T19:09:57.000000Z

That's typically called HTTP long polling, and it's a commonly used alternative to things like WebSockets.

hamandcheese · 2024-04-15T19:11:18.000000Z

This is just long-polling, right?

mike_hearn · 2024-04-15T19:47:21.000000Z

Think about every messenger out there, any Slack or Slack-like app that uses WebSockets, email clients that use IMAP etc. They aren't polling once a minute like it's 1995.

It's not really easier to implement or scale, in my view. It may seem that way if you've never worked on large scale stateful connection serving. If you want users to get the answer as soon as it's ready, and you do or at least your users do, then you need users to hold open connections and at that point it's really a question of how much your protocols and libraries do for you. If you use HTTP the answer is "not much" because it was designed to download hypertext. The actual task of managing lots of connections server side isn't especially hard. There are MQ engines that support sharding and can have a regular TCP LB like a NetScaler stuck in front of them, or of course, you can just implement client-side LB as well.

dmw_ng · 2024-04-15T20:09:32.000000Z

I'm not promoting one way or the other, just pointing out why things are the way they are. Restarting a service with stateful networking is reason enough to avoid it where possible, watching entire buildings melt for 15 minutes because a single binary SEGV'd is a real outcome. For an extra helping of pain, add a herd of third party clients of random versions to a system that never needed the comms capabilities on offer and you have a problem to solve that never needed to exist in the first place

mike_hearn · 2024-04-15T20:19:10.000000Z

Good MQ libraries know how to do backoff and retry transparently, so if you aren't provisioned for peak load (which is what melting implies) you can just reject connections for a while until everyone is back.

The alternative with polling rapidly turns into being melted all the time - it's literally handling constant reconnection load - especially as you don't control the polling code and as the user is expected to write it themselves you can safely bet it will be MVP. Retries if they occur at all will be in a tight loop.

I worked on Gmail for a few years and we had a ton of clients permanently connected for IMAP, web long polling and Talk. Those clients tended to be well written and it wasn't the big problem people seem to think it is, especially as some could do client side LB which reduced resource requirements server side. I also experienced (luckily from a distance) fully HTTP polling based systems where backoff/retry wasn't implemented properly and caused services to go hard down because they couldn't handle being hit with all the polling simultaneously, and the clients just broke or did something stupid like immediately retry if they got server errors.

Fundamentally, state is fundamental and can't be removed, so it's going to be sitting somewhere. If state isn't in the protocol stack then it's in your database or session cache instead. Sometimes that's better, sometimes it's worse.

dec0dedab0de · 2024-04-15T19:20:47.000000Z

webhooks are a little closer in that they remove the need for constant polling, atleast for longer running processes