Hacker News new | past | comments | ask | show | jobs | submit login
50% discount on OpenAI pricing if you submit a batch and give them up to 24h (twitter.com/simonw)
92 points by tosh 4 months ago | hide | past | favorite | 65 comments



This is probably using their excess capacity, but not necessarily that their GPUs are idle.

For LLMs/large models the huge cost is memory ops to load each layer weights during the forward pass. This is why doing inference at batch size 1 is extremely wasteful: you pay all the mem ops cost and don’t use enough compute FLOPs to justify.

You want a high enough batch size so that compute:mem ops is close to the ratio for that GPU. This is usually done by batching together multiple user requests.

At times of low usage there is excess capacity because batch size is below this “optimal” ratio. So they can slot in these “relaxed SLA” requests for little marginal increase resource usage on their end.

Basically have a queue of these requests that you use to “top off” your batch size when you can.

Edit: also you may not be able to get optimal batch size depending on when the requests arrive, eg you don’t want to wait forever to fill up a batch. So again having a queue of outstanding/delayed requests to serve allows for smoothing things out and increasing compute utilization


This awesome talk [1] from OpenAI covers this topic quite a bit, one useful takeaway is how GPU compute is basically static, gone are the days of autoscaling as there is nothing to autoscale to.

I think that beyond optimizing batch size, massive training clusters tend to benefit from scheduled maintenance periods where everything gets fixed vs rolling fixes (as you either need everything to be working or you need to restart the training window). If OpenAI could interleave batch inference with training specific HW downtime like interconnect maintenance it would be another basically free source of GPU FLOPS.

[1] https://www.youtube.com/watch?v=PeKMEXUrlq4


Hey there, this comment is super insightful. I'd love to talk to you about this some more, but you don't have contact info. If you're interested my contact info is in my profile.


Thanks - I added my contact info (I don’t comment a lot on HN, mostly just read) but will drop you a line


I wonder if they have a clear hardware separation between each of the API, ChatGPT, their lower-scale experiments and their large scale (e.g. GPT5) training hardware. Or is everything just a big blob of hardware, that dynamically gets allocated to jobs depending on demand?

Hardware demand is so high, having GPUs idling is a massive waste, but you also want to have separation between dev, test and prod environments, so not obvious what to do.


Yeah this makes sense. I do wonder though how it changes the dynamics around provisioned capacity, if at all.


It reduces the need. If they can get non-latency sensitive users onto this API then they only need to be provisioned to support their max interactive query load (ChatGPT) rather than peak API load, which can be arbitrary high (however fast the program generating the load can run). The lower pricing should move users across quite fast, and the higher efficiency will free up hardware and reduce the rate at which they need to grow it.


That's the way it seems to me as well. Curious too about the business implications. My guess is that they wanted to bite the bullet and commit to provisioned capacity but wanted to do so in a way that didn't require massive overprovisioning for API requests.


They're well beyond that point now I guess. MS has been building whole datacenters just for OpenAI.


It's a pity that this sort of feature is done with a custom ad-hoc HTTP protocol instead of using a message queuing protocol like AMQP or STOMP. Those are designed for this kind of thing and already have libraries available, they avoid the need for busywork polling and avoid the need to break work up into batches client side. You should just be able to submit messages containing work to a queue and then let the broker inform you when new responses arrive on the response queue if you're connected. If you're not connected they buffer up for a while until you reconnect and drain the queue.

The advantage of that approach would be that latency could be minimized or tiered. Instead of a single "one batch with maybe up to 24hr latency" the offer could be a series of tiered queues with different SLAs and costs, and you can pick which you submit your work to depending on how important it is.

I wonder why they went for this custom JSON/HTTP approach instead. Maybe they feel that they can't expect developers to understand anything other than POSTing JSON.


There's a reason pretty much everything that does not require low latency replies avoids stateful networking - everything from RSS to video streaming prefers stateless polling designs because it is vastly easier for both parties to implement and scale. Meanwhile, I couldn't name a single system in widespread use built around a MQ paradigm in its public interface, except for actual MQ APIs, and many of those (e.g. from AWS) are still built on polling for the reasons just described


Behind each request made to OpenAI is a staggering amount of GPU computation. If the price of the queue request is even a hundred thousandth of the overall price of a single request I'd be stunned. There is no message queue scaling issue here. Message queue scaling issues arise when you are blasting around a lot of messages, but each of them take minimal resources on an individual basis to service, so it's feasible for the queue itself to be the bottleneck. I wouldn't be surprised a single Raspberry Pi could handle the entire queuing load here, and if it couldn't it's not off by a very large factor, because the GPU resources behind what it would take to service a full RPi's queuing capacity would be staggeringly enormous, I think well beyond what OpenAI actually has.


Isn't the same true then of an HTTP server? Handling the polling requests is a minute amount of work compared to running the real jobs. And you've addressed the scalability problem, but not the connectivity issues that generally plague long-lived connections on the Internet at large.


Not always. There are HTTP servers where you are making an HTTP request for an in-memory value where the work is less than the parsing cost for an HTTP request. There are many HTTP services where the time to fulfill the request is much longer than the parse cost of the request, but that time is not 100% CPU of either the server or any given service, because there's a lot of back and forth delays and latency. There are many HTTP services where they are 100% CPU and orders of magnitude greater than the cost of a HTTP request parse, but are still on the order of <1ms and if such a service was actually a message queue you might still be able to clog a message queue at least somewhat.

This is a very pessimal case, though. You make a tiny HTTP request which is parsed in microseconds at the most, and it invokes somewhere between one to five million microseconds of 100% utilization of an expensive resource. A thousand queuings per second would be fairly easy for a RPi, it could handle that no problem (at least assuming you use a decent language to manage it; a super fancy Python framework that also does a lot of metaprogramming might choke under the load, sure, though some half-carefully written Python still ought to be able to handle this), while those 1000 requests/second would require around 2000 GPUs to dispatch them in real time so we can maintain that 1000 rps. I'm pretty sure you can add an order of magnitude before I'd really start worrying about the RPi as a queuing mechanism, and you're getting to the RPi being able to queue for ~100,000 GPUs without too much strain. I don't know how many GPUs OpenAI has but that's got to be getting pretty close to their order of magnitude. They may have a million, I doubt ten million.

(Of course, I wouldn't actually do this on an RPi; I'm just using that as a performance benchmark.)


Inevitably some users will decide to poll every 60 seconds or whatever, because they have no idea when the work will be completed and because what they really want is "results ASAP but willing to tolerate latency to pay less". And then your servers are doing a ton of TLS negotiation, user authentication, request serving and database lookups, just to answer "not yet".

I think people are getting distracted by the idea of connections being somehow expensive. They aren't really compared to polling (unless the poll is genuinely very rare). A stateless request is expensive because you have to go back to your source of truth on every request (probably an expensive and hard to scale RDBMS), and you don't control how often the user makes such requests. CPU load is potentially unbounded and users don't pay unless you introduce pay-per-poll micropayments.

Compare that to an MQ design: the overhead is a single TCP connection and a bit of memory to map that connection to an internal queue. Whilst the work sits in the queue or is being processed, nothing is happening and there's no DB load. Overhead is a matter of bytes and in the event that you run out of RAM you can always kick users off at random and let them exponentially back off and retry (automatically - because the libraries handle this and make it transparent). Or just use swap, after all, latency is not that important.


Nothing prevents, in principle, a long lived HTTP connection where the server only replies once the response is available (long polling). However, on the real internet, such long lived connections just don't work, for a large minority of users. There are numerous devices, typically close to the client, which kill "idle" connections. NAT gateways and stateful firewalls are some of the most common culprits.

So, you just can't rely on your customers being able to keep around a long connection.

Not to mention the numerous corporate environments in which it is hard to even open an outgoing connection which is not HTTPS or a handful of other known protocols.


Well, as I've said several times on this thread, good MQ libraries know how to reopen connections automatically if they break, backoff, retry, connect to several endpoints and load balance between them and so on. All this is an abstraction layer higher than what HTTP provides, so problems HTTP long polling can have in consumer/mobile use cases isn't necessarily relevant. It's like files vs SQLite.

As for the general issue of connections, that's true for consumer use cases. B2B workloads have far fewer problems with that especially when running in the cloud. If your cloud gives you mobile-quality internet then you have a problem, but again, it's a problem a good MQ implementation will fix for you. Consider the "lessons from 500 million tokens" blog post the other day, in which the author mentioned repeatedly that they had to write their own try/catch/retry loops around every OpenAI call because their HTTP API was so flaky.

And again, if you are behind a nasty firewall then you might find your connection dying at any moment because OpenAI got classified as a hate speech site or something. The fix is to file the right tickets to get your environment set up correctly.


Dropbox's approach here in the early 2010s (not sure if still used) was actually quite clever. Both official clients and third parties could open up a long-timeout HTTP connection that wouldn't be responded to until the web server was delivered a message in their internal message queue. Avoided the overhead of polling and allowed for extremely low latency, while letting clients still use their favorite HTTP library.

https://dropbox.tech/developers/low-latency-notification-of-...


> long-timeout HTTP connection that wouldn't be responded to

This open connection IS the overhead. Some approaches have more overhead, some less. Long-polling these days is not needed and can be replaced by SSE [1] (if you are OK with unidirectional communication), websockets (bidirectional), or callback URIs in the request. The latter is a per-request webhook, essentially, and would have the lowest overhead at the cost of ceremony (the client now needs to have a running web server).

By the way, long polling was popularized by Comet [2] around 2006.

[1]: https://en.wikipedia.org/wiki/Server-sent_events

[2]: https://en.wikipedia.org/wiki/Comet_(programming)


Wait, how is having the client run its own http(s) server and accept regular HTTP(S) requests the lowest overhead option? Sounds like the highest overhead one, with all the protocol statelessness.


That's typically called HTTP long polling, and it's a commonly used alternative to things like WebSockets.


This is just long-polling, right?


Think about every messenger out there, any Slack or Slack-like app that uses WebSockets, email clients that use IMAP etc. They aren't polling once a minute like it's 1995.

It's not really easier to implement or scale, in my view. It may seem that way if you've never worked on large scale stateful connection serving. If you want users to get the answer as soon as it's ready, and you do or at least your users do, then you need users to hold open connections and at that point it's really a question of how much your protocols and libraries do for you. If you use HTTP the answer is "not much" because it was designed to download hypertext. The actual task of managing lots of connections server side isn't especially hard. There are MQ engines that support sharding and can have a regular TCP LB like a NetScaler stuck in front of them, or of course, you can just implement client-side LB as well.


I'm not promoting one way or the other, just pointing out why things are the way they are. Restarting a service with stateful networking is reason enough to avoid it where possible, watching entire buildings melt for 15 minutes because a single binary SEGV'd is a real outcome. For an extra helping of pain, add a herd of third party clients of random versions to a system that never needed the comms capabilities on offer and you have a problem to solve that never needed to exist in the first place


Good MQ libraries know how to do backoff and retry transparently, so if you aren't provisioned for peak load (which is what melting implies) you can just reject connections for a while until everyone is back.

The alternative with polling rapidly turns into being melted all the time - it's literally handling constant reconnection load - especially as you don't control the polling code and as the user is expected to write it themselves you can safely bet it will be MVP. Retries if they occur at all will be in a tight loop.

I worked on Gmail for a few years and we had a ton of clients permanently connected for IMAP, web long polling and Talk. Those clients tended to be well written and it wasn't the big problem people seem to think it is, especially as some could do client side LB which reduced resource requirements server side. I also experienced (luckily from a distance) fully HTTP polling based systems where backoff/retry wasn't implemented properly and caused services to go hard down because they couldn't handle being hit with all the polling simultaneously, and the clients just broke or did something stupid like immediately retry if they got server errors.

Fundamentally, state is fundamental and can't be removed, so it's going to be sitting somewhere. If state isn't in the protocol stack then it's in your database or session cache instead. Sometimes that's better, sometimes it's worse.


webhooks are a little closer in that they remove the need for constant polling, atleast for longer running processes


That sounds really difficult to use to me. I'd need to figure out how to connect to a custom messaging protocol, and how to keep that connection online (maybe routing through firewalls etc), and handle retries and connection failures and suchlike.

An HTTP polling API may be less efficient but I know how to use that from all of the programming languages I work with already.


AMQP isn't a custom messaging protocol and MQ libraries do all of that for you already (handling reconnects and retries etc). Connection interruptions aren't an issue. You're actually worrying about low level details of the sort MQ systems are specifically designed to solve for you, but HTTP stacks do not.

For example, someone below asks "how do I see my submitted jobs". MQ protocols solve that for you out of the box. You can browse the queues that you own. It's all already standardized and implemented. Because they rolled their own protocol, OpenAI lacks this function and will have to develop it themselves. Then because it's all oriented around batches (when the underlying problem does not actually need batching), the users will have to run a local DB to keep track of what prompts map to what batches or they'll find themselves redownloading the whole batch to find out what's in it.

As for firewalls, you open the port. The only issue there is if you're behind a firewall you don't have any control over and where the people who do don't want you using AI, but that's not an issue in the datacenter deployments that are presumably generating inferencing batch workloads, and such firewalls or proxies can break HTTP APIs too anyway.


FWIW, AMQP is the same protocol as RabbitMQ- there's a library for pretty much every language/ecosystem.


Yeah, I've used that in the past and found it significantly harder than HTTP.


> they avoid the need for busywork polling

The fact that you know that you can only expect an answer after 24h also avoids the need for busywork polling.

> Instead of a single "one batch with maybe up to 24hr latency" the offer could be a series of tiered queues with different SLAs and costs, and you can pick which you submit your work to depending on how important it is.

But that is what they have. They have one "ASAP" API and one "within 24h" one. These are two big distinct use cases. You use the fast one when there is a user waiting for an answer realtime, and you use the slow one when nobody is waiting for it realtime.

These are two distinct use cases from the user's perspective.

You are correct in identifying that there could be an API which offers 2h returns, or one which offer 14day returns, but why would you complicate the offering with that? It adds a lot of complication to the documentation, a lot of complication on the scheduling side and a lot of complication on the pricing side, and for what upside?


There's almost no complexity on the scheduling side done properly. Just pop work off the queues in priority order.

There are use cases for multiple latency tiers

1. At the lower end you have realtime-ish responses for when the chat is via SMS/WhatsApp/tickets/email, i.e. the social convention is that responses may be delayed by a few minutes or even 10-20 minutes and that's reasonable, but not by 24 hours.

2. At the high end you have a database with 200 million documents, a prompt that works, and you just want to submit all of them in one go as quickly as your client can work, then get the results as they get processed. There is no latency requirement whatsoever, but you'd prefer not to have to chunk things up and track batch IDs and such yourself. What you DO really want though is to have exactly-once submission, so you don't accidentally lose your connection half way through a batch upload and then end up paying for the same batch twice.

These are all basic problems of the type that occur all the time in the IT industry. The solutions are well known and mature. Your bank runs on message queues of various kinds. The tech really isn't that fancy.


> I wonder why they went for this custom JSON/HTTP approach instead. Maybe they feel that they can't expect developers to understand anything other than POSTing JSON.

Its simpler for them to build and operate since it fits their existing API (and user validation, quota tracking, etc.) infrastructure, its simpler for them to update their SDK to handle another HTTP endpoint than to build an AMQP or STOMP client into the SDK for this feature, its easier for most consumers who aren't using the SDK, and a message queue interface for this one bit of functionality would offer minimal benefits. Easy choice.

(If you are consuming from an environment where you are alreayd using AMQP or STOMP, sure, build a quick adapter for the HTTP interface. But I don't see any good reason OpenAI should have done one of those instead of HTTP.)


> Instead of a single "one batch with maybe up to 24hr latency" the offer could be a series of tiered queues with different SLAs and costs

The API includes a `completion_window` parameter ("The time frame within which the batch should be processed. Currently only 24h is supported.").

https://platform.openai.com/docs/api-reference/batch/create


You'd have to hold a ton of connections open for 24 hours+


That isn't hard. They aren't going to serve a billion connections are they, and WhatsApp scaled a full NxN low latency messaging system up with a handful of people.

At any rate, that's only the case if every request actually sits in the queue for 24 hours. The 24 hour period is clearly chosen to let them absorb free capacity overnight (all such services have diurnal cycles), but unless their implementation is really shoddy at any given moment their workers will be popping work off an internal queue anyway to take advantage of moment-sized lulls in traffic. Done properly you can saturate your hardware 100% all the time... there's certainly enough inferencing demand out there.

So at that point the question is just one of external API.


Message queues are difficult to maintain.

Webhooks is the best solution IMO.



We've gone full circle to an era of Mainframe timeshare scheduling tasks.


Give it 10 years and your mid-price android phone will be running models that would make ChatGPT look like baby talk.


Maybe even talking birthday cards with LLM ASICs in a blob


Suggests to me that they have a LOT of compute that they can't, or have decided not to, scale down. Interesting way for them to keep the infra up and cover costs.


It mostly suggests large spikes in usage, no? Which does not seem very surprising.


Totally. There are other explanations (like wanting to offer a discount), but my mind initally went to "they have a lot of provisioned compute that they pre-paid for, and they want higher-utilization when API request rates are low."

Just hypothesizing.


Or they don't have enough provisioned compute to cover peak usage, and they therefore want to reduce peak usage.


I don't think that's right. There is an optimal batch size, and this allows them to just keep everything optimal or close to it for as much time as possible. I'd bet every single minute they miss chances to run things optimally because they need to hit some latency number, and this enables them to not have to make that trade off.


Yeah, that's another good explanation.


For anyone trying to figure out these sorts of pricing questions when running a service in the cloud: I wrote up a blog post about how (d one of) the math works out.

https://alexsci.com/blog/modeling-on-demand-pricing/


Or simply that traffic is uneven. Like the Spot instances of EC2.


Maybe this gives them more control and improves ability to handle latency-sensitive traffic.

Or maybe they’ve found a way to batch process many requests together in a way that’s more efficient.


Not really. It could be they have massive traffic during US daytime hours, which is not evenly matched by traffic from the rest of the world during US night times, so they'll just run this batch during US night time and hence the 24 hour period.


Yah, that's what I'm saying. In those low periods they could just scale down the infra, but perhaps there's a constraint that doesn't allow for that. This lets them not waste it.


Depends on what you mean by scale down infra.

If they're running their own DC (or their own hardware in a colo) then I believe the only scaling down can by shutting down the systems at night, which, while non-trivial amount in electricity savings, is not really much when compared to the total hardware CapEx or even the total OpEx.

And, even then, you have to consider the cost of shutting down and bringing them up and the complexity it brings.

Now, if they're using a cloud provider they can theoretically terminate the instances. But, even in that case, to do that, they'll have to use on-demand instances rather than use reserved instances which will probably wipe out all the savings.


Don't spill the card stack!


50 years of Moore's law improvements and we're back to batch processing. Nothing illustrates better to me the regression that using massive amounts of GPU to do inference represents as an approach to computation.


> 50 years of Moore's law improvements and we're back to batch processing.

It never went away.

I did some consulting years ago for a large telco. They had an overnight billing batch process that was obviously critical to the company.

It took about 22 hours to run. They were terrified it was going to start taking longer, if it ever reached 24 hours that would be the death of the company.


Wish this was done over Bundle Protocol, would love to code that for them!


Probably a response to provisioned capacity inference from aws gcp


Could def be useful for many use cases.


How does this work, do I submit a rest call and get an IOU response to look up the results later?


Copied from OpenAI's Jeff Harris on Twitter:

1) Upload your batch file

curl https://api.openai.com/v1/files -H "Authorization: Bearer $OPENAI_API_KEY" -F purpose="batch" -F file="@batch_example.jsonl"

2) Create the batch job

curl --request POST --url https://api.openai.com/v1/batches --header "Authorization: Bearer $OPENAI_API_KEY" --header 'Content-Type: application/json' --data '{ "input_file_id": "file-DYJImtYAQ2y3j25b5F27Eefp", "endpoint": "/v1/chat/completions", "completion_window": "24h" }'

3) Check the status of your job

curl https://api.openai.com/v1/batches/batch_f5Hh3MXaM0NNuTGOmutl... -H "Authorization: Bearer $OPENAI_API_KEY"

4) Download the completed file!

curl https://api.openai.com/v1/files/file-Og2t4LcJOCqOFq6rmp22Uzk... -H "Authorization: Bearer $OPENAI_API_KEY" > batch_output.jsonl


Cool, not really looking to do this just yet, but it's definitely on my radar the next time I have a project.

Would love an option to look up all my open jobs. I imagine doing this in a ci/cd pipeline or GitHub action. But I don't want to have to store state.


Presumably that would be a

curl --request GET --url https://api.openai.com/v1/batches

but that doesn't seem to exist yet: https://platform.openai.com/docs/api-reference/batch


But it looks like you can upload files and list them.

I'd just upload a file with the name Timestamp_batchID, and get the list of files later.


this is crazy cool, will try it out shortly.

i need to batch anaylze 100K whisper transcriptions. 8-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: