Hacker News new | past | comments | ask | show | jobs | submit login
We're Cutting L40S Prices in Half (fly.io)
59 points by LukeLambert 29 days ago | hide | past | favorite | 30 comments



L40S has 48GB of RAM, curious how they're able to run Llama 3.1 70B on it. The weights alone would exceed this. Maybe they mean quantized/fp8?

I just had to implement GPU clustering in my inference stack to support Llama 3.1 70b, and even then I needed 2xA100 80GB SXMs.

I was initially running my inference servers on fly.io because they were so easy to get started with. But I eventually moved elsewhere because the prices were so high. I pointed out to someone there that e-mailed me that it was really expensive vs. others and they basically just waved me away.

For reference, you can get an A100 SXM 80GB spot instance on google cloud right now for $2.04/hr ($5.07 regular).


Our standard A100 SXM 80GB price is $3.50/hr, for what it's worth.


For a reference, that's at least 40% more than what H100 sxm would cost if you are willing to reserve for a month (so not apples to apples).

H100 will also be much faster, especially if you are willing to use fp8. Maybe 3-4x


> You can run DOOM Eternal, building the Stadia that Google couldn’t pull off, because the L40S hasn’t forgotten that it’s a graphics GPU.

Savage.

I wonder if we’ll see a resurgence of cloud game streaming


I feel like Geforce Now has been growing quite steadily, they onboard new titles every week including some big ones like World of Warcraft recently and they've stood up streaming DC's in quite a lot of places now.


GeForce Now is pretty great.


Is it responsive enough to play action games? Noticeable lag is what kept me off of Stadia.


I expected lag and I haven't experienced it beyond an occasional moment like ... once a month maybe. I just attribute that to local internet issues and it stops and I just keep playing.

It is a surprisingly fluid experience.


Is the services that PlayStation Now uses publicly known? That's the only streaming service I've used so far.


As someone who used to work on that platform back when it was still called PlayStation Now, no it is not. There's a few rumors and alleged leaks though.


MS seems to be continuing to push xCloud with Games Pass. It has been useful when playing games with my friends a few times.


I hadn’t even heard of L40S until I started renting to get more memory for small training jobs. I didn’t benchmark it, but it seemed to be pretty fast for a pcie card.

Amazon’s g6 instances are L4-based with 24gb vram, half the capacity of the L40S, with sagemaker in demand prices at this rate. Vast ai is cheaper, though a little more like bidding and varying in availability.


> You can run Llama 3.1 70B — the big Llama — for LLM jobs.

That's the medium Llama. Does anyone know if an L40S would run the 405B version?


Hi, I'm the person that wrote that sizing comment in the draft for this article. I have been trying for a while and have been unsuccessful at getting 405B running on any of the GPU machines. I suspect I'd need a raw 8xA100 node to do it at Q4. I doubt there is any reasonable combination of L40s cards that can do it on fly.io. It's just too big. I suspect that in time the 70b model will be brought up to be roughly equivalent, but realistically it's already on the GPT-4 threshold as is. I've found that 70b is more than sufficient in practice.


Be that as it may, Llama 3.1 70B is not the big Llama.


I fixed it.


Prices lowered to $1.25/hr... still 2X vast.ai prices.


There are definitely GPU providers where you can buy cheaper L40S hours than us. I'm not entirely sure what their system architectures are, or whether they're just buying in absolutely spectacular volume, because we are cutting pretty close to the bone with our pricing.

One cost factor we have that other providers might not have (I'd love to know): we have to dedicate individual racked physical hosts to each group of GPUs we deploy, because we don't (/can't, depending on how you think about systems security) allow GPU-enabled workloads to share hardware with non-GPU-enabled workloads, and we don't allow anyone to share kernels.

But like we said in the post: we're still figuring this stuff out. What we know is: at the same price level, we're consistently sold out of A10 inventory.


Hadn't heard of vast.ai before and looked into it. The prices seem really good. Then saw "Our software allows anyone to easily become a host by renting out their hardware."

Ya, that's a no from me.


It is a little squicky. Kind of love the idea, though, even if I'd be worried about using rando compute someone installed an agent on.


Not happening here, Kurtrand.


Also, vast.ai and fly.io just in general are not apples to apples. Sure, go to vast, get yourself a vm or vps or instance or docker container or whatever instance they are giving you. Do your stuffs. Sure. But that is not even close to the same set of features/infra/platform that fly.io offers is it? I'm not sure why people keep thinking that gpu pricing on fly should be the same as an instance on some generic GPU farm or with vast you could even be getting a slice on some random gamer dude's actual computer. Am I not wrong here?


I don't know what platform vast.ai uses but what I have noticed is cpu compute is pretty slow in those. Specifically the tokenization stage was unusually slow for no apparent reason. Had to give that up and use Google cloud for my research project


Sometimes vast.ai is running GPUs on Fly.io that people with YC credits have spun up and added to their marketplace. Those would have been fast though.

They run on literally anything someone installs their agent on.


Not as fast as the L40S, but Runpod.io has the A40 48gb for $0.28/hr spot price, so if its mainly VRAM you need, this is a lot cheaper option. Vast.ai has it for the same price as well.


Runpod is definitely cheaper than we are! We are not the cheapest GPU/hour you can get on any hardware iteration. That's not what we're about, and it is 100% legit to point out that there are workloads that make more sense on other platforms. It would be very weird if that wasn't the case.


Suddenly cutting prices in half shows that the business model is in dire straits.


What it shows is that we're sold out of one part but not the next part up. We're not cutting all our prices in half. We'd just rather source more L40S's than A10's, for what I think are pretty obvious reasons.

This all happened because we were having internal meetings about trying to find A10s to rack, and Kurt stopped and said "wtf are we doing".

If it'll make you feel better, we'll continue to charge you the previous list price for L40S GPU hours.


they buy them at 12 K, so they pay them off in 1 year approx

nice business to be in I guess.


LOL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: