L40S has 48GB of RAM, curious how they're able to run Llama 3.1 70B on it. The weights alone would exceed this. Maybe they mean quantized/fp8?
I just had to implement GPU clustering in my inference stack to support Llama 3.1 70b, and even then I needed 2xA100 80GB SXMs.
I was initially running my inference servers on fly.io because they were so easy to get started with. But I eventually moved elsewhere because the prices were so high. I pointed out to someone there that e-mailed me that it was really expensive vs. others and they basically just waved me away.
For reference, you can get an A100 SXM 80GB spot instance on google cloud right now for $2.04/hr ($5.07 regular).
I feel like Geforce Now has been growing quite steadily, they onboard new titles every week including some big ones like World of Warcraft recently and they've stood up streaming DC's in quite a lot of places now.
I expected lag and I haven't experienced it beyond an occasional moment like ... once a month maybe. I just attribute that to local internet issues and it stops and I just keep playing.
As someone who used to work on that platform back when it was still called PlayStation Now, no it is not. There's a few rumors and alleged leaks though.
I hadn’t even heard of L40S until I started renting to get more memory for small training jobs. I didn’t benchmark it, but it seemed to be pretty fast for a pcie card.
Amazon’s g6 instances are L4-based with 24gb vram, half the capacity of the L40S, with sagemaker in demand prices at this rate. Vast ai is cheaper, though a little more like bidding and varying in availability.
Hi, I'm the person that wrote that sizing comment in the draft for this article. I have been trying for a while and have been unsuccessful at getting 405B running on any of the GPU machines. I suspect I'd need a raw 8xA100 node to do it at Q4. I doubt there is any reasonable combination of L40s cards that can do it on fly.io. It's just too big. I suspect that in time the 70b model will be brought up to be roughly equivalent, but realistically it's already on the GPT-4 threshold as is. I've found that 70b is more than sufficient in practice.
There are definitely GPU providers where you can buy cheaper L40S hours than us. I'm not entirely sure what their system architectures are, or whether they're just buying in absolutely spectacular volume, because we are cutting pretty close to the bone with our pricing.
One cost factor we have that other providers might not have (I'd love to know): we have to dedicate individual racked physical hosts to each group of GPUs we deploy, because we don't (/can't, depending on how you think about systems security) allow GPU-enabled workloads to share hardware with non-GPU-enabled workloads, and we don't allow anyone to share kernels.
But like we said in the post: we're still figuring this stuff out. What we know is: at the same price level, we're consistently sold out of A10 inventory.
Hadn't heard of vast.ai before and looked into it. The prices seem really good. Then saw "Our software allows anyone to easily become a host by renting out their hardware."
Also, vast.ai and fly.io just in general are not apples to apples. Sure, go to vast, get yourself a vm or vps or instance or docker container or whatever instance they are giving you. Do your stuffs. Sure. But that is not even close to the same set of features/infra/platform that fly.io offers is it? I'm not sure why people keep thinking that gpu pricing on fly should be the same as an instance on some generic GPU farm or with vast you could even be getting a slice on some random gamer dude's actual computer. Am I not wrong here?
I don't know what platform vast.ai uses but what I have noticed is cpu compute is pretty slow in those. Specifically the tokenization stage was unusually slow for no apparent reason. Had to give that up and use Google cloud for my research project
Sometimes vast.ai is running GPUs on Fly.io that people with YC credits have spun up and added to their marketplace. Those would have been fast though.
They run on literally anything someone installs their agent on.
Not as fast as the L40S, but Runpod.io has the A40 48gb for $0.28/hr spot price, so if its mainly VRAM you need, this is a lot cheaper option. Vast.ai has it for the same price as well.
Runpod is definitely cheaper than we are! We are not the cheapest GPU/hour you can get on any hardware iteration. That's not what we're about, and it is 100% legit to point out that there are workloads that make more sense on other platforms. It would be very weird if that wasn't the case.
What it shows is that we're sold out of one part but not the next part up. We're not cutting all our prices in half. We'd just rather source more L40S's than A10's, for what I think are pretty obvious reasons.
This all happened because we were having internal meetings about trying to find A10s to rack, and Kurt stopped and said "wtf are we doing".
If it'll make you feel better, we'll continue to charge you the previous list price for L40S GPU hours.
I just had to implement GPU clustering in my inference stack to support Llama 3.1 70b, and even then I needed 2xA100 80GB SXMs.
I was initially running my inference servers on fly.io because they were so easy to get started with. But I eventually moved elsewhere because the prices were so high. I pointed out to someone there that e-mailed me that it was really expensive vs. others and they basically just waved me away.
For reference, you can get an A100 SXM 80GB spot instance on google cloud right now for $2.04/hr ($5.07 regular).