More

steeve · 2024-10-12T20:41:21 1728765681

I mean, we (zml) clocked MI300X ($20k) at +30% than H100 ($30k).

So…

wmf · 2024-10-12T21:12:44 1728767564

That was then. Now it's about MI325 vs. B100.

peterhhchan · 2024-10-12T20:48:27 1728766107

What about power consumption? edit: My understanding from about a year ago is that AMD and NVDA's chips were priced similarly in terms of performance per watt.

steeve · 2024-10-02T21:47:13 1727905633

You can look us up at https://github.com/zml/zml, we fix that.

andyferris · 2024-10-02T22:33:01 1727908381

Wait, looking at that link I don't see how it avoids downloading CUDA or ROCM. Do you use MLIR to compile to GPU without using the vendor provided tooling at all?

steeve · 2024-10-03T18:24:00 1727979840

We do use ROCm and CUDA. Only we sandbox it with the model and download only the needed parts which are about 1/10th of the size.

steeve · 2024-10-02T20:40:57 1727901657

Hi, we (ZML), fix that: https://github.com/zml/zml

latchkey · 2024-10-04T17:09:55 1728061795

Works out of the box on our MI300x. Fantastic work steeve!

https://x.com/HotAisle/status/1842245896085356949

fazkan · 2024-10-02T20:45:51 1727901951

This is pretty cool. Is there a document that shows which AMD drivers are supported out of the box?

steeve · 2024-10-02T21:36:08 1727904968

We are in line with ROCm 6.2 support. We actually just opened a PR to bump to 6.2.2: https://github.com/zml/zml/pull/39

steeve · 2024-09-24T08:02:44 1727164964

We (ZML) measured MI300X at 30% faster than H100. These are great chips!

steeve · 2024-09-17T15:43:30 1726587810

pretty easy, usually the hardest part is figuring out what the python code is doing

steeve · 2024-08-10T13:37:40 1723297060

The last one is so very true

steeve · 2024-08-10T09:06:54 1723280814

Bazel is amazing and doing C++ with anything other is like going back to the stone age.

The Bazel team has done an amazing job, the VM is embedded and trimmed. It’s as easy and download and run.

And worst case you can invest in Buck2.

steeve · 2024-07-16T07:00:41 1721113241

Yes, that’s how it works (pipeline parallelism)

mg · 2024-07-16T07:03:12 1721113392

Interesting. Let's do the math ...

Let's say the model has 50B parameters and 50 layers. That would mean about one billion values have to travel through the wifi for every generated token?

I wonder how much data that is in bytes and how long it takes to transfer them.

blackbear_ · 2024-07-16T07:28:35 1721114915

It's not the parameters that are sent, it's the layer outputs. That makes for a few thousands floats per token

mg · 2024-07-16T07:37:52 1721115472

Woops! I would have thought the number of neurons roughly equals the number of parameters, but you are right. The number of parameters is much higher.

tama_sala · 2024-07-16T17:46:24 1721151984

The embedding size is only 8k so while the parameters are 70B. So it's a huge difference

steeve · 2024-07-15T20:46:07 1721076367

It’s outrage and bots.

Fuck Musk.

steeve · 2024-07-01T18:09:55 1719857395

1. Jon Oliver’s piece on Science reporting

2. Didier Raoult