Sigh. It's not like the zero copy buzzword is going to help you during training,... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

rfoo on April 1, 2023 | parent | context | favorite | on: Llama.cpp 30B runs with only 6GB of RAM now

Sigh.

It's not like the zero copy buzzword is going to help you during training, all your weights have to stay on GPU, you are going to sample your training data randomly and your data is on a networked storage anyway, so mmap HURTS. You'd better just O_DIRECT.

Similarly, as long as you run your inference on GPU it's not like you can mmap... And I have indeed worked on inference runtimes for mobile devices and on the rare cases we need to use CPU only (hey, your phone also have GPUs since forever) at $PREVIOUS_JOB we did have a mmap-able model format, it also helps in TEE/SGX/whatever enclave tech. Oh, and there are no Python at all.

The recent development of ggml is interesting as it catches a moment that "big ML shop infra" guys don't care: running models on Apple Silicon. M1/M2s are expensive enough that we don't consider deploying them instead of those 1000000000 bizarre accelerators in production, yet everyone on HN seems to have one and hey it's fast enough for LMs. They are rather unique as they are CPU+high bandwidth RAM+accelerators with totally shared RAM with CPU, instead of some GPU shit.

tldr it's not like "big ML shop infra" guys are stupid and leaves performance on table. They just don't run their production workload on MacBooks. That's where the community shine right?

sroussey on April 1, 2023 | [–]

On a Mac, mmap definitely works for the GPU since it’s all the same unified memory.

danbst on April 3, 2023 | | [–]

in llama.cpp inference runs on CPU, using AVX-2 optimizations. You don't need GPU at all

It runs on my 2015 ThinkPad!

miraculixx on April 2, 2023 | [–]

This! Thank you.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact