Hacker News new | past | comments | ask | show | jobs | submit | dhruvdh's comments login

Why would you use this over vLLM?

we have vllm in certin production instances, it is a pain for most non-nvidia related architectures. A bit of digging around and we realized that most of it is just a wrapper on top of pytorch function calls. If we can do away with batch processing with vllm supports, we can be good, this is what we did here.

Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM (7900XTX). By running 15 requests at once.

Also, there is a Dockerfile.rocm at the root of vLLM's repo. How is it a pain?


driver mismatch issues, we mostly use publicly available instances, so the drivers change as the instances change, according to their base image. Not saying it won't work, but it was more painful to figure out vllm, than to write a simple inference script and do it ourselves.

What's the point of the 8000 LOC limit? Has anyone worked in a project with a LOC limit? Why was the limit in place?


It's just a way to keep the code size in check, make sure it can be read and understood relatively easily. Don't overthink it. I doubt much, if any, research went into picking the limit. The line width is over 120 in many places, and the code inevitably ends up looking like

  cache_key = (device, st, dtype, op, arg, tuple(ref(x) for x in srcs)) if base is None else (st, ref(base))


Seems the code sample contradicts your first statement


I think this might be an example of https://en.m.wikipedia.org/wiki/Goodhart's_law

The line count probably does still act as a limit on complexity overall but perhaps less than hoped for.


Indeed, I was making a point.


This is truly depressing because the aspirations of tinygrad are so appealing in terms of being concise, effective and maintainable. Then, instead, they throw comprehensibility entirely out of the window.


To compare, the PyTorch repo has ~400k lines of C, ~850k lines of C++ and more than 1.5 million lines of Python code.

PyTorch does more than tinygrad, but does it really do 343x more things?


If PyTorch does the 1-2 things you need and Tinygrad doesn't do, then what are you going to use?

The Python source distribution has long maintained the philosophy of “batteries included” – having a rich and versatile standard library which is immediately available, without making the user download separate packages.

https://peps.python.org/pep-0206/

OTOH:

  Simple is better than complex.
  Complex is better than complicated.
https://peps.python.org/pep-0020/


PyTorch of course. Or alternatively a lib or custom code on top of TinyGrad. Is that a problem?


geohot explained on one of this streams, and per my terrible memory: “tiny” is a way of expressing the architecture constraint that the system should not attempt to target [(many hardware architectures and their optimizations) * (many model, training, etc etc variants)] like PyTorch - which requires maintenance of a shit ton of code and a staff/community behind Meta. Instead, tinygrad should provide core abstractions that can be composed to accomplish a similar set of targets but for only one hardware architecture (for now I guess). He is releasing a companion hardware item which would fund the development I believe.


I think you massively underestimate the complexity of pytorch. Even if we exclude all GPUs except for AMD, and exclude clang (required for AOT engine), pytorch depends on almost every ROCm library. And inside it depends on original Triton library, and on forked Triton, and on aotriton, which depends on forked MLIR (because AMD MLIR don't contribute these changes to upstream), which depends on another forked LLVM/Clang (because LLVM api is not stable enough for them, I guess). And then there is MIOpen/rocBLAS/hipBLASlt/hipSOLVER/rocFFT/etc - libraries with gigabytes (!) of autogenerated code. Additionally, there are dozens of smaller linked libraries like oneDNN, LIBXSMM, magma, numpy, openBLAS, all needed for running "things". So even without autogenerated code, consider multiplying 1.5 million LOC to 100.


Probably.


Easily


uh, ya? lol


Right now there doesn't seem to be much point. IIRC they had a 1000 LOC limit on the core part of the code when the project was early.

The README no longer mentions the limit and it looks like they just raise it whenever needed. Three months ago it was bumped to 6500 LOC. One month ago it was bumped to 8000 lines.


A tech debt ceiling so to speak then. There might be some use to it. It's still inevitably increased, but only after debate, discussion, and a lot of time in-between really considering the form and impact of the code being entered to fit within the constraint


To keep it "tiny". (IIRC geohot started it because he thought pytorch and others were bloated and a simple ml framework would be inherently better)


It used to be a 1,000. I guess it’s just a reminder to be succinct.


Looking at the code base right now, apparently to produce some of the most unreadable code possible (https://github.com/tinygrad/tinygrad/blob/master/tinygrad/re...)

LOC limits have to be one of the worst incentives you can give programmers.


The only one I can think of the dwm window manager (https://dwm.suckless.org/), that used to prominently mention a SLOC limit of 2000. Doesn't seem to be mentioned in the landing page anymore, not sure if it's still in effect.


There are benefits of having a low number of lines of codes, e.g. if you want to print out on a paper (and reduce the number of pages), or store on a disk with a limited storage (although number of bytes is a more useful measure, then), or if you want to read it to understand it in less time than a longer program, etc. Of course, the limit of number of characters on each line, is also necessary, then.

However, that doesn't solve everything. Many things it does not accurately measure, e.g. complexity, number of stuff in one line, program speed, memory usage, etc. Those are other things to measure, and it can be helpful to reduce memory usage etc, but that is not the number of lines of codes.


Cyclomatic complexity would be a better measurement.


To stay Tiny


No new features.


Yeah, and any updates to the model that make Recall attractive, cannot go back in time and reanalyze the past to be more useful.

At least they will learn a lot from this, in 6-12 months and perhaps a new generation of NPUs we might have something attractive.


The NPU runs this Silica model at 1.5 watts. MacBooks cannot even drive multiple monitors in this price range.


The MacBooks have an NPU too. Just nobody has done anything with them.


The MacBook NPU is 3x slower than the 45 TOPS threshold required for Copilot+ PC branding.


The FPGA being used is I believe one of the lowest speced SKUs.

AWS instance prices are more of a supply/demand/availability thing, it would be more interesting to compare from a total cost of ownership / perf-power-area prespective.


I don't know what you are trying to say here. If one system doesn't need to move as much data because it is more flexible, that is a good thing. What do we gain by making it "fair"?


If you're limiting the size of the model to 110 million parameters (105MiB assuming int8) because that's what will fit onto your FPGA then of course it's going to be more energy efficient than a Broadwell era Xeon with a 24GB RTX 3090. It's like concluding that a rickshaw is more efficient than a train, something that will absolutely be true in a technical sense if you're only transporting a single passenger, but makes no sense if you're transporting hundreds if not thousands of passengers.

A more apt comparison would have been with a phone made in the past 5 years, even without an AI accelerator chip I'm sure you could manage 20-30+ t/s from a 110m model but this depends entirely on the memory bandwidth of the phone.


I would imagine the importance of weights depends on the prompt. How do you decide which weights are important?


Yeah, that is the point more or less - it dynamically chise the weights layer per layer depending on the internal state.

A bit technical explaination here. https://kolinko.github.io/effort/equations.html


There is a VitisAI execution provider for ONNX, and you can use ONNX backends for inference frameworks that support it. More info here - https://ryzenai.docs.amd.com/en/latest/

But regardless, 16 TOPs is no good for LLMs. Though there is a Ryzen AI demo that shows Llama 7B running on these at 8 tokens/sec. A sub-par experience for a sub-par LLM.


In the benchmark you have linked, you clearly see that the performance of the CPU only implementation and the NPU implementation are identical.

https://github.com/amd/RyzenAI-SW/blob/main/example/transfor...

What this should tell you is that "15 TOPs" is an irrelevant number in this benchmark. There are exactly two FLOPs per parameter. Loading the parameters takes more time than processing them.

There are people with less than 8GB of VRAM and they can't load these models into their GPU and end up with the exact same performance as on CPU. The 12tflops of the 3060 Ti 8GB are "no good" for LLMs, because the bottleneck for token generation is memory bandwidth.

My Ryzen 2700 gets 7 tokens per second at 50 GFLOPs. What does this tell you? The NPU can saturate the memory bandwidth of the system.

Now here is the gotcha: Have you tried inputting very large prompts? Because that is where the speedup is going to be extremely noticeable. Instead of waiting minutes on a 2000 token prompt, it will be just as fast as on GPUs, because the initial prompt processing is compute bound.

Also, before calling something subpar, you're going to have to tell me how you are going to put larger models like Goliath 70b or 120b models on your GPU.


Thanks, I was looking for information on this, it seems to be lower speed than pure-CPU inference on M2, and probably much worse than a ROCm GPU-based solution?


Because the NPU isn't for high-end inferencing. It's a relatively small coprocessor that is supposed to do bunch of tasks with high TOPS/watt without engaging the way more power hungry GPU.

At release time, the windows driver for example included few video processing offloads used by Windows Frameworks used for example by MS Teams for background removal - so that such tasks use less battery on laptops and free up CPU/GPU for other tasks on desktop.

For higher end processing you can use the same AIE-ML coprocessors various chips available previously from Xilinx and now under AMD brand.


> the same AIE-ML coprocessors

they're not the same - versal acaps (whatever you want to call them) have AIE1 arch while phoenix has AIE2 arch. there are significant differences between the two arches (local memory, bfloat16, etc.)


Phoenix has AIE-ML (what you call AIE2), Versal has choice of AIE (AIE1) and AIE-ML (AIE2) depending on chip you buy.

Essentially, AMD is making two tile designs optimized for slightly different computations and claims that they are going to offer both in Versal, but NPUs use exclusively the ML-optimized ones.


Wow, that's simply embarrassing.


You can't buy these pro variants from Microcenter for example, but you can buy them from pre-built OEM desktops. Mostly meant for enterprise customers who buy in bulk, I think.


It's not like Microsoft is working on "Windows AI Studio" [1], or released Orca, or Phi. It's not like there's any talk of AI PCs with mandatory TOPs requirements for Windows 12. Big bad Microsoft coming for your local AI, beware.

[1] https://github.com/microsoft/windows-ai-studio


The whole embrace, extend.. ?

> Mistral Remove "Committing to open models" from their website

That was 5 hours ago.

Without having insider details it is hard to know why, but the coincidence of timing with the Microsoft deal is not lost on me. It could have even been a stipulation.


I have no explanation for why Microsoft has started aggressively innovating again (with the introduction of Satya) than my theory that US DoD realized the country's tool of dominance in the future will be predominantly with tech superiority instead of military power. Microsoft's new strategy of running everything on the cloud aligns with this, even if it may have been also motivated by the fact that most people now only own a battery-constrained mobile device and laptops getting smaller and thinner.


You’re downvoted for the snarky tone I guess, but you’re absolutely right


Even easier to rug pull your own teams project than someone else's.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: