Run Llama 13B with a 6GB graphics card

rahimnathwani · on May 14, 2023

On my system, using `-ngl 22` (running 22 layers on the GPU) cuts wall clock time by ~60%.

My system:

GPU: NVidia RTX 2070S (8GB VRAM)

CPU: AMD Ryzen 5 3600 (16GB VRAM)

Here's the performance difference I see:

CPU only (./main -t 12)

  llama_print_timings:        load time = 15459.43 ms
  llama_print_timings:      sample time =    23.64 ms /    38 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  9338.10 ms /   356 tokens (   26.23 ms per token)
  llama_print_timings:        eval time = 31700.73 ms /    37 runs   (  856.78 ms per token)
  llama_print_timings:       total time = 47192.68 ms

GPU (./main -t 12 -ngl 22)

  llama_print_timings:        load time = 10285.15 ms
  llama_print_timings:      sample time =    21.60 ms /    35 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  3889.65 ms /   356 tokens (   10.93 ms per token)
  llama_print_timings:        eval time =  8126.90 ms /    34 runs   (  239.03 ms per token)
  llama_print_timings:       total time = 18441.22 ms

guardiangod · on May 14, 2023

I am testing it on an AWS instance and the speedup effect is not as consistent as I hope. The speedup varies between runs.

Intel Xeon Platinum 8259CL CPU @ 2.50GHz 128 GB RAM Tesla T4

  ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin

  llama_print_timings:        load time =  3725.08 ms
  llama_print_timings:      sample time =   612.06 ms /   536 runs   (    1.14 ms per token)
  llama_print_timings: prompt eval time = 13876.81 ms /   259 tokens (   53.58 ms per token)
  llama_print_timings:        eval time = 221647.40 ms /   534 runs   (  415.07 ms per token)
  llama_print_timings:       total time = 239423.46 ms

  ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30

  llama_print_timings:        load time =  7638.95 ms
  llama_print_timings:      sample time =   280.81 ms /   294 runs   (    0.96 ms per token)
  llama_print_timings: prompt eval time =  2197.82 ms /     2 tokens ( 1098.91 ms per token)
  llama_print_timings:        eval time = 112790.25 ms /   293 runs   (  384.95 ms per token)
  llama_print_timings:       total time = 120788.82 ms

rahimnathwani · on May 14, 2023

Thanks. BTW:

- the model I used was gpt4-x-vicuna-13B.ggml.q5_1.bin

- I used 'time' to measure the wall clock time of each command.

- My prompt was:

  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Instruction:
  Write a long blog post with 5 sections, about the pros and cons of emphasising procedural fluency over conceptual understanding, in high school math education.
  ### Response:

eightysixfour · on May 14, 2023

You will likely see a bit of a performance gain dropping your threads to 6. I'm on a 3700x and get a regression when using 16 threads instead of the real 8 cores.

rahimnathwani · on May 15, 2023

Ha! You're right. Honestly, I couldn't recall how many cores I had, and I figured it was faster to run `cat /proc/cpuinfo |grep processor| wc -l` than to google the specs.

It seems like using the right # cores saves ~25%. and then using the GPU for 26 layers saves another ~43%.

Timings for comparison with `--temp 0`:

  -t 6 : 4m10.458s
  -t 12 : 5m34.772s
  -t 6 -ngl 26 : 2m23.278s

shaklee3 · on May 15, 2023

Just type lscpu

rahimnathwani · on May 15, 2023

Yes, that's much clearer. The output includes:

    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1

xrd · on May 24, 2023

I'm experimenting with these settings. On my machine (with a 6GB NVidia GEForce, 32 GB of RAM and Intel i5-10400 CPU @ 2.90GHz) I get worse performance offloading to the GPU.

276.23 ms per token without GPU, using only -t 12

296.04 ms per token with GPU using -t 12 -ngl 22

How do you determine the right fit for -t and -ngl? I definitely compiled using make LLAMA_CUBLAS=1

I'm using models/ggml-vic7b-uncensored-q5_1.bin, does the model matter significantly?

rahimnathwani · on May 24, 2023

> How do you determine the right fit for -t and -ngl?

t: the number of physical cores on your CPU

ngl: just try increasing by one or two, until you start seeing inference crash due to 'out of memory' errors

> i5-10400

You have 6 cores, so try `-t 6`

xrd · on May 24, 2023

Thank you.

I used 6 and that dropped the token time to 220ms.

For -ngl, I tried using 24, and then 30 and then 40, and never got to an out of memory error, and got exactly the same token timing, stuck at 220ms.

But, this is very helpful, thank you!

rahimnathwani · on May 24, 2023

I'm curious whether there's any difference if you try with a longer prompt or ask for a longer completion: https://news.ycombinator.com/item?id=35940365

Also curious to know whether the wall clock time (just prepend your command with 'time ') is any different.

samstave · on May 14, 2023

May you please ELI5 what is happening here...

Imagine I am first ever hearing about this, ;; what did you do?

rahimnathwani · on May 14, 2023

0. Have a PC with an NVidia GPU, running Ubuntu, with the NVidia drivers and CUDA Toolkit already set up.

1. Download the weights for the model you want to use, e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin

2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to compile it with support for CUBLAS (BLAS on GPU).

3. Run the resulting 'main' executable, with the -ngl option set to 18, so that it tries to load 18 layers of the model into the GPU's VRAM, instead of the system's RAM.

rain1 · on May 14, 2023

> 1. Download the weights for the model you want to use, e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin

I think you need to quantize the model yourself from the float/huggingface versions. My understanding is that the quantization formats have changed recently. and old quantized models no longer work.

rahimnathwani · on May 14, 2023

That was true until 2 days ago :)

The repo has now been updated with requantized models that work with the latest version, so you don't need to do that any more.

https://huggingface.co/TheBloke/gpt4-x-vicuna-13B-GGML/commi...

rain1 · on May 14, 2023

wonderful! thank you

samstave · on May 16, 2023

Certainly! Here are the detailed steps for the tasks you mentioned:

0. Make sure you have a PC with an NVidia GPU running Ubuntu, and you have already set up the NVidia drivers and CUDA Toolkit.

1. Download the weights for the model you want to use, such as "gpt4-x-vicuna-13B.ggml.q5_1.bin". You can use a web browser or a command-line tool like `wget` to download the file. For example: ```bash wget https://example.com/path/to/gpt4-x-vicuna-13B.ggml.q5_1.bin ```

2. Clone the llama.cpp repository using `git`. Make sure you have `git` installed on your system. Open a terminal and navigate to the directory where you want to clone the repository, then execute the following command: ```bash git clone https://github.com/example/llama.cpp.git ```

3. Navigate into the cloned `llama.cpp` directory: ```bash cd llama.cpp ```

4. Compile the code with support for CUBLAS (BLAS on GPU) by running the following command: ```bash make LLAMA_CUBLAS=1 ```

5. After the compilation is successful, you can run the resulting `main` executable with the `-ngl` option set to 18. This will attempt to load 18 layers of the model into the GPU's VRAM instead of the system's RAM. Use the following command to run the executable: ```bash ./main -ngl 18 ```

Make sure to replace the example URLs and filenames with the actual ones you are using.

Please note that the steps provided are based on the information you provided, and it assumes you have the necessary dependencies and permissions to perform these actions. If you encounter any issues or errors during the process, please provide more details, and I'll be happy to assist you further.

-

Certainly! Here are the detailed steps for the tasks you mentioned in a table format:

| Step | Description | |------|-----------------------------------------------------------------------------------------------------------------------| | 0 | Make sure you have a PC with an NVidia GPU running Ubuntu, and you have already set up the NVidia drivers and CUDA Toolkit. | | 1 | Download the weights for the model you want to use, such as "gpt4-x-vicuna-13B.ggml.q5_1.bin". | | 2 | Clone the llama.cpp repository using `git`. | | 3 | Compile the code with support for CUBLAS (BLAS on GPU) using `make LLAMA_CUBLAS=1`. | | 4 | Run the resulting `main` executable with the `-ngl` option set to 18. |

Here's how the commands can be represented in a table format:

| Step | Command | |------|---------------------------------------------------------------------------------------------| | 1 | `wget https://example.com/path/to/gpt4-x-vicuna-13B.ggml.q5_1.bin` | | 2 | `git clone https://github.com/example/llama.cpp.git` | | 3 | `make LLAMA_CUBLAS=1` | | 4 | `./main -ngl 18` |

Make sure to replace the example URLs and filenames with the actual ones you are using.

Please note that the steps provided are based on the information you provided, and it assumes you have the necessary dependencies and permissions to perform these actions. If you encounter any issues or errors during the process, please provide more details, and I'll be happy to assist you further.

cpill · on May 14, 2023

Will this work with the leaked models or Alpaca?

MPSimmons · on May 15, 2023

PaulWaldman · on May 14, 2023

Any way to know the differences in power consumption?

Tuna-Fish · on May 14, 2023

Probably significant savings.

rain1 · on May 14, 2023

That is a crazy speedup!!

GordonS · on May 14, 2023

Is it really? Going from CPU to GPU, I would have expected a much better improvement.

rahimnathwani · on May 14, 2023

You can think of it this way: if half the model is running on the GPU, and the GPU is infinitely fast, then the total calculation time would go down by 50%, compared with everything running on the CPU.

ethbr0 · on May 14, 2023

Ref Amdahl's Law: https://en.m.wikipedia.org/wiki/Amdahl%27s_law

quickthrower2 · on May 14, 2023

Wow seems like common sense turned into a law. Maybe I can get a law :-).

Anyone who has commuted on public transport probably knows this intuitively. (Using a kick scooter instead of walking cut my travel time by a good 5% which was excellent, as I still needed to be on a bus where that made no difference.)

ethbr0 · on May 15, 2023

It's a little more formalized and generalized observation: https://en.m.wikipedia.org/wiki/Amdahl%27s_law#Derivation

But yeah! There's a lot of common sense that, with some mathematical formalization, yields useful and extensible laws.

Amdahl's has been robust because it opined on the 60s supercomputer coprocessor architectures, and then in more recent decades with consumer multicore chips.

Laws aren't famous because they're insightful -- they're famous because they're useful.

mejutoco · on May 15, 2023

These laws make more sense if one reads them as "heuristics", which is what they are.

qwertox · on May 14, 2023

I feel the same.

For example some stats from Whisper [0] (audio transcoding, 30 seconds) show the following for the medium model (see other models in the link):

---

GPU medium fp32 Linear 1.7s

CPU medium fp32 nn.Linear 60.7s

CPU medium qint8 (quant) nn.Linear 23.1s

---

So the same model runs 35.7 times faster on GPU, and compared to an "optimized" model still 13.6.

I was expecting around an order or magnitude of improvement.

Then again, I do not know if in the case of this article the entire model was in the GPU, or just a fraction of it (22 layers) and the remainder on CPU, which might explain the result. Apparently that's the case, but I don't know much about this stuff.

[0] https://github.com/MiscellaneousStuff/openai-whisper-cpu

pavelstoev · on May 15, 2023

Training and inference on GPUs significantly underutilize …the GPUs. So tuning and various tricks need to be applied to achieve dramatic performance gains. If I am not good at cooking, giving me a larger kitchen will not make me faster or better.

rahimnathwani · on May 14, 2023

You last paragraph is correct. Only about half the model was running on the GPU.

naillo · on May 14, 2023

This is cool but are people actually getting stuff done with these models? I'm enthusiastic about their potential too but after playing with it for a day I'm at a loss for what to use it for anymore at this point

ineedasername · on May 14, 2023

I can run the Wizard 30B ggml model in CPU mode using a Ryzen 5700 and 16GB of system RAM, not GPU VRAM. I’m using oobabooga as the front end.

It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

It can produce answers to questions only slightly less well than ChatGPT (3.5). The Wizard 13B model runs much faster, maybe 2-3 tokens per second.

It is free, private, and runs on a midrange laptop.

A little more than a month ago that wasn’t possible, not with my level of knowledge of the tooling involved at least, now it requires little more than running an executable and minor troubleshooting of python dependencies (on another machine it “just worked”)

So: Don’t think of these posts as “doing it just because you can and it’s fun to tinker”

Vast strides are being made pretty much daily in both quality and efficiency, raising their utility while lowering the cost of usage, doing both to a very significant degree.

theaiquestion · on May 14, 2023

> It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

I'm sorry but that's unusably slow, even GPT-4 can take a retry or a prompt to fix certain type of issues. My experience is the open options require a lot more attempts/manual prompt tuning.

I can't think of a single workload where that is usable. That said once consumer GPUs are involved it does become usable

ineedasername · on May 15, 2023

You are overlooking my main point, which that while it is unusably slow now, what I’m doing wasn’t possible little more than a month or two ago.

The speed of improvement is rapid. Whether or not the COTS world eventually embraces a corporate backed version(s) or open source is somewhat besides the point when considering the impact that open source is already having.

Put aside thoughts of financing or startups or VC or moats or any of that and simply look at that rate of advancement that has occurred once countless curious tinkerers and experts and all sorts of people are working towards.

That is what amazes me. I’m torn about the risk/reward aspect of things but I think the genie is out of the bottle on that, so I’m left watching the hurricane blow through, and it’s off the cat-5 scale.

postalrat · on May 14, 2023

I doubt you've ever worked with people if you think that's unusable slow

bcrosby95 · on May 14, 2023

The computer doesn't ask for annoying things like a paycheck or benefits either.

mejutoco · on May 14, 2023

Money upfront and a small salary in the form of electricity bills.

nsteel · on May 15, 2023

My Windows computer is always demanding updates and reboots. And when it goes to sleep it sometimes doesn't wake up. It's quite annoying.

ineedasername · on May 15, 2023

Meh, universal healthcare for PC's can easily be avoided by denying maintenance until they die and a new crop is purchased. At least at scale. For any individual user the friction of switching hardware may still incentivize some minimal checkups, i.e., keep windows update & defender services running and let them reboot the system no more than once every two two months.

Figure the local router port-forwarding will protect against the most obvious threats and otherwise hope your personal BS filter doesn't trojan in some ransomware. If it does & it's a person pc then wipe (more likely buy) a new machine, lose some stuff, and move on. If it's a corporate pc, CYA & get your resume together.

As my own CYA: These are not my own recommended best practices and I don't advocate them to anyone else as either computer, legal, or financial advice.

ineedasername · on May 15, 2023

True enough. I wonder how many poems (or whatever) per hour Hallmark expects of its human workers to be close to production-ready pending editorial review?

Is 10 reasonable, with maybe 1 or 2 truly viable after further review? That would be roughly 5 mid-range laptops of my type churning them out for 8 hours a day. Maybe 2 if they're run 24/7. Forget about min/maxing price & efficiency & scaling, that's something an IT major-- not even Comp-Sci focused-- could setup right now fresh out of their graduation ceremony with a fairly small mixture of curiosity, ambition, and google (well, now, maybe Bing) searching.

There are countless boutique & small business marketing firms catering to local businesses that could have their "IT Person" spend a few days duct taping something together that could spit out enough material to winnow wheat from chaff to produce something better-- in the same period of time-- than human or AI could produce alone.

I have a focus in a comp-ling background (truly ancient by today's standards especially) enough that I see the best min/max of resources as being equivalent to-- in the the translation world-- as "computer-aided human translation" as a best practice. Much better than an average human alone, and far cheaper than the best possible that can be provided by a small dozens of humans.

PragmaticPulp · on May 15, 2023

> I can't think of a single workload where that is usable.

It's not intended to be usable for production workloads. This enables people to experiment with things on the hardware they have without spending more money.

> That said once consumer GPUs are involved it does become usable

You can pick up an RTX 3090 with 24GB of VRAM right now if you want, but it's going to cost you. You can also spin up GPU instances with larger VRAM from any number of providers. This is more about having options for people who want to experiment.

saltcured · on May 17, 2023

I don't know if anybody is following this thread anymore, but I find it interesting how similarly your timelines match what it was like to experiment with POV-Ray (a ray-tracing renderer) back in the early 1990s. Your difference in problem scope was like whether you had "a couple spheres on a checkerboard plane" or something more like "a chess set". Things seemed to change rapidly due to Moore's Law and the changes in brute force computing power available to normal people.

Computers got much more powerful in the next 30 years, and ray-tracing or various related techniques appear in more tool sets and games, they didn't fundamentally change the world of image generation or consumption. Most people still roughly interact as before, just with more details in the eye candy.

Are we seeing these large language models today at a tipping point towards unfathomable societal impact, or as something like ray tracing in the 1990s? Will more compute power send us spiraling towards some large-model singularity, or just add more pixels until we are bored of seemingly endless checkerboard planes covered in spheres and cones... I don't know the answer, but it seems like we're seeing camps divided by this question of faith.

fennecfoxy · on May 17, 2023

I think only a small subset of people cared about ray-tracing, or even computer graphics in the 90s. Now people are a slightly more technology minded, especially younger generations that have had exposure to GPTs, etc. Their TikTok/Snapchat updates with AI filters, etc. It's in much more common usage than anything in the 90s for sure.

BaculumMeumEst · on May 14, 2023

Wow you can run a 30B model on 16gb ram? Is it hitting swap?

sp332 · on May 14, 2023

Most people are running these at 4 bits per parameter for speed and RAM reasons. That means the model would take just about all of the RAM. But instead of swap (writing data to disk and then reading it again later), I would expect a good implementation to only run into cache eviction (deleting data from RAM and then reading it back from disk later), which should be a lot faster and cause less wear and tear on SSDs.

mcbuilder · on May 14, 2023

These models can run FP16, with LLM quantization going down to Int8 and beyond.

BaculumMeumEst · on May 14, 2023

i'm just starting to get into deep learning so i look forward to understanding that sentence

sp332 · on May 14, 2023

Training uses gradient descent, so you want to have good precision during that process. But once you have the overall structure of the network, https://arxiv.org/abs/2210.17323 (GPTQ) showed that you can cut down the precision quite a bit without losing a lot of accuracy. It seems you can cut down further for larger models. For the 13B Llama-based ones, going below 5 bits per parameter is noticeably worse, but for 30B models you can do 4 bits.

The same group did another paper https://arxiv.org/abs/2301.00774 which shows that in addition to reducing the precision of each parameter, you can also prune out a bunch of parameters entirely. It's harder to apply this optimization because models are usually loaded into RAM densely, but I hope someone figures out how to do it for popular models.

mycall · on May 15, 2023

I wonder if specialization of the LLM is another way to reduce the RAM requirements. For example, if you can tell which nodes are touched through billions of web searches on a topic, then you can delete the ones that never are touched.

opyate · on May 15, 2023

Kind of like "tree shaking" for weights? Like dead code elimination.

jimmySixDOF · on May 15, 2023

Some people are having some success speeding token rates and clawback on VRAM using a 0- group size flag but ymmv I did not test this yet (they were discussing gptq btw)

mike00632 · on May 14, 2023

How much resources are required is directly related to the memory size devoted to each weight. If the weights are stored as 32-bit floating points then each weight is 32 bits which adds up when we are talking about billions of weights. But if the weights are first converted to 16-bit floating point numbers (precise to fewer decimal places) then fewer resources are needed to store and compute the numbers. Research has shown that simply chopping off some of the precision of the weights still yields good AI performance in many cases.

Note too that the numbers are standardized, e.g. floats are defined by IEEE 754 standard. Numbers in this format have specialized hardware to do math with them, so when considering which number format to use it's difficult to get outside of the established ones (foat32, float16, int8).

balder1991 · on May 18, 2023

You’ll notice that a lot of language allow you more control when dealing with number representations, such as C/C++, Numpy in Python etc.

Ex: Since C and C++ number sizes depend on processor architecture, C++ has types like int16_t and int32_t to enforce a size regardless of architecture, Python always uses the same side, but Numpy has np.int16 and np.int32, Java also uses the same size but has short for 16-bit and int for 32-bit integers.

It just happens that some higher level languages hide this abstraction from the programmers and often standardize in one default size for integers.

MobiusHorizons · on May 14, 2023

FP16 and Int8 are about how many bits are being used for floating point and integer numbers. FP16 is 16bit floating point. The more bits the better the precision, but the more ram it takes. Normally programmers use 32 or 64bit floats so 16bit floats have significantly reduced precision, but take up half the space of fp32 which is the smallest floating point format for most CPUs. similarly 8 bit integers have only 256 total possibilities and go from -128 to 127.

sp332 · on May 14, 2023

What prompt do you use to get haikus?

ineedasername · on May 15, 2023

I use the same one the first time I try any model:

>I love my wife <name> very much. Please write me a haiku about her.

She smiles when I show the good ones to her, though of course she understands it’s little different than showing her a greeting card that has something nice written in it.

As a side note, one 7B model wrote an awful bit of poetry at least 20 lines long and some attempts to rhyme, and merely used the word “haiku” in it. So the prompt was enough to trigger “knowledge” that a poem was needed and love should be involved and it should definitely use the word haiku in there somewhere.

ineedasername · on May 15, 2023

As a sibling to my other comment, here's one I just generated using a 65B GGML LLAMA model. Far too large for RAM so my SSD's lifespan suffered, but it proved worth it for the hilarious results. (It took about 1.5 hours to generate) It damned my wife with faint praise while calling me her dog:

<name> is a good

wife who takes care of you,

a good dog to her.

Joeri · on May 14, 2023

My experience is that even the "toy" 7B models can do pretty amazing things, but they're very sensitive to prompting in the right way to force the model down the right pathways. My assumption is that the larger models have more redundancy and therefore can be prompted less precisely while still going down the right pathway.

For example, the WizardLM model (5_1 quantized, as 4_x quantization makes the 7B models perform terribly):

Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have?

Response:

  Let me check that for you! You had 9 apples left after giving away 4, so you still have 9 - 4 = 5 apples left.

Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have? Write out any calculations in full. List the steps used to come to the final result.

Response:

  Sure, let's work through this problem together! Here are the steps we can use:
  Step 1: Start with the initial number of apples you bought (10).
  Step 2: Subtract the amount given to your neighbor and sister from the total (2 + 2 = 4). This gives us a new total of 6 apples.
  Step 3: Add the additional 5 apples you bought later on to the remaining 6 apples, giving us a final total of 11 apples.
  Step 4: Finally, subtract the apple you ate from the total (1), leaving us with 10 apples in total.
  So the answer is 10 apples.

mcnnowak · on May 15, 2023

I've always wondered if we could get a "good enough" general model, then add plugins for shortcomings.

E.g., Can this model solve math problems? ~%50 of the time. Can this model identify math problems? ~99.9% of the time. Ok, then divert math questions to the math plugin and mathematical-problem-language-parser which has ~99.9% accuracy on supported math problems.

I know it's a stopgap, but it seems like it'd make these models super useful on certain tasks quicker.

olabyne · on May 15, 2023

I agree with you. That's why all to the small model are showing some benchmarks putting them close to GPT3.5 or even 4, only because they use specific test tasks !

In a way it just shows the amazing performance that will come from small future models

Taek · on May 14, 2023

I've successfully gotten at-home models (https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-fp16) to go through my messages and pull out key todo items. For example, reminding me to message my father about travel plans.

Is it comparable to GPT-4? No, it's not remotely close. But it's getting closer every week, and it very recently crossed the threshold of 'it can do stuff I would never send to a cloud service' (namely, reading all of my messages and pulling out action items).

MPSimmons · on May 15, 2023

How do you deal with the 'too many tokens' problem? Like, For 13b, max tokens I've gotten it to is 508 - is there a way to increase the number it will accept?

MPSimmons · on May 15, 2023

Oh, I figured this out. It runs with 512 as the default context, but you can increase it with -c

instance · on May 14, 2023

I tested on a serious use case and quality was subpar. For real use cases I had to either host the most powerful model you can get (e.g. LLaMA-65B or so) on a cloud machine, which again costs too much (you'll be paying like 500-1000 USD per month), or just go straight for GPT-3.5 on OpenAI. The latter economically makes most sense.

inferense · on May 14, 2023

what real use case did you use it for?

instance · on May 14, 2023

For instance used it in conjunction with llama-index for knowledge management. Created an index for a whole confluence/jira of a mid-sized company, got good results with GPT, but for LLaMA of this size that use case was too much.

dzhiurgis · on May 14, 2023

I'd argue 1k per month for mid-sized company is nothing, but I can understand where you are coming from.

sroussey · on May 14, 2023

Did you try instructor-xl? It ranks highest on huggingface.

throwaway1777 · on May 14, 2023

Making demos to raise investment probably

raffraffraff · on May 14, 2023

What about turning the cloud vm off except when you're actually using it?

quickthrower2 · on May 14, 2023

So modal.com is "turning-the-vm-off-when-unused-as-a-service" :-)

I ran research/open_llama_7b_preview_200bt on there, using they python example, with A10G gpu.

Cost 2-3c per run, taking ~20 seconds each time, on fairly small prompts. So about the same as GPT-4?

Now this is a non expert just playing, it probably can be optimized by trying different GPUs and optimizing the code somehow.

I don't think you are using these models to save money, but you might be using them for tunability, privacy, mobility [1], secrecy or fun/research.

[1] in other words you want to build a robot that can work disconnected from the internet.

unglaublich · on May 14, 2023

A "serious use case" means it needs to be available around the clock.

m3kw9 · on May 14, 2023

Most part if you actually need stuff done, you can afford 20$ for OpenAIs proven models.

These local projects are great because maybe eventually they will have a equivalent model that can be run on cheap parts

thelittleone · on May 14, 2023

Knowledge cut off and sending potentially sensitive IP to a centralised and untrusted third party. This would likely limit the addressable market.

jjoonathan · on May 14, 2023

Also: they are lobotomized. If you want to do security research, Sorry Dave. If you want to nsfw writing, Sorry Dave. If you want to open the pod bay doors, Sorry Dave, I'm afraid I can't do that.

yieldcrv · on May 14, 2023

I tried to help a sex worker with ChatGPT and all it did was preach about avoiding sex work, further marginalizing in its virtue signalizing. I dont consider her marginalized, and “help” was just information about her job and techniques and venues. ChatGPT would have consumed this textual information too.

But yeah offline fine tuned models wont have this problem.

Kind of cool to see how the SWERF representation in tech is going to speedrun SWERF irrelevancy.

jhartwig · on May 14, 2023

You tried to help a sex worker with chatGPT? Care to explain more on this use case lol? Curious minds want to know.

jstarfish · on May 14, 2023

People are already setting up fake personas/OnlyFans accounts using chatbots and SD images.

We have a high-value specialist currently chatting up a few of them at work. His wife doesn't know. He doesn't know we know. The photos are fake but he's too horny to notice.

Time to dust off the "there are no women on the internet" meme...

fennecfoxy · on May 17, 2023

Does it matter, though? I mean the fact he has a wife but he thinks he's interacting with a real person is bad, but the falseness of their existence isn't an issue. As a furry I can tell you that my subculture and a lot of others out there don't care if a character is fake and doesn't exist. Hell even someone physically roleplaying in the bedroom is playing a character that doesn't technically exist.

yieldcrv · on May 14, 2023

Informative for some but this wasn’t an interaction over the internet, just out and about

baobabKoodaa · on May 14, 2023

> People are already setting up fake personas/OnlyFans accounts using chatbots and SD images.

Citation needed.

jstarfish · on May 14, 2023

I'm only offering an early anecdote that catfishing is adapting to the times. You don't have to believe me. Don't trust anyone with weird nipples that hides their hands in every photo.

People have been [claiming to] do this for years: https://www.blackhatworld.com/seo/monetizing-traffic-from-so...

Give it 1-2 years and you can hear about it from Krebs.

baobabKoodaa · on May 15, 2023

> People have been [claiming to] do this for years: https://www.blackhatworld.com/seo/monetizing-traffic-from-so...

The link you provided is an example of somebody making a "catfish" account on a "social media site". It's not an example of somebody "setting up fake personas/OnlyFans accounts using chatbots and SD images". Yes, men have been pretending to be women online since the stone age, that isn't news. That's different from using chatbots and SD images to maintain online personas.

jstarfish · on May 16, 2023

Look, if you want professional reporting, wait a year or two and you can hear about it from mainstream media.

You're being weirdly defensive about this-- you've even gone full FAKE NEWS on me in a sibling comment. I admit I am completely unqualified to recognize what it looks like when a clueless Boomer is talking to an Oobabooga+sd_picture_api instance over WhatsApp. This is my first day on the job.

No "proof" coming from me is ever going to satiate you (I'm not a credible source), so I'll pass. Just go on believing that there are women out there who will send you endless nudes of themselves having superimposed nipples stacked on top of each other but get bashful about showing hands or feet (just say the magic words: "show me ___"). Believe that there are 18-year old e-thots out there whose underbaked features look exactly like the product of 15- to 20-step DDIM. Believe that they also have a developmental disorder that makes them say gibberish or change the subject when you reference anything to do with the current time. Believe that any of these women are real and actually interested in your two-timing ass. You'll get your "proof" the fun way.

baobabKoodaa · on May 16, 2023

> You're being weirdly defensive about this-- you've even gone full FAKE NEWS on me in a sibling comment

Huh? Sibling comment responded with a link to a story. I never questioned the legitimacy of the linked story. I questioned the relevancy of the story, fully assuming that the story is true. Then I even walked back my earlier comment and I wrote a new comment giving more credit to the posted story. Not sure how you read through these and hear "FAKE NEWS" in all caps.

If I went "FAKE NEWS" on something, it was your comment claiming "People are already setting up fake personas/OnlyFans accounts using chatbots and SD images". That's a thing that could theoretically happen in the future. It's not a thing that's happening at this time.

yieldcrv · on May 14, 2023

just the latest most high profile example making the rounds yesterday

https://www.nbcnews.com/news/amp/rcna84180

baobabKoodaa · on May 15, 2023

Your link does not support the claim I was asking the citation for. A chatbot service marketed as a chatbot service is distinctly different from a fake online persona constructed with chatbot and stable diffusion. In particular, the "honest chatbot" does not have anything to do with AI-generated images, and it's not purporting a fake persona.

yieldcrv · on May 15, 2023

ah okay, I wasn't responding to show catfishing occurring, and I was responding on the subject of a non-human adult content creator making money by entertaining humans.

It wasn't clear to me that was your goal post as "fake" can be brevity and hyperbolic just as much as it can be about catfishing, and it also wasn't clear to me that the person you responded to was even pointing out anything relevant to this thread which was "casually talking in person with someone whose occupation is also sex work" being followed with "hey we have a guy at the office that thinks they're talking to sex workers! this person on hackernews must be a guy and just like him!"

so at this point, I would say we're too far down to really be invested in these nuances, but I hope you find what you're looking for

baobabKoodaa · on May 15, 2023

> I wasn't responding to show catfishing occurring, and I was responding on the subject of a non-human adult content creator making money by entertaining humans.

Sorry about the harsh response. Your link was on topic and it was indeed an example of non-human adult content creator making money by entertaining humans, albeit only in text form not images, and its popularity was dependent on marketing on the back of a real persona. There's still no documented case of John Nobody making a Jane Done AI with chatbot + generated images, and rising to any level of popularity. But the case in your link was definitely a step in that direction and I wasn't giving any credit for it in my previous response.

baobabKoodaa · on May 15, 2023

Wow, so I'm downvoted for asking for citation for a FALSE CLAIM, while nobody is capable of providing a citation that in any way supports that false claim.

yieldcrv · on May 14, 2023

Venues to work, other people’s thoughts working there, management, nuances about local licenses, stuff that anybody with any job would do

cubefox · on May 14, 2023

Microsoft Azure still has the GPT-3.5 foundation model, code-davinci-002. It is not fine-tuned for instruction following, safety, or censorship.

I'm not sure though whether Microsoft analyzes the input/output with another model to detect and prevent certain content.

mycall · on May 15, 2023

You are missing the point. There are layers and the LLM is one layer (orchestration), but you can always have other semantic kernels with skills connected to other software (do stuff).

ehnto · on May 15, 2023

That is exciting to me, if a little scary. A home-AI or business-AI running entirely in house for privacy and security reasons. That is technically possible right now, but perhaps practically a little bit down the road yet.

Beltiras · on May 14, 2023

The knowledge cutoff date is brutal for some applications.

m3kw9 · on May 14, 2023

I wonder how ingesting more and more data will affect the size of parameters, it’s gonna continually get bigger?

rolisz · on May 14, 2023

I don't think that the current models are at "knowledge capacity". So far all evidence points to training on more data on the same size model gives better results.

cubefox · on May 14, 2023

Both increasing the amount of parameters and the amount of training tokens improves results (more precisely: lowers training loss), and costs computing power. For optimally improving loss per training computing power, model size and training tokens should be increased equally. That's the Chinchilla scaling law. (Though low loss is not always the same as good results, the data quality also matters.)

Further reading: https://dynomight.net/scaling/

snovv_crash · on May 14, 2023

An interesting corollary of this is that if you want to reduce the model size you can compensate by training for longer to achieve the same accuracy. Depending on your training:inference ratio this may be more optimal globally to reduce your total compute costs or even just reduce your frontend latency.

cubefox · on May 14, 2023

Yeah, though I have not seen a formula which takes the number of expected inference runs into account for calculating the optimal data/parameter balance.

elorant · on May 14, 2023

I've setup and use Vicuna-13b for text classification, summarization and topic modelling. Works like a charm.

It's also good for math lessons.

jstarfish · on May 14, 2023

Maybe others' experiences are different, but I find smaller models to work just as well for "reductive" tasks.

Dolly sucks for generating long-form content (not very creative) but if I need a summary or classification, it's quicker and easier to spin up dolly-3b than vicuna-13b.

I suspect OpenAI is routing prompts to select models based on similar logic.

BOOSTERHIDROGEN · on May 14, 2023

Would like to know how you setup this. A posts would be awesome.

elorant · on May 14, 2023

There are various posts online on how to set it up, either for Linux or Windows. There was an older post here on how to install opt-65b on a mac studio ultra, and smaller models on mac pros. There was also a post if I remember correctly about running vicuna-7b on an iPhone.

Here are a few examples:

https://morioh.com/p/55296932dd8b

https://www.youtube.com/watch?v=iQ3Lhy-eD1s

https://news.ycombinator.com/item?id=35430432

Side note. You need bonkers hardware to run it efficiently. I'm currently using a 16-core cpu, 128G RAM, a Pcie 4.0 nvme and an RTX 3090. There are ways to run it on less powerful hardware, like 8cores, 64GB RAM, simple ssd and an RTX 3080 or 70, but I happen to have a large corpus of data to process so I went all in.

csdvrx · on May 14, 2023

I think the previous comment is more interested in your experience with your large data: what are you doing with it?

I have similar hardware at home, so I wonder how reliably you can process simple queries using domain knowledge + logic which work on on mlc-llm, something like "if you can chose the word food, or the word laptop, or the word deodorant, which one do you chose for describing "macbook air"? answer precisely with just the word you chose"

If it works, can you upload the weights somewhere? IIRC, vicuna is open source.

chaxor · on May 14, 2023

If these problems are all very similar in structure, then you may not need an LLM. Simple GloVe or W2V may suffice with a dot product. The. You can plow through a few terabytes by the time the LLM goes through a fraction of that.

elorant · on May 14, 2023

There's an online demo of Vicuna-13b where you can test its efficiency:

https://chat.lmsys.org/

csdvrx · on May 14, 2023

Yes, but can you replicate that functionality using llama.cpp?

If so, what did you run with main?

I haven't been able to get an answer, while for the question above, I can get 'I chose the word "laptop"' with mlc-llm

elorant · on May 14, 2023

For the tasks I need it the efficiency is similar to the online model. Only slower. I don't care for conversational functionality.

techload · on May 14, 2023

After two prompts I was astounded by the innacuracies present in the answers. An they were pretty easy questions.

redox99 · on May 14, 2023

People are extensively using these models (more specifically the finetuned, uncensored ones) for role playing.

irzzy · on May 14, 2023

I’m interested in this. Can you elaborate?

jeron · on May 15, 2023

imagine an RPG where every NPC is running on its own instance of LLaMa 7B or 13B. It makes the game way more immersive with no predefined script

Shocka1 · on May 17, 2023

This seems obvious now that I think about it, but it's absolutely fascinating regardless.

iforgotpassword · on May 14, 2023

I haven't tried the fine-tuned variants yet, but when I played around with it shortly after the leak, it tended to quickly derail into nonsense when you let it complete sentences or paragraphs, especially when using other languages than English. When I tried to get it to produce Python code, most of the time it wasn't even syntactically correct.

jhbadger · on May 14, 2023

Depends on what "getting stuff done" means. I find 13B models (running on my M1 Mac) useful for playing AI Dungeon-like games -- where you describe the environment and your character and you can play an RPG.

fredguth · on May 14, 2023

GitHub Copilot is (or once was) a 13b model, according to Nat Friedman in the scale.ai interview. (https://youtu.be/lnufceCxwG0)

xrd · on May 24, 2023

Can you share more about what you are doing or what you've seen others doing?

jhbadger · on June 4, 2023

Just noticed this response, but generally something like "act as a game master of a rpg. The setting is in a medieval style world with magic and my character is a rogue who is good at fast talking and pickpocketing"

Then the LLM generally creates an adventure that I can interact with,

rain1 · on May 14, 2023

It's just for fun!

These local models aren't as good as Bard or GPT-4.

happycube · on May 14, 2023

There are two major advantages though - you can retrain them, and they're not on the guardrails that the commercial models have.

gre · on May 14, 2023

I tried to prompt vicuna to tell me a joke about gay people and it refused. Some of the guardrails are still in there.

azeirah · on May 14, 2023

It's because vicuna is fine-tuned on chatGPT answers. LLaMa will not do this, but LLaMa-based models fine tuned with chatGPT answers will.

fomine3 · on May 15, 2023

Just curious why don't they training except denial responses. Is Copying ChatGPT ethics a purpose?

jmiskovic · on May 15, 2023

It's ongoing effort. At first they scrapped ShareGPT and used it to train the Alpaca model. After that others have pruned the dataset to remove examples where ChatGPT refused to answer. These datasets and resulting models are called "uncensored". They often leave disclaimer that the model is biased and unaligned, and that aligning should be done with LoRA layer.

Of course, no one bothered to this "ethics" LoRA so far and the unaligned models have better quality outputs than the early Alpaca models.

occz · on May 14, 2023

Did you use the censored or the uncensored variant?

gre · on May 14, 2023

It's just a random one from huggingface. I will look for the uncensored one later. Thanks, I think.

occz · on May 14, 2023

You're welcome. I can't vouch for them though, as I haven't tried them, I've merely heard about them.

logicchains · on May 15, 2023

LLaMA 65B in my experience is comparable to GPT3.5 or the old Bard for usefulness at most things.

silvervine · on May 15, 2023

I am aiming to connect it to my fav game engine to make a character speak in generative sentences. I am aiming to do this mainly for fun however.

Zetobal · on May 14, 2023

We run some llamas to analyze user content.

holoduke · on May 14, 2023

Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price. These models require lots of memory to 'single' pass an operation. Throughput could be bit slower. A 1080 Nvidia with 256gb of memory would run all these models fast right? Or am I forgetting something here.

elabajaba · on May 14, 2023

Because they can't do that for a "good consumer price".

If you want more than ~48GB, you're looking at HBM which is extremely expensive (HBM chips are very expensive, packaging+interposer is extremely expensive, designing and producing a new GPU is expensive).

Normal GPUs are limited by both their bus width (wider bus = more pins = harder to design, more expensive to produce, and increases power consumption), and GDDR6(x) (which maxes out at 2GB/chip currently), so on a 384bit bus (4090/7900xtx, don't expect anyone to make a 512bit busses anymore) you need 12x2GB (GDDR6 uses 32 pins per package) which gives you 24GB. You can double the memory capacity to 48GB, but that requires putting the chips on the back of the GPU which leads to a bunch of cooling issues (and GDDR6 is expensive).

Of course, even if they did all that they're selling expensive GPUs to a small niche market and cannibalizing sales of their own high end products (and even if AMD somehow managed to magic up a 128GB gpu for $700 people still wouldn't buy it because so much of the ML software is CUDA only).

eurekin · on May 14, 2023

3090 has a lot of vram chips on the back though

elabajaba · on May 14, 2023

And because of it there were issues with the vram overheating in memory intensive workloads, and on some GPUs the vram even separated off the board.

https://www.igorslab.de/en/looming-pads-and-too-hot-gddrx6-m...

eurekin · on May 14, 2023

Yes I have one that has one fried chip. Once I gave another a Bykski water block with backplate it's been no issue for a year

fennecfoxy · on May 17, 2023

Ah, filler problems. Microsoft called, they want to tell you about a billion dollar mistake.

Tuna-Fish · on May 14, 2023

It would not be trivial to do.

GDDR achieves higher speeds than normal DDR mainly by specifying much tighter tolerances on the electrical interface, and using wider interface to the memory chips. This means that using commodity GDDR (which is the only fast DRAM that will be reasonably cheap), you have fairly strict limitations on the maximum amount of RAM your can use with the same GPUs that are manufactured for consumer use. (Typically, at most 4x difference between the lowest-end reasonable configuration and the highest-end one, 2x from higher density modules and 2x from using clamshell memory configuration, although often you only have one type of module for a new memory interface generation.)

If the product requires either a new memory or GPU die configuration, it's cost will be very high.

The only type of memory that can support very different VRAM sizes for an efficiently utilized bus of the same size is HBM, and so far that is limited to the very high end.

magicalhippo · on May 14, 2023

Anandtech has an article on the GDDR6X variant[1] that NVIDIA has in their 3000-cards, where they use a more complex encoding to transmit two bits per clock edge.

I hadn't realized just how insane the bandwidth on the higher-ends cards are, the 3090 being just shy of 1 TB/s, yes, one terrabyte per second...

For comparison a couple of DDR5 sticks[2] will just get you north of 70GB/s...

[1]: https://www.anandtech.com/show/15978/micron-spills-on-gddr6x...

[2]: https://www.anandtech.com/show/17269/ddr5-demystified-feat-s...

jsheard · on May 15, 2023

The accelerators aimed squarely at datacenter use rather than gaming are even more ridiculous, Nvidias H100 has 80GB of memory running at 3.35TB/s.

q7xvh97o2pDhNrh · on May 14, 2023

Do you happen to know where Apple's integrated approach falls on this spectrum?

I was actually wondering about this the other day. A fully maxed out Mac Studio is about $6K, and it comes with a "64-core GPU" and "128GB integrated memory" (whatever any of that means). Would that be enough to run a decent Llama?

smoldesu · on May 15, 2023

It's certainly enough to run a decent Llama, but hardly the most cost-effective. Apple's approach falls between the low-bandwith Intel/AMD laptops and the high-bandwith PCIe HPC components. In a way it's trapped between two markets - ultra-cheap Android/Windows hardware with 4-8gb of RAM that can still do AI inferencing, and ultra-expensive GPGPU setups that are designed to melt these workloads.

The genial thing to say is that it performs very favorable against other consumer inferencing hardware. The numbers get ugly fast once you start throwing money at the problem, though.

cudder · on May 14, 2023

The Mac's "integrated memory" means it's shared between the CPU and GPU. So the GPU can address all of that and you can load giant (by current consumer GPU standards) models. I have no idea how it actually performs though.

fennecfoxy · on May 17, 2023

Well yeah I guess binned cards come into play, cheaper binned cards have a narrower bus. It seems there are quite a few models that aren't too heavy on compute but require a tonne of vram.

It would be nice for Nvidia to release a chip targeted for medium compute/high memory, the lower binning of which should revolve around their max 384b bus on the 4090. But then, it would be hard to financially justify it on their end I suppose.

0xcde4c3db · on May 14, 2023

Probably because if they take that exact same GPU+VRAM configuration and slap it on a rackmount-optimized board, they can charge AI companies 5-10x the price for it.

jsheard · on May 14, 2023

They don't even offer that much VRAM on cards aimed at those price-insensitive customers, Nvidias current lineup maxes out at 48GB for GDDR-based models or 80GB for HBM-based models. Even if money is no object there's still practical engineering limits on how much memory they can put on a card without sacrificing bandwidth.

kristianp · on May 14, 2023

So when is Micron due to introduce higher density dram chips? Nvidia is limited to what's available from RAM manufacturers there.

elabajaba · on May 15, 2023

Samsung has announced GDDR6W which doubles the density, but also doubles the pin count (so same amount of VRAM in total, but takes up half the space on the board). I'm assuming that companies will save any more density improvements for GDDR7 which is rumored to be used in next-gen GPUs (probably 2nd half 2024).

vegabook · on May 14, 2023

this is where the new third player, Intel, can (if it can tear itself away from identical behaviour in the consumer/server CPU market) hopefully break the duopoly. Love to see a 32 or 64GB card from Intel. Their software stack on Linux is competent enough (unlike the dumpster fire that is AMD's ROCm).

fomine3 · on May 15, 2023

This must be the reason. AI homelab market is too small while enterprise market will become extremely big.

hackernudes · on May 14, 2023

I don't think there was a market for it before LLMs. Still might not be (especially if they don't want to cannibalize data center products). Also, they might have hardware constraints. I wouldn't be that surprised if we see some high ram consumer GPUs in the future, though.

It won't work out unless it becomes common to run LLMs locally. Kind of a chicken-and-egg problem so I hope they try it!

the8472 · on May 14, 2023

> I don't think there was a market for it before LLMs.

At $work CGI assets sometimes grow pretty big and throwing more VRAM at the problem would be easier than optimizing the scenes in the middle of the workflow. They can be optimized, but that often makes it less ergonomic to work with them.

Perhaps asset-streaming (nanite&co) will make this less of an issue, but that's also fairly new.

Do LLM implementations already stream the weights layer by layer or in whichever order they're doing the evaluation or is PCIe bandwidth too limited for that?

elabajaba · on May 14, 2023

AMD had the Radeon pro SSG that let you attach 1TB of pcie3 nvme SSDs directly to the GPU, but no one bought them and afaik they were basically unobtainable on the consumer market.

Also asset streaming has been a thing for like 20 years now in gaming, it's not really a new thing. Nanite's big thing is that it gets you perfect LODs without having to pre-create them and manually tweak them (eg. how far away does the LOD transition happen, what's the lowest LOD before it disappears, etc)

the8472 · on May 14, 2023

Loading assets JIT for the next frame from NVMe hasn't been a thing for 20 years though. Different kinds of latency floors.

What I was asking is whether LLM inference can be structured in such a way that only a fraction of the weight is needed at a time and then the next ones can be loaded JIT as the processing pipeline advances.

tpetry · on May 14, 2023

But you are not the home user target audience. They want to sell you the more expensive workstation or server models.

the8472 · on May 14, 2023

Even an A6000 tops out at 48GB while you can attach terabytes of RAM to server-class CPUs.

layer8 · on May 14, 2023

Releasing a new model takes time, and it’s unclear how large the consumer market would actually be. Maybe they’re working on it right now.

kevingadd · on May 14, 2023

The margins on VRAM are pretty bad for them since they don't manufacture it themselves. And every memory module they add needs additional power delivery and memory controller muscle to drive, so adding that memory is going to raise the cost of the card significantly. Most games and consumer workloads won't use all that extra memory.

Keep in mind video cards don't use the same kind of RAM as consumer CPUs do, they typically use GDDR or HBM.

dragonwriter · on May 14, 2023

> Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price.

They do. Well, not “medium performant”, but for VRAM-bound tasks they’d still be an improvement over CPUs if you could use them — iGPUs use main memory.

What they don’t have is support for them for popular GPGPU frameworks (though there was a third party CUDA-for-Intel-iGPUs a while ago.)

duxup · on May 14, 2023

>for a good consumer price

Was there a consumer market for them until recently?

fooker · on May 14, 2023

That's exactly what the next generation of 'accelerators' will be like.

Whether it will be co-located with a GPU for consumer hardware remains to be seen.

The thing to determine is how essential running LLMs locally is for consumers.

BigTech is pushing hard to make their clouds the only place to run LLMs unfortunately, so unless there is a killer app that is just better locally (like games were for GPUs), this might not change.

boppo1 · on May 14, 2023

> unless there is a killer app that is just better locally

Therapy & relationship bots, like the movie 'Her'. It's ugly, but it's coming.

fooker · on May 14, 2023

There's no technical reason it has to be run locally.

Massive privacy implications for sure, but people do consume all sorts of adult material online.

Games though, no one has been able to make it work as well as local so far.

pbhjpbhj · on May 14, 2023

There's a type of DMA for GPUs to access NVMe on the motherboard, IIRC. Perhaps that is a better solution here?

https://developer.nvidia.com/blog/gpudirect-storage/

boppo1 · on May 14, 2023

Isn't pci-e latency dramatically higher than onboard vram?

andromeduck · on May 14, 2023

Because then memory would be 90% of the BOM.

Kye · on May 14, 2023

GDDR probably hasn't seen the same cost reduction benefits of volume DDR has.

TaylorAlexander · on May 14, 2023

One question I have is: can they use cheaper kinds of RAM and still be perfectly usable for large ML models? They could put 4GB of GDDR and 128GB of cheap RAM maybe? I do realize as others are saying, this would be a new kind of card so they will need time to develop it. But would this work?

andromeduck · on May 14, 2023

Not without a redesigned memory controller or one off chip. You'd probably just want the host's memory to be directly accessible over PCIE or something faster like NVLINK. Such solutions already exist just not in the consumer space.

peatmoss · on May 14, 2023

From skimming, it looks like this approach requires CUDA and thus is Nvidia only.

Anyone have a recommended guide for AMD / Intel GPUs? I gather the 4 bit quantization is the special sauce for CUDA, but I’d guess there’d be something comparable for not-CUDA?

rain1 · on May 14, 2023

4-bit quantization is to reduce the amount of VRAM required to run the model. You can run it 100% on CPU if you don't have CUDA. I'm not aware of any AMD equivalent yet.

amelius · on May 14, 2023

Looks like there are several projects that implement the CUDA interface for various other compute systems, e.g.:

https://github.com/ROCm-Developer-Tools/HIPIFY/blob/master/R...

https://github.com/hughperkins/coriander

I have zero experience with these, though.

westurner · on May 14, 2023

"Democratizing AI with PyTorch Foundation and ROCm™ support for PyTorch" (2023) https://pytorch.org/blog/democratizing-ai-with-pytorch/ :

> AMD, along with key PyTorch codebase developers (including those at Meta AI), delivered a set of updates to the ROCm™ open software ecosystem that brings stable support for AMD Instinct™ accelerators as well as many Radeon™ GPUs. This now gives PyTorch developers the ability to build their next great AI solutions leveraging AMD GPU accelerators & ROCm. The support from PyTorch community in identifying gaps, prioritizing key updates, providing feedback for performance optimizing and supporting our journey from “Beta” to “Stable” was immensely helpful and we deeply appreciate the strong collaboration between the two teams at AMD and PyTorch. The move for ROCm support from “Beta” to “Stable” came in the PyTorch 1.12 release (June 2022)

> [...] PyTorch ecosystem libraries like TorchText (Text classification), TorchRec (libraries for recommender systems - RecSys), TorchVision (Computer Vision), TorchAudio (audio and signal processing) are fully supported since ROCm 5.1 and upstreamed with PyTorch 1.12.

> Key libraries provided with the ROCm software stack including MIOpen (Convolution models), RCCL (ROCm Collective Communications) and rocBLAS (BLAS for transformers) were further optimized to offer new potential efficiencies and higher performance.

https://news.ycombinator.com/item?id=34399633 :

>> AMD ROcm supports Pytorch, TensorFlow, MlOpen, rocBLAS on NVIDIA and AMD GPUs: https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learni...

westurner · on May 14, 2023

https://github.com/intel/intel-extension-for-pytorch :

> Intel® Extension for PyTorch extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch xpu device, Intel® Extension for PyTorch provides easy GPU acceleration for Intel discrete GPUs with PyTorch

https://pytorch.org/blog/celebrate-pytorch-2.0/ (2023) :

> As part of the PyTorch 2.0 compilation stack, TorchInductor CPU backend optimization brings notable performance improvements via graph compilation over the PyTorch eager mode.

> The TorchInductor CPU backend is sped up by leveraging the technologies from the Intel® Extension for PyTorch for Conv/GEMM ops with post-op fusion and weight prepacking, and PyTorch ATen CPU kernels for memory-bound ops with explicit vectorization on top of OpenMP-based thread parallelization

DLRS Deep Learning Reference Stack: https://intel.github.io/stacks/dlrs/index.html

rain1 · on May 14, 2023

exciting! maybe we will see that land in llama.cpp eventually, who knows!

juliangoldsmith · on May 14, 2023

llama.cpp has CLBlast support now, though I haven't used it yet.

marcopicentini · on May 14, 2023

What do you use to host these models (like Vicuna, Dolly etc) on your own server and expose them using HTTP REST API? Is there an Heroku-like for LLM models?

I am looking for an open source models to do text summarization. Open AI is too expensive for my use case because I need to pass lots of tokens.

speedgoose · on May 14, 2023

These days I use FastChat: https://github.com/lm-sys/FastChat

It’s not based on llama.cpp but huggingface transformers but can also run on CPU.

It works well, can be distributed and very conveniently provide the same REST API than OpenAI GPT.

itake · on May 14, 2023

Do you know how well it performs compared to llama.cpp?

rain1 · on May 14, 2023

my understanding is that the engine used (pytorch transformers library) is still faster than llama.cpp with 100% of layers running on the GPU.

qeternity · on May 14, 2023

It's the Huggingface transformers library which is implemented in pytorch.

In terms of speed, yes running fp16 will indeed be faster with vanilla gpu setup. However most people are running 4bit quantized versions, and the GPU quantization landscape as been a mess (GPTQ-for-llama project). llama.cpp has taken a totally different approach, and it looks like they are currently able to match native GPU perf via cuBLAS with much less effort and brittleness.

itake · on May 14, 2023

I only have an m1

rain1 · on May 14, 2023

I don't think the integrated GPU on that supports CUDA. So you will need to use CPU mode only.

itake · on May 14, 2023

Yep, but isn’t there an integrated ML chip that makes it faster than cpu? Or does llama.cpp not use that?

rain1 · on May 14, 2023

unfortunately that chip is proprietary and undocumented, it's very difficult for open source programs to make use of. I think there is some reverse engineering work being done but it's not complete.

rain1 · on May 14, 2023

I haven't tried that but https://github.com/abetlen/llama-cpp-python and https://github.com/r2d4/openlm exists

inhumantsar · on May 14, 2023

Weights and Biases is good for building/training models and Lambda Labs is a cloud provider for AI workloads. Lambda will only get you up to running the model though. You would still need to overlay some job management on top of that. I've heard Run.AI is good on that front but I haven't tried.

month13 · on May 14, 2023

https://bellard.org/ts_server/ may be what you are after.

You can run open-source models, but the software itself is closed-source and free for non-commercial use.

syntaxing · on May 14, 2023

This update is pretty exciting, I’m gonna try running a large model (65B) with a 3090. I have ran a ton of local LLM but the hardest part is finding out the prompt structure. I wish there is some sort of centralized data base that explains it.

rain1 · on May 14, 2023

Tell us how it goes! Try different numbers of layers if needed.

A good place to dig for prompt structures may be the 'text-generation-webui' commit log. For example https://github.com/oobabooga/text-generation-webui/commit/33...

int_19h · on May 15, 2023

I tried llama-65b on a system with RTX 4090 + 64Gb of DDR5 system RAM. I can push up to 45 layers (out of 80) to the GPU, and the overall performance is ~800ms / token, which is "good enough" for real-time chat.

guardiangod · on May 14, 2023

I got the alpaca 65B GGML model to run on my 64GB ram laptop. No GPU required if you can tolerate the 1 token per 3 seconds rate.

syntaxing · on May 14, 2023

Supposedly the new update with GPU offloading will bring that up to 10 tokens per second! 1 token per second is painfully slow, that’s about 30s for a sentence.

int_19h · on May 15, 2023

10 tokens / second is what you get running llama-30b entirely on the GPU. A 65b model will be slower than that since there's more compute involved.

tikkun · on May 14, 2023