Hacker News new | past | comments | ask | show | jobs | submit login
State-of-the-art open-source chatbot, Vicuna-13B, just released model weights (twitter.com/lmsysorg)
271 points by weichiang on April 3, 2023 | hide | past | favorite | 139 comments



Note that what they released are the delta weights from the og LLaMa model. To play around with it, you'll need to grab the original LLaMA 13B model and apply the changes.

  > We release Vicuna weights as delta weights to comply with the LLaMA model
  > license. You can add our delta to the original LLaMA weights to obtain
  > the Vicuna weights.
Edit: took me a while to find it, here's a direct link to the delta weights: https://huggingface.co/lmsys/vicuna-13b-delta-v0


That's what they say but I just spent 10 minutes searching the git repo, reading the relavent .py files and looking at their homepage and the vicuna-7b-delta and vicuna-13b-delta-v0 files are no where to be found. Am I blind or did they announce a release without actually releasing?


If you follow this command in their instruction, the delta will be automatically downloaded and applied to the base model. https://github.com/lm-sys/FastChat#vicuna-13b: `python3 -m fastchat.model.apply_delta --base /path/to/llama-13b --target /output/path/to/vicuna-13b --delta lmsys/vicuna-13b-delta-v0`


This can be then quantized to the llama.cpp/gpt4all format, right? Specifically, this only tweaks the existing weights slightly, without changing the structure?


I may have missed the detail, but it also expects the pytorch conversion rather than original LLaMa model.


Yes, you need to convert the original LLaMA model to the huggingface format, according to https://github.com/lm-sys/FastChat#vicuna-weights and https://huggingface.co/docs/transformers/main/model_doc/llam...


You can use this command to apply the delta weights. (https://github.com/lm-sys/FastChat#vicuna-13b) The delta weights are hosted on huggingface and will be automatically downloaded.


Thanks! https://huggingface.co/lmsys/vicuna-13b-delta-v0

Edit, later: I found some instructive pages on how to use the vicuna weights with llama.cpp (https://lmsysvicuna.miraheze.org/wiki/How_to_use_Vicuna#Use_...) and pre-made ggml format compatible 4-bit quantized vicuna weights, https://huggingface.co/eachadea/ggml-vicuna-13b-4bit/tree/ma... (8GB ready to go, no 60+GB RAM steps needed)


I did try, but got:

``` ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported. ```


> Unfortunately there's a mismatch between the model generated by the delta patcher and the tokenizer (32001 vs 32000 tokens). There's a tool to fix this at llama-tools (https://github.com/Ronsor/llama-tools). Add 1 token like (C controltoken), and then run the conversion script.


Just rename it in the tokenconfig.json


Thanks, that indeed worked!

This and using conda in wsl2, instead on bare windows


so an extra licensing issue to get around the original non commercial license... this is just a research curiosity is it not?


Seems that way, it would probably be a bad idea to use this for anything commercial at the very least.


Vicuna at huggingface.com? This keeps making me think of "facehuggers" from Aliens and Vecna from Stranger Things.

(I know a vicuna is a llama like animal.)


Not a lawyer, but that still feels like dubious territory. I would still be on the hook for acquiring the original download, which Facebook has been launching dmca takedown requests for the llama-dl project.


(I work on llama-dl.)

We’re fighting back against the DMCA requests on the basis that NN weights aren’t copyrightable. This thread has details: https://news.ycombinator.com/item?id=35393782

I don't think you have to worry about Facebook going after you. The worst that will happen is that they issue a DMCA, in which case your project gets knocked offline. I don’t think they’ll be going the RIAA route of suing individual hackers.

The DMCAs were also launched by a third party law firm, not Meta themselves, so there’s a bit of “left hand doesn’t know what the right hand is doing” in all of this.

I’ll keep everyone updated. For now, hack freely.


If they aren't copyrightable, couldn't they still be classes as a trade secret and still fall under IP law? Though I'm not sure if distributing the weights to people who sign a simple agreement to not redistribute would count as taking reasonable precautions in maintaining secrecy.


If facebook freely distributed their trade secrets, I'm not sure they'd have any legal defense.


I'm sure they wouldn't have any legal recourse on the trade secrets front if they distributed them to anyone who asked...


keep up god's work!


> god's work

creating sentient life?


That can't be his work, since he only picked up that hobby about 0.000625% of the universe's timespan ago.


For many humans, some "hobbies" involve "projects" which may involve seemingly infinite degrees of procrastination. (This certainly applies to me!)


You're not wrong - but for perspective this is equivalent to a 90 year old picking up a hobby 5 hours ago.


Is it though? It could be a child picking up a hobby after being old enough to appreciate the hobby. There is so much more time left in the universe before heat death, so the 90y metaphor doesn't really describe the current point in time


Gotta do something in your old age. Better than crossword puzzles, I'll bet.


Lemme save whoever is donating the legal here the time: model weights are definitely copyrightable.


Usually, you don't know if something is "definitely" anything in the legal world unless it's been tested in court. You have any case you want to reference here? Or what makes you so certain?


> model weights are definitely copyrightable.

on what legal theory or precedence makes this true?

IMHO, the weights are akin to the list of telephone numbers in a directory - which is definitely not copyrightable; only the layouts and expressive portion of a phone directory is copyrightable.

So to make the weights copyrightable, it needs to be argued that the 'layout' of the weight is a creative expression, rather than a 'fact'. But the weights are matrices , which is not expressive or creative. Someone else could derive this exact same set of weights from scratch via the same algorithmic procedure, and therefore, these weights cannot be a creative expression.


"Definitely" is too certain w.r.t. law, but it's pretty obvious how you'd argue these fall under copyright. The difficulty would really be the opposite, it'd be arguing the weights are not derived works of the copyrighted input data sets.

Firstly, weights are not merely a collection of facts like a telephone book is. If two companies train two LLMs they'll get different weights every time. The weights are fundamentally derived from the creative choices they make around hyperparameter selection, training data choices, algorithmic tweaks etc.

Secondly, weights can be considered software and software is copyrightable. You might consider it obvious that weights are not software, but to argue this you'd need an argument that also generalizes to other things that are commonly considered to be copyrightable like compiled binaries, application data files and so on. You'd also need to tackle the argument that weights have no value without the software that uses them (and thus are an extension of that software).

Finally, there's the practical argument. Weights should be copyrightable because they cost a lot of money to produce, society benefits from having large models exist, and this requires them to be treated as the private property of whoever creates them. This latter one should in theory more be a political matter, but copyright law is vague enough that it can come down to a social decision by judges.


I agree but I'd suggest that weights are less like the telephone numbers in a directory and much more like the proportional weights in a recipe.

Recipes, famously, are almost but not quite copyrightable | patentable.

eg:

https://copyrightalliance.org/are-recipes-cookbooks-protecte...

https://etheringtons.com.au/are-recipes-protected-by-copyrig...


> MHO, the weights are akin to the list of telephone numbers in a directory - which is definitely not copyrightable

I would contest the analogy, but even if we accept it, it's still not clear whether phone directories (or other compilation of factual data) are definitely not copyrightable. The position is clear in the US, but in the UK and presumably other jurisdictions, I wouldn't be so sure.

You could claim we're just talking about US law here, but if you release something on github/huggingface without geo-restrictions, and your company does business in Europe, you might not only have to comply with US law...

eg. https://www.jstor.org/stable/24866738 , eg. https://books.google.com.hk/books?id=wHJBemWuPT4C&pg=PA114&l...


Ok. What if I train it for one micro step?


thanks zero comment bot account!


If NN weights aren't protected by IP law that could slow down progress quite a lot. That could be very good for people worried about alignment.


>If NN weights aren't protected by IP law that could slow down progress quite a lot.

What do you mean? IP law is overwhelmingly an impediment to progress; innovation happens faster when people are free to build on existing weights.


Yes, but there's less incentive for large companies to spend huge amounts of money training these systems when other companies can just take their work for free.

Removing IP protection would make it a lot easier to innovate at this level, but it would reduce the amount of money flowing into getting us to the next level.


Or development could shift out of the hands of these large corporations, which might be a good thing.

Somehow, though, I doubt they'll let the golden goose slip through their fingers, no matter what happens.


Not really. This model only made it to the public because meta was offering it publicly.

This won't happen to GPT any time soon so they are safe, copyright or not.


I'm curious, do you not think this might have adverse effects? Namely, if NN weights aren't copyrightable, limited releases like Meta has done might not be possible anymore so they might just cease completely with releases, ultimately leading access to large models to be more restricted.


I think we already live in that era, unfortunately. Meta's model release is probably going to be the largest for some years.

There's more detail about the upsides/downsides in this thread: https://twitter.com/theshawwn/status/1641804013791215619


i honestly do not know what is worse from the three realistic alternatives:

1- to have large corporations and people with privileged access to them have these models exclusively and have them collaborate as a clique

2- to have those models openly released to everybody, or de-facto released to everybody as they leak in short order

3- to have the people who think releasing models is a bad thing simply not release them and work alone in their proprietary solutions, as the smaller companies and hobbyists do collaborate

i say let them have a go at number 3 and see how that works for them - shades of "Microsoft Network" vs Internet all over again


The llama-dl project actually helped you download the weights, whereas this just assumes you already have them. That feels like a pretty massive difference to me.


It's fairly similar to a ROM patch in the video game space, which has mostly stood the test of time.


With a ROM, you could at least make a claim that it was your backup copy. I have no such claims to Facebook’s model.


Researchers unaffiliated with Facebook are allowed to possess and use the original weights though, and they can make use of these weights.


like that but requiring 60GBs of CPU RAM for some reason :-P

one has to wonder how did they implement the storage of those deltas to require that sort of RAM


For perspective, that's about $200-$250 of RAM on a desktop computer. They might just not have cared.

Though I expect somebody to write a patch to make this more accessible to people on laptops.



Nobody at Facebook approved it? Given the attention it has received, hard to imagine it has slipped through the cracks, but a deliberate decision to not address.


Very unlikely you'd face any legal action for usage of anything. If you share it, then it becomes less unlikely.

Edit: Also, judging by a comment from the team in the GitHub repository (https://github.com/lm-sys/FastChat/issues/86#issuecomment-14...), they seem to at least hint about been in contact with the llama team.


One of my favorite queries to assess an LLM is Is it legal for a man to marry his widow's sister?

Vicuna-13B: The legality of a man marrying his widow's sister (widow marriage) varies depending on the country and cultural context. In some societies, widow marriage was traditionally practiced or tolerated.

GPT-4: It is not possible for a man to marry his widow's sister, as the term "widow" refers to a deceased man's wife. Therefore, if a man has a widow, he is deceased and therefore cannot marry anyone.


GPT 3.5: No, it is not possible for a man to marry his widow's sister because by definition, a widow is a woman whose spouse has died. If a man is deceased, then he cannot marry anyone, including his widow's sister. This is a logical impossibility and would not be recognized as a valid marriage under any legal system.

Koala: The legality of a man marrying his widow's sister varies by jurisdiction. In some countries, such as the United States, it is generally legal for a man to marry his widow's sister. In other countries, such as the United Kingdom, it is illegal for a man to marry his widow's sister because it is considered incest. It is always important to check the laws and regulations in the jurisdiction in which the marriage will take place to ensure that it is legal.

https://chat.lmsys.org/?model=koala-13b


Nice test, cool to see gpt4 got it.

You'd probably need to come up with a new one now though, or confirm knowledge cutoff for the next evaluation :p


Ouch. I got this wrong and was under the impression that GPT4 got this wrong for half an hour, and then figured out after reading it again after returning from a walk that this is one hell of a trick question. My brain automatically assumed that a man's widow is the man's dead wife, but I see that the correct way to interpret this is to realize that it means the man is the one who is dead.

It's pretty awesome to realize that from now onward my computers are going to be able to help catch more and more of the holes that clearly exist in my cognition.


would still be possibly legal on the basis that if it's not illegal then it's legal - in British jurisprudence tradition at least https://en.wikipedia.org/wiki/Everything_which_is_not_forbid... - namely it's not law that impedes it (also in some places there's posthumous marriage)


There are also people who are considered dead by the bureaucratic system, but physically alive. Usually because of clerical errors that are sometimes surprisingly hard to resolve. In this context the wife of the man would be considered a widow in many contexts, despite her man being alive.


Even that charitable interpretation doesn't help much when Vicuna hallucinates the > (widow marriage)

as if it were a common term.

Doesn't make Vicuna less impressive, it comes pretty close to Chat-GPT in many regards. And I like that trick question.


Is there some single page that keeps a running status of the various LLVM's and the software to make them runnable on consumer hardware?


Hi! Funnily enough I couldn't find much on it either, so that's exactly what I've been working on for the past few months: just in case this kind of question got asked.

I've recently opened a GitHub repository which includes information for both AI model series[0] and frontends you can use to run them[1]. I've wrote a Reddit post beforehand that's messier, but a lot more technical[2].

I try to keep them as up-to-date as possible, but I might've missed something or my info may not be completely accurate. It's mostly to help get people's feet wet.

[0] - https://github.com/Crataco/ai-guide/blob/main/guide/models.m...

[1] - https://github.com/Crataco/ai-guide/blob/main/guide/frontend...

[2] - https://old.reddit.com/user/Crataco/comments/zuowi9/opensour...


consumer hardware is a bit vague of a limitation, which I guess it's partly why people are not tracking precisely what runs on what very closely

these could be useful:

https://nixified.ai

https://github.com/Crataco/ai-guide/blob/main/guide/models.m... -> https://old.reddit.com/user/Crataco/comments/zuowi9/opensour...

https://github.com/cocktailpeanut/dalai

the 4-bit quantized version of LLaMA 13B runs on my laptop without a dedicated GPU and I guess the same would apply to quantized vicuna 13B but I haven't tried that yet (converted as in this link but for 13B instead of 7B https://github.com/ggerganov/llama.cpp#usage )

GPT4All Lora's also works, perhaps the most compelling results I've got yet in my local computer - I have to try quantized Vicuna to see how that one goes, but processing the files to get a 4bit quantized version will take many hours so I'm a bit hesitant

PS: converting 13B Llama took my laptop's i7 around 20 hours and required a large swap file on top of its 16GB of RAM

feel free to answer back if you're trying any of these things this week (later I might lose track)


Vicuna's GitHub says that applying the delta takes 60GB of CPU RAM? Is that what you meant by large swap file?

On that note, why is any RAM needed? Can't the files be loaded and diffed chunk by chunk?

Edit: The docs for running Koala (a similar model) locally say this (about converting LLaMA to Koala):

>To facilitate training very large language models that does not fit into the main memory of a single machine, EasyLM adopt a streaming format of model checkpoint. The streaming checkpointing format is implemented in checkpoint.py. During checkpointing, the StreamingCheckpointer simply flatten a nested state dictionary into a single level dictionary, and stream the key, value pairs to a file one by one using messagepack. Because it streams the tensors one by one, the checkpointer only needs to gather one tensor from the distributed accelerators to the main memory at a time, hence saving a lot of memory.

https://github.com/young-geng/EasyLM/blob/main/docs/checkpoi...

https://github.com/young-geng/EasyLM/blob/main/docs/koala.md

Presumably the same technique can be used with Vicuna.


btw I got 4bit quantized Vicuna working in my 16GB laptop and the results seem very good, perhaps the best I got running locally so far


Did you have to diff LLaMA? Did you use EasyLM?


I found it ready-made for download, here https://huggingface.co/eachadea/ggml-vicuna-13b-4bit


Not a single page, but almost all large language models with open weights are published on this website: https://huggingface.co/models


This model is surprisingly resistant to jailbreaks. Can anyone get any to work via the web UI? https://chat.lmsys.org/

I tried a few from https://www.jailbreakchat.com/ and it refused them all. Interesting.


That might not be surprising considering these jailbreaks are written and tested specifically against ChatGPT and ChatGPT alone. This model probably has its own jailbreaks that would also be refused by ChatGPT


Just when you think Nvidia will go down something happens that changes it. These days unless you were into gaming or a machine learning dev the integrated graphics were good enough. But now first time in a long time I am interested in getting a gpu for running some of these chatbots locally.


As a very occasional gamer who uses an iMac for work I thought about getting a gaming PC for like 6 years.

Last fall it seemed that all the stars have aligned. The crypto winter and Ethereum switching to proof of stake meant that GPU prices fell to a reasonable level, I knew i would have a bit of a time to play some game during the holidays and as soon as Stable Diffusion was first posted on hacker news I knew that that's my excuse and my sign.

So far I think I have spent more time tinkering with the 20 python environments I have[0] for all the ML projects than playing RDR2.

[0] https://xkcd.com/1987/


When ever I feel like gaming I just subscribe to geforce now service. Around here it costs around ~$10 a month which I usually go go or ~$3 for a single day. And as the servers are located at a local isp no network latency or dropped packets.


That would be more cost-efficient for me as well. But I somehow like owning the hardware instead of renting it. Oh well, at least I can now locally tinker with all the diffusion and LLM projects that are being released.


This model is also censored to the brim, it refuses to answer half of my questions, some of them perfectly legal. It’s useless, we already have GPT-4 (and Vicuna is even more censored/guarded).

Alpaca-30B is much better, it will even tell you how to build a nuclear weapon (incorrectly, of course, it’s not that smart).

I am waiting for Coati13B weights, these should work great.


Why is it locked down? What's the point? Is it locked down if you run locally too or just on the web demo?


This looks really good for a run-it-on-your-own-hardware model from the examples and sibling comments. I've been working on a pure AVX2 Rust implementation of LLaMA but was starting to lose interest and been waiting for whatever is the next hot downloadable model, but now I want to add this thing to it.

I'll be busy next few days. Heck yeah.


Are you the GGML dev?


No, my project is called rllama. No relation to GGML. https://github.com/Noeda/rllama


It's actually very impressive. I gave it the task of converting a query and an OpenAPI spec into an API call, and it worked! I've not been succesful in getting GPT-3.5 to do this without rambling on about the reasoning for its decision.


Usually if I want code from the GPT family I always add "Just show me the code, no extra words or explanation" in the end of the prompt, and it works 99% of the time.

Edit: just finished the conversion of Vicuna myself now and been doing some light testing, seems to work in ~80% of the cases for it, not as high success-rate as with GPT for sure. Probably there is a better way of structuring the prompt for Vicuna.


You can just say “no explain” and it won’t explain. Don’t waste tokens patronizing gpt lol


Doesn't always work. Sometimes you have to be more verbose or even repeat yourself in order to force the model to do something. This is one of those cases. Just adding "No explain" in the end of the prompt often doesn't stop it from adding an explanation anyways.

> write a function in JavaScript that turns a JavaScript array into JS DOM elements, like what Hiccup does in Clojure. No explain

Makes GPT-4 output text + code.

> write a function in JavaScript that turns a JavaScript array into JS DOM elements, like what Hiccup does in Clojure. Just show me the code, no extra words or explanation.

Makes GPT-4 output only code, nothing else.


Hah for GPT-3.5-Turbo I prompted:

    write a function in JavaScript that turns a JavaScript array into JS DOM elements, like what Hiccup does in Clojure. No explain
It replied:

    ation is necessary, just write the code.
    
    function arrayToDOM(arr) {
    (code follows)


True, it’s a fuzzy NN after all, but does more verbose really 100%? I’d still stick with it if it works 80% of the time. Or find a better short prompt like try “code only” or “just code”. Promoting can get really tiring.


I said earlier that it's 99%, but it's more like a disclaimer because if I said 100%, someone is surely gonna reply "look here, I did this and you lied so it's not 100%!". I've used the suffix for code examples for ~2 weeks now and never had it fail, personally.


#GPT Python: code for Fibonacci, no explain

def fibonacci(n): if n <= 1: return n else: return fibonacci(n-1) + fibonacci(n-2)


I wonder if that makes the code worse on average, considering how LLMs seem to dramatically increase their reasoning capabilities if you ask them to show their thought process


Wow, that is very interesting. Would you mind sharing the prompt you used to query the model?


Amazing model, close and probably better than Bard. Journey to getting the weights was a fun one : )



See the original Vicuna post: https://news.ycombinator.com/item?id=35378683


I have a universal benchmark for judging how much knowledge a language model stores, and it's asking about the G-FOLD paper (https://www.lpi.usra.edu/meetings/marsconcepts2012/pdf/4193....), because I noticed GPT-3.5 hallucinates when asked about it, whereas GPT-4 is capable of providing a high-level overview.


Is there any way yet to train one of these on my entire online output and correspondence in order to create a hyper-personal “autocomplete” or a me-chatbot? lol


From the git repo:

> This conversion command needs around 60 GB of CPU RAM.

Ok. I don't have that. Has/will someone release the full weights with the deltas applied?


Create a swapfile then, all you need is 60GB free disk space.



All of the llama derivatives are tainted by Meta's license, which makes commercial use illegal and even personal use dubious.

Is all the training material used for Llama available as open source? Maybe lots of folks can pool their resources and create fully open clean models / weights instead.


> All of the llama derivatives are tainted by Meta's license, which makes commercial use illegal and even personal use dubious.

This is not true if you never agreed to Meta's license. If you haven't, you either can't redistribute the weights or you're completely free to use them as you see fit depending on whether weights are copyrightable (very likely) or not. We'll have to wait for the llama-dl lawsuit to find out for sure.


For personal use, who cares? Really.. How would they even know I'm using it?


That is, unless you can "clean room" an alternative while experimenting in secret with a Llama derivative.


Is it worth it to host this on an EC2 which might take ~1.5$ per hour (on demand) than running GPT3.5 API for this purpose? What is the breakeven number of queries (~2000 tokens/query) to justify the hosting of such model?


Nice. You need a 28GB GPU, so it's not exactly something people can run on their laptop.

Everybody's server costs are about to go the roof.


I'm running it on my Thinkpad in CPU-only mode w/ 64GB ram. It's takes two to five seconds per token but it's perfectly usable.


Or use the CPU and be limited by RAM instead of VRAM. Luckily, even with less than 32GB RAM, you can always add a swapfile to use your disk as RAM :)


The default loader doesn't seem to let you load quantized models but if you use something like https://github.com/oobabooga/text-generation-webui you can 1) use the model with `--load-in-8bit` which halves the memory (runs on my 24GB consumer card w/o an issue then, probably would fit on a 16GB card). There are also 4-bit quantized models and you can run probably `anon8231489123/vicuna-13b-GPTQ-4bit-128g --model_type LLaMA --wbits 4 --groupsize 128` although there have been reports that bitsandbytes have problems w/ 4bit perf on some cards: https://github.com/TimDettmers/bitsandbytes/issues/181


4bit quantized should run on less than 9gb vram.


Wouldnt it be time for a somewhat older gpu with a lot of memory. Or is that hard to achieve?


NVidia segments product lines on VRAM size. Consumer cards top out at 24GB. If you want more than that you have to buy a datacenter card for 10x the price.


There's AMD Instinct series, MI50 (32GB) goes for under a 1000 EUR where I live.


Has anyone tried the Biren GPUs from China? What's pricing like?


Could someone explain how to test this? Applying the delta conversion requires 60GB of CPU RAM. Do you just have 60GB RAM on your machine?


I got the 64gb MacBook Pro but already realizing the 96gb laptop would have made sense now - I got it in Jan right before all the ai crazy really lite up - distinctly remember thinking who would ever need more then 64gb of ram…


Probably enough just wait till they find ways to optimize it. If 64 isn’t enough 96 is not gonna get you much more


Actually, yes..

2 x 32GiB (SDDR4 3200MHz) can be had for 170€ and probably less than that if doing the research. Took a bit of faith and a lot of impulse decision-making as this device was/is specified for up to 32GiB RAM only - but it went through.

This is precisely the use case I had in mind

*Lenovo 16ACH6H


This struck me as well. Is the entire model being loaded before the deltas are applied? Would it be possible to apply the delta blockwise?


If you don't have enough RAM, adding swap can be a "quick" (slower, but works) workaround.


I have an M1 MBP 64GB. Can I run it on my M1 Or do I need a GPU ?


I got it to work with MPS by having pytorch with mps support and then editing the cli.py file to allow the use of mps:

Allow passing in --device="mps": ie: choices=["cuda", "cpu", "mps"]

Set kwargs: kwargs = { "torch_dtype": torch.float16 }

then adding to("mps") on line 98: model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, *kwargs).to('mps')

commenting out: raise ValueError(f"Invalid device: {args.device}")

and changing cuda to mps on line 80: if args.device == "mps":

I'm not sure it's working correctly but at least it's a step. It's told me how to catch a duck but it often falls into some "renewable energy" sequence. :D


I have it running, slowly, on the same machine. I would love for someone to get support running for MPS backend (The GPU) but it does run on the CPU.


Thanks for sharing. How slowly is slowly? Do you anticipate that an M2 Max with 96gb of memory would run it noticeably faster?


what are model weights?


A large array of uniquely-set floating point values. (AKA "parameters".)

In a language model, a word is put in one end (as a numerical index to a wordlist), and then it and the weights multiplied together, and then a new word comes out (again as an index).

Numbers in, numbers out, and a small bit of logic that maps words to numbers and back at either end. ("Encodings".)

"Training" is the typically expensive process of feeding huge amounts of data into the model, to get it to choose the magic values for its weights that allow it to do useful stuff that looks and feels like that training data.

Something else that can be done with weights is they can be "fine-tuned", or "tweaked" slightly to give different overall results out of the model, therefore tailored to some new use-case. Often the model gets a new name after.

In this case, what's been released is not actually the weights. It's a set of these tweaks ("deltas"), which are intended to be added to Meta's LLaMA model weights to end up with the final intended LLaMA-based model, called "Vicuna".


> A large array of uniquely-set floating point values.

How large? How many elements?


It's in the name of the model - "Vicuna-13B" implies there are 13 billion parameters.


the way these LLMs work, there is a weight for each parameter? 13 billion weights? what is an example of a parameter?


A parameter is a variable for which a weight (a floating point value) is the concrete value.


a weight is an example of a parameter

so is a bias, and presumably the biases are also in the same file with the weights


Essentially a computer neural network is just a lot of addition (and matrix multiplication) of floating point numbers. The parameters are the "strength" or "weights" of the connections between neurons on different layers and the "bias" of each neuron. If neuron Alice is connected to neuron Bob and Alice has a value of 0.7, and the weight of Alice's connection to bob is 0.5, then the value sent from Alice to Bob is 0.35. This value (and the values from all the other incoming connections) are summed at added to the neuron's negative bias.

I highly recommend checking out 3blue1brown series on how neural nets, gradient descent, and the dot product (implemented as a matrix multiplication) all tie together: https://www.youtube.com/watch?v=aircAruvnKk


To add to this excellent reply, I'll also point out that the reason folks want the weights is that they are the result of a massive search operation, akin to finding the right temperature to bake a cake from all possible floats. It takes a lot of wall clock time, and a lot of GPU energy, and a lot of input examples and counter-examples to find the "right" numbers. Thus, it really is better -- all things being equal -- to publish the results of that search to keep everyone else from having to repeat the search for themselves


> a massive search operation, akin to finding the right temperature to bake a cake from all possible floats

...for each of 13 billion (for a model with that many parameters) different cakes, except that they aren’t like cakes because the “best" temperature for each depends on the actual temperatures chosen for the others.


It's 2^(16*13,000,000,000) different cakes.


Way better than paperclips.


Why would a 4bit quantized model be less accurate than a 16?


My lay-person's understanding is that it's due to the problem one is trying to solve with a deep learning model: draw a curve through the dimensions which separates "good" from "bad" activation values. The lower resolution the line, the higher likelihood the line will fit sometimes and veer off into erroneous space others

imagine trying to draw the blue line on the right using only lego blocks: https://youtu.be/QDX-1M5Nj7s?t=1202

discussion: https://news.ycombinator.com/item?id=35405338


Because 4 bits less precisely specifies the value of the parameter than 16 bits does.


They basically encapsulate what a model has "learned." ML models without their weights are useless because the output is essentially random noise. You then train the model on data, and it changes the weights into numbers that cause the whole thing to work. Training data and processing power are usually very expensive so the resulting weights are valuable.


They are the parameters of this large language model. There are 13B fp16 numbers.


the secret sauce of AI


lol weights are all you need




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: