Hacker News new | past | comments | ask | show | jobs | submit login

Hi Everyone,

I'm a regular involved with the RWKV community.

AMA, on RWKV, and I will do my best to answer them here for the next hour (one at a time)

PS: you can find our discord here : https://discord.gg/qt9egFA7ve




Do you think there could be a kind of universality principle at work here, where once you make a good enough architecture then the details don't matter so much compared to the model size and training flops and dataset size? In other words, maybe it wasn't a coincidence that your architecture worked about as well as the transformer architecture?


There is a reasonable argument for that (heard the idea go around between multiple AI engineers, that once you go past a certain scale, it does not matter for its evals)

One of the biggest issue for testing all of this, is it takes a crap ton of GPUs to prove all the alternatives to transformers beyond 1B param.

For example I’m waiting for someone to do a 1B-14B text based diffusion network

Finally, if this is truely the case (and all that really matter is size+dataset)

We really should use an architecture that is cheaper to train and run. And that’s what RWKV represents here

You can even run the 7B quantized model reasonably on most laptops (try the rwkv-cpp / rwkv-cpp-node project)


The paper says it's comparable to transformers right now but that means that it might be better later. Do you guys have concrete plans to make it better? Are they secret? Also, what's the deal with that foundation? Is it a cult or like the new OpenAI that will turn closed or maybe it's to reap the value of random contributors to the project?


Completely the opposite.

- it is NOT backed directly or owned by any VC funded company

- it is 100% OSS driven by the community (Apache 2 license)

- it’s currently the top OSS chat model that can be used commercially on the chatbot arena score board

- IMO it is undertrained, so expanding the training data alone will make it much better (however for the sake of this paper, we wanted to focus on architecture not training data, so we compared similarly trained models)

And yes we do have multiple experiments and plans to make it better. It’s a list, and we will not know which is final until we try. Individual members can go to great lengths on what they are working on

For better or worse, being truly OSS means our initiatives are more disorganized then a centrally planned org


> it’s currently the top OSS chat model that can be used commercially on the chatbot arena score board

To be fair, that filters the majority of models in the scoreboard.


Dun we all wish this wasn’t the case?

Where we have more OSS models to choose from without weird rule lawyering gotchas. Or needing to be from a research institute / a license to download the weights


Can beginners without money with by PhDs, contribute? If so what would be the best way to start?


There are lots of really low hanging fruits

- integrating this with AI platform X/Y/Z

- setting up evals

- improving the code quality

- making a how to guide (it’s stuck on my todo list)

- helping with dataset

- doing silly experiments on how the architecture work (and if the changes give good result)

- etc etc

One of the community goals is to make this a model for EVERYONE on earth that means we need quality dataset for all the non English languages

So even on that level there are things to do

( find something that interest you on the community )


i would guess, go to their discord. also they put their github so you could fix a bug in their github.


(Note: My comments do not represent or project those of my collaborators) I remember talking to Blink DL about this, I think the plan is just to build an ecosystem, provide more diversity in the DL space. There are plans to make a RWKV5, they are in the open in the RWKV5 channel. From an engineering standpoint I don't really see the "reap" the value of random contributors to the project. Most of us I believe ... are hackers and tinkerers that just want to learn and contribute and be apart of something that can change the current


Currently, what I'm seeing with RWKV is that attention fades of quickly. The model will start to produce output, but very quickly (a few dozen tokens), its own output tokens are suddenly taking 'precedence' over the input question and it starts to simply repeat itself.

For example, I'm currently attempting to use RWKV for named entity extraction. I ask it to analyze a piece of text and provide output in JSON format. It starts off great. However, eventually, it seems like the beginning of the JSON list 'overtakes' the question I asked, and it starts to just produce random data that would seem plausible based on the set of things in the list. I realize this is due perhaps to the precision losses of the RNN as weights decay.

However, I feel there ought to be some way we can prevent that. Any thoughts?


Rearrange the query.

Ask the question / explain the task first. Then give it the data you want to extract from.

Also you may want to give a one shot example for best result (the instruct training of raven model is very limited)


Yeah... So I did that which is how I got it to begin correctly. This is what I mean though.

I'll say "get a list of Blah from the following document in Json format like this:

Example"

Then I feed the document and add a spot for the answer.

The model begins correctly. But usually in the middle of the Json list generation, it will veer off, and start hallucinating as if it forgot the document and the task. I'm happy to share specifics and datasets but this is a cross cutting problem.

Rwkv is able to answer my questions when I ask simple yes/no or classification. It's the listing that throws it for a loop. Transformers do not have the same problem. Both llama and gpt are able to maintain focus.

Also, do you know where I'd find information on how the current weights were trained?


Hmm we might need to look into the instruct training data. Which is mostly based on gpt4all filtered and mixed with others

(You are using raven right? That’s the instruct trained varient)

Btw ping the discord if ur looking into finetuning for your usecase


Yeah I'm using raven. Raven does work better. And I'm on the discord.

Unfortunately I really would like machine readable responses and raven is a bit too verbose.

Looking at fine-tuning right now.


Why would asking the question first improve quality? Is it because the model will be better aware of what info it Can and can’t throw away at each step? This seems like the opposite of transformers.


RWKV does not work like transformers. The "transformer" part here is the training step. RWKV is an RNN with fixed-size state, so old information slightly decays each time it reads a new token. Hence the freshest memory is of the most recent tokens.


Are there any ways to train it to maintain attention on the original prompt no matter the distance from it, and selectively pay attention to its own output where relevant?


Instruction training. This is a WIP


Are there currently any plans to create a RWKV 30B or 65B? That seems to be the size at which the LLaMA transformer models become genuinely competitive with GPT3.5 for many tasks.


TLDR: please donate A100s to make this happen

Most of the focus is in the 1-14B range. Due to constraints of the dataset sizes (chinchilla law), and GPUs available

Community demand is also mostly in this range as there is a strong desire to optimise and run on local GPU. So more focus is in this range.

Not representing blink directly here - but if anyone wants to see a 30B / 65B model. Reach out to contribute the GPUs required to make it happen

The code is already there, just need someone to run it,

Ps: I too am personally interested in how it will perform at ~60B, which I believe will be to be optimal model size for higher level of thoughts (this number is based on intuition not research)


https://twitter.com/boborado/status/1659608452849897472

You might find that thread interesting, they're taking submissions for potential partnership with LambdaLabs a cloud compute company that has a few hundred H100s laying around. They have an open form and their cofounder is currently doing the rounds having meetings and this may be a good candidate.

I'm not associated with them at all, just interested in the space and things going on.


wierdly their form requires a company rep (which RWKV does not have, as its not a company) - lets see how it goes ...


Are there any estimates anywhere of how many A100s would be needed to e.g. train a 30B model in 6 months?


That’s a loaded question without deciding dataset size


Would it be possible to just use the exact same dataset as LLaMA? (There's an open source project currently training a transformer on exactly that).


You mean red pajama? I believe that has already started for 1-14B (need to double check)


Yep that's the one. Curious roughly how many A100s it'd take to train a 65B RWKV on that.


Really bad napkin math as no one has attempted 65B (so +\- 50%)

8 x 8 x 8 A100, should be able to do a 100k++ tokens/s at that size

With a dataset of 1.2 trillion tokens. That’s 12 million seconds. Or 140 days

(PS: this is why everyone is training <60B, its crazy the cost, even if my math estimate is wrong by 300%, its still a crazy number)


Thank you! 888 is 512 A100s, that is indeed pretty expensive.


can you elaborate on the chinchilla law / dataset problem a bit? (perhaps by editing your previous comment?)

what datasets are available to the community, how big are these, are they needed to be updated from time to time, where are these stored, what are the usual cost ranges involved, ...? :o

thank you!


Chinchilla law is a rule of thumb that you should have 11++ x training tokens for every param

If not, you are getting diminishing benefits for each param you add

I’m extreme cases your model can even perform worse with more param due to lack of training data

More complicated: the quality of the data matters as well

So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task

Or go big on everything (aka openAI) which requires monster GPU time for every reply token

There are obviously in between as well. Hence why the question is so loaded

Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one


30B would be interesting because that's the practical ceiling for local GPUs assuming 4-bit quantization.

Is there some kind of dedicated fund for training hardware? Donating an A100 sounds unlikely, but surely they could be crowdfunded?


weirdly enough, organisations are more willing to rent GPUs than money.

If you want to help fund RWKV, the ko-fi link is - https://ko-fi.com/rwkv_lm

IMO: this needs way more funding, just to sustain blink leading this project, let alone GPUs for training.

(Also - current tests shows this model doing really badly with 4bit quantized, but alright at Q5 and Q8)


Is there any potential improvements over transformers for interpretablity or alignment?


For anything past 8k context size

We are talking about over 10x reduction in GPU time for inferencing tokens and for training too

Aka it’s cheaper and faster

Alignment is frankly IMO purely a dataset design and training issue. And has nothing to do with the model


Is everyone still carefully not mentioning that it looks like it’s pronounced “Roku”? ;)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: