Do you think there could be a kind of universality principle at work here, where once you make a good enough architecture then the details don't matter so much compared to the model size and training flops and dataset size? In other words, maybe it wasn't a coincidence that your architecture worked about as well as the transformer architecture?
There is a reasonable argument for that (heard the idea go around between multiple AI engineers, that once you go past a certain scale, it does not matter for its evals)
One of the biggest issue for testing all of this, is it takes a crap ton of GPUs to prove all the alternatives to transformers beyond 1B param.
For example I’m waiting for someone to do a 1B-14B text based diffusion network
Finally, if this is truely the case (and all that really matter is size+dataset)
We really should use an architecture that is cheaper to train and run. And that’s what RWKV represents here
You can even run the 7B quantized model reasonably on most laptops (try the rwkv-cpp / rwkv-cpp-node project)
The paper says it's comparable to transformers right now but that means that it might be better later. Do you guys have concrete plans to make it better? Are they secret? Also, what's the deal with that foundation? Is it a cult or like the new OpenAI that will turn closed or maybe it's to reap the value of random contributors to the project?
- it is NOT backed directly or owned by any VC funded company
- it is 100% OSS driven by the community (Apache 2 license)
- it’s currently the top OSS chat model that can be used commercially on the chatbot arena score board
- IMO it is undertrained, so expanding the training data alone will make it much better (however for the sake of this paper, we wanted to focus on architecture not training data, so we compared similarly trained models)
And yes we do have multiple experiments and plans to make it better. It’s a list, and we will not know which is final until we try. Individual members can go to great lengths on what they are working on
For better or worse, being truly OSS means our initiatives are more disorganized then a centrally planned org
Where we have more OSS models to choose from without weird rule lawyering gotchas. Or needing to be from a research institute / a license to download the weights
(Note: My comments do not represent or project those of my collaborators)
I remember talking to Blink DL about this, I think the plan is just to build an ecosystem, provide more diversity in the DL space. There are plans to make a RWKV5, they are in the open in the RWKV5 channel. From an engineering standpoint I don't really see the "reap" the value of random contributors to the project. Most of us I believe ... are hackers and tinkerers that just want to learn and contribute and be apart of something that can change the current
Currently, what I'm seeing with RWKV is that attention fades of quickly. The model will start to produce output, but very quickly (a few dozen tokens), its own output tokens are suddenly taking 'precedence' over the input question and it starts to simply repeat itself.
For example, I'm currently attempting to use RWKV for named entity extraction. I ask it to analyze a piece of text and provide output in JSON format. It starts off great. However, eventually, it seems like the beginning of the JSON list 'overtakes' the question I asked, and it starts to just produce random data that would seem plausible based on the set of things in the list. I realize this is due perhaps to the precision losses of the RNN as weights decay.
However, I feel there ought to be some way we can prevent that. Any thoughts?
Yeah... So I did that which is how I got it to begin correctly. This is what I mean though.
I'll say "get a list of Blah from the following document in Json format like this:
Example"
Then I feed the document and add a spot for the answer.
The model begins correctly. But usually in the middle of the Json list generation, it will veer off, and start hallucinating as if it forgot the document and the task. I'm happy to share specifics and datasets but this is a cross cutting problem.
Rwkv is able to answer my questions when I ask simple yes/no or classification. It's the listing that throws it for a loop. Transformers do not have the same problem. Both llama and gpt are able to maintain focus.
Also, do you know where I'd find information on how the current weights were trained?
Why would asking the question first improve quality? Is it because the model will be better aware of what info it Can and can’t throw away at each step? This seems like the opposite of transformers.
RWKV does not work like transformers. The "transformer" part here is the training step. RWKV is an RNN with fixed-size state, so old information slightly decays each time it reads a new token. Hence the freshest memory is of the most recent tokens.
Are there any ways to train it to maintain attention on the original prompt no matter the distance from it, and selectively pay attention to its own output where relevant?
Are there currently any plans to create a RWKV 30B or 65B? That seems to be the size at which the LLaMA transformer models become genuinely competitive with GPT3.5 for many tasks.
Most of the focus is in the 1-14B range. Due to constraints of the dataset sizes (chinchilla law), and GPUs available
Community demand is also mostly in this range as there is a strong desire to optimise and run on local GPU. So more focus is in this range.
Not representing blink directly here - but if anyone wants to see a 30B / 65B model. Reach out to contribute the GPUs required to make it happen
The code is already there, just need someone to run it,
Ps: I too am personally interested in how it will perform at ~60B, which I believe will be to be optimal model size for higher level of thoughts (this number is based on intuition not research)
You might find that thread interesting, they're taking submissions for potential partnership with LambdaLabs a cloud compute company that has a few hundred H100s laying around. They have an open form and their cofounder is currently doing the rounds having meetings and this may be a good candidate.
I'm not associated with them at all, just interested in the space and things going on.
can you elaborate on the chinchilla law / dataset problem a bit? (perhaps by editing your previous comment?)
what datasets are available to the community, how big are these, are they needed to be updated from time to time, where are these stored, what are the usual cost ranges involved, ...? :o
I'm a regular involved with the RWKV community.
AMA, on RWKV, and I will do my best to answer them here for the next hour (one at a time)
PS: you can find our discord here : https://discord.gg/qt9egFA7ve