> they just come from training bigger models on the same data Are you arguing th...

astrange · 2024-02-22T03:39:59 1708573199

I sure am ignoring that, because the bitter lesson of AI is usually applicable and implies that all such research will be replaced by larger generic transformer networks as time goes on.

The exception is when you care about efficiency (in training or inference costs) but at the limit or if you care about "better" then you don't.

recursivecaveat · 2024-02-22T07:32:30 1708587150

This is kindof an odd statement because the transformer is not the most generic neural net. It's the result of many levels of improvements in architecture over older designs. The bitter lesson is methods that can scale well with compute win (alpha/beta beats heuristics alone, neural network beats alpha/beta), not that the most obvious and generic approach eventually wins. Given the context-length problems with transformers I think it's fair to say they have scaling problems.

killerstorm · 2024-02-22T08:10:34 1708589434

There's a principle more powerful than the bitter lesson: GIGO.

Training to predict internet dump can only give you so much.

There's a paper called something like "learning from textbooks" where they show that a small model trained on high-quality no-nonsense dataset can beat a much bigger model at a task like Python coding.