Hacker News new | past | comments | ask | show | jobs | submit login

> they just come from training bigger models on the same data

Are you arguing that all AI models are using the same network structure?

This is only true in the most narrow sense, looking at models that are strictly improvements over previous generation models. It ignores the entire field of research that works by developing new models with new structures, or combining ideas from multiple previous works.




I sure am ignoring that, because the bitter lesson of AI is usually applicable and implies that all such research will be replaced by larger generic transformer networks as time goes on.

The exception is when you care about efficiency (in training or inference costs) but at the limit or if you care about "better" then you don't.


This is kindof an odd statement because the transformer is not the most generic neural net. It's the result of many levels of improvements in architecture over older designs. The bitter lesson is methods that can scale well with compute win (alpha/beta beats heuristics alone, neural network beats alpha/beta), not that the most obvious and generic approach eventually wins. Given the context-length problems with transformers I think it's fair to say they have scaling problems.


There's a principle more powerful than the bitter lesson: GIGO.

Training to predict internet dump can only give you so much.

There's a paper called something like "learning from textbooks" where they show that a small model trained on high-quality no-nonsense dataset can beat a much bigger model at a task like Python coding.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: