So it will now be cost-effective to connect the exhaust of ChatGPT to its inlet and watch as the quality of output deteriorates over time while making money off ads. Whatever rocks your boat, I guess. How long before the answer to every prompt is "baaa baaa baaa"?
You’re sadly misinformed if you think training an LLM consists of dumping the unfiltered sewage straight from the web into a training run. Sure, it’s been done in early experiments but after you see the results you learn the value of data curation.
That article itself might be part of the degradation. It mentions at least four times that the contract was canceled as if it's something new. I wonder if someone just dumped a bunch of facts and ran it through a spin cycle a few times with AI to get a long form article they didn't expect anyone to read.
It's clearly working because the models are only getting better, believing that the performance of these models would fall at some point in the future is just very delusional.
Weren't they just getting better mostly because they were being scaled up? There's no way to do that once you've exhausted all of the data. Besides progress has slowed down at this point anyway.
Not only. Look at the subject of this thread, GPT-4o mini.
I'm optimistic about synthetic data giving us another big unlock, anyway. The text on the internet is not that reasoning dense. And they have a snapshot of pre-2023 that is fixed and guaranteed not to decay. I don't think one extra year of good quality internet is what will make or break AGI efforts.
The harder bottleneck will be energy. It's relatively doable to go from 1GW to 10GW but the next jump to 100GW becomes insanely difficult.
GPT-3 was 173B parameters and it's very bad compare to much smaller models we have nowadays, the data and the compute play a giant role, also I doubt you would need to train a model further after you have trained it on absolute everything (but we are very far from that).