>Yes, we don't really understand where emergent capabilities are coming from, at least not to extent of being able to predict them ahead of time ("if we feed it this amount of data, of this type, it'll learn to do X"). New emergent capabilities arise, from time to time, as models are scaled up, but no one can predict exactly what their next-gen model is going to be capable of.
While finite precision, finite width transformers aren't TC, I don't see why the same property of the game of life, where one cannot predict the end state from the starting state wouldn't hold.
As we know transformers are at least as powerful as TC^0 which contains AC^0, which is as powerful as first order logic, it is undecidable and thus may be similar to HALT, were we will never be able to accurately predict when emergence happens so approximation may be the best we do unless there are constraints through something like the parallelism tradeoff that allows for it.
If you consider PCP[O(log n),O(1)] = NP, or that only O(log n) bits are required for NP, the results of this paper seems more plausible.
I don't see that the difficulty of predicting/anticipating emergent capabilities is really related to undecidability, although there is perhaps a useful computer analogy... We could think of the trained LLM as a computer, and the prompt as the program, and certainly it would be difficult/impossible to predict the output without just running the program.
The problem with trying to anticipate the capabilities of a new model/training-set is that we don't even know what the new computer itself will be capable of, or how it will now interpret the program.
The way I'd tend to view it is that an existing trained model has some set of capabilities which reflect what can be done by combining the set of data-patterns/data-manipulations ("thought patterns" ?) that it has learnt. If we scale up the model and add more training data (perhaps some of a different type than has been used before), then there are two unknowns:
1) What new data-patterns/data-manipulations will it be able to learn ?
2) What new capabilities will become possible by using these new patterns/manipulations in combination with what it had before ?
Maybe it's a bit like having a construction set of various parts, and considering what new types of things could be built with if it if we added some new parts (e.g. a beam, or gear, or wheel), except we are trying to predict this without even knowing what those new parts will be.
While finite precision, finite width transformers aren't TC, I don't see why the same property of the game of life, where one cannot predict the end state from the starting state wouldn't hold.
As we know transformers are at least as powerful as TC^0 which contains AC^0, which is as powerful as first order logic, it is undecidable and thus may be similar to HALT, were we will never be able to accurately predict when emergence happens so approximation may be the best we do unless there are constraints through something like the parallelism tradeoff that allows for it.
If you consider PCP[O(log n),O(1)] = NP, or that only O(log n) bits are required for NP, the results of this paper seems more plausible.
https://arxiv.org/abs/2304.15004
I have yet to see any peer review that makes that continuous view invalid.
As you pointed out, we understand the underlying systems, but I think we should be surprised if someone does find a good approximation reduction.
But in my experience that also indicates an extreme limit in what can be modeled.
Then again all FFNs are effectively DAGs and I.I.D. does force a gaussian distribution of inputs.
But unless you are learning something that is Markovian and Ergotic undecidablity seems like a high probability.