> That tells me that there's something still structurally missing about the training efficiency.
Imagine Alice is abducted by aliens and given reams and reams of unfamiliar symbols and trained to predict which one came next given a long long prefix. Alice held in a cell alone with just symbol sequences for 15 years, and by the end of that period she's gotten pretty good at predicting which symbol comes next. Bob's experience is exactly the same. Neither has any way to understand what any of the symbols mean. Finally, Alice and Bob are let out of their cells for a break, and meet Krang. Krang explains that Alice has been doing a sometimes acceptable job of producing computer code for a kind of computer she's never been able to directly interact with! She might have gotten really good by the end of year 1 if anyone had explained that she was writing programs, or given her access to a REPL, or a debugger, or a manual. But she's been trained with exactly the same procedure as Bob, who has been pumping out advertising copy.
Current code LLMs are only doing next token prediction, and critically they don't have access to a model of formal semantics for each language, an interpreter or debugger or compiler, etc. This is a shame, because program generation is arguably one of relatively few areas in which we could give our models a "complete" view of the domain. An appropriately structured model could generate the program, predict and observe the AST, predict and observe the IR graph, predict and observe generated bytecode, predict and observe program traces from execution, etc, etc. But it doesn't do any of that. It doesn't have an explicit model of what the program will do during execution. It doesn't have an ability to check that an invariant is maintained at each iteration of a loop. It doesn't get to check that what it wrote behaves as intended.
Yesterday, one of the chat models which also can generate code gave me a Kotlin example which used a language feature that Kotlin doesn't actually have (basically scala-style pattern matching), and of course was totally unaware that the generated code was not even valid Kotlin because it never attempted to call any part of the toolchain.
Imagine Alice is abducted by aliens and given reams and reams of unfamiliar symbols and trained to predict which one came next given a long long prefix. Alice held in a cell alone with just symbol sequences for 15 years, and by the end of that period she's gotten pretty good at predicting which symbol comes next. Bob's experience is exactly the same. Neither has any way to understand what any of the symbols mean. Finally, Alice and Bob are let out of their cells for a break, and meet Krang. Krang explains that Alice has been doing a sometimes acceptable job of producing computer code for a kind of computer she's never been able to directly interact with! She might have gotten really good by the end of year 1 if anyone had explained that she was writing programs, or given her access to a REPL, or a debugger, or a manual. But she's been trained with exactly the same procedure as Bob, who has been pumping out advertising copy.
Current code LLMs are only doing next token prediction, and critically they don't have access to a model of formal semantics for each language, an interpreter or debugger or compiler, etc. This is a shame, because program generation is arguably one of relatively few areas in which we could give our models a "complete" view of the domain. An appropriately structured model could generate the program, predict and observe the AST, predict and observe the IR graph, predict and observe generated bytecode, predict and observe program traces from execution, etc, etc. But it doesn't do any of that. It doesn't have an explicit model of what the program will do during execution. It doesn't have an ability to check that an invariant is maintained at each iteration of a loop. It doesn't get to check that what it wrote behaves as intended.
Yesterday, one of the chat models which also can generate code gave me a Kotlin example which used a language feature that Kotlin doesn't actually have (basically scala-style pattern matching), and of course was totally unaware that the generated code was not even valid Kotlin because it never attempted to call any part of the toolchain.