Now this is truly the programming language that we should be using to benchmark LLM code gen in a private hold out set. There is no substantial datasets on the internet or github, and no documentation except the one provided. And that's all the model should need.
I asked GPT-4 to write a mat mul function, but that was too ambitious and it spit out outrageous nonsense.
To be more fair, I gave it in-context access to the documentation in prompt, along with the fibonacci example function; aka everything humans have access to. I then asked it to do the simpler task of converting a base 10 integer to binary. It was unable to write something error free even after 4 rounds of supplying it the error messages.
I repeated this 5 times in case it generates something grammatical in the Top-K@5.
I suspected there was some confusion it couldn't surmount about string manipulation. So I changed the question to something challenging, yet something that only used function calls, conditional logic, basic math ops, and numbers. First, I asked for an nth root approximator using newton's method. Didn't work. Asked for just the square root. Didn't work. Finally, I asked for a function that prints a student's grade given their integer percentage. Not even.
GPT-4 also persistently hallucinated the keyword BREAKING NEWS, which I think sounds like a pretty good keyword if Tabloid were to ever get error handling.
The spooky part is that the almost all the solutions at face value would get partial credit. They had the right abstract approach, being familiar with reams of example approaches in natural language or programming languages. However, in each case, GPT-4, 4o, Claude all failed to produce something without syntax errors.
I suspect this is the case because transformers do subgraph matching, and while on one end there are rich internal connections for all the problems I requested, on the other end there is nothing similar enough for it to even get a foothold, hence the biggest struggle being syntax. If the only barrier to executing Tabloid code (or other unseen languages) is more basic syntax training, then it excitingly suggests it just needs to learn the abstract concepts from leetcode scrapes once for every syntax it knows. Prior research has shown that grammar is easy for language models. When GPT-2 was made large enough, it went from babbling to grammatical sentences very early in it's training, and at that moment its loss plummeted.
All tests conducted in temporary data mode so that this eval stays dark.
DISCOVER HOW TO square_root WITH x, iterations
RUMOR HAS IT
EXPERTS CLAIM guess TO BE x DIVIDED BY 2
DISCOVER HOW TO improve_guess WITH current_guess
RUMOR HAS IT
SHOCKING DEVELOPMENT
(current_guess PLUS (x DIVIDED BY current_guess)) DIVIDED BY 2
END OF STORY
DISCOVER HOW TO iterate WITH current_guess, remaining_iterations
RUMOR HAS IT
WHAT IF remaining_iterations SMALLER THAN 1
SHOCKING DEVELOPMENT current_guess
LIES! RUMOR HAS IT
EXPERTS CLAIM new_guess TO BE improve_guess OF current_guess
SHOCKING DEVELOPMENT
iterate OF new_guess, remaining_iterations MINUS 1
END OF STORY
END OF STORY
SHOCKING DEVELOPMENT iterate OF guess, iterations
END OF STORY
EXPERTS CLAIM number TO BE 16
EXPERTS CLAIM num_iterations TO BE 5
YOU WON'T WANT TO MISS 'The square root of'
YOU WON'T WANT TO MISS number
YOU WON'T WANT TO MISS 'is approximately'
YOU WON'T WANT TO MISS square_root OF number, num_iterations
Same, I've been pretty impressed as well and typically give Claude a shot. Sometimes I even pass their results back and forth in an LLM collab so they generate more diverse perspectives. However, this paper from 4 days ago shows that Claude can fall apart quickly in out of distribution tasks. If you ask opposite day questions, GPT-4 is weirdly strong at it (figure 2).
I really think the "please like and subscribe" that ends the program should also be printed out (with a link to the project's GitHub page to make it more... actionable).
That is exactly what I did for the 'A practical approach to parsing' workshop I gave at MCH2022 [1]. You can give it a try with the online IParse Studio [2], which has a simple build in interpreter, and if you are lazy or getting stuck, you can have a look at the grammar I wrote myself [3], which does not specify operator precedence yet.
I asked GPT-4 to write a mat mul function, but that was too ambitious and it spit out outrageous nonsense.
To be more fair, I gave it in-context access to the documentation in prompt, along with the fibonacci example function; aka everything humans have access to. I then asked it to do the simpler task of converting a base 10 integer to binary. It was unable to write something error free even after 4 rounds of supplying it the error messages.
I repeated this 5 times in case it generates something grammatical in the Top-K@5.
I suspected there was some confusion it couldn't surmount about string manipulation. So I changed the question to something challenging, yet something that only used function calls, conditional logic, basic math ops, and numbers. First, I asked for an nth root approximator using newton's method. Didn't work. Asked for just the square root. Didn't work. Finally, I asked for a function that prints a student's grade given their integer percentage. Not even.
GPT-4 also persistently hallucinated the keyword BREAKING NEWS, which I think sounds like a pretty good keyword if Tabloid were to ever get error handling.
The spooky part is that the almost all the solutions at face value would get partial credit. They had the right abstract approach, being familiar with reams of example approaches in natural language or programming languages. However, in each case, GPT-4, 4o, Claude all failed to produce something without syntax errors.
I suspect this is the case because transformers do subgraph matching, and while on one end there are rich internal connections for all the problems I requested, on the other end there is nothing similar enough for it to even get a foothold, hence the biggest struggle being syntax. If the only barrier to executing Tabloid code (or other unseen languages) is more basic syntax training, then it excitingly suggests it just needs to learn the abstract concepts from leetcode scrapes once for every syntax it knows. Prior research has shown that grammar is easy for language models. When GPT-2 was made large enough, it went from babbling to grammatical sentences very early in it's training, and at that moment its loss plummeted.
All tests conducted in temporary data mode so that this eval stays dark.