Tabloid: A clickbait headline programming language (2021)

cowsaymoo · 2024-07-15T07:01:18 1721026878

Now this is truly the programming language that we should be using to benchmark LLM code gen in a private hold out set. There is no substantial datasets on the internet or github, and no documentation except the one provided. And that's all the model should need.

I asked GPT-4 to write a mat mul function, but that was too ambitious and it spit out outrageous nonsense.

To be more fair, I gave it in-context access to the documentation in prompt, along with the fibonacci example function; aka everything humans have access to. I then asked it to do the simpler task of converting a base 10 integer to binary. It was unable to write something error free even after 4 rounds of supplying it the error messages.

I repeated this 5 times in case it generates something grammatical in the Top-K@5.

I suspected there was some confusion it couldn't surmount about string manipulation. So I changed the question to something challenging, yet something that only used function calls, conditional logic, basic math ops, and numbers. First, I asked for an nth root approximator using newton's method. Didn't work. Asked for just the square root. Didn't work. Finally, I asked for a function that prints a student's grade given their integer percentage. Not even.

GPT-4 also persistently hallucinated the keyword BREAKING NEWS, which I think sounds like a pretty good keyword if Tabloid were to ever get error handling.

The spooky part is that the almost all the solutions at face value would get partial credit. They had the right abstract approach, being familiar with reams of example approaches in natural language or programming languages. However, in each case, GPT-4, 4o, Claude all failed to produce something without syntax errors.

I suspect this is the case because transformers do subgraph matching, and while on one end there are rich internal connections for all the problems I requested, on the other end there is nothing similar enough for it to even get a foothold, hence the biggest struggle being syntax. If the only barrier to executing Tabloid code (or other unseen languages) is more basic syntax training, then it excitingly suggests it just needs to learn the abstract concepts from leetcode scrapes once for every syntax it knows. Prior research has shown that grammar is easy for language models. When GPT-2 was made large enough, it went from babbling to grammatical sentences very early in it's training, and at that moment its loss plummeted.

All tests conducted in temporary data mode so that this eval stays dark.

silentdanni · 2024-07-15T14:01:15 1721052075

Claude managed to write code successfully.

```

DISCOVER HOW TO square_root WITH x, iterations RUMOR HAS IT EXPERTS CLAIM guess TO BE x DIVIDED BY 2 DISCOVER HOW TO improve_guess WITH current_guess RUMOR HAS IT SHOCKING DEVELOPMENT (current_guess PLUS (x DIVIDED BY current_guess)) DIVIDED BY 2 END OF STORY

    DISCOVER HOW TO iterate WITH current_guess, remaining_iterations
    RUMOR HAS IT
        WHAT IF remaining_iterations SMALLER THAN 1
            SHOCKING DEVELOPMENT current_guess
        LIES! RUMOR HAS IT
            EXPERTS CLAIM new_guess TO BE improve_guess OF current_guess
            SHOCKING DEVELOPMENT
                iterate OF new_guess, remaining_iterations MINUS 1
        END OF STORY
    END OF STORY
    
    SHOCKING DEVELOPMENT iterate OF guess, iterations

END OF STORY

EXPERTS CLAIM number TO BE 16 EXPERTS CLAIM num_iterations TO BE 5

YOU WON'T WANT TO MISS 'The square root of' YOU WON'T WANT TO MISS number YOU WON'T WANT TO MISS 'is approximately' YOU WON'T WANT TO MISS square_root OF number, num_iterations

PLEASE LIKE AND SUBSCRIBE

```

CapeTheory · 2024-07-15T16:18:45 1721060325

This is consistent with my own experience that Claude is just downright better than ChatGPT.

cowsaymoo · 2024-07-15T16:38:33 1721061513

Same, I've been pretty impressed as well and typically give Claude a shot. Sometimes I even pass their results back and forth in an LLM collab so they generate more diverse perspectives. However, this paper from 4 days ago shows that Claude can fall apart quickly in out of distribution tasks. If you ask opposite day questions, GPT-4 is weirdly strong at it (figure 2).

https://arxiv.org/pdf/2307.02477

cowsaymoo · 2024-07-15T16:12:19 1721059939

Ah bravo! What was the prompt and Claude model?

carterdmorgan · 2024-07-15T15:33:03 1721057583

Great idea here. I wonder if there's potentially more demand for new programming languages now purely as benchmarks for LLMs, like you said?

cowsaymoo · 2024-07-15T16:28:17 1721060897

Maybe they will take on that role too one day

rob74 · 2024-07-15T06:08:36 1721023716

I really think the "please like and subscribe" that ends the program should also be printed out (with a link to the project's GitHub page to make it more... actionable).

abtinf · 2024-07-15T14:45:13 1721054713

I would change BEATS/SMALLER THAN to “DESTROYS” and “HUMILIATED BY”

boredemployee · 2024-07-15T16:38:21 1721061501

hahaha laughed hard on this one.

6510 · 2024-07-15T19:01:27 1721070087

and functions: WHY YOU SHOULD foo WITH bar

velcrovan · 2024-07-15T14:20:42 1721053242

I wrote the Racket implementation, in case you want to be able to compile your Tabloid programs: https://github.com/otherjoel/tabloid

ChrisArchitect · 2024-07-15T05:10:32 1721020232

Some more discussion from 2020 with author input:

https://news.ycombinator.com/item?id=24578749

georgf · 2024-07-15T03:38:24 1721014704

Reminds me of ArnoldC[1] from a few years ago.

[1] https://lhartikk.github.io/ArnoldC/

red-iron-pine · 2024-07-15T20:36:33 1721075793

For-Loops should be something like

[n] GOOD REASONS WHY [i =< n]

[thing] HATES THIS THING <----- exception handling

ITS TIME WE TALK ABOUT [x] <----- while-loop

Cthulhu_ · 2024-07-15T08:25:25 1721031925

I couldn't believe and was SHOCKED to find out that this was a computer language! Please like and subscribe to learn more.

dools · 2024-07-15T12:25:40 1721046340

> Before making Tabloid, I also created a ... boring and unpopular programming language, called Ink.

That line killed me.

xbar · 2024-07-15T13:15:32 1721049332

Compiler developers hate him.

bincyber · 2024-07-15T11:36:29 1721043389

Whoever built this is a bloody genius

olebedev · 2024-07-15T19:42:54 1721072574

Reminds me of aussue++[1] from a few years ago.

[1] https://github.com/zackradisic/aussieplusplus/

greasegum · 2024-07-15T15:02:36 1721055756

Software engineers don't want you to know this one weird trick()

pnut · 2024-07-15T16:12:21 1721059941

https://codewithrockstar.com/

can16358p · 2024-07-15T07:18:12 1721027892

This seems like a both fun/humorous and an educative project on programming language and interpreter design.

Motivating, lovely.

fjfaase · 2024-07-15T08:43:17 1721032997

That is exactly what I did for the 'A practical approach to parsing' workshop I gave at MCH2022 [1]. You can give it a try with the online IParse Studio [2], which has a simple build in interpreter, and if you are lazy or getting stuck, you can have a look at the grammar I wrote myself [3], which does not specify operator precedence yet.

[1] https://fransfaase.github.io/MCH2022ParserWorkshop/

[2] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu...

[3] https://github.com/FransFaase/MCH2022ParserWorkshop/blob/mai...

blackbaze · 2024-07-15T11:40:00 1721043600

Looks like FORTH!

jollyllama · 2024-07-15T16:42:37 1721061757

I miss the old headlinese. Slam, pan, rip.

trulyhnh · 2024-07-15T03:29:22 1721014162

stcredzero · 2024-07-15T19:57:00 1721073420

I'm very disappointed that (Number four will shock you) wasn't some kind of break statement or event handling.

akasakahakada · 2024-07-15T16:19:54 1721060394

This is cursed as shit lol