Hi there, I have two question: 1 - Why did you choose Markdown? It seems an odd ...

amasad · on May 3, 2023

1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.

2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.

fuzzythinker · on May 3, 2023

Have you thought of finding or creating something like this [0]?

I created this as the basis for my origami folding descriptive language. I tried to find something similar, requirements being both well structured and English-like but couldn't find any, so I created it.

The origami folding app will hopefully be out in 2 weeks, so you can see how it's used.

[0] https://github.com/fuzzthink/mation-spec

runnerup · on May 3, 2023

They trained on https://huggingface.co/datasets/bigcode/the-stack-dedup which is a massive curated dataset accumulated from GitHub. Details are here: https://www.bigcode-project.org/docs/about/the-stack/

Many of the most-represented "languages" on GitHub are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, and SVG.

More details from them here: https://blog.replit.com/llm-training