Hacker News new | past | comments | ask | show | jobs | submit login

Hi there, I have two question:

1 - Why did you choose Markdown? It seems an odd choice for training a model like this.

2 - Have you tried to train only one single PL and then benchmark it against this more general version?




1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.

2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.


Have you thought of finding or creating something like this [0]?

I created this as the basis for my origami folding descriptive language. I tried to find something similar, requirements being both well structured and English-like but couldn't find any, so I created it.

The origami folding app will hopefully be out in 2 weeks, so you can see how it's used.

[0] https://github.com/fuzzthink/mation-spec


They trained on https://huggingface.co/datasets/bigcode/the-stack-dedup which is a massive curated dataset accumulated from GitHub. Details are here: https://www.bigcode-project.org/docs/about/the-stack/

Many of the most-represented "languages" on GitHub are actually things like JSON, XML, HTML, CSV, text, markdown, YAML, and SVG.

More details from them here: https://blog.replit.com/llm-training




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: