1- We trained on languages that are most popular on Replit. Markdown is important because you need some amount of natural language in the data, and it will act as a sort of "natural language label" for code.
2- I like how portable it is being a single small model doing a lot of languages. Single code models are an approach that models like Salesforce/Codegen did that, but I believe we beat (or get very close) to their mono models on benchmarks.
Have you thought of finding or creating something like this [0]?
I created this as the basis for my origami folding descriptive language. I tried to find something similar, requirements being both well structured and English-like but couldn't find any, so I created it.
The origami folding app will hopefully be out in 2 weeks, so you can see how it's used.
1 - Why did you choose Markdown? It seems an odd choice for training a model like this.
2 - Have you tried to train only one single PL and then benchmark it against this more general version?