"Every day, millions of files are uploaded onto Github, a coding repository where users share code and collaborate on projects. But how do you tell what language those files are written in?
Identifying programming languages are surprisingly hard. Symbols used in one language often have different meanings in another. For example, ‘#’ in python indicates a comment, while in C it indicates a preprocessor command. Even worse, code in one language can actually contain code in different languages, such as HTML, which can contain CSS and/or Javascript.
At the moment, Github uses a giant checklist to identify unique quirks in the language. For example, if the code contains “:- module”, then it’s probably"); background-size: 1px 1px; background-position: 0px calc(1em + 1px);"> Mercury. However, most languages simply don’t have enough unique quirks for this method to be accurate enough.
The current solution the Github team at ML@B came up with is to use a machine learning algorithm called a Naïve Bayes Classifier. To optimize the program, they used Github’s checklist as a guideline for choosing the correct language rather than a hard-and-fast rule. Currently, the team is working on scraping Rosetta Code, a large repository of code in hundred of different programming languages, for data that they can use for training the program and testing its accuracy. Eventually, they will want to see if implementing other models, such as neural networks, can improve accuracy."
I'm certain I saw more advanced claims of success on this project and that github knows about it, since last time I asked they were 'still talking about it internally', but every link I had is dead...
"Every day, millions of files are uploaded onto Github, a coding repository where users share code and collaborate on projects. But how do you tell what language those files are written in?
Identifying programming languages are surprisingly hard. Symbols used in one language often have different meanings in another. For example, ‘#’ in python indicates a comment, while in C it indicates a preprocessor command. Even worse, code in one language can actually contain code in different languages, such as HTML, which can contain CSS and/or Javascript.
At the moment, Github uses a giant checklist to identify unique quirks in the language. For example, if the code contains “:- module”, then it’s probably"); background-size: 1px 1px; background-position: 0px calc(1em + 1px);"> Mercury. However, most languages simply don’t have enough unique quirks for this method to be accurate enough.
The current solution the Github team at ML@B came up with is to use a machine learning algorithm called a Naïve Bayes Classifier. To optimize the program, they used Github’s checklist as a guideline for choosing the correct language rather than a hard-and-fast rule. Currently, the team is working on scraping Rosetta Code, a large repository of code in hundred of different programming languages, for data that they can use for training the program and testing its accuracy. Eventually, they will want to see if implementing other models, such as neural networks, can improve accuracy."
I'm certain I saw more advanced claims of success on this project and that github knows about it, since last time I asked they were 'still talking about it internally', but every link I had is dead...
WELL now that I googled seriously, I think I was too pessimistic, there seems to be some movement on this front at github : https://github.blog/2019-07-02-c-or-java-typescript-or-javas...
Excited to see what's coming out of this new endeavour... Just sent an email to github support, fingers crossed.