Structured Generative Models of Natural Source Code [pdf]

mallamanis · on June 19, 2014

This is a new and very interesting area in machine learning and software engineering. Anyone interested might also find the following papers interesting too

Hindle, Abram, et al. "On the naturalness of software." Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 2012.

Tu, Zhaopeng, Zhendong Su, and Prem Devanbu. "On the Localness of Software."

Nguyen, Tung Thanh, et al. "A statistical semantic language model for source code." Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 2013.

Campbell, Joshua Charles, Abram Hindle, and José Nelson Amaral. "Syntax errors just aren't natural: improving error reporting with language models." Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 2014.

Allamanis, Miltiadis, and Charles Sutton. "Mining source code repositories at massive scale using language modeling." Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on. IEEE, 2013.

Movshovitz-Attias, Dana, and William W. Cohen. "Natural Language Models for Predicting Programming Comments." ACL (2). 2013.

Allamanis, Miltiadis, Earl T. Barr, and Charles Sutton. "Learning Natural Coding Conventions." arXiv preprint arXiv:1402.4182 (2014).

Allamanis, Miltiadis, and Charles Sutton. "Mining Idioms from Source Code." arXiv preprint arXiv:1404.0417 (2014).

tomp · on June 19, 2014

Assuming you're more familiar with this than I am, can you outline why this is useful (either practically, or why it's theoretically more useful than e.g. modelling natural language)?

mallamanis · on June 19, 2014

From a machine learning perspective (i.e. theoretically according to my view) this is useful because source code is highly structured with very complex constraints and tons of data available (e.g. every project in GitHub). This means that machine-learning methods for handling such problems need to be developed. Such methods may eventually be useful in other application.

Now on the applied side (software engineering, programming languages), such methods (probabilistic machine learning and probabilistic/statistical models) can handle uncertainty with a principled way and provide useful tools to software engineers that exploit the amounts of data that is available both in internal and external codebases. This is something that is not 100% possible with formal tools, that usually require some form of human knowledge to be embedded. For example, you will see on the list above, tools that can do autocompletion, others that suggest "reasonable" renamings or others that help migration of source code between languages and all thanks to data.

Hopefully, at some point these methods will be so advanced that they will be able to learn (i.e. trained) from every piece of code that is available online and (e.g.) spot bugs in your code, semi-automatically refactor your code etc.

metronius · on June 20, 2014