[I'm one of the Microsoft Research people who worked on this]
Thanks for your questions! We have thought of many heuristics but we didn't want to constrain the dataset release on some heuristic that we picked, possibly ruining the dataset. Participants in the challenge should feel free to apply additional filters as they see fit. For example, this [1] work could be useful as a filtering method.
Unfortunately, we do not have the budget to provide any compute resources to help with running the models at this time. Note that any techniques developed with this dataset will be owned by those who develop them and it's up to them how/if they will make them available/open-source.
[I'm one of the Microsoft Research people who worked on this]
That's certainly true for simple use cases. Our goal here is to eventually also capture the long-tail of queries about a codebase. Often, within the domain of a project there is a set of natural language terms/jargon that describe complex tasks specific to the domain. Imagine for example a developer joining a mid-sized project and trying to find how to achieve some simple but project/domain-specific task.
[I'm one of the Microsoft Research people who worked on this]
We did consider adding StackOverflow questions. Some of our queries in the CodeSearchNet challenge do actually come from StackOverflow (via StaQC [1]). It's certainly interesting to see how all other SO data can be useful for this task. Thanks for the suggestion!
The reason we didn't try this at this point:
Many people in research have tried working with SO data. In my experience I have observed an interesting problem with the data: it's deduplicated! This is great for users but bad for machine learning, since the data looks "sparse" (roughly, each concept appears once). Sparsity is an obstacle, since it's hard for most existing machine learning methods to generalize from sparse data. In contrast, in natural language there are (e.g.) multiple articles describing the same event more or less.
Moritz Hardt's work is really awesome. Hardt is co-author two of the newer approaches to re-using test data, which I think is going to be a very big thing in machine-learning/data-science.
Thanks for your questions! We have thought of many heuristics but we didn't want to constrain the dataset release on some heuristic that we picked, possibly ruining the dataset. Participants in the challenge should feel free to apply additional filters as they see fit. For example, this [1] work could be useful as a filtering method.
Unfortunately, we do not have the budget to provide any compute resources to help with running the models at this time. Note that any techniques developed with this dataset will be owned by those who develop them and it's up to them how/if they will make them available/open-source.
[1] https://arxiv.org/abs/1806.04616