More

mallamanis · on Sept 27, 2019

[I'm one of the Microsoft Research people who worked on this]

Thanks for your questions! We have thought of many heuristics but we didn't want to constrain the dataset release on some heuristic that we picked, possibly ruining the dataset. Participants in the challenge should feel free to apply additional filters as they see fit. For example, this [1] work could be useful as a filtering method.

Unfortunately, we do not have the budget to provide any compute resources to help with running the models at this time. Note that any techniques developed with this dataset will be owned by those who develop them and it's up to them how/if they will make them available/open-source.

[1] https://arxiv.org/abs/1806.04616

mallamanis · on Sept 27, 2019

[I'm one of the Microsoft Research people who worked on this]

(thanks Nick. Here are the links)

Generative Code Modeling with Graphs: https://arxiv.org/abs/1805.08490 Learning to Represent Programs with Graphs: https://arxiv.org/abs/1711.00740

mallamanis · on Sept 27, 2019

[I'm one of the Microsoft Research people who worked on this]

That's certainly true for simple use cases. Our goal here is to eventually also capture the long-tail of queries about a codebase. Often, within the domain of a project there is a set of natural language terms/jargon that describe complex tasks specific to the domain. Imagine for example a developer joining a mid-sized project and trying to find how to achieve some simple but project/domain-specific task.

mallamanis · on Sept 27, 2019

[I'm one of the Microsoft Research people who worked on this]

We did consider adding StackOverflow questions. Some of our queries in the CodeSearchNet challenge do actually come from StackOverflow (via StaQC [1]). It's certainly interesting to see how all other SO data can be useful for this task. Thanks for the suggestion!

The reason we didn't try this at this point:

Many people in research have tried working with SO data. In my experience I have observed an interesting problem with the data: it's deduplicated! This is great for users but bad for machine learning, since the data looks "sparse" (roughly, each concept appears once). Sparsity is an obstacle, since it's hard for most existing machine learning methods to generalize from sparse data. In contrast, in natural language there are (e.g.) multiple articles describing the same event more or less.

[1] https://ml4code.github.io/publications/yao2018staqc/

mallamanis · on Nov 22, 2016

Using machine learning like this https://arxiv.org/pdf/1506.05869.pdf would be fun for this and avoid(?) canned responses...

mallamanis · on Oct 6, 2015

An upcoming NIPS paper seems to be a followup

"Hidden Technical Debt in Machine Learning Systems" D Sculley*, Google Research; Gary Holt, ; Daniel Golovin, Google, Inc.; Eugene Davydov, Google, Inc.; Todd Phillips, Google, Inc.; Dietmar Ebner, ; Vinay Chaudhary, Google, Inc.; Michael Young, Google, Inc.; Jean-Francois Crespo, Google, Inc.; Dan Dennison, Google, Inc.

(see https://nips.cc/Conferences/2015/AcceptedPapers )

Unfortunately, I haven't been able to find the paper yet.

mizzao · on Oct 7, 2015

That's because NIPS 2015 will be held in December. Final papers are due Oct 30th : https://nips.cc/Conferences/2015/Dates

mallamanis · on Sept 26, 2015

http://googleresearch.blogspot.co.uk/2015/08/the-reusable-ho... this seems relevant to the article.

jmount · on Sept 26, 2015

Moritz Hardt's work is really awesome. Hardt is co-author two of the newer approaches to re-using test data, which I think is going to be a very big thing in machine-learning/data-science.

mallamanis · on Sept 10, 2015

http://research.microsoft.com/en-us/um/people/marron/selectp... has some interesting technical information about time-travel debugging.

mallamanis · on Aug 26, 2015

I use https://github.com/magicmonty/bash-git-prompt which I also like. It seems to present less information than this one though

noir_lord · on Aug 26, 2015

I use this one as well and it's great, mostly for that "did I remember to commit and push before I go home" check.

chmike · on Aug 27, 2015

Unfortunately, it is lacking a feedback for conflicts.

mallamanis · on Aug 25, 2015

There is this https://code.google.com/p/gittorrent/ although I've never tried it...

edit: a more recent (?) link https://github.com/cjb/gittorrent

jlcx · on Aug 25, 2015

I think they're two different projects, but this is exactly what I thought of. A separate Chinese DDoS was even mentioned by cjb in his announcement of GitTorrent a few months ago: http://blog.printf.net/articles/2015/05/29/announcing-gittor...

ywecur · on Aug 25, 2015

It would be really nice to get it to succeed. Decentralisation is almost always preferable.