Hacker News new | past | comments | ask | show | jobs | submit | mallamanis's comments login

[I'm one of the Microsoft Research people who worked on this]

Thanks for your questions! We have thought of many heuristics but we didn't want to constrain the dataset release on some heuristic that we picked, possibly ruining the dataset. Participants in the challenge should feel free to apply additional filters as they see fit. For example, this [1] work could be useful as a filtering method.

Unfortunately, we do not have the budget to provide any compute resources to help with running the models at this time. Note that any techniques developed with this dataset will be owned by those who develop them and it's up to them how/if they will make them available/open-source.

[1] https://arxiv.org/abs/1806.04616


[I'm one of the Microsoft Research people who worked on this]

(thanks Nick. Here are the links)

Generative Code Modeling with Graphs: https://arxiv.org/abs/1805.08490 Learning to Represent Programs with Graphs: https://arxiv.org/abs/1711.00740


[I'm one of the Microsoft Research people who worked on this]

That's certainly true for simple use cases. Our goal here is to eventually also capture the long-tail of queries about a codebase. Often, within the domain of a project there is a set of natural language terms/jargon that describe complex tasks specific to the domain. Imagine for example a developer joining a mid-sized project and trying to find how to achieve some simple but project/domain-specific task.


[I'm one of the Microsoft Research people who worked on this]

We did consider adding StackOverflow questions. Some of our queries in the CodeSearchNet challenge do actually come from StackOverflow (via StaQC [1]). It's certainly interesting to see how all other SO data can be useful for this task. Thanks for the suggestion!

The reason we didn't try this at this point:

Many people in research have tried working with SO data. In my experience I have observed an interesting problem with the data: it's deduplicated! This is great for users but bad for machine learning, since the data looks "sparse" (roughly, each concept appears once). Sparsity is an obstacle, since it's hard for most existing machine learning methods to generalize from sparse data. In contrast, in natural language there are (e.g.) multiple articles describing the same event more or less.

[1] https://ml4code.github.io/publications/yao2018staqc/


Using machine learning like this https://arxiv.org/pdf/1506.05869.pdf would be fun for this and avoid(?) canned responses...


An upcoming NIPS paper seems to be a followup

"Hidden Technical Debt in Machine Learning Systems" D Sculley*, Google Research; Gary Holt, ; Daniel Golovin, Google, Inc.; Eugene Davydov, Google, Inc.; Todd Phillips, Google, Inc.; Dietmar Ebner, ; Vinay Chaudhary, Google, Inc.; Michael Young, Google, Inc.; Jean-Francois Crespo, Google, Inc.; Dan Dennison, Google, Inc.

(see https://nips.cc/Conferences/2015/AcceptedPapers )

Unfortunately, I haven't been able to find the paper yet.


That's because NIPS 2015 will be held in December. Final papers are due Oct 30th : https://nips.cc/Conferences/2015/Dates



Moritz Hardt's work is really awesome. Hardt is co-author two of the newer approaches to re-using test data, which I think is going to be a very big thing in machine-learning/data-science.


http://research.microsoft.com/en-us/um/people/marron/selectp... has some interesting technical information about time-travel debugging.


I use https://github.com/magicmonty/bash-git-prompt which I also like. It seems to present less information than this one though


I use this one as well and it's great, mostly for that "did I remember to commit and push before I go home" check.


Unfortunately, it is lacking a feedback for conflicts.


There is this https://code.google.com/p/gittorrent/ although I've never tried it...

edit: a more recent (?) link https://github.com/cjb/gittorrent


I think they're two different projects, but this is exactly what I thought of. A separate Chinese DDoS was even mentioned by cjb in his announcement of GitTorrent a few months ago: http://blog.printf.net/articles/2015/05/29/announcing-gittor...


It would be really nice to get it to succeed. Decentralisation is almost always preferable.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: