For an engineering community manager, that was an awfully rant-y way to make the point: "It's frustrating that researchers don't publish their code."
Not to say I disagree with the frustration... but it's also not something new. It's been this way for decades. I'd much rather hear about who is doing work in this space and what they're working on. Here are the ones I know of:
1. The Center for Open Science (https://cos.io) is one such org trying to fix this with the Open Science Framework [1].
2. GitHub also recognizes the need for citable code and gives special discounts for research groups, in fact, Mozilla is one they work with [2].
Two smaller related startups are:
3. Datazar (https://datazar.com) - A way to freely distribute scientific data.
4. Liquid (https://getliquid.io) - A scientific data management platform. Somewhat like "Excel for scientific data as a Service".
I like his rant. But at the same time, as an ex-CS grad student now doing biology, I wish my field was 1/10th as rigorous, tidy and beautiful as computer science.
Code is a joke, most data is processed using "pipelines", which in reality means some irreproducible mess. People don't generally do research trying to understand how cells or tissues work, they generally write papers about "stories" they found. Only a small minority are trying to do some serious modeling using serious math.
> Code is a joke, most data is processed using "pipelines", which in reality means some irreproducible mess.
You're not wrong, and it's not limited to bioinformatics; Reinhart-Rogoff's findings were reversed when an additional 5 rows were included in a spreadsheet they used to calculate their correlation between GDP growth and debt ratios. And of course, they insist that despite the actual outcome being twice as strong and in the opposite direction, they still support their original position.
I wonder if one can get a CS PhD by producing enough retractions. Of course, it won't win you many friends in the academy, and would probably lead to less source code made available. But given the Perl code I've seen published who's termination condition is a divide-by-zero exception, one can argue that peer review in the information age has to include code review.
Didn't know about the Reinhart-Rogoff controversy [0], interesting! They state that they have been careful not to claim that high debt causes slow growth, but rather that it has an “association” with slow growth.
I read abt this quite a bit, the refutation is attacking a small portion of the data, because that small portion is trendy and hot in politics.
Judging academically, the original paper and the refuting paper is a healthy debate, but the dynamic of the society and politics ab-use them to attack a whole school of thought at large (the austrian school: less bailout, less intervention by government, less control over everything) in favor of keynesian school (more bailout, more government spending, more public debt, especially in recession and crisis).
Anyway, it remains a controversy, because theoretically one can do what one wants, but once it involves policy and real life matters, it is hard to argue for what method is right and what is wrong, in the presence of so many (ready to b angry) interest groups.
Excuse me? That single attacked paper was the intellectual blanket for an unprecedented victory march of the Austrian school after the financial crisis and the recession.
I agree that from a purely academic point of view this is nothing big to worry about, but this paper played a completely outsized role. And the authors stood by and let things run their course, without any attempt to reign in or moderate the debate.
Fair enough, but i add that academic life is sad, one has to pursue one's endeavor at one's own cost. However, politicians and the public want too much from us researchers. So sometimes, we do believe that our sweating formulas have life impact, or to fancy, save the world.
>>People don't generally do research trying to understand how cells or tissues work, they generally write papers about "stories" they found. Only a small minority are trying to do some serious modeling using serious math.
Speaking as a layman, that seems like a very strong claim for what one hopes is a hard science supposedly applying the very best practices of the scientific method (i.e. falsifiable theories vs. anecdotal stories).
Is this meant to be hyperbole to get your point across, or something that is generally known throughout the bio[med|tech] industry? As a sister comment pointed out, the latter scenario would be quite alarming.
EDIT: I'm aware of growing sentiment within the scientific community to reconsider using p-values, of which John P.A. Ioannidis and his body of work[1] helped to raise awareness of. Was this the "story"-like theme that you're referring to in cell and tissue papers?
His claim is hyperbolic, but only from the perspective of the way that scientific research has been conducted for the last 10-15 years. Prior to this, many researchers were forced to tell "a story" because the sheer volume of data and complexity in fields like cellular biology was/is so massive. This does not mean that their findings were in any way intentionally deceptive, it's simply that it was impossible to cover all the ground necessary to tell the whole story.
However, as data analysis techniques have become more powerful and data more easy to produce thanks to advances in the scientific methodologies we use to access information about the matter we are studying, the datasets are getting bigger but also easier to manage.
Biology is driven by genetics, and it's what grounds the field. I think that, perhaps, your experience is more in bioinformatics, for which I share some of your skepticism. Some good work is being done there, but usually that has a strong genetic component.
We're at an interesting point where many of the useful mutant screens have been done to saturation, at least for the most common species. Expanding our genomic resources to screen more species is certainly helping. The way forward is just the same as it ever was; good mutant phenotypes with strong genetics.
I don't think 'serious math' is at all useful for most biological investigation. Modelling is only a means for generating hypotheses, and in my experience it's a terribly weak tool. A poor substitution for proper genetics.
I mean really, the Churchill quote about democracy applies here as well: the current way of doing research is the worst possible way of doing it, except for all the other ways we've tried before.
VUSes that actually could be clinically significant are hoarded, and conversely, some companies claim to know that certain VUSes are not based on their own proprietary data that no one else sees/double checks. meanwhile the actual data comes from patients, who are left in the dark
That's just cruel. Comparatively, at least CS gets math formulations
I think the reason academics don't publish their code is that they don't want to get stuck into an endless discussion on the code along with people asking for "support" on it.
What I have found is that if you need the code for a specific data analysis then send an email to the authors of the paper. More than ever you will be surprised. They do share code with fellow researchers.
So - if you do really want code, write the email. Be forewarned though - the quality is often bad from a s/w engineering perspective and suits a very specific purpose. It will also come in a language of the authors choice (Tcl/Tk for ns2 scripts, matlab scripts, octave, etc.)
I've published code for most of my papers, and rarely, if ever, received support requests. When I have, it's usually been from someone else trying to do work with it, who asks good, intelligent questions (i.e., the requests you want to receive).
Maybe that just means I don't write very interesting papers.
How is this different from any other academic research? What he is asking about is neither openness nor reproducibility (which are, indeed, very important). He is asking that researchers produce code that he can put into production. Not only they have negative incentives to do so (for one, providing such code will surely result in a stream of all kinds of support requests), it would actually work against the reproducibility objective.
The purpose of the code written is usually very simple: to produce results of the paper, not to provide a tool other people can use out of the box. Even when such a tool is nominally provided (for example, when a statistics paper is accompanied by an R package), there are good reasons to be very careful with it: for example, the paper may include assumptions on valid range of inputs, and using the package without actually reading the paper first would lead to absurd results -- which is something that has happened. The way to use academic research results is to (1) read and understand the paper, (2) reproduce the code -- ideally, from scratch, so that his results are (hopefully) unaffected by authors' bugs, (3) verify on a test problem, and (4) apply to his data. Using an out of the box routine skips steps 1-3, which are the whole point of reproducibility.
> He is asking that researchers produce code that he can put into production.
That's not how I read his comments at all.
What he seems to be asking for is the ability to take the code you used to produce the pretty graphs and tables in your paper and re-run it, maybe tweak it himself and use it on a slightly different dataset. He wants to be able to see that your results extend to more than just the toy synthetic dataset you made up, and also be able to verify that some bug in your verification code didn't make the results seem more successful. Finally, he wants to be able to compare apples-to-apples by knowing all the details of your procedure that you didn't bother putting into the paper.
> What he seems to be asking for is the ability to take the code you used to produce the pretty graphs and tables in your paper and re-run it,
You're assuming such code exists. If the graphs were produced by hand (eg, typing things into MATLAB to create the plot and then saving the figure), then there is no code to hand off. Now the code request has risen to "redo all that work".
And, as an academic myself, we should force people to save and publish their matlab scripts.
That is not to much to ask, but the academic system is full of perverse incentives. Doing good, robust work looses to good looking, quick and dirty all the time.
We need funding bodies to require publication of all these details, and we need to structurally combat publish or perish. Hire people based on their three best results, not on h-factor or other statistical meassures.
Some (many?) use MATLAB interactively to produce figures and then save them by hand. Often this involves a mixture of typing commands and clicking on GUI elements like "add colorbar". So a m-file that produces the figure doesn't exist; at best there would be fragments of code buried in MATLAB's command history.
> reproduce the code -- ideally, from scratch, so that his results are (hopefully) unaffected by authors' bugs
This rests on a common false assumption that programmers make: they think it's easier to write bug-free code when starting from scratch. The reality is that it's almost always easier to start with something that's nearly working and find and fix the bugs.
What really happens when you do a clean room reproduction is that you end up with two buggy programs that have non-overlapping sets of bugs, and you spend most of the effort trying to figure out why they don't match up. It's a dumb way to write software when it can be otherwise avoided.
I wonder though, maybe non-overlapping sets of bugs are actually better for science? That is, it could avoid systematic errors. Of course, one bug free implementation is clearly better!
True, but this is research, not business. Getting 2 programs that don't agree, even when your post-doc has cleaned up all the bugs, is the point of reproducing the research. Ideally, you want to know if you both ran it the same way so you know it's 'true'.
True, but how would one discover those bugs in the first place? In commercial software a client might run into them and report them. But for science papers? Algorithms that go into production might work the same way, but analysis papers such as the OP is talking about don't.
Worse, with the attitude that the OP has, do you really think they will take extra time to verify the entire code base or look for bugs?
Often the best way to find errors in these sorts of analyses is to do a clean room implementation then figure out why the two programs don't produce the same results.
But you're not writing software for production, you're writing software for understanding/advancing research. Every thing that doesn't match up is either a mistake on your end, or an undocumented (and possibly dubious) assumption on their end, and it's really valuable to find out either way.
Reimplementation matter hugely (in ML at least). But that doesn't mean having the original implementation available isn't a huge advantage, obviously it is.
I think there's exceptions to this: most modern attempts to verify software has involved design decisions that are only practical when you're starting from scratch. Similarly, rewriting in a language with more static safety guarantees may lead to fewer bugs (or at least different, less critical ones).
It can also be used to flesh out unconscious biases. I did this at work and this helped to find several issues that might have become bugs in the future.
> What he is asking about is neither openness nor reproducibility (which are, indeed, very important). He is asking that researchers produce code that he can put into production.
That is 100% not what he's asking. I don't know how you could even interpret that as what he's asking.
He wants to be able to take your research and run it over an updating dataset to verify that the conclusions of said research actually still apply to that data.
> "How is this different from any other academic research?"
Well it's not, although CS should be particularly amenable to reproducibility.
> "He is asking that researchers produce code that he can put into production"
No, he asked for code [full stop]. "Because CS researchers don't publish code or data. They publish LaTeX-templated Word docs as paywalled PDFs."
> "The way to use academic research results is to (1) read and understand the paper, (2) reproduce the code -- ideally, from scratch, so that his results are (hopefully) unaffected by authors' bugs, (3) verify on a test problem, and (4) apply to his data."
He wants to re-run the authors analysis with new data. He's not looking to recreate the research from scratch or publish a new paper. Saying that this is the only valid usage of the results is awfully short sighted. It misses the point that the research has use beyond usage by other researchers.
Imagine if rather than open source software, we published the results of our new modules and told potential collaborators to build it from scratch to verify the implementation first. You'd learn a lot about building that piece of software, but you've missed an enormous opportunity along the way.
"He wants to re-run the authors analysis with new data. He's not looking to recreate the research from scratch or publish a new paper. Saying that this is the only valid usage of the results is awfully short sighted. It misses the point that the research has use beyond usage by other researchers."
That sort of thing has been done for years in the scientific computing world. The end result is that you are making decisions based on code that may have worked once (Definition of 'academic code': it works on the three problems in the author's dissertation. There's production code, beta code, alpha code, proofs-of-concepts, and academic code.) but that you have no reason to trust on other inputs and no reason to believe correct.
Case in point: I had lunch a while back with someone whose job was to run and report the results of a launch vehicle simulation. Ya know, rocket science. It required at least one dedicated human since getting the parameters close to right was an exercise in intuition. Apparently someone at Georgia Tech wanted to compare results so they got a copy. The results, however, turned out to be completely different because they had inadvertently given the researchers a newer version of the code.
Two points to consider here. First, it's not fair to criticize the whole field of academic CS research. Not everyone's work can be accurately represented by a github repo. Second, even when it can, expecting the "Run author's [whatever] against an up-to-date dataset." step to work is asking a quite a lot. Typically there are a infrastructural assumptions baked in (file paths, legacy code dependencies, etc) and manual steps to get to plotted results. In an ideal world with enough resources, every lab would have technical staff to help with this process, but most researchers unfortunately don't have bandwidth to spend on this problem.
If your paper has results, it has code of some sort which can be put into a github repo.
That's the bare minimum. If you don't know how to make code agnostic to file paths or dependencies, that's too bad, but fortunately a field practitioner picking your code up will know how to work around those issues. At least they're not starting from scratch on trying to rewrite your code.
> In an ideal world with enough resources, every lab would have technical staff to help with this process, but most researchers unfortunately don't have bandwidth to spend on this problem.
And they shoot themselves in the foot constantly by not prioritizing non-crappy software.
Then think it's normal that a new student needs months of close support before they can start doing anything interesting. They think it's normal to spend weeks trying to get bitrotted code running again, or just give up and throw it out losing months or years of hard-won incremental improvements.
I'm not even talking about supporting reproducibility by outsiders -- they can't even achieve reproducibility within their own labs, because they don't follow practices that are baseline in industry (version control, configuration management, standardized environments, etc).
> I'm not even talking about supporting reproducibility by outsiders -- they can't even achieve reproducibility within their own labs, because they don't follow practices that are baseline in industry (version control, configuration management, standardized environments, etc).
True that. Four years ago, when I was writing my thesis in computational physics, I attended a research group meetup. One session gathered all the students and had them showcase their research. When there was still some time left at the end, I asked the audience who was using version control systems for the programs they write, and only 5% or so raised their hands. I then immediately ran them through a Git tutorial, and people were amazed by what is possible.
Indeed, this is all too common and such a waste. I imagine it would help to have a manual of "modern" best practices for research--maybe this exists? A lot of non-CS researchers could benefit as well.
A lot of researchers do make their results available as code and a whole lot of those who don't actually publish lies, as many people in academia from grad students to professors who implemented research papers and observed that the algorithms don't really work discovered.
File paths and manual steps can be worked around, mobody said it should build and run painlessly forever after having been published on any future system with no tweaking and with the researcher indefinitely maintaining the program. If however nothing remotely close to a working state can be published it's not a great sign.
Yes, many do release code, and it is a great thing! But not everyone's research is represented by code (even within CS). Sometimes research also depends on data that cannot be shared publicly, obscure user interfaces, commercial libraries, includes yet to be published work, etc. In those cases, pushing out without the dependecies may also give the impression of poor research, because it appears to be broken.
On this topic I'd like to mention the machine learning papers site GitXiv, which is wonderful. It publishes papers alongside the Github repo containing the code.
Thanks! That's awesome. I have long thought that a platform for scientific publication where code and data are included would be valuable, and I'm glad to learn it exists. Going further, imagine the code is like a container that anyone can run to reproduce the findings, including statistical analysis and summary on raw data sets -- the key findings should "build". Perhaps the platform also provides continuous build during "development" (research) so that researchers can work privately and then publish their "repo" publicly along with their paper. An easy way to clone and reproduce the build after publication: "fork my research"
As an extreme version of the idea, imagine if the actual paper itself (TeX) and all the data within it are also built as part of the repository; any graphs in the paper are rendered from data in the repo, any numbers are data accesses, etc. This probably wouldn't be helpful to researchers, but it would promote scientific reproducibility and aid everyone building on a researcher's work. Tremendous work goes into authoring the papers themselves, sometimes with methods or tricks that are private; laying it all out publicly would greatly help students of science.
Going even further: to avoid cherry picking of positive results, review boards expect experimental criteria to be published (at least privately to them) in advance, for research that involves capital-E experiments. Perhaps this includes analysis code at least in prototype form; like test driven development, the acceptance criteria are written first. When the paper is ready for review, the reviewers can compare the initial prototype analysis logic to the final form. Perhaps the board also expects all data and trials collected during experiments to be made available in the repository, whether positive or not. All collected data should be in the platform, in the most raw form it was originally recorded, as well as all steps of summary and analysis.
I wonder if a process and platform like this could contribute to the integrity and quality and reproducibility of scientific research. People funding research ought to ask for it, especially public funded research, and the whole repo is made open eventually if not initially.
Perhaps as part of the platform's value prop to researchers (on whom it is imposing probably more burdens than benefit, for sake of public benefit), the hosting is free and funded by a foundation, or steeply discounted. (OK, it won't pay for LHC scale data sets, but otherwise ...) So using it to host your data, code, and paper is free, at least up to a point. I would be interested to contribute time and resources toward building or supporting a platform like this.
I don't think research should be as structured as software development. In CS, many of the most interesting papers come about when the authors discover something unexpected or non-intuitive and choose to explore down that thread. That's why it's research -- sometimes you can't know what you will find until you're there.
I don't think research should be as structured as software development
In many ways, it already is: good research requires meticulous log keeping in order to reproduce results, and equal effort must be spent on maintaining references to other literature, or you risk missing a citation in a published paper.
You may be interested by looking into "An Effective Git And Org-Mode Based Workflow For Reproducible Research", which describes more or less your idea (https://hal.inria.fr/hal-01112795/document).
I took a graduate course on software engineering just a month ago and read many of those paper which use Mozilla's data. It's a very popular dataset in the field since it is both open and large. I'm sure Mike Hoye meant to criticize a part of that field, not academic CS research as a whole.
My impression of the field was there was a severe mismatch of skillset. The set of people with the scientific background to carry proper experiments, and the funding to do so, is very disjoint from the set of people who understand the field. That made a lot of the papers feel "off". Almost like reading text generated by a machine: individual sentences make perfect sense, the whole doesn't seem to go in a relevant direction.
As someone who's done a fair bit of practical software engineering, seeing academics study software engineers feels like seeing a WW2 veteran trying to understand how youngsters use snapchat. It feels very awkward for the youngster, just as it does for the software engineer. Which I imagine is one reason why Mike is pissed off.
There is some irony that businesses are much more scientific in this particular subfield than academia, because business incentives require the results to be reproducible and meaningful, over a longer period of time.
The state of machine learning research these days seems pretty good. Essentially all research is published on ArXiv and there is a lot of code released too (though there could certainly be more).
I think openness has been a big contributor to the recent explosion in popularity and success of machine learning. When talking to academics about this, machine learning would be a great field to hold up as an example.
I'd say the opposite as a member of a group at my university who review
ML papers. First off right now there seems to be a drive to explain many phenomena in ML in particular why neural networks are good at what they do. A large body of them reaches a point of basically "they are good at modeling functions that they are good at modeling". The other type of papers that you see, is researchers drinking the group theory kool-aid and trying to explain everything through that. At one point we got 4 papers from 4 different groups that tried to do exactly that. All of them are flawed, either in their mathematics or assumptions (that will most likely never be true, like assumptions of linearity and your data sey being on a manifold). Actually speaking of math, many papers try to use very high level mathematics (functional analysis with homotopy theory) to essentially hide their errors as nobody bothers to verify it.
>First off right now there seems to be a drive to explain many phenomena in ML in particular why neural networks are good at what they do. A large body of them reaches a point of basically "they are good at modeling functions that they are good at modeling".
Since this is closely related to my current research, yes, ML research is kind of crappy at this right now, and can scarcely even be considered to be trying to actually explain why certain methods work. Every ML paper or thesis I read nowadays just seems to discard any notion of doing good theory in favor of beefing up their empirical evaluation section and throwing deep convnets at everything.
I'd drone on more, but that would be telling you what's in my research, and it's not done yet!
> All of them are flawed, either in their mathematics or assumptions (that will most likely never be true, like assumptions of linearity and your data set being on a manifold).
Although I can see global linearity being unlikely in most cases; why is local linearity unlikely?
For a computer scientist, reproducibility means more work that they aren't paid to do. If I ask the Mozilla team to implement new feature X, the response will be either (a) point me to a donation link, or (b) We're open source, so why don't you implement the feature yourself? The computer scientist's response is the same.
Reproducibility for the computer scientist means including any code written and data collected or relied on in the scientific publication itself. In practice, getting there from here isn't literally zero work, since some actual human action is needed to bundle the code and data, but that effort ought to be negligible overall, especially if we make it a standard part of the scientific process.
Trust me, it's currently far from zero work to submit code with a research paper. I was recently the corresponding author on a software paper sent to a journal that at least verifies the code compiles and runs and produces expected output. Since the poor person testing the softwares submitted is permanently in the ninth circle of dependency hell, across all platforms and libraries imaginable, it took about fifteen emails back and forth plus an OS reinstall before everything checked out. And they said that wasn't anything extraordinary.
How about a platform centered around Linux containers (or maybe one of several OS containers or VM images), as the repository image?
I'm not saying the work is zero now, but maybe we can get there. If a researcher is developing on a platform where their repository is expressed as a container-like image, then they should be able to publish it for anyone to run exactly as-is. The container repo includes the data, the operating system, and any languages and libraries, with an init system that optionally builds the results.
Yes, I think we need to go in this direction. The problem is that the container system is yet another tool for researchers to learn. The first step is to get everyone using VCS and nightly testing. Many are still at the point of clumsily written, old Fortran code that gets emailed around and exists in N different variants. (Not that there is anything wrong with Fortran.) Many are at the point where if you email them a link to a git repo to clone, they're clueless about what to do.
It would help if Git didn't have such an awful learning curve (and I say this as a git user that already went through it).
I know researchers that used Subversion when it was on the rise, but they just abandoned version control altogether when Git became the generally preferred option.
There's a difference though between "just published" and "reviewed, verified and published". We'd gain a lot if anyone simply included the code they were using with the papers. It doesn't matter if it runs and how generic it is. If it's important, it can be fixed by the next user. If it's not, it didn't matter in the first place if it's verified.
Of course verification on submission is also a great idea, but we can make it the next step.
This is exactly the reason. Academia is a business just like any other. Academics do what they are incentivized to do. I guarantee that if NSF, DOE, NIH and the other US government funding agencies made publishing your code a requirement for funding, and actually checked that you did, you would see everyone start doing it immediately. Right now NSF requires a "data management plan" as part of the proposal, which includes information about sharing, reusing and archiving data and code, but nobody ever checks that you followed the plan after the money is awarded.
I don't think Mozilla Corp takes donations for software development. The Mozilla Foundation does take donations, but it doesn't develop software. Rather, the foundation works on outreach (those cute videos you sometimes see on the Firefox start page).
Mozilla is a bit of a weird entity, half non-profit, half very much corporate, yet all code is free.
Good point. Research software is always far from production quality and researchers don't get into research to write production software. To publish software, it has to build and run on all reasonable platforms, and users should be able to read, understand and modify the code. All this effort distracts the researcher from doing research. (Unless we redefine research as working on software.)
Nobody is saying your research has to have the quality of industry open source projects. You're just coming up with excuses to continue to hide your code.
Merely publishing it at all would be helpful. That way other people can pick it up and modify it to work for them. That's still better than them flying blind and having to completely rewrite your code from scratch.
It's really infuriating to see researchers creating this false dichotomy between publishing production-quality software and publishing any code at all.
Yes, exactly. All I'm asking for is a literal code and data dump. Take a snapshot of the operating environment and publish it. Ideally as a virtual machine or container image, so people don't have to fiddle with dependencies and versions; but if it's just a file system dump of the code that's better than nothing.
The folks who are asking for this are NOT asking for: (1) production quality code (2) portable code (3) an open source project based on the work (one way code dump is fine) (4) well commented high quality code (5) any support using the code beyond which would normally be afforded fellow researchers in reproducing one's results.
We (me and anyone else who is asking) understand that there may be some circumstances in which not all data can be published due to privacy, or in which code depends on proprietary dependencies that can't be shared. That's fine. Document it if you can as a disclaimer, and people can work on getting access to those private resources if they need to.
Anyone funding research should expect this; and publicly funded research should be required to disclose all technical work product that's materially involved in published results.
That's all great. And when I publish my code and it doesn't work on your machine as it does on mine, what do we do next? When you're saying "should", you should say who's going to pay for this and at whose expense (in money but also in time).
If the virtual machine image you prepared works on your machine and not on mine, I'll troubleshoot it and submit a bug to the VM project. It's pretty unlikely, though - that's why virtual machines and more recently operating system containers are so useful. They are portable. You can even emulate an x86 virtual machine in your browser in JavaScript and boot Linux within it: http://jslinux.org/ - cool, huh? With mature virtualization technology it's possible to take a snapshot of an operating environment + data + program as a portable image, then someone else can boot it into a virtual machine and run it. VMs are mature and run a large portion of the software powering the Internet. Containers are fairly portable as well. See more of my other comments on this topic if you're interested in my ideas about technology specifics and how it would be funded (I'm not saying I have a full plan, just avoiding repeating myself).
> They publish LaTeX-templated Word docs as paywalled PDFs.
Somewhat tangential, but do CS academics actually write papers in Word? During my grad school days I did not encounter a single paper 'typeset' in Word. Writing was usually done with a LaTeX and and a makefile in a git repo.
Just adding another data point: When I was in CS grad school (late 90s), everyone used LaTeX (including the journals). Then I started hanging out with neuroscience people, and everyone (including the journals) used Word+Endnote, with basically ad-hoc treatment of the figures.
How does Word work out when multiple collaborators are writing at once? One of the cool things about version controlling LaTeX is that conflicts, merges, etc are dealt with using standard tools, which is super helpful when you have collaborators across the globe furiously writing and redoing figures a few days before a deadline.
Word is awkward for this. (At least the offline versions of Word are---I've never tried the online ones.) Word does have a "track changes" feature, which is invaluable when you have multiple authors, at least in the absence of 'traditional' source control tools. I don't think many neuroscientists have tried a LaTeX-git workflow and rejected it---I think the learning curve of LaTeX and git are steep enough that few have tried it. I myself prefer LaTeX, but I think you'll admit it also has its frustrations. And any time I suggesting using it to my neuroscience coworkers, they looked at me like I had lobsters coming out of my ears.
Well when you deal with merge conflicts in Word, your standard tool is Word. I don't think you can get much more standard than Office when you consider collaborators outside academia. It's not great for simultaneous editing (although I think this is now possible in 365).
It is, however, very good for tracking changes over versions. Many academics are not familiar with git, diff and so on and it's nice to easily see historical edits in the document. For simple documents like abstracts, it's much easier to send a Word document than it is to send a tex file and assume that everyone on consortium is going to be able to compile it (especially if you work with industry).
It would want to have improved since I used it a couple years ago in Word 2013. The main problem was with citation managers - Word would give a paragraph lock to you whenever you edited a paragraph, and it would only unlock that paragraph after a save (either auto or manual). Of course, when you have a citation manager, they have the habit of changing all the paragraphs when you insert a new citation that changes the numbering (ie. [1] becomes [2], etc.).
with office365 it's not terrible, but it's worse than google docs or quip imo. i've run into trouble esp when some people edit from web and some edit from desktop.
There are several IEEE Transactions that don't like LaTeX, and even one that uses a submission web form that is "Best viewed in Netscape Navigator". I don't understand why IEEE don't mandate a common, sane submission system.
It depends on the subcommunity. I have seen LaTeX essentially everywhere in CS, but I recall some exceptions. I'm not sure where, but I think it was applied stuff, probably close to some application domain (e.g., bioinformatics).
I work in bioinformatics and I can confirm that most biologists feel very uncomfortable in the idea of having to touch LaTeX. As a result, in our collaboration project, I'd have to survive with using Word+endnote. It's a fairly painful experience..
There are a lot of bioinformaticians in my office and I also hear that Word is common in that world. One told me that he uses pandoc to convert Word documents from/to something more pleasant, and doesn't even tell his co-authors that he's not using Word.
In bioinformatics this works for me too, up until I get comments and tracked changes back from co-authors - getting Word comments/changes back into a md file isn't worth the headache, after that step I have to use Word too.
I've made some good experiences with Authorea, a paid online collaborative scientific writing tool, I think that one runs on top of Pandoc+git too, but it doesn't have full Word import support yet AFAIK
Edit: 0 mentions for Authorea in this entire discussion, I guess these guys need to do a bit more advertising :)
I've never used pandoc, but I was told that tracked changes were do doable. I see that pandoc converts them from/to special markup which includes the author, etc. It sounds like receiving Word documents with tracked changes is not a problem, but you would need a special tool to turn a diff into a document marked up with your changes when you want to send it back as a Word doc (similar to latexdiff).
As a recovering CS academic, I have to say - boo-fucking-hoo. Tell me exactly where the funds to do this extra level of support is supposed to come from? Is Mozzila paying for those grad student hours? No? Well then.
> Publishing your code (at all) does not mean you have to make it production-quality or provide support for it.
No. If you publish something that's incomplete or doesn't have all the right dependencies listed, etc, it's not really of any use. Writing up compiling instructions plus dependencies plus how to run it plus input files etc takes time and by the time you've got it to the state that someone else can run it, now it's "production-quality".
> Given that, it seems reasonable that you should make the modicum of effort to publish your code.
There's currently no incentive/requirement to publish code, it uses time and does not increment your publication counter. Find the incentive and you'll start seeing published code.
> I guess researchers really are so incompetent with software that they think code which requires debugging is useless.
So published papers are held to a high standard -- filtered through editors and peer review -- while publishing code can be half-assed at best? I still disagree; if it's worth doing, it's worth doing right.
exactly, even if its just algorithms, i can fit them into another code-base. Its also a lot easier to re-write code if you have something to base it on.
Well, tuition costs for even public universities have been rising a lot faster than inflation. Many universities these days are basically hedge funds with a school front. And a lot of universities get state funding.
We're pushing a generation of kids without rich uncles to bail them out (myself being one of them; I start university in fall) towards massive, crippling debt. The places they're going aren't hurting for money. Is it that unreasonable to ask for open access?
State funding has been falling dramatically. The actual researchers are hurting for money. Grad students are paid a pittance, so rather understandably, most grad students want to graduate ASAP. Making your code nice and publishing it does not help you graduate early. It also does not help you get grants or funding.
You could pressure universities / departments to make reproducibility a requirement for graduation, but I don't see why they would follow along, because this is not going to help the university get more funding or publish more papers (prestige).
Now, if organizations and companies were willing to put money behind software reproducibility (maybe some sort of fellowships or million-dollar research grants to labs), then the incentives would be aligned.
I would like to make it a simple requirement of receiving any public funding. No one is asking for the code to be polished - I would actually rather it was not modified whatsoever from the state in which the research was conducted. The goal is reproducibility and integrity, so I want to see the actual code that was used, not some polished subsequent version.
Every publication involving data and code or analysis should publish them to a degree that makes validation possible in at least as detailed a way as portrayed in "Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff" (http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04dea...)
From my perspective, Mozzila shared raw data with researchers, and they were lucky enough to have some very smart people provide potentially extremely valuable interpretations of that data for free with them.
It might be inconvenient for Mozzila to have to reproduce results by themselves, but let's keep things in perspective here: being "extremely angry" seems almost ludicrous.
Reproducible is a major issue across science in general, but the difference is there's no reason why one shouldn't be able to easily re-run a defined analysis on a more recently updated data set to ask if conclusions drawn previously still hold. I actually published a side-project paper on this (in biological sciences) last year [1] - what was scary was there was such a lack of discussion surrounding this idea, despite the fact that large databases of biological data are CONSTANTLY changing and updating.
The other difference is that as far as I know, computer science is the only discipline for which industry has solved the problem of reproducibility; it's one thing to be asked to design a method to run reproducible studies of humans, it's another to ask researchers to run `git remote add https://github.com/user/repo && git push --set-upstream`. That's not asking for any support, or other effort on the researcher's part, and I frankly don't understand how the CS academic community doesn't have this as a standard when it'd be so easy to implement.
It is a real issue. The programming languages and software engineering communities have been working on evaluating software artifacts for a few years now:
@ open access: In my specific subfield of academic CS, a huge success story was ECOOP (one of the biggest conferences in the field) going open access last year. I'm hoping that there'll be other conferences following suit. My money is on open access being commonplace in 10-15 years.
You had me confused for a second, I didn't realise I was talking to jpolitz :) Thanks for bringing that up. AFAICT, he (advisor) has done a lot to make that happen, indeed.
(tl;dr of my linked blog post: Apply a slightly lighter weight form of some industry engineering practices in CS research coding. I think it's feasible. It doesn't solve all of the problems, because as discussed elsewhere in this thread, some of them are incentive-related and I'm not going to claim to have answers to everything. :)
(a) Convince more research groups to do their research on GitHub by default -- ideally, in open repositories. They get good hosted SCM, the world gets a better chance of seeing their code.
(b) Create more incentives, like the USENIX Community Award, for research that puts out its code. I'd say that in the systems community, a pretty decent chunk of the papers at SOSP, OSDI, and NSDI have code releases (of varying degrees of usability) accompanying them, though that's not a scientific count.
Mozilla could throw $1k to help create community-award-style incentives in the conferences they're interested in. Win-win. You get engaged with the community, you create some incentive for people to do the right thing, and you can use it as an onroad to deeper engagement with the winning authors (i.e., you can try to bring them in for internships. :).
What it will take is creating a new system of research and development that ignores the traditional academic system. Because this has been a problem for a while and they clearly are not hearing the message.
The reason academic research works is because it takes risk on potential failures, because it's only donated or grant money anyways. But academic institutions fetishize academic papers, which is the problem from the article.
We need to legitimize research outside of the academic institution. Pay people to do research on their own time, if they perform it in an open-source, reproducible fashion. Incentivize it based on the reproducibility factor, but avoid attaching a profit motive.
Look at what Xerox PARC and Bell Labs were able to do before the penny-pinching bean counters took over.
Otherwise, it just sounds like "we want all this risky work done, and we don't want to pay for it."
Maybe I've drunk the kool-aid, but saying "academic institutions fetishize academic papers" seems wrong to me. It's like saying that developers fetishize working applications. It's not a fetish, it's the whole point of the thing. The output of research is research papers. Yes, sharing raw data and code are both good things, and should be promoted. But no one is going to take the time to look at either unless the paper presents good evidence for some novel result.
Papers don't do anything. Papers aren't executable. Papers are largely only useful in regards to writing other papers. To make the whole point of the exercise to be something that is inert and self-referential is the fetishization part.
Papers teach people things. You can't really hand someone some working code and expect them to iterate on it to produce something novel and insightful, because it won't be written in a high-enough language to express the core concepts flexibly.
I don't see how that's related. There are loads of open source software projects that are maintained by only a single person precisely because they haven't bothered to properly document why it exists, what came before it, what important things it is demonstrating, or what needs to be done to improve it; the archetypical content of a research paper.
The end result was not papers. The point was products. The papers were a byproduct. For most academic institutions, the entire point is the paper, whether it's useful to anyone or not.
The useful end result was definitely the papers. The products offer a proof of concept (that it works) and a demonstration (how the details add up).
Then in future, others throw away the product (or observe it a bit) and make the next nice things (products for industry adaptations, or further research) mostly based on their papers.
This is why I believe NixOS is so important. It allows one to completely freeze the entire development environment on one hand, but also does so in a declarative, well abstracted manner (vs VM image let's say) so tweaking/porting is actually feasible.
Until well get to a point where building/installing/administrating is not hours of bullshit, research (and free software) will suffer.
As an academic researcher I find it absolutely hilarious that you think the complex social problem of incentive structure and competition will be solved by some Unix OS.
If you are interested just take a look at the complexity of licensing/ownership of code written by a PhD student at a Research University in United States.
If you look at most of my Open Source code, I use AWS AMIs to share both data as well as OS + code, however I can do that only for side projects. The main thesis projects are typically very high value and consequences of sharing it far more complex to understand.
No that is not what I think at all, see my follow-up comments below. I just think the combination of shitty tools + incentive structure is even more insurmountable. This is a tough problem that should be attacked from as many fronts as possible.
> The main thesis projects are typically very high value and consequences of sharing it far more complex to understand.
Commercial value, the university is just more stringent on the licensing/ownership restrictions, or something else?
2. Future grant applications (a competing group not sharing code will have better chance winning the grant.)
3. Future of other students and collaborators in the group. If two PhD students write a paper, the junior student might wish to write extension papers without getting scooped.
And many more. Yet if a paper is important enough, independant researchers will often attempt at replication, this nowadays routinely happens in Machine Learning and Vision due to huge amount of interest. Also in several cases replication is fundamentally impossible, e.g. consider a machine learning paper that uses proprietary data from hospital attached to the university, etc.
I totally get that researchers' incentives are not aligned toward publishing it, so no need to explain that further. There are costs and downsides and probably not enough benefit to them. That's fine. Everyone works within their system of incentives.
If it's paid for by public dollars, then the code and data belong in the public domain eventually. I understand there are exceptions like hospital data affected by patient confidentiality - that's fine. However the code released by that researcher should be capable of reproducing their results with that data set plugged in (such as by someone else who has access to it).
As a taxpayer, my concern for publicly funded research is maximizing benefit to the public good. I understand your point about follow-on research, and I'm not saying that I'd expect the code and data to be made available immediately with publication, but that deserves to be the case some reasonable time afterward (like a year). I understand that researchers' incentives are not necessarily aligned toward making it public; I am saying that people who fund research (including taxpayers through the political process) should require and expect it. Keeping it private indefinitely is a degree of self-centeredness that does not strike an appropriate balance between benefit to the researcher and to the public in my opinion.
I never understood the meme about "public funding" translating into "public domain". Just because research is "publicly funded", does not means that the "public" owns it or even has a right to ownership. Public education is publicly funded does not means that government can ask for every drawing drawn by 9 year old in classroom to be in the public domain :) . In fact its actually opposite (https://en.wikipedia.org/wiki/Bayh%E2%80%93Dole_Act), given that Universities can and do patent inventions from publicly funded research.
Further funding arrangements themselves are very complex, a professor typically procures funding from University, NSF, NIH, private companies, donors etc. In such cases if NSF adopts a hard line approach that any research touching its dollars ought to release code under say GPL, it would make it impossible to collaborate. Finally all requirements aside, one can always release intentionally poorly written code in form of MatLab .m and compiled mex files. I have observed several such cases, where the code can demonstrate a concept but is intentionally crippled.
Finally graduate students, graduate and are paid for doing research which is publishing/presenting papers at peer reviewed conferences and journals. If what funding agencies really seek is ready made software they ought to fund / pay at the same level as software developers (as many companies do).
> Just because research is "publicly funded", does not means that the "public" owns it or even has a right to ownership.
I didn't make the argument that the public owns it or has a right to ownership, though I suppose that some people might and so I can see why you would touch on that point.
I would describe my view as like this: public funding is subject to the political process, and voting by taxpayers (directly or indirectly through voting of politicians or their appointees). As a taxpayer, I prefer to make public domain publication a requirement of publicly funded research, and I think every taxpayer should too. I consider the goal of public funding of science to be the benefit of public good, and believe that public good will best be served by public domain publication of all data, code, and analysis methods. (Whew, there's a lot of "pub" and "public" in there!)
One might reach my position by working backwards from, "Why do we as taxpayers agree to fund science with government money?" It's certainly not to give researchers prestige or jobs! (Those may be necessary parts of achieving public good, but they're not the goal which is the public good, and if they're in tension with public good then the public good probably needs to win.)
I don't seek ready made software; not at all. I only seek adequate disclosure of data and analysis methods sufficient for others to easily verify it and build on it. See for example the attempt at replication in http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04dea...
> In such cases if NSF adopts a hard line approach that any research touching its dollars ought to release code under say GPL, it would make it impossible to collaborate.
I will need to think more about this issue. I might be willing to accept the downside as a taxpayer. I'm not sure I understand it well enough what the friction would be to collaboration at the moment. If you're referring to the GPL specifically, then yes I agree that's probably the wrong license - public domain would be more appropriate.
I would be OK if this was simply an electronic log of the data as well as all machine commands that have been run on it - something that is recorded automatically by the operating environment. I am truly not looking for "working production code". But, those sequence of commands should be reproducible if someone "replays" them; a verifiable digital journal. Publishing an article that's difficult-to-reproduce feels like producing the least possible public good while still getting credit. Publishing an article that's fully and automatically reproducible, because it contains references to all of the data and code that yield the results as an executable virtual machine with source code, provides the maximum public good, and that's what I want science funded with public money (and ultimately all science) to work toward. (I realize that this is just like, my opinion man :)
You are correct in expecting return of public investment. Actually NIH has a policy that explicitly favors "Basic Scientific Research" over applied or application of research. According to NSF and NIH, the primary goal of government funded research is advancement of science, this is done via conducting experiments, and publishing results at peer reviewed venues. The peer review both during grant application and at publication stage factors heavily into assessment by funding agencies. If tomorrow NSF were to give significant weight to availability of source code (They actually do that to a small extent), it might set up perverse incentives. A small percentage of federally funded research goes into Computer Science and even a smaller fraction involves results where there is enough demand for software. Another aspect of academic funding that people don't get is that research grants unlike say contracts have a significantly different set of expectations associated with them. E.g. a student can get NSF Fellowship, claiming he want to cure diseases using machine learning, only to later spend 3 years working on music recommendation system (True Story!).
Regarding the economics study you linked to, I am very much familiar with that study having seen the interview of graduate student on Colbert Report. For non-CS fields the quality of code is anyway so bad that its much more difficult. Further several researchers rely on proprietary tools, which only make this task difficult.
In my opinion the correct way is not by having NSF impose rules, but rather by having venues that accept papers (Conference & Journals) insist on providing software. However this is easier said than done, since its a competitive two sided market.
Regarding actual licensing issues, I can assure you that GPL is second favorite license of choice favored by University IP departments, the first one being "All rights reserved with modification explicitly forbidden, except for reproduction of experiments."
Ah, so theses are special insofar that they spawn more derivative commercial and paper-writing opportunities, and aren't singled out simply be virtue of being called "thesis".
The problem here is (EDIT: IMHO) a social one. The question is not "Why does nobody USE NixOS?", it's "Why does nobody WANT NixOS?"
And the answer to that is that the incentives are set up such that reproducibility is a waste of time. As a CS researcher, I want to be idealistic. But the field is competitive and I'm not sure how much idealism I can afford.
There's considerable effort to bring artifact evaluation into the academic mainstream (I'm actually helping out at OOPSLA this year [1]), and I think this is a good way forward.
I'm not trying to argue the wrong-headed institutional incentives are irrelevant. It is true that once one gets over the learning curve, Nix* gives you artifact reproducability for free, but that still leaves the problem of the learning curve, and caring about reproducability in the first place.
Thank you for yours, and others', work making artifact evaluation a priority. Concurrently, people (including myself) are trying to do something about that learning curve. Hopefully both efforts will massively succeed within a decade :).
Sorry if I've come on too strong. I might've gotten a bit defensive. I've done work in the past that is hard to reproduce and that's not an aspect of my work I'm proud of -- to say the least.
Btw, I'm excited about NixOS. It sounds like you're involved. Thank you. I haven't used it, but I'm hoping to find the time soon.
No you definitely didn't :). I'd say probably a common HN bias is to ignore institutional forces when it is convenient. Glad to hear you are excited about NixOS!
I'm currently in the process of rewriting some of my code for doing certain simulations via probabilistic programming in Haskell. I expected this to be a pain in the ass, but maybe make the code neater. I've actually found the unexpected benefit to be that the code runs deterministically and produces the same results each time, so I know that an apparent result is not going to go away with another run and a fresh PRNG stream.
Funny, as students we were expected to solve the exam problem and "show your work"...
I think the root a the problem is that the goals of the researchers are not aligned with the goals of Science. This isn't a criticism of the researchers but instead of the "game" they are forced to play.
For example,the goal of Science is to move the ball(knowledge) down the field for the benefit of mankind. We don't reward researchers for doing that, at least not very well. We reward researchers for writing papers full stop - not for making their research easy to reproduce or build on.
Its not just the researchers' fault. Maybe the industry should help. Mozilla is a major stakeholder of the web platform that makes distribution easy. Lets make sure web is the best platform to do all the research.
* Provide great scientific and matrix manipulation libraries within the browser. WebAssembly isn't going to solve this. Why would the academia rewrite everything?
* Provide tools that help research being open. Uploading your code to Github isn't a solution. The real solution is making it easy to use and link other person't research. Can we make research as accessible as a javascript file that you include in your html file to run. And it shouldn't cost the creator to host/maintain it. Offline web sucks(still) and it costs money to host your servers.
* Provide incentive to use the web for everything. A great one would be an easy to use and debug toolset and an easy set of methods to get data in and out, an editing environment that can be setup with one click. The closest is iPython Notebook. And it takes work to get there.
Sharing should be default and easy. If it isn't we are no one to complain.
Why not? This sounds like exactly the solution. It's no burden on the researcher - they don't have to alter their research methods to fit a new system, they just dump the code, and leave it up to other users to reimplement.
Building web tools to allow research sounds like precisely the wrong solution, at least in the short to medium term. Research funding doesn't go far enough as it is, you're not going to get researchers changing their processes entirely for no gain. What if they need custom hardware, or access to tools and libraries that haven't been implemented?
Sharing _is_ default and easy. That's why Github has exploded.
You forgot to consider IP issues. Schools vary a lot on their policies by but for many the code belongs to the school and can not be open sourced without permission which requires extra work. Funding sources have their own IP deals to consider too.
It all sounds so easy until you look at the actual constraints. Professors are usually smart and experienced and they have thought about this stuff a lot. If it was as easy as you thought, it wouldn't be a problem.
I publish all code as a matter of lab policy. I chose where to set up my lab partly so that I was able to do this. Not everyone has this luxury or makes this a priority.
You're right, I did, but I wasn't actually addressing ease in some absolute sense, I was talking about the appropriate tools for the job.
If researchers have the legal right to publish their work, I can't see any reason why github wouldn't be exactly the place they'd share it, rather than some custom online research system as proposed by the parent.
That said, I don't have any experience in CS research, it's not my field, so I may be wrong about that, do tell if so.
We share our code on github and publish the commit hash for the code that generated the results in the paper. That way we can continue to develop the code after publication but readers can retrieve the exact code described in the paper if they wish. Simple and effective.
But again, the IP rules at my university allow this at my sole discretion, which is unusual.
We manage our code on a private github repository, but sharing it would be impossible - the research code relies on various code pieces developed during 5+ different projects with varying IP policies, and we can't make it available to public without either getting permission from all of the relevant organizations and companies (not gonna happen) or getting specific targeted funding to replace/rewrite those components to enable the whole thing to be public - and that's not gonna happen either.
Political influence from the top (non-scientific top - the funding sources mostly) is the only way that can improve anything for this problem, the scientists don't really have any realistic way to do that - i.e. one that doesn't expect to sacrifice their scientific and personal goals to go significantly against the current incentive system just to slightly improve openness in their fields.
The same goes about datasets - quite a few of the more interesting datasets from industry to do science on are available only under very harsh conditions. You can do very interesting and useful analysis that cannot be reproduced by anyone else, because they are unlikely to ever get access to that particular set of data.
An idealistic grad student working on a greenfield project could start and keep that project as open and reproducible as would be best - but most of them work on basis of existing projects with lots of legacy, and there it's different.
Releasing code is a widespread practice in the programming languages and software engineering communities, and one that is getting stronger (see http://artifact-eval.org).
If you are a CS researcher, please fill out this survey of open source and data in computer science:
I guess I can see why many researchers are not inclined to publish their code. I've worked as a research assistant in the institute of microelectronics of my university for a year and the code quality was somewhere between mediocre and downright terrible. And that's not to mention the absence of sane software engineering standards like bug tracking or code reviews.
From reading this discussion, it seems that IP rules are inconsistent, and therefore part of the solution needs to be to perhaps have funding agencies enforce a standard license. With that barrier eliminated, scientists wouldn't expend valuable time negotiating with their school's layers.
making research reproducible with publications that actually contain their code (in addition to prose and graphs, all connected together) is one of the goals of Beaker Notebook, which has a publication server where you can share such results, and then with one click open them up in the cloud, modify them as you see fit, and then run them again (if you have seen Beaker before, this last part, the cloud hosted version, is new as of last week).
Wow, looks impressive and that it comes from hedge fund is cool too.
It's like Jupyter but much more robust, multiple languages in single notebook and ability to share data across languages out of box is really nice.
I would not be surprised if most academic CS research is not reproducible. This is true for many other fields outside of CS, and I've seen it first hand in machine learning. It's a problem but it's also just how things are.
Maybe it can be thought of as a tooling problem?
Say, a plugin that allows a one-click publish code + data from Matlab, and then it all goes up on a well-indexed page so that others can download/run it.
I doubt the problem is that academic CS researchers don't know how to publish their code, but rather that the disincentives are usually stronger than the incentives.
Is there even code to publish? I am under the impression a lot of papers from Bugzilla data are of the form "we imported the data into Excel and had a hand-crafted one-off spreadsheet".
In that case, yes: the spreadsheet itself consists of data and analysis over data (aggregations over columns and rows, etc.) so the spreadsheet itself would ideally be version controlled and published.
The idea isn't to ask researchers to formalize what they make more than before, but to include fully reproducible details in the publication. A spreadsheet is totally fine because you can see how it works, reproduce the result, and tweak the inputs/methods to build on it.
It seems like politely writing to the researchers and asking if they still have the code lying around might have good results. (If nothing else, it lets them know someone cares.)
A colleague was working on a replication study. We got the code from the original researcher and another researcher who did a follow-on study. The code barely runs, and the results seem off. I spent days debugging to no avail. Just because researchers provide code does not mean it is well-written code, or that it necessarily works.
Then I'd be skeptical about those results. Releasing the code allows others to judge the likely accuracy and integrity of the results. A lot of things can go wrong in complex, multi-step computational processes. If care and rigor has not been put into them, then I'd have no reason to believe in the integrity of the output. I want the general public right to judge that, as well as build on it when it's useful and valuable.
Every publication involving data and technical analysis should publish them to a degree that makes validation possible in at least as detailed a way as portrayed in "Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff" (http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04dea...)
I would extrapolate that to all of academia. If you want to work on something that's useful, insightful and will make the world a better place, the ivory tower of academia is great if you want to live in denial in an echo chamber.
That's a strong claim. While I'm sure the sentiment is understandable for some fields of academia, others have produced significant results that have been adopted by the industry, sometimes very quickly.
Machine Learning has a lot of theoretical results produced by academia, but the more practical techniques (decision trees, SVMs, neural networks, etc) also all came from academia. The engineers scaled the algorithms to run on bigger datasets, but the initial work is still driving those systems.
Graphics research has seen comparable contributions between industrial research labs and academia. Sure, a higher proportion of papers from academia don't end up practical, but the number of papers that are very practical makes that quite irrelevant. It's to be expected since you can't predict what works in advance, and industrial research labs just don't bother publishing negative results, not that they don't get any.
Many programming languages and compiler techniques came from papers.
I could go on but you see the point - it depends on the field.
But that doesn't say anything about the person in academia who writes a speculative generic research paper about a way too simple implementation of a decision tree. I don't want to demean that accomplishment, but for me it would be hard to have sustained excitement from that. Also I think that a counter argument is that just because the initial work of X, Y and Z algorithm were sketched out in academia, they still very much so would have been scaled in industry once the actual need has been found for them.
What percentage of the CS research based on any data at all? It is mostly purely theoretical. Still, it would be nice if they published their proofs as a Coq code.
I think you're conflating a very narrow slice of CS (theoretical computer science) with the larger field. There's a huge amount of CS research that relies on gathering and analyzing data, building systems, etc. Theoretical computer science is actually a very small slice of the research pie.
These are just the areas I work in, its very hard to making sweeping statement about CS as a full field since it is very diverse sub-community to sub-community.
My perception might be skewed, of course. In the broad range of practical areas I worked in I rarely came across any data-driven papers. Mind giving some examples?
Not to say I disagree with the frustration... but it's also not something new. It's been this way for decades. I'd much rather hear about who is doing work in this space and what they're working on. Here are the ones I know of:
1. The Center for Open Science (https://cos.io) is one such org trying to fix this with the Open Science Framework [1].
2. GitHub also recognizes the need for citable code and gives special discounts for research groups, in fact, Mozilla is one they work with [2].
Two smaller related startups are:
3. Datazar (https://datazar.com) - A way to freely distribute scientific data.
4. Liquid (https://getliquid.io) - A scientific data management platform. Somewhat like "Excel for scientific data as a Service".
Also, a related HN thread from some years ago: "We need a GitHub of Science" https://news.ycombinator.com/item?id=2425823
---
1: https://osf.io
2: https://github.com/blog/1840-improving-github-for-science