Hacker News new | past | comments | ask | show | jobs | submit login

Don't forget code! "we then used our in-house untested code, developed by an undergrad we scrapped from Biology, to normalize the results. The code is comfortably not attached"



To be fair:

"All computer codes involved in the creation or analysis of data must also be available to any reader of Science."

And

"Upon publication, Nature Journals consider it best practice to release custom computer code in a way that allows readers to repeat the published results."


Could we take it one step further? I'd love to see a publication format that bundles the code, the data, and maybe even the paper together into a single executable bundle for distribution.

Imagine that the bundle starts with a manifest file that describes the data and the code, pointing to an authoritative, versioned copy. The user can "run" the manifest, which will download and install the data and any necessary dependencies. (If you're thinking about this from a containerized point of view, the code dependencies and data may be modeled as a container or virtual machine image.)

The manifest is like a build file that executes the code passing the data as input, in a clear and reproducible way. Anyone running the manifest will get the same results as the researcher did. Because the entirety of the execution environment is versioned and captured by the manifest, including all of the system software, the results are reproducible from the code and data phase onward with zero effort.

For bonus points, the manifest would also "build" the paper (e.g. if it's TeX), and would substitute the results of the code execution directly into the paper, i.e. graphs and numbers. You could conceptualize the paper as something like a notebook, where its values can change dynamically according with the data you provide it.


This unfortunately just isn't possible if you want your research to be able to be read in the future. "download?" from where? Links rot. "run?" which language, compiler, cpu architecture? Dependencies? Code bit rots. And data formats become unreadable - unless they're self-documenting like plain old csvs, they're fine.

That's why algorithms are written in pseudocode, and mathematics is written in, well, mathematical notation. It's expected that if you want to use it you'll re-implement it yourself in a software setup convenient to you.

So you can't publish runnable code unless it's severely limited - being in some standard language that is unchanging. I'd much prefer someone tell me in words what statistical method they used, so I can type it into Python myself than be forced to spin up some standardised but old and crufty fortran 77 or something that they didn't enjoy writing and I didn't enjoy running (I am aware that Python is often just calling the old fortran libs though!). Giving me the Python in the first place also isn't feasible - some scientific analysis code I wrote 6 months ago already doesn't run because of API changes in matplotlib.

A few years ago I saw Python 1.5 code implementing an algorithm I wanted to use. I couldn't run it - instead I read it, and I read the Python 1.5 docs about the bits I was unfamiliar with. If it were in pseudocode it would have been more self-contained.

Code and data formats, other than very crufty ones, are living. They're not suitable as the final form to publish in. If your project and field is ongoing, then by all means try to develop standards and use the same software and formats as each other to help sharing data. But the actual publications need to still be readable in a decade (or much more) time, so publishing runnable code seems like it conflicts too much with that.


The english language, and other languages like it, are living. They're not suitable as the final form to publish in. Who knows what language we'll be speaking in a few millennia? It's best to write in a series of emojis, the universal language.

My point is that we've done a pretty good job at archiving and deciphering ancient text so far. There's no reason to think that we won't be able to emulate current CPU architectures (and run virtual machines on them) for a long time. Pseudocode works for algorithms, and it worked for your case, but it doesn't work for the 99% of scientific software with more than a thousand lines. Executable papers are absolutely possible and are currently being created with virtualization/containers. It's just not easy yet.


>>"download?" from where? Links rot.

Best practice here is to self-host + host on a third-party site and run a torrent as well for the archived files. I've done it in published papers and one of the OA journals I publish with strongly recommends this exact path.


Far from being impossible, the challenges that you mention have fairly straightforward solutions. Every cloud hosting provider that executes user-supplied virtual machines is employing practical technology that can solve this problem today.

I'd represent the code as a virtual machine image. Let the researcher work however they want. When they're done, they take a virtual machine snapshot of their execution environment. We provide some tooling so that running the "manifest" in this execution environment (re-)produces all of their computational results. Thus, think of research as a "build" process that runs analysis and produces output. Ideally this build would also compile any necessary source code before running the compiled applications in the context of research data.

Far from being "limited", researchers can run any code that can run on a cloud hosting provider today. To ensure portability, the VM image will include a versioned copy of (or reference to) the kernel (e.g. Linux) and all of the userland libraries that are used by the software (e.g. Python). Think of it like a self-describing Linux container, like a Kubernetes Pod [1] or Docker Container. The image fully describes the software running on the system; it's a precise copy. With this machine image, we can run code in the same execution environment that the researcher used.

Have you ever run a very old NES or SNES game on a Nintendo DS, or on your PC in an emulator? It's the same concept.

The researcher's data is stored directly within the virtual machine image, or is represented as a virtual data drive that's mounted in the virtual machine. When this approach becomes influential and widely adopted, researchers will represent their code as containers throughout the development process. The researcher won't just run Jupyter (or Matlab or whatnot) on a random machine, they'll e.g. run Jupyter in a carefully packaged minimal container that was designed for reproducibility.

> Code bit rots.

"Bit rot" is a phenomenon that references the difficulty of maintaining code as environment and system change around it. It doesn't directly apply to our scenario. A virtual image containing a Python 1.0 program running on version 1.0 of Linux will continue to work indefinitely, just like an old SNES game. New software releases don't affect our ability to run old software -- they just make it harder to mix the two together. Furthermore, we can even run the program while passing different (modern) data as input! We've already made huge progress over where we are today.

Sure, if we want to adapt that code to a newer version of Python and Linux, then we have work to do, but that's a different problem than reproducibility. There's no free lunch; nothing can make that problem go away. But if we do want to adapt the researcher's algorithm to another language or platform, then we have a huge advantage: we can run the researcher's actual code with their actual data and get the very same result! That's huge! That will make it far easier to confidently adapt their algorithm to a new environment.

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod/


data => $0.1 / GB / Year in a public Cloud, or even $0.01 / GB / Year if using Glacier / Coldline.

code => a Docker/CFI image.

dependencies => captured in the image.

format => the code understands the format.


That is short term thinking, who will guarantee any of that stuff will be around in the next 50 years?


Does it cost more to keep code in a standard format, with all it the dependencies needed to run it? Yes, especially if you replicate the codebase anywhere the journal article is stored so it has backups in case of disaster.

However, the costs are fairly comparable to what was once the cost of keeping paper copies for everything, so I think it is a cost that can be absorbed by academia.


It is not only about money, technology, infrastructure and people for keeping it running also play an important role.


You can run Apple II code in a browser, for example https://www.scullinsteel.com/apple2. A code format that has survived for 40 years and counting. This holds for plenty other data and code formats that were once popular.

Docker/CFI is very popular, hard to believe it will disappear short of a complete breakdown of the digital era.



Would all this be possible using git? In that case, even normal people (non-professors , non-R&D engineers) who read research papers to quench their curiosity, would have to know to use git.


You could use an open protocol based on git, Docker, &c. internally, and build some interface friendly to non-programmers on top.


You would need more than a non-programmer friendly interface on top- biologists, astronomers, chemists, et al. usually aren't fully fledged programmers themselves, and aren't likely to understand what Docker is, let alone how to use it. A scientific programming project framework is a great idea (perhaps something built into Julia, or a specialized packaging of Python similar to Anaconda) but it would require a lot of the complex machinery to be concealed somehow. A challenging project, even before the "get editors to demand it" stage.


That’s exactly what I meant—conceal the details so anyone can use it, but base those details on common standards so it’s possible for a savvy individual to work with such packages manually without vendor lock-in. The true challenge definitely isn’t the technical side—it’s getting people to adopt it, and thereby make it standard.


Sounds like a Jupyter Notebook.


And that's just the CS papers.


Not entirely true: I am an astrophysicist, and I can say that this practice would be extremely useful for my discipline as well. In cosmology (my field) we use codes that are often extremely complex, yet they are not usually released together with the papers based on their outputs. This puzzles me, as last year I attended a conference on informatics and astrophysics (ADASS), and one of the talks showed that releasing your codes as open source increases a lot the chance of your papers being cited (expecially if you advertise your code using services like ASCL [1]).

The only reason I can think of this unwillingness to publish codes is the fact that these codes written by physicists are often extremely unreadable: very long Fortran routines with awkward variable names, no comments, no test cases, no documentation… Once you get a result from your own software, you get more satisfaction if you publish a paper with the results than if you polish and publish the code.

IMHO this is bad science, but it is difficult to change this way of working: cosmology is today done by large collaborations, not by individuals, and if you propose to make the codebase developed by your team public, this idea is usually not welcomed by your co-authors.

[1] http://ascl.net/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: