This unfortunately just isn't possible if you want your research to be able to b...

cing · on Aug 26, 2017

The english language, and other languages like it, are living. They're not suitable as the final form to publish in. Who knows what language we'll be speaking in a few millennia? It's best to write in a series of emojis, the universal language.

My point is that we've done a pretty good job at archiving and deciphering ancient text so far. There's no reason to think that we won't be able to emulate current CPU architectures (and run virtual machines on them) for a long time. Pseudocode works for algorithms, and it worked for your case, but it doesn't work for the 99% of scientific software with more than a thousand lines. Executable papers are absolutely possible and are currently being created with virtualization/containers. It's just not easy yet.

icelancer · on Aug 26, 2017

>>"download?" from where? Links rot.

Best practice here is to self-host + host on a third-party site and run a torrent as well for the archived files. I've done it in published papers and one of the OA journals I publish with strongly recommends this exact path.

jcrites · on Aug 28, 2017

Far from being impossible, the challenges that you mention have fairly straightforward solutions. Every cloud hosting provider that executes user-supplied virtual machines is employing practical technology that can solve this problem today.

I'd represent the code as a virtual machine image. Let the researcher work however they want. When they're done, they take a virtual machine snapshot of their execution environment. We provide some tooling so that running the "manifest" in this execution environment (re-)produces all of their computational results. Thus, think of research as a "build" process that runs analysis and produces output. Ideally this build would also compile any necessary source code before running the compiled applications in the context of research data.

Far from being "limited", researchers can run any code that can run on a cloud hosting provider today. To ensure portability, the VM image will include a versioned copy of (or reference to) the kernel (e.g. Linux) and all of the userland libraries that are used by the software (e.g. Python). Think of it like a self-describing Linux container, like a Kubernetes Pod [1] or Docker Container. The image fully describes the software running on the system; it's a precise copy. With this machine image, we can run code in the same execution environment that the researcher used.

Have you ever run a very old NES or SNES game on a Nintendo DS, or on your PC in an emulator? It's the same concept.

The researcher's data is stored directly within the virtual machine image, or is represented as a virtual data drive that's mounted in the virtual machine. When this approach becomes influential and widely adopted, researchers will represent their code as containers throughout the development process. The researcher won't just run Jupyter (or Matlab or whatnot) on a random machine, they'll e.g. run Jupyter in a carefully packaged minimal container that was designed for reproducibility.

> Code bit rots.

"Bit rot" is a phenomenon that references the difficulty of maintaining code as environment and system change around it. It doesn't directly apply to our scenario. A virtual image containing a Python 1.0 program running on version 1.0 of Linux will continue to work indefinitely, just like an old SNES game. New software releases don't affect our ability to run old software -- they just make it harder to mix the two together. Furthermore, we can even run the program while passing different (modern) data as input! We've already made huge progress over where we are today.

Sure, if we want to adapt that code to a newer version of Python and Linux, then we have work to do, but that's a different problem than reproducibility. There's no free lunch; nothing can make that problem go away. But if we do want to adapt the researcher's algorithm to another language or platform, then we have a huge advantage: we can run the researcher's actual code with their actual data and get the very same result! That's huge! That will make it far easier to confidently adapt their algorithm to a new environment.

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod/

pacala · on Aug 26, 2017

data => $0.1 / GB / Year in a public Cloud, or even $0.01 / GB / Year if using Glacier / Coldline.

code => a Docker/CFI image.

dependencies => captured in the image.

format => the code understands the format.

pjmlp · on Aug 26, 2017

That is short term thinking, who will guarantee any of that stuff will be around in the next 50 years?

true_religion · on Aug 26, 2017

Does it cost more to keep code in a standard format, with all it the dependencies needed to run it? Yes, especially if you replicate the codebase anywhere the journal article is stored so it has backups in case of disaster.

However, the costs are fairly comparable to what was once the cost of keeping paper copies for everything, so I think it is a cost that can be absorbed by academia.

pjmlp · on Aug 26, 2017

It is not only about money, technology, infrastructure and people for keeping it running also play an important role.

pacala · on Aug 26, 2017

You can run Apple II code in a browser, for example https://www.scullinsteel.com/apple2. A code format that has survived for 40 years and counting. This holds for plenty other data and code formats that were once popular.

Docker/CFI is very popular, hard to believe it will disappear short of a complete breakdown of the digital era.