Hacker News new | past | comments | ask | show | jobs | submit login

This unfortunately just isn't possible if you want your research to be able to be read in the future. "download?" from where? Links rot. "run?" which language, compiler, cpu architecture? Dependencies? Code bit rots. And data formats become unreadable - unless they're self-documenting like plain old csvs, they're fine.

That's why algorithms are written in pseudocode, and mathematics is written in, well, mathematical notation. It's expected that if you want to use it you'll re-implement it yourself in a software setup convenient to you.

So you can't publish runnable code unless it's severely limited - being in some standard language that is unchanging. I'd much prefer someone tell me in words what statistical method they used, so I can type it into Python myself than be forced to spin up some standardised but old and crufty fortran 77 or something that they didn't enjoy writing and I didn't enjoy running (I am aware that Python is often just calling the old fortran libs though!). Giving me the Python in the first place also isn't feasible - some scientific analysis code I wrote 6 months ago already doesn't run because of API changes in matplotlib.

A few years ago I saw Python 1.5 code implementing an algorithm I wanted to use. I couldn't run it - instead I read it, and I read the Python 1.5 docs about the bits I was unfamiliar with. If it were in pseudocode it would have been more self-contained.

Code and data formats, other than very crufty ones, are living. They're not suitable as the final form to publish in. If your project and field is ongoing, then by all means try to develop standards and use the same software and formats as each other to help sharing data. But the actual publications need to still be readable in a decade (or much more) time, so publishing runnable code seems like it conflicts too much with that.




The english language, and other languages like it, are living. They're not suitable as the final form to publish in. Who knows what language we'll be speaking in a few millennia? It's best to write in a series of emojis, the universal language.

My point is that we've done a pretty good job at archiving and deciphering ancient text so far. There's no reason to think that we won't be able to emulate current CPU architectures (and run virtual machines on them) for a long time. Pseudocode works for algorithms, and it worked for your case, but it doesn't work for the 99% of scientific software with more than a thousand lines. Executable papers are absolutely possible and are currently being created with virtualization/containers. It's just not easy yet.


>>"download?" from where? Links rot.

Best practice here is to self-host + host on a third-party site and run a torrent as well for the archived files. I've done it in published papers and one of the OA journals I publish with strongly recommends this exact path.


Far from being impossible, the challenges that you mention have fairly straightforward solutions. Every cloud hosting provider that executes user-supplied virtual machines is employing practical technology that can solve this problem today.

I'd represent the code as a virtual machine image. Let the researcher work however they want. When they're done, they take a virtual machine snapshot of their execution environment. We provide some tooling so that running the "manifest" in this execution environment (re-)produces all of their computational results. Thus, think of research as a "build" process that runs analysis and produces output. Ideally this build would also compile any necessary source code before running the compiled applications in the context of research data.

Far from being "limited", researchers can run any code that can run on a cloud hosting provider today. To ensure portability, the VM image will include a versioned copy of (or reference to) the kernel (e.g. Linux) and all of the userland libraries that are used by the software (e.g. Python). Think of it like a self-describing Linux container, like a Kubernetes Pod [1] or Docker Container. The image fully describes the software running on the system; it's a precise copy. With this machine image, we can run code in the same execution environment that the researcher used.

Have you ever run a very old NES or SNES game on a Nintendo DS, or on your PC in an emulator? It's the same concept.

The researcher's data is stored directly within the virtual machine image, or is represented as a virtual data drive that's mounted in the virtual machine. When this approach becomes influential and widely adopted, researchers will represent their code as containers throughout the development process. The researcher won't just run Jupyter (or Matlab or whatnot) on a random machine, they'll e.g. run Jupyter in a carefully packaged minimal container that was designed for reproducibility.

> Code bit rots.

"Bit rot" is a phenomenon that references the difficulty of maintaining code as environment and system change around it. It doesn't directly apply to our scenario. A virtual image containing a Python 1.0 program running on version 1.0 of Linux will continue to work indefinitely, just like an old SNES game. New software releases don't affect our ability to run old software -- they just make it harder to mix the two together. Furthermore, we can even run the program while passing different (modern) data as input! We've already made huge progress over where we are today.

Sure, if we want to adapt that code to a newer version of Python and Linux, then we have work to do, but that's a different problem than reproducibility. There's no free lunch; nothing can make that problem go away. But if we do want to adapt the researcher's algorithm to another language or platform, then we have a huge advantage: we can run the researcher's actual code with their actual data and get the very same result! That's huge! That will make it far easier to confidently adapt their algorithm to a new environment.

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod/


data => $0.1 / GB / Year in a public Cloud, or even $0.01 / GB / Year if using Glacier / Coldline.

code => a Docker/CFI image.

dependencies => captured in the image.

format => the code understands the format.


That is short term thinking, who will guarantee any of that stuff will be around in the next 50 years?


Does it cost more to keep code in a standard format, with all it the dependencies needed to run it? Yes, especially if you replicate the codebase anywhere the journal article is stored so it has backups in case of disaster.

However, the costs are fairly comparable to what was once the cost of keeping paper copies for everything, so I think it is a cost that can be absorbed by academia.


It is not only about money, technology, infrastructure and people for keeping it running also play an important role.


You can run Apple II code in a browser, for example https://www.scullinsteel.com/apple2. A code format that has survived for 40 years and counting. This holds for plenty other data and code formats that were once popular.

Docker/CFI is very popular, hard to believe it will disappear short of a complete breakdown of the digital era.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: