I live in this field now. I refuse to publish in so-called "highly-ranked journals" that take 3-6 months to find peers to review and referee your work, drag their feet, and basically amount to a ridiculous group of cronies that guard the gates of "science."
I've talked about it before on HN but I will only publish to open access and open data (most important) journals that require all data be scrubbed of PII and published alongside the paper for replication purposes. I highly favor journals that not only allow, but encourage replication (yes, many elite journals cite "novelty" as a top factor regarding publication decisions... embarrassing).
Science should be freely available, open access, open data, and replicable. Otherwise it isn't science. It's primarily garbage that exists to advance the careers and egos of a select few.
Don't forget code! "we then used our in-house untested code, developed by an undergrad we scrapped from Biology, to normalize the results. The code is comfortably not attached"
"All computer codes involved in the creation or analysis of data must also be available to any reader of Science."
And
"Upon publication, Nature Journals consider it best practice to release custom computer code in a way that allows readers to repeat the published results."
Could we take it one step further? I'd love to see a publication format that bundles the code, the data, and maybe even the paper together into a single executable bundle for distribution.
Imagine that the bundle starts with a manifest file that describes the data and the code, pointing to an authoritative, versioned copy. The user can "run" the manifest, which will download and install the data and any necessary dependencies. (If you're thinking about this from a containerized point of view, the code dependencies and data may be modeled as a container or virtual machine image.)
The manifest is like a build file that executes the code passing the data as input, in a clear and reproducible way. Anyone running the manifest will get the same results as the researcher did. Because the entirety of the execution environment is versioned and captured by the manifest, including all of the system software, the results are reproducible from the code and data phase onward with zero effort.
For bonus points, the manifest would also "build" the paper (e.g. if it's TeX), and would substitute the results of the code execution directly into the paper, i.e. graphs and numbers. You could conceptualize the paper as something like a notebook, where its values can change dynamically according with the data you provide it.
This unfortunately just isn't possible if you want your research to be able to be read in the future. "download?" from where? Links rot. "run?" which language, compiler, cpu architecture? Dependencies? Code bit rots. And data formats become unreadable - unless they're self-documenting like plain old csvs, they're fine.
That's why algorithms are written in pseudocode, and mathematics is written in, well, mathematical notation. It's expected that if you want to use it you'll re-implement it yourself in a software setup convenient to you.
So you can't publish runnable code unless it's severely limited - being in some standard language that is unchanging. I'd much prefer someone tell me in words what statistical method they used, so I can type it into Python myself than be forced to spin up some standardised but old and crufty fortran 77 or something that they didn't enjoy writing and I didn't enjoy running (I am aware that Python is often just calling the old fortran libs though!). Giving me the Python in the first place also isn't feasible - some scientific analysis code I wrote 6 months ago already doesn't run because of API changes in matplotlib.
A few years ago I saw Python 1.5 code implementing an algorithm I wanted to use. I couldn't run it - instead I read it, and I read the Python 1.5 docs about the bits I was unfamiliar with. If it were in pseudocode it would have been more self-contained.
Code and data formats, other than very crufty ones, are living. They're not suitable as the final form to publish in. If your project and field is ongoing, then by all means try to develop standards and use the same software and formats as each other to help sharing data. But the actual publications need to still be readable in a decade (or much more) time, so publishing runnable code seems like it conflicts too much with that.
The english language, and other languages like it, are living. They're not suitable as the final form to publish in. Who knows what language we'll be speaking in a few millennia? It's best to write in a series of emojis, the universal language.
My point is that we've done a pretty good job at archiving and deciphering ancient text so far. There's no reason to think that we won't be able to emulate current CPU architectures (and run virtual machines on them) for a long time. Pseudocode works for algorithms, and it worked for your case, but it doesn't work for the 99% of scientific software with more than a thousand lines. Executable papers are absolutely possible and are currently being created with virtualization/containers. It's just not easy yet.
Best practice here is to self-host + host on a third-party site and run a torrent as well for the archived files. I've done it in published papers and one of the OA journals I publish with strongly recommends this exact path.
Far from being impossible, the challenges that you mention have fairly straightforward solutions. Every cloud hosting provider that executes user-supplied virtual machines is employing practical technology that can solve this problem today.
I'd represent the code as a virtual machine image. Let the researcher work however they want. When they're done, they take a virtual machine snapshot of their execution environment. We provide some tooling so that running the "manifest" in this execution environment (re-)produces all of their computational results. Thus, think of research as a "build" process that runs analysis and produces output. Ideally this build would also compile any necessary source code before running the compiled applications in the context of research data.
Far from being "limited", researchers can run any code that can run on a cloud hosting provider today. To ensure portability, the VM image will include a versioned copy of (or reference to) the kernel (e.g. Linux) and all of the userland libraries that are used by the software (e.g. Python). Think of it like a self-describing Linux container, like a Kubernetes Pod [1] or Docker Container. The image fully describes the software running on the system; it's a precise copy. With this machine image, we can run code in the same execution environment that the researcher used.
Have you ever run a very old NES or SNES game on a Nintendo DS, or on your PC in an emulator? It's the same concept.
The researcher's data is stored directly within the virtual machine image, or is represented as a virtual data drive that's mounted in the virtual machine. When this approach becomes influential and widely adopted, researchers will represent their code as containers throughout the development process. The researcher won't just run Jupyter (or Matlab or whatnot) on a random machine, they'll e.g. run Jupyter in a carefully packaged minimal container that was designed for reproducibility.
> Code bit rots.
"Bit rot" is a phenomenon that references the difficulty of maintaining code as environment and system change around it. It doesn't directly apply to our scenario. A virtual image containing a Python 1.0 program running on version 1.0 of Linux will continue to work indefinitely, just like an old SNES game. New software releases don't affect our ability to run old software -- they just make it harder to mix the two together. Furthermore, we can even run the program while passing different (modern) data as input! We've already made huge progress over where we are today.
Sure, if we want to adapt that code to a newer version of Python and Linux, then we have work to do, but that's a different problem than reproducibility. There's no free lunch; nothing can make that problem go away. But if we do want to adapt the researcher's algorithm to another language or platform, then we have a huge advantage: we can run the researcher's actual code with their actual data and get the very same result! That's huge! That will make it far easier to confidently adapt their algorithm to a new environment.
Does it cost more to keep code in a standard format, with all it the dependencies needed to run it? Yes, especially if you replicate the codebase anywhere the journal article is stored so it has backups in case of disaster.
However, the costs are fairly comparable to what was once the cost of keeping paper copies for everything, so I think it is a cost that can be absorbed by academia.
You can run Apple II code in a browser, for example https://www.scullinsteel.com/apple2. A code format that has survived for 40 years and counting. This holds for plenty other data and code formats that were once popular.
Docker/CFI is very popular, hard to believe it will disappear short of a complete breakdown of the digital era.
Would all this be possible using git? In that case, even normal people (non-professors , non-R&D engineers) who read research papers to quench their curiosity, would have to know to use git.
You would need more than a non-programmer friendly interface on top- biologists, astronomers, chemists, et al. usually aren't fully fledged programmers themselves, and aren't likely to understand what Docker is, let alone how to use it. A scientific programming project framework is a great idea (perhaps something built into Julia, or a specialized packaging of Python similar to Anaconda) but it would require a lot of the complex machinery to be concealed somehow. A challenging project, even before the "get editors to demand it" stage.
That’s exactly what I meant—conceal the details so anyone can use it, but base those details on common standards so it’s possible for a savvy individual to work with such packages manually without vendor lock-in. The true challenge definitely isn’t the technical side—it’s getting people to adopt it, and thereby make it standard.
Not entirely true: I am an astrophysicist, and I can say that this practice would be extremely useful for my discipline as well. In cosmology (my field) we use codes that are often extremely complex, yet they are not usually released together with the papers based on their outputs. This puzzles me, as last year I attended a conference on informatics and astrophysics (ADASS), and one of the talks showed that releasing your codes as open source increases a lot the chance of your papers being cited (expecially if you advertise your code using services like ASCL [1]).
The only reason I can think of this unwillingness to publish codes is the fact that these codes written by physicists are often extremely unreadable: very long Fortran routines with awkward variable names, no comments, no test cases, no documentation… Once you get a result from your own software, you get more satisfaction if you publish a paper with the results than if you polish and publish the code.
IMHO this is bad science, but it is difficult to change this way of working: cosmology is today done by large collaborations, not by individuals, and if you propose to make the codebase developed by your team public, this idea is usually not welcomed by your co-authors.
I really enjoy PeerJ and their mission. They lack a Sports Science journal which severely limits what I can publish, but I find my way into it via other disciplines when relevant. I help them out as much as possible.
Well then let me bite on the question of the problem with such high payouts.
"The next question is whether there’s anything wrong with this idea, and if so, what. It makes me uncomfortable, but that’s not the appropriate measure. As long as the journals themselves keep up their editorial standards, the main effect would seem to be that their editorial staffs must get an awful lot of China-derived manuscripts whose authors are hoping to get lucky."
Well, I'd offer that this analysis ignores how science has historically worked.
Essentially, science has involved something like a "club" of individuals who can be trusted to make a sincere effort to find the true. Certainly, the use of exact instruments and the theoretical reproducibility of experiments were important but also dealing with individuals are reliable, systematic and trustworthy was an important thing.
Which is to say that science journals aren't designed to filter out articles which are wholly fraudulent, a tissue of lies cleverly created to mimic an actual scientific breakthrough. With a clever enough person, naturally, there is no way to see fraud from the article, one would have to lab at the laboratory, attempt to reproduce the experiment and so-forth (and reproducibility was the key historically, most experiments weren't actually reproduced and no journal is going to maintain a lab to reproduce submitted articles - except perhaps in CS where a program might submittable).
And there you have it. The lab coated scientist is a movie character but be scientist, as least once, was more than to engage in some activities. It was to be part of the "broad march of progress" where the lesser-scientist still helped move things forward by small, honest iterations of basic research (and this "club" arguably was disproportionately and unfairly white, male and first-world but every process has its weaknesses).
A move which makes the pursuit of science a purely adversarial affair, by consigning poor performs to absolute poverty and giving fairly vast rewards to good performs is going to break the club everywhere.
And the implications of the end of science as an idealist pursuit vary from field to field. Fraud in CS or math might be impossible or might be rendered impossible with suitable measures. But in social sciences or other fields, things could get nasty indeed.
In practice math papers are basically arguments for something. It's possible to make a bad faith argument by ignoring that A does not imply B and just hoping nobody notices. You can even structure thing to be intentionally misleading.
Honestly, I suspect people fudge thing in math more than you might think.
How reliable math proofs are is pretty tangential to my argument above - I don't actually have any attachment there not being any fudging modern math proofs - notice I said "fraud might be" made impossible in these fields.
That said, a tool like coq will deal with any inaccuracies as things good forward. Some portion of math has supposedly been verified with it.
But, of course, I haven't run the program, just read about it. So still - no dog in that particular race.
Sure, but consider you just spent months working on something and your giving up. You can toss that work away or gamble that nobody notices the error. Further, if they catch something then you have not really lost anything at that point anyway.
What's wrong with poor performance leading to poverty? There seem to be too many scientists, at least publishing too many papers. Wouldn't it be better if a subset of them quit and did more productive work instead. That would also ease the pressure on publication and fraud for the remaining ones.
I don't believe there are really hero superstars in science. Sure, there's a gradient of ability, but just because you happened to be the first one to discover something important doesn't mean you'll always be the best at future discoveries. There must be a lot that depends on the environment they work in.
The rate of scientific progress is roughly proportional to the number of working scientists. People might have other value judgements, but I don't think science is progressing nearly fast enough.
We've been trying hard to cure cancer for decades, with significant progress but we haven't cut the death rate even in half yet. We can't get people to Mars. Phones still need batteries charged every day or two. Most cars still burn fossil fuels. We don't have clean, safe, nuclear power. We don't understand nearly everything about how the brain works, and millions of people suffer from poorly controlled mental illnesses. We don't have household robots to cook and clean. It takes a whole day to get to the other side of the world.
These are all bottlenecked by science. 2x more scientists would result in substantial improvements in daily life within a decade. 10x would be better.
Science has both diminishing and accelerating returns. Diminishing because of duplication of effort, accelerating because large projects need lots of people. 1 scientist can't build a LIGO detector in any number of years. Across all the disciplines, linear seems like a good approximation.
"What's wrong with poor performance leading to poverty?"
- It incentivizes fraud.
If the only way to avoid a McDonald-level job is a ground-breaking paper, you'll be willing to fake that ground-breaking paper, especially since if you're caught, you can still get the McDonald-level job sooner or later.
I am referring to a lot of current conditions, where adjunct professorships are paid at close to bare survival wages and "superstars" get substantial compensation.
A scientist once apologized for being a jerk to me at a meeting the previous year, after research that I presented there was published in a highly-ranked journal. Oddly, I didn't even remember the incident in question, but the change in opinion, apparently just because of the publication venue has stuck with me for a long time.
I published in Science and Nature during my PhD and other than some prestige and help getting a fellowship I never received any sort of direct payout. I don't think my supervisor got anything directly either, definitely helped future grant applications, however.
When I was in grad school, you were sent to the front of the line when applying for tenure track positions. (If you got a paper or two into those two publications).
However, people still had to like you in the interview and job talk.
I'm assuming these were not 1st author publications. If you have a 1st author Nature and a 1st author Science paper during grad school, Universities should be rolling out the red carpet.
They were both first author, details are on my very out of date website link in my profile if you are interested. Definitely no red carpet rolling out, not sure if you are in academia but those days have passed. Could I possibly have done another postdoc and managed a faculty position somewhere in the USA, probably, was it worth it to me, nope. Felt like I failed for a while, its hard to escape from the academia cult.
Whoa. Well first congrats, that is a triumphant feat for a grad student.
I am truly surprised your faculty search wasnt more fruitful, given there is hardly much more you could have done other than maybe secure a K99 grant.
Why do you suspect your cv was not top-tier competetive? I ask because yes, I am in Academia and from my experience (including being on faculty search committees) suggests you would have at least been invited to job talks.
Heh .. I don't want this to turn into an AMA. But yes .. I too have questions. How long have you been away from academia and how do you feel about it? I had success in grad school (papers in top CS conferences). I went the industrial research route which paid well initially. Now, I'm nearly 40 and have no tenure and a gap of a few years in my publication history. I am very tempted to get out of the game but am afraid about it being a 1 way street.
Hah, I don't really enjoying revisiting my decision making but I'll try to answer. I've been away for just over a year and I miss the people and the freedom but I don't miss the politics and the rat race.
Grad school was successful. However, my postdoc was a different story, I got a relatively prestigious fellowship at a DOE lab but my supervisor quit before I started and I tried to do my own stuff but really didn't get much support and it wasn't much fun or productive. I quit after 2 years, about 1 year ago and joined a small biotech company in the bay area. I'm pretty sure I already have too large a gap in my publications to go back, I thought about going back to academia for a bit but that would have entailed doing a 2nd postdoc but I really wasn't interested in moving some random place for 3+ years, then dealing with the whole faculty application process again and 5+ more years before eventually getting tenure.
If you don't hate writing grants maybe academia can be for you.
Do you have any interest in neuroscience? My lab (malinowlab.com) has two postdoc openings right now at UCSD, and we take smart people from any STEM field.
In my opinion, in retrospect, I should have joined well known groups that had a strong history of placing students and postdocs in faculty positions at top schools.
Certainly not directly, but in most universities publication output is the main driver for salary negotation (and advancement on tenure track). Most places will put you on a standard scale which increments each year, and having a top tier publication may be enough to negotiate a bump of a couple of rungs (worth a thousand or two pa in the UK).
I would imagine the same applies to most jobs - rather than an on-the-spot bonus, if you published in a prestigious journal as part of your job, it would be strong grounds for some compensation at your next review.
In addition to the prestige and career boost, you get a a really swell awesome feeling and get to be a permanent part of the encylopedia of knowledge. You're in libraries now!
In chemical physics, many of the most influential papers (tens of thousands of citations) have been published in The Journal of Chemical Physics, an absolutely non-flashy publication with impact factor ~3 and a myriad of uninteresting papers.
That happens if you publish in any indexed journals. Perhaps more people will see your name if you publish in a higher profile journal, but that is also not a certainly (though highly probable)..
In addition to providing more incentive for manipulation of result and flashy research, it also rewards researchers not for the contents of their research, but for where they manage to get it published. Especially the "top tier" journals place emphasis on noteworthiness, disincentivising e.g. replication studies, and often with a higher number of retractions [1].
It also means that the position of the traditional, subscription-based journals are cemented more, even though many funders are also aiming to transition to open access publishing.
So overall, I guess I'm not that enthusiastic about this.
How does intellectual property work for published research based on approaches developed while consulting for a company?
For example, let's say you develop a novel modeling methodology for a company who hired you. You'd like to publish the methodology and give conference talks on it. Since you're hired, your work and the methodology would belong to the company as their asset.
Are there certain types of licenses applicable here? How about in the case of the article where researchers are compensated for presenting a talk on that methodology?
>>How does intellectual property work for published research based on approaches developed while consulting for a company?
This is negotiable upfront for industrial research positions. You should always review your IP contract inside your job offer packet and battle HR/management hard on the rights to the IP and compensation surrounding it. Most boilerplate language assigns all IP rights to the company you work for with a maximum of $1 consideration paid for your work (to satisfy contract law if it is ever argued in a court), with acknowledgements or minor credit to you in the official documentation.
Needless to say, you shouldn't blindly agree to that unless your base salary is commensurate with such a loss of control and future rights of your work.
Thanks, that's helpful as an upfront approach. How about if a contract is already in place and the company is okay with the methodology being published, as long as they retain rights to it.
Is there a licensing vehicle that allows for publishing while remaining an asset of the company?
For example, Square built their Dagger library and uses the Apache license. http://square.github.io/dagger/
My impression is that Dagger is still an asset of Square, but you can use it as long as you meet the licensing requirements.
Is there an equivalent to this for a modeling approach or work done by a consultant?
IANAL. However, as an industrial researcher in CS, the standard is this. Once you patent your work, the company is okay with you publishing. There are exceptions. There is a story (no idea if true of not) that the folks who published the dynamo paper almost got fired. Some ideas are so important to the business that they may not be patented and just reserved a trade secret. All of this is negotiable and often done on a case by case basis. Most employers I've had have a standard reviewing process where the researchers submit the paper (to be submitted) and we get an approval that it is okay to publish. There is typically a checkbox that asks if you sought protection for any IP in the paper, etc.
Thanks, very helpful perspective. Once a paper is reviewed and approved, are there any rights around being able to present that paper at a conference?
I'm trying to understand what kind of legal transformation takes place one a paper on a non-patented methodology is released. I've skimmed the dynamo paper and it seems a good analogy in that it's about general database architecture concepts, rather than the specific implementation at Amazon. I might be wrong though as I haven't read it deeply.
For example, let's say you're the primary author of a paper and then later leave the company. Can there by any legal restriction preventing you or others who have never worked at the company from giving a conference presentation on that paper once published?
Technical development and design that results in some "method and apparatus" to do a particular thing is patentable, but the research required before that R&D that leads to understand how stuff works and what might work is not patentable. And that's not speaking about less technical fields of science where nothing is patentable, or computer science which is not patentable in much of the world (software patents aren't valid in EU and other places).
Also, patenting is expensive. Especially because you need to file a patent in multiple major markets, you need multiple different patent lawyers, if you're outside of USA then filing a patent in USA is a bit of a hassle, and just for USA+main EU countries gets quite costly quick - so that's a reason not to patent things unless you have a clear picture that some product is going to violate that patent and thus it'll become valuable.
In my experience, the lawyers just chuck 'A System for' in front of the paper's title and open the firehose at the patent office. Lots of ridiculous stuff gets through...
They aren't mutually exclusive. As a researcher/engineer, whenever we publish in a top tier journal the company usually files for patents on the main contribution too.
Not everything can be patented and some folks have ethical concerns about keeping research open. Not to mention, you pretty much have to publish once you get into grad school.
If you're going for your Ph.D. then your goal is to further the art.
It means a lot of status and prestige on Reddit and other 'smart' communities, especially if the paper is in a STEM subject. Outside of academia and offline, very little.
I've talked about it before on HN but I will only publish to open access and open data (most important) journals that require all data be scrubbed of PII and published alongside the paper for replication purposes. I highly favor journals that not only allow, but encourage replication (yes, many elite journals cite "novelty" as a top factor regarding publication decisions... embarrassing).
Science should be freely available, open access, open data, and replicable. Otherwise it isn't science. It's primarily garbage that exists to advance the careers and egos of a select few.