Hacker News new | past | comments | ask | show | jobs | submit login
Handful of Biologists Went Rogue and Published Directly to Internet (nytimes.com)
492 points by srikar on March 15, 2016 | hide | past | favorite | 169 comments



When I first found out about the web, early 90's, and it was "obvious" that the role of the web was to expand scientific publishing. I expected that everybody would publish latex files (raw source, not PDFs) in computationally accessible formats, with raw data in easily parseable, semantic forms.

That didn't really happen as expected. In my own chosen field (biology) it happened much more slowly than I hoped- physics (with arxiv) was far better. However, just getting PDFs on biorxiv is only a small part of the long game. I did not appreciate the huge importance placed on publishing in high-profile journals has on one's career trajectory and how large a role that would play in slowing the transition to free publication and post-publication review.

The long game is to enable the vast existing resources, and the new resources, to be parsed semantically by artificial intelligence algorithms. We've already passed the point where individuals can understand the full literature in their chosen field and so eventually we will need AI just to make progress.


> I expected that everybody would publish latex files (raw source, not PDFs) in computationally accessible formats, with raw data in easily parseable, semantic forms.

Do you realize that many researchers would be ashamed if others saw their latex code?

A friend worked for a journal, he kept a folder of the worst that he had ever seen, and could rant for hours one some gems. Including some papers that were beautifully typeset, about elegant combinatorics/type theory, and were coded on the latex equivalent of a spaceship cobbled with recycled plastic bottles, scraps and duct tape.

Do you imagine going through thousand lines macro tex file that grew over the years, when Latex is one of the worst mess ever, with conflicting packages, absurd syntactic 'features', and legacy crippling flaws that make debugging an horrible task?

Please, continue to compile your .tex.


> Do you realize that many researchers would be ashamed if others saw their latex code?

I really don't get this argument; I see it come up regarding open data for reproducibility too. TeX/LaTeX source is just an extra piece of metadata which might help someone out in a few years time. That's nothing to feel ashamed of, and it's certainly less shameful than not providing the code. If the code allows flaws in the research to be exposed, that's either something to be proud of, or (if it was intentional) it's not the code that's shameful.

My own LaTeX is mostly an inconsistent mixture of copypasta from stackexchange. That doesn't mean I'm ashamed of it, and it's all in public Git repos at http://chriswarbo.net/git if any human or bot actually cares to look.

Speaking of which, the commit histories of those Git repositories aren't very good either. Lots of commit messages like "Fixes", or "Oops, typo in last commit", with multiple changes bundled into single commits, atomic changes spread across multiple commits, etc. But, in the same way as my LaTeX source, it's better than having no commit history, which would be the case if I set myself higher standards.

Of course, I can follow higher standards, but that's when I'm paid to and/or when collaboration is important. For personal stuff, I'd rather invest the time I save interacting with Git on writing more tests, or refactoring, etc.


It would be much easier to get better if you could view other's latex code.


I will never understand why latex devotees think everyone else should be really into typesetting as a hobby. It's like sneering at photographers who don't want to mix darkroom chemicals.


That's not a good comparison, because mixing your own chemicals for photo development adds more time, effort, and defect rate to the process.

Becoming more proficient with TeX saves time and makes for more re-usable typesetting structures, which allow you to accurately do what you want faster and faster.

It's similar to using emacs or vim. You could say, "Why should programming devotees expect others to adopt those editors?"

It's about productivity. I'm a productivity devotee and when I hear people use short-term thinking to discount how valuable the up-front costs for learning TeX (or emacs/vim, etc) are, it motivates me to help them see why it's not true.

Of course, I don't want to be dogmatic either. If something works for you and your personal utility function discounts the value of things like TeX or emacs, that's fine.


Becoming more proficient with TeX saves time and makes for more re-usable typesetting structures, which allow you to accurately do what you want faster and faster.

Not necessarily, there's an opportunity cost to all the time you spend become proficient with TeX. If there are other products that offer sufficient control over typesetting with less time and effort, and you don't anticipate needing the advanced features that are unavailable in competing products, then becoming a TeX expert is a complete waste of time. I'm happy to concede that it's probably the best tool and certainly the best value for money, but if someone's needs are quite limited then an adequate-but-easy-to-use tool may be preferable to the superior-but-user-unfriendly one.


> It's similar to using emacs or vim

Actually, that's a great comparison. I don't use either of those (yeah, I said it) because of their arcane interfaces. I don't doubt for a second that wizards with those programs, and LaTeX, can do great things, but IDEs exist for the same reason word processors do - to make it easier to focus on creating.


Yes, but if you pay the up-front cost of learning the interface (which, far from being arcane, is actually designed the way it is because it's ergonomic), then you can focus even more on creating than the person using the IDE. While they are stuck debugging some autocomplete error in Eclipse, you've already finished your code project, wrote a letter to your mom, dabbled in an interactive R sessions to teach yourself R, and played Tetris, all without touching the mouse.

IDEs exist for the same reason 7/11 exists. Convenience. You pay a higher price (lost productivity) to experience certain comforts (clicking a button instead of issuing a keystroke command).

My experience has universally been that IDEs only get in the way of focusing on creating. They might be necessary when you are literally just beginning with software and you need a lot of scaffolding to help you -- and I don't think anyone would begrudge university students using IDEs.

But not using a power editor once you are an experienced programmer is hard to understand. I mean, it's so important that it was even included as a whole chapter in The Pragmatic Programmer.

Paying the up-front costs of proficient editor usage is like paying the up-front costs of wearing braces so you can have straight teeth or wearing correcting shoes so you can walk properly. Of course it's more convenient in an absolute sense to simply not wear braces or corrective shoes. But that's not the point. The point is that the value you get from the outcome (e.g. straight teeth, proper walking, or dramatically increased productivity when developing) is far greater than the cost.

In other words, I think you're using a hyperbolic discount function. You are placing such a high value on absolutely immediate "focus on creating" that you incorrectly discount how much more focus on creating you would be able to do with the power of a real editor framework.


It's about productivity for the 1% or 5% of stars who understand it and eat it and breathe it.

For the 90%+, it's an arbitrary checkbox requirement to be checked on the path towards something else.


I don't eat and breathe LaTeX (although I try to use it for any typesetting needs: after a few years in academia it's much faster to do it this way than to open "Word") but the productivity enhancement on writing while forgetting of how it looks and just having to think about how you want it to look and letting the system handle it for you is invaluable.


> It's about productivity.

No it is about aesthetics, which are mostly irrelevant for 99% of all research papers.


It's not just aesthetics, having proper typesetting makes papers more accessible, easier to read and parse, the standard formatting makes understanding the layout of papers easier, there is a huge library of packages that cleanly abstract the task of rendering obscure symbols and complex mathematical formulas, and it automates the job of tracking figure, equation and reference citations to name just a few of the features. Fine, TeX is an ugly language that only a programmer could love, but in over twenty years nobody has yet been able to replace its expressive power and flexibility -- doc generators just compile to LaTeX and more user friendly GUIs like Lyx and TeXMaker have been written to ease the process of writing for end users, but the underlying format has yet to be replaced in over twenty years. Like Markdown, TeX rides the line between being able to be written by hand for greater flexibility and automatically generated for newbies.

Unless you think that reading red text on a blue background in payprus would be an equally enjoyable reading experience it more than simple aesthetics.


You sound like aesthetics is a bad thing you need to disavow for some reason.

> Fine, TeX is an ugly language that only a programmer could love, but in over twenty years nobody has yet been able to replace its expressive power and flexibility

Which doesn't make it less ugly. If we can't cure common cold yet, that is no reason to praise it and making it sound as if it's a great thing. It is a problem and so is the need to use ugly language to produce aesthetically pleasing papers. The fact that we don't have solution for this problem yet does not turn it into non-problem. It just says nobody yet was good enough to solve it.


... did I just read a rant about how enjoyability and simplicity are NOT aesthetics?


Are you related to the author of "The Joy of Tex"?


In my field, equations and diagrams are relevant for 99% of research papers. I would poke my eyes out with a grapefruit spoon if I had to spend the rest of my life trying to do this in Word.


It's a trade off. Getting proficient with a new subject requires time and mental space, one that could be otherwise used for doing other things, usually things much more interesting to you unless you happen to be excited by doing TeX stuff. That's why one would usually opt for "just do whatever is needed to get over with it".

> It's similar to using emacs or vim. You could say, "Why should programming devotees expect others to adopt those editors?"

Exactly. There were times when I wrote emacs lisp and vim macros, and had time to dive into arcane mechanics of it. I don't have that luxury anymore. I need the editor to just work and do what I need to be done, without me investing significant time into it. Maybe if I spend couple of days on learning stuff about vim I could do in one minute when now I do in ten, but I'm not sure it pays off anymore, and frankly I have more interesting things to learn. I do use vim (not emacs anymore) but for me I came to the conclusion that added value of diving into the arcana with the goal of being more efficient later vs. brute-forcing through the immediate tasks is just not worth it. Later you'd need a different thing than you thought you'd need anyway.

And, note, I am a programmer. I can imagine how less interested would be a person who does not spend their days digging into the code anyway.


At the very least it would make sense to provide the source that outputs the document. That could be something like OpenOffice or Latex. There are probably some WYSIWYG solution for Latex. If you are referencing a table from another paper and you want to show or alter it, it would be a lot quicker if you have the code that renders it. If you are in the business of writing papers that you want people to read and quote, it seems like a good idea to master the tools that help you do that.



LyX is salvation. Edit in WYSIWYG, export to latex or PDF.


You have obviously never used troff then.


Good LaTeX is less about aesthetics and more about having boilerplate code and macros which greatly simplify the process of making readable documents with citations in the correct (nitpicky, archaic, utterly idiosyncratic) format and figures correctly included and labeled and captioned and scaled.

The joy of LaTeX is that it handles the aesthetics for you. You can focus entirely on structure. Or that's how it works when you have good boilerplate to start from.


I think the problem is that many people who don't use LaTeX (and some who do) are so inept at typography that it actually impedes the reader's ability to use the document.

One LaTeX-agnostic example is when people abuse hyphens-like this-which makes it impossible to parse the sentence without reading it several times.


Well you're onto something there. I liked typography for its own sake long before it became affordable to do it on personal computers, so I suppose I take my competence in that area for granted.


You can now view (& compile) a large selection of latex examples, templates and articles directly in the browser on platforms such as Overleaf[1].

I'm one of the founders, and we're also making it possible to directly submit projects from Overleaf to repositories such as the arXiv and bioRxiv, and to traditional journals, to help speed up the submission and publication process.[2]

[1] https://www.overleaf.com

[2] https://www.overleaf.com/tutorial


And if others could contribute improvements on yours.


FYI, the .tex source is available for all papers uploaded to the arXiv.


No, you can upload just the PDF if you want.


OK, correction: the .tex source is available for almost all papers uploaded to the arXiv (because they give a big warning/plea message to do so).


Also, (experimental) biologists tend not to use TeX/LaTeX. Typically they use Microsoft Word and a proprietary citation manager like Endnote. Typesetting is seen as something the journal deals with, not the author.


I was a biologist, but I used LaTeX: it was important to have parseable references in a flat file.


Many of biologists stick to word file, especially the review/communet function. It is often because the collaborators stick to word and you have to follow up with it.


back in the day when I did this, I ended up writing papers with biologists. All the biologists would be cc'd on an email with the doc as an attachment and a "token" was passed between people to determine doc ownership. The person would make their changes and reattach the new doc to the email and pass the token to the next person.

Ultimately, this would fall over as people accidentally made concurrent changes and it would be hard if not impossible to merge between two documents. Doc files were never designed to be put into a version control system with merge semantics. Some people use shared folders with locking but locking doesn't work well over file shares.

I'm not sure but I suspect now people use dropbox instead of email; it's not clear to me how they manage conflicts and merges.

I rather my format be a text-line-based non-rendered format that I can render down if need be, but use the full power of patch and diff to manage conflicts.


I wasn't defending LaTeX; I don't use it any more.

I am certainly aware that many scientists don't want to publish their code out of embarrassment. It was a surprise to me, but I've heard that feedback.

I investigated various alternatives after I found LaTeX too arcane, including DocBook. It's still not entirely clear what solution exists to the collection of problems I have (want to semantically mark up my paper, want it to display nicely regardless of screen, etc).


What about something like markdown plus some syntax for formulas? You can compile the markdown (via a detour to LaTeX perhaps) to something that looks nice.


Yeah, some form of enhanced markdown, with self-rendering papers on github (or whatever-hub or whatever-xiv) seems like a very practical solution. It couples the paper's representation to the version control system in a way that could be convenient and expressive.


Markdown gets very limited very quickly.

Are you aware of pandoc?

http://pandoc.org/


I figure that any markdown dialect enhanced far enough to be usable would be as complex as LaTeX, only without the tooling and pre-written macros.


Enhanced versions of markdown are just turning it back into a full wiki language.


Shame in 'science', I find that unbelievably absurd.

About the 2nd part, may I rephrase it as PDFs are the docker of publication .. (Adobe would probably be very proud of that).


Or, if published, we could develop tools to normalize the data and improve the state of the art.


This is the right approach. Like what OCR does for digitizing volumes of literature.


OCR for historical literature is a really hard problem. Beyond the fact that OCR quality is pretty low, perception of the meaning of the papers isn't addressed by that. So it makes it easier to access (can read papers on the net rather than going to a library) but that's about it.


> it makes it easier to access (can read papers on the net rather than going to a library) but that's about it.

I disagree. The text is indexed and searchable.

> the meaning of the papers isn't addressed by that.

I agree. I wonder if starting with research papers that have lots of symbolic logic (e.g. math proofs) would be the easiest starting point for a system like this.


HA! Not to mention unflattering comments in LaTeX or BibTeX files. You thought source code had expletives... yeah.


Some of my .tex files may or may not contain a lot of profanities.


> Do you realize that many researchers would be ashamed if others saw their latex code?

The exact same think could be said about HTML, but that hasn't stopped people...


Can you unpack what is meant by "to be parsed semantically by artificial intelligence algorithms"?

I understand the words, but in this context it has me a bit confused as to what problem you see AI solving.


I mean, if somebody includes the name of a gene in their text, the goal is to annotate (then use the annotation) the gene in the text of the paper. It's often nontrivial to extract the meaning from the text without having a full experiment human. A typical case would be a reference to a gene by name, where the paper authors really meant the transcript or the protein product.

The semantic parsing part is an annotation process. AI algorithms are required to make accurate annotations - a huge amount of context is required.

So if authors instead marked up their papers such that each mentioned entity's semantic meaning was obvious, it would make it much easier to AI that scans all papers and generates hypotheses.


I'm thinking of a less than ambitious angle to the use of AI or at least parsing algorithms: cross referencing commonly researched subjects and methods. If say a certain method seems to yield less than ideal results it would be nice to know if someone else figured out the problem well in advance of any laboratory work. Feeding that sort of information into a computer would be dead simple since I assume most methods are easily categorized as it is.


No, I don't think it's dead simple to take methods and compare them. The problem is that most of the method details are implicit and leave out a lot of the aspects that are required to replicate a study.


Ah, awesome!! Makes complete sense!

Thank you very much for unpacking it and all the best in your career in Biology! I am in tech now, but studied Biology at Texas A&M for undergrad so hearing the words in your response reminded me of the good ol' days!


What he/she is saying is that the pace of research & publishing is such that it's impossible for researchers & academics to stay current the only fashioned way (reading, writing and attending conferences). It would be beneficial to have an intelligent bot that could be trained to browse for content of interest as a mechanism to augment an individual human's capacity to do this manually.


I don't just mean that, although that is part of the goal.

I want AIs to automatically find conflicting papers/hypotheses, and propose experiments that resolve the ambiguities.


But wouldn't a different data structure/database be better suited for this approach, than LaTeX? I mean, you can still just babble but style it in LaTeX ... and the AI would have to find out, that your're saying nothing. This would require true AI.

I mean, I don't know much about LaTeX, but I doubt there are Elements for "Hypthese", "Definition", "exact reference" etc. If you would have those, described in a structured, simple language - then I guess, it will be much easier to process those Information for a KI, when the context is clear.


True. I work at Google now, and my advice would be to just write standard XHTML and let Google's parsers do their best job at inferring the meaning of the text.


But writing xhtml in plain text can be quite a pain ... you would need at least good tools, to be efficient ...

Or something pythonlike (also supported by a IDE):

hypothesis:

(indent) blablabla link:"link_to_Element_in_paper"


Someday I'd love to feed scientific papers into AI algorithms. Watson does that, and google is starting to. But the file formats are such a mess, especially with equations and figures.


There are entire companies with squadrons of contractor scientists who just read papers and convert them to their ontology/database/analysis engine (Ingenuity, for example).

When I spoke to them they said if they had an AI that could do as good a job as people, they wouldn't need contractors. I think Google's approach would be to contract a bunch of scientists, have them read and interpret the papers, then use that data to train a deep net that could do it more accurately (you need some baseline humans to act as golden standards). This worked well for Google in several publicized examples, such as discriminating house numbers from numbers on cards in Street View imagery.


> I did not appreciate the huge importance placed on publishing in high-profile journals has on one's career trajectory

Who is it primarily that is concerned with such things? (Curious.)

Is it fellow academics? Or university administrators? Or government agencies that control grants?

Whoever it is, these are the people we need to work on, it seems.


"Is it fellow academics? Or university administrators? Or government agencies that control grants?"

Yes.

Fellow Academics: The people who will be discussing your tenure case, writing letters of support, nominating you for prizes, etc.

University Administrators: The people who are the final word in your tenure case, and who do things like evaluate how well your department/college/etc. is performing.

Grants: Be they your fellow academics in the form of reviewers, program officers, etc., the prestige of your publications will likely matter to them.

Publications are the unit of currency in academia at the moment, and the incentive and evaluation structures at almost every step are oriented around that. Going your own way is a laudable step, perhaps, but largely the luxury of established researchers who have already had their prestige publications, or those willing to take the potential hit to their careers for the principled stand.


Yep: every group cares because others care and because they don't want to make things easier for others later (that would not be fair). Academics care because they need to get grants and promotions, granting agencies value the papers because they need to justify their grants to the public/Congress/academics, etc.

If you could get them to all drop caring at once, that would work. However, if any critical mass of any majority of groups sticks with it, those who try to ignore the high-profile publication wheel will just get squeeeeezed out. Simple game theory.


Exactly. Which is why big labs going pre-print/OA is meaningful, but the heads of those big labs saying "As Chair of my Department, OA publications will be positively considered in tenure, hiring and promotion decisions" would probably be a bigger deal.


I should also mention that my graduate program (top-tier in its field) unofficially required all students to publish in a top-tier journal to be allowed to graduate with a PhD. It was unofficial, but enforced by all the professor/advisors. If you didn't publish in a high enough journal, you'd end up in 7-8th year with a long thesis and have to leave with a master's degree.


If you go the principal investigator route (professor or scientist at national lab), then your boss and granting agencies are the ones that care about your pubs. It's critical to obtaining tenure (if you don't, you're kicked out of the university) and grants.


> it was "obvious" that the role of the web was to expand scientific publishing

Yeah especially considering the roles CERN and NCSA played. Things might be different if they had published there immediately instead of staying on their journals.


Might be interested to check out https://science.ai


Latex is a terrible format that most biologists will never learn. HTML is the one uber-standard, it was actually created for that.


Obviously you've never had to typeset non-trivial mathematical expressions.


To keep your tone ... obviously you've never tried selling it to biologists.


I was working in academia back in 2002 and I remember talking about this crazy open access thing, and blogs, and wikis, with the folks on the tenure committee then. I remember thinking how fast this tenure/publishing thing will change in the next few years. And here it is 13 years later and there's a headline about a HANDFUL of biologists going rouge and daring to publish a PREPRINT!? I know there's been quite a bit of progress, but I'm still surprised at just how little things have changed.


3000 manuscripts posted to bioRxiv since 2013. More than a handful.


The HANDFUL is referring to the number of biologist, not the number of manuscripts.

However, compared to the number of potential biologists or manuscripts that could be published, it is just a HANDFUL.


"HANDFUL of biologists going rouge"

Rouge? They're positively crimson, darling.


Why is everybody down-voting this? The unintentional double entendre is pretty hilarious :)

But to stay on topic, I'd say that the impact of blogs and wikis have been felt, it is just up to the gatekeepers, the powers-that-be to accept this as relevant. Like somebody said here, "publications" and Nature/Science papers shouldn't be nearly as influential (wrt career trajectory) as they were in 1986.


Uh ... isn't this what Tim Berners-Lee meant when he created the world wide web was that scientists do this exactly? It's like he handed a machine gun to us cavemen scientists 25 years ago and we've been collectively clubbing him in the head with it ever since.


I just uploaded three articles in the last two weeks to BioRxiv. The papers were previously just sitting in review. I have already received several emails thanking and informing me that my work is influencing the manuscripts they are writing - more citations. Overall the experience has been an extremely positive experience for me. I don't really see any downsides. So excited for the revolution.


the amazing thing is that arXiv has always had a "quantitative bio" section, and I guess it took a sea change for this to be come a thing in bio.


It's a social problem, not a technical one. The main new features bioRxiv has are comments and digital object identifiers (DOIs).


At least originally, I believe bioArxiv's raison d'etre was to give biologists more assistance in producing professionally typeset documents (since far fewer of them know LaTeX then physicists). Not sure that's still true.


While this is nice in that it sets the example (and provides publicity to bioarxiv) it's not a couple of Nobel laureates that post one out of dozen(s) of papers/year published that will make the real difference.

The system is aged and inefficient (some would even argue it's rotten) and IMO comprehensive changes are needed. Like racial or gender discrimination can't be addressed without changing the social rules people live by, the current academic system that's rather elitist, non-inclusive, discriminatory, often more biased and less fair than many think needs to change substantially.

Such change will be aided by important people setting examples (and often going back to their old ways). However more substantial change is needed on multiple levels, most importantly: academic leaders and funding agencies (run by the former) need to stop looking at who's who and how many Nature/Science/insert-your-fancy-journal papers does the person have. For instance, the culture of applying for grant money with work that's half done to maximize one's chances needs to stop and so should the over-emphasis of impressive and positive results.

Additionally, publishers exploiting everyone need to die out and as long as these researchers "go rogue" with a single paper (rather than for instance committing to publish 100% preprints and >75% open access), not much will change.


I know the "wisdom of crowds" is passe, but the continued success (all things considered) of Wikipedia and open source software really makes me question the value of quality gatekeepers. I know I'm biased because I work in software and the costs of mediocrity in this industry are less than in others, but I think we could speed up innovation and discovery if we opened up science and made it more publicly accessible and collaborative. At some point the gatekeepers are just protecting their turf and hold back progress.


The crucial difference is that an amateur can provide a meaningful contribution to an open-source software project while the same is unlikely in (modern) science.

You need years of intense studying even to understand the current state of the art in a chosen scientific field.

Experiments require more money too.


As someone who has fallen into a transition from software engineering to science I think I can say this is mostly wrong.

It seems to me that any given field of "science" isn't any harder than dropping into a new area of software engineering. There's a lot to learn, sure, but there's a lot to do as well and there are areas where an complete rank amateur can make significant contributions.

It's still true that every day I discover something I have no idea at all about. But it seems that this is fairly normal in science - people know a lot about a single specialized area and not much outside it.


> I think I can say this is mostly wrong.

There are 3 sentences in my comment. What specific claim do you consider to be wrong and what is your evidence?


All 3

The crucial difference is that an amateur can provide a meaningful contribution to an open-source software project while the same is unlikely in (modern) science.

http://motherboard.vice.com/read/meet-the-amateur-comet-hunt...

http://io9.gizmodo.com/5841287/the-story-of-the-woman-who-di...

You need years of intense studying even to understand the current state of the art in a chosen scientific field.

From personal experience, this isn't the case. But you'll complain about citations, so: https://www.quantamagazine.org/20160313-mathematicians-disco...

Experiments require more money too.

See the examples provided above.


I disagree. There's no reason that properly educated and briefed volunteers can't do a lot of the same work. Just look at citizen scientists. [0] There have been studies that have shown them to be just as effective as experts. [1]

[0]: https://en.wikipedia.org/wiki/Citizen_science

[1]: http://journals.plos.org/plosone/article?id=10.1371/journal....


Truth lies in between. We're not for faster AND lower quality, mostly for less absurd resistance in the flow of knowledge. Granted that in some domains, a paper stuck in a waiting queue means death for people. Let's hope this kicks discussion and adaptation.


I'm for faster and greater quantity, knowing that will reduce median quality and, I believe, increase average quality.


What's the best way to support and reward these researchers? Something we can do in the next five minutes while they have the reader's attention.


Write a paper that cites them and get it published in Nature. Papers and citations in high-impact journals are what gets scientists a job, grants, and tenure.


It's hard to tell whether this is a joke with a lot of truth, a genuine sideways takedown of the effort being reported in the article, a cynic's lament, or just a straightforward answer to the question of what would be the most helpful (if not most accessible) action for these authors. In any case I find it a tremendously compelling comment. It evokes a lot in a very small footprint.

edit: clarification


Papers in prestigious journals matter, but the location of a citation does not really matter. As long as the source is included in whatever you're using to count citations (usually Google Scholar, Scopus, or Web of Science), a citation in Nature is no more valuable than one in PLOS ONE.

The only wrinkle to this is that citations in more popular venues are more likely to be seen and re-cited.


Retweet? Post their papers to discussion forums, such as HN or others? If you read it and understand it, email them or tweet and give feedback? Simply comment somewhere to let them know you applaud this?


This is good advice worth more than us here might expect. A bit of chatter about a paper is worth a whole lot. In a specialized field, getting 2 or 3 comments back after publication is 'a lot'. Doubling that to 5 or 6 interesting comments doubles the feedback and interest.


The thing I'd want to see most of all? An experiment confirming or refuting the result of a posted paper, and the results of that experiment posted as a pre-print within ~3 months.

I.e., actually getting real science done fast, taking advantage of how the long journal process was bypassed.


Nothing of real importance, but I'm sure they would appreciate a nice email.


Nobel prize winning scientists can go rogue. Until these same rogues hire incoming professors based on their own biorxiv papers this is a small advance.

This whole thing needs to start at the level of the funding agency, namely NIH. Publishing in a good journal is a prerequisite to getting a grant. Try getting an R01 on a Biorxiv paper. Not gonna happen.


Most of the pre-prints posted on bioRxiv are submitted and later accepted to a traditional journal. The authors get the best of both worlds—early dissemination of their results and still get the stamp of approval from traditional journals that many other institutions value.


When a mainstream publication like the New York Times has a positive article about Nobel-prize-winning scientists bypassing the choke-hold of established journals by directly publishing preprints online, you know it's the beginning of the end for the old, bureaucratic way of publishing scientific research.

Awesome.


But when the same article is exclusively citing Twitter as a source, it's far closer to the end of the end of their publication's integrity.


This article is a little bit breathless. In the academic circles that I run (genomics, computational biology, cancer), bioArxiv is not "going rogue". It's becoming pretty common, and will continue to increase in popularity as the FUD surrounding preprints and high-impact journals begins to dissipate. i.e. Nature won't accept my paper if it's on bioArxiv! (Yes they will).


It's certainly fear, uncertainty, and doubt, but usually when people use the term "FUD' they mean to imply that the fear, uncertainty, and doubt are unfounded. But in this case, all three are justified. The article mentions that Nature and Science are open to papers that have been pre-published. But I believe that many biology journals still have a blanket policy of not considering papers that have been pre-published. If that's the case, the working scientist is highly motivated to not pre-publish, since it shuts the door to later (peer-reviewed) publication. Unless, of course, the scientist is 100% sure they can get it published in Science or Nature. And in practice one is almost never sure of this.

I'm always curious to know how physics, as a field, got over this and related humps. Was it just easier to get everyone on board because it's a smaller community? Were the journals not as savvy to the fact that pre-prints are not really in their interest?


The journals don't really care about preprints. Most of their revenue comes from university subscriptions, and so long as they can claim that their paywalled versions of the papers are the final and official ones and the preprints are merely unedited drafts, there is no threat to that revenue stream. (Placating the journals is part of the reason why they're called preprints rather than just papers. On Arxiv, authors will often update their preprint even after the paper is accepted in a journal.)


How do you know this? One reason they're called "preprints" is that they haven't (necessarily) been peer-reviewed...


Almost all the publishers I would consider publishing in now allow pre-prints. A partial list: Nature journals (not just Nature), AAAS (not just Science), PNAS, Springer, Cambridge University Press, Cold Spring Harbor Press, many Oxford University Press journals. Cell Press does not forbid it although they reserve right to make a case-by-case decision.


That's great (seriously), but doesn't that undermine the premise of the original article? I just checked, and American Physiological Society journals still expressly forbid prepublication. So it's still not universal...


I was just responding to your comment, and didn't claim it was universal. But it's many journals, not just a couple. For me, it's almost all journals I would have considered publishing in anyway.


I talk with a doctoral candidate in Chemistry regularly about the different paper cultures. It's amazing how different different disciplines are... her accounts of Chemistry is that (I synthesize) they are extremely locked down and research is very much aimed towards granting patents. Knowledge sharing to the broader chemistry community does not appear to be a key goal.

It was a huge shock to me, coming from CS: knowledge sharing has such a high value in our community.


consider though that that culture is heavily influenced by the pharma industry (where many PhDs will wind up with productive jobs). Over and over again, the mantra (e.g. [0] [1]) is that "pharma is the one place where IP makes sense", even among IP skeptics.

[0] Richard Posner ("pharmaceuticals are the poster child for the patent system.... Most industries could get along fine without patent protection.") http://www.theatlantic.com/business/archive/2012/07/why-ther...

[1] Notch "I would personally prefer it to have those be government funded (like with CERN or NASA) and patent free as opposed to what’s happening with medicine, but I do understand why some people thin[k] patents are good in these areas." http://notch.tumblr.com/post/27751395263/on-patents


I think another issue wrt biology in particular is that there is a great deal of fear about what laymen might do with the info. I have a genetic disorder and have spent 15 years getting myself healthier. Good faith efforts to share what I have learned along the way consistently result in shit shows and ugly personal attacks. I share a lot less these days.

On the one hand, I have sympathy for the very real concerns people have about bad (potentially harmful) health information going out. On the other hand, I think the primary reason it is dangerous to begin with is a lack of a culture of vigorous discussion. It is only dangerous if it cannot be thoroughly hashed out. Unfortunately, it seems there is no place where that can happen.


There's a concern there, but that doesn't promote understanding and trust, unfortunately. Elitism is a unpleasantly common accusation...


A grad student is really exposed to being "scooped," i.e., to having their project taken up, finished, and published by somebody else. There are even professors who are notorious for this.


> A grad student is really exposed to being "scooped," i.e., to having their project taken up, finished, and published by somebody else. There are even professors who are notorious for this.

I have no words for my disgust at the lack of honesty that scooping would demonstrate.


Or more commonly, someone who was already working in the area will rush to publish their own work to avoid being scooped themselves.


Even cooler, I think, are "working papers". In my arguably limited experience, this seems to be popular mostly in economics. As I understand it, authors are soliciting comment from peers, and thinking becomes long-term collaborative. It's a conversation, not a paper. Maybe scientific research can become even more open and collaborative, using the GitHub model or whatever.


I'm a physicist and we routinely publish on the arxiv. I hope the chemists are next (a number of chemistry journals ban preprints)!!!


Is it just me who think that this NYT title reflects the negative view of releasing research results in preprints by using the phrase "went rogue"?

Speeding up the knowledge sharing and to solve more problems more quickly is a good thing IMMO. As the same article pointed out, Physicists has been releasing research results in preprints since 1990s.


I agree it's a terrible title, but I think the "went rogue" reflects more the NYT's attempt to conjure up a BuzzFeedy title than anything else.


This needs to happen in more scientific fields, like a field-specific consensus publishing platform. Everyone agrees to publish their research to benefit everyone else.


Total off topic, but here is an anecdote I heard from a professor when he was explaining about review process. One reviewer argued to change

    Figure 5. The statistics ....
to

    Figure (5) The statistics ....
because the reviewer liked () format better, although (IMO) the new format is so ugly.

Another reviewer saw the comment and had a really nasty debate. The paper was published with the original format nonetheless.


That is great (or, to be honest, it's way past time for this, this shouldn't be news). But we need to go beyond that. The PDF format is a relic. We need a platform where scientists can directly edit their articles. Figures should be replaced by interactive visualizations where possible. This would solve the problem of data availability and allow other researchers to have direct access to the data shown in a plot.


HTML?


Of course, i mean an easy way for researchers to write them.


Why not just use arxiv.org?


I heard from Paul Ginsparg that bioArxiv's original raison d'etre was to give biologists more assistance in producing professionally typeset documents since far fewer of them know LaTeX then physicists. This was going to be funded by a submission fee of ~$50. That idea was apparently nixed, so that the difference with the arXiv are now mostly cosmetic (with the notable exception that bioArxiv has commentary).

Fractionalization is generally undesirable, but it's plausible that bioArxiv can make tweaks that accelerate adoption among biologists compared to arXiv.


Discoverability is a big part of it as well, at least at the user end. I find out about new papers from email alerts of subject matter and tables of contents of the new issue of the journals I keep up with. (Obviously when I need to look into a particular issue I use google scholar).

I have never once found a paper of particular relevance to my research (geology/geophysics) on arXiv. I haven't looked very many times, but after striking out several times, why keep trying?

If there was a geoRxiv then I would probably browse it more regularly because the chances of me finding something relevant would be much higher.

For that reason, and another, I kind of disagree that fragmentation (or fractionalization) is undesirable. The second reason, which is not unrelated, is that–at least with peer-reviewed journals–the quality of the work in field- or subfield-specific venues is often far higher than in the sort of pan-scientific journals like Science or Nature. I think a lot of it has to do with the quality of the reviewing/editing, but a lot of it is that the papers have to be written for and justified to a wider audience that wants to be wowed and doesn't know the background well enough to evaluate the science for its own sake.

If I write a paper and send it to Tectonophysics, I know that the readership will understand what I'm doing and why, and I will write the paper accordingly. If I write the paper for Nature, then I have describe and justify the how and why to a wide range of people from my peers to journalists for phys.org and the NYT. Sometimes that's fine: If I find out that the Seattle fault is loaded and ready to pop, the press, policy makers and citizens need to know. But if I find out that the stress field on the Seattle fault is largely determined by the topography near the fault and that has some persistent influence on how an earthquake rupture propagates on the fault (but doesn't necessarily change the seismic hazard) then I don't need to go through the rigmarole of explaining and justifying any of that to anyone who isn't intrinsically interested, and maybe more importantly, I don't have to explain (i.e. gloss over) the subtleties, ambiguities and caveats of the work to an audience that lacks the relevant background. This simply allows me to write a more clear and more honest paper.

This brings up a tangent that is relevant to the broader topic of self-publication: You always need to write to a specific audience, and with a journal you know who that audience is. With a blog or a website, you don't necessarily. That may be fine but it can trip a lot of people up, and make the writing much worse.


The arXiv serves to unbundle the dissemination part of journal publishing for the filtering and certifying part. The arXiv is only trying to disseminate, and it is happy (for now) to let traditional journals and other sources do the filtering and certifying.

Let me reply to your points in particular:

> Discoverability is a big part of it as well, at least at the user end...If there was a geoRxiv then I would probably browse it more regularly because the chances of me finding something relevant would be much higher.

Are you just talking about the individual subject areas (ecology, genetics, etc.)? The arXiv has those as well, which you can subscribe to. Few physicists subscribe to the entire thing. (Incidentally, folks may find https://scirate.com/ filters a bit better.)

Of course, the arXiv doesn't have a dedicated biology section, much less sub-divisions, but this is because there hasn't been enough interest.

> The second reason, which is not unrelated, is that–at least with peer-reviewed journals–the quality of the work in field- or subfield-specific venues is often far higher than in the sort of pan-scientific journals like Science or Nature.

The main point of the arxiv is to put everything in one place which is permanent, searchable, sortable, freely available etc. Filters generally come from elsewhere, such as the aforementioned sections or SciRate, or by simply looking at arXiv papers published in certain journals (without needing journal access).

Obviously, in the absence of additional filters, the bioArxiv won't be useful as filter either.

> If I write a paper and send it to Tectonophysics,...

You'll find that there are plenty of popular-level papers on the arXiv sharing space with highly technical ones. This is generally noted in the abstract. While the arXiv is not meant for public consumption, there are plenty of filters that try to pluck out accessible papers, e.g., the Physics ArXiv blog (which isn't as official as it sounds) https://medium.com/the-physics-arxiv-blog


> to give biologists more assistance in producing professionally typeset documents

So why does it seem like Greider's paper received none of that assistance? It's typeset in double-spaced Arial. Both double-spacing and Arial are immediate indications of unprofessional typesetting.


> That idea was apparently nixed


Probably because bioRxiv.org, where these folks submitted their work, will be more widely read by biologists than arxiv.org.


biorxiv reeks of NIH (No pun intended)


And that means?


NIH means "not invented here", referring to the tendency of some engineers to build their own tools from scratch rather than learn or adapt an off-the-shelf tool. Can't comment on the context as I'm not experienced with these different *xives.


National Institutes of Health provides a lot of the funding for biology and medical research.


Hence "(no pun intended)".


arxiv only accepts computational biology papers. Compare the subject areas of biorxiv http://biorxiv.org/ and arxiv http://arxiv.org/archive/q-bio


That's not necessarily an immutable fact of arxiv.org though, otherwise there wouldn't even be computational biology on there.

It's probably more likely due to a lack of demand. There's only about 4k articles on all of biorxiv in any case.


Does arXiv accept biology papers? I think they are exclusively math/physics/CS.


It used to be only particle physics, then only physics, but my impression is they expand to fit whatever. I'm sure they wouldn't mind expanding, but probably some of it is they don't have experts/money to help out with some of the organizational things they typically do for a field.


computational biology, sure


So it was put online without peer review? Papers can always be submitted by the author to PMC using NIHMS if the journal doesn't do it. However the paper must go through a journal because they arbitrate the peer review.


There's a massive project going on in math using GitHub to write an open-source algebraic geometry textbook.

http://stacks.math.columbia.edu


can someone explain this bit:

> If university libraries drop their costly journal subscriptions in favor of free preprints, journals may well withdraw permission to use them

withdraw permission to do what exactly, and enforced how?


Withdraw permission to publish an article in their journal that has already been distributed as a preprint. Many journals in biology do that now -- they consider it publishing an article twice, which is a big no no ("self plagiarism"). This is beginning to change and hopefully preprints will be accepted in biology as they are in other fields.


Times have already changed. Many biology journals and publishers allow pre-prints.

https://news.ycombinator.com/item?id=11293619


It's little appreciated, but journals are in competition with each other for good papers, too. Editors at top journals are out at conferences trying to find the next submissions and handing out business cards. Nature editors are pissed when a hot paper goes to Science instead (savvy PIs exploit this dynamic). In this kind of 'market', a publisher who unilaterally rejects preprinted papers is putting themselves out of the running for those papers.


preprints are free but when they are added to an issue of a journal, that issue might not be free. so journal X agrees to free preprints (rough drafts) and then charges for the actual issue. if the library drops subscription to the journal, maybe the journal/society/publisher pulls the library's permission to use the preprints.


This isn't that innovative. Creationists have been doing this for years.


But for different reasons ... nobody (serious) wants them. If they were in high demand, they would paywall themself, for sure.


Does somebody know why we are still using pdfs for papers ? I know a lot of people that are trying to parse PDF files and it is an awful process.

If somebody is looking for an idea for a new venture, this is a problem, yet to be solved !


We are using PDFs because they have universal adoption and, importantly, they reliably produce the same document everywhere. Most alternatives you might think of will give variable results on different machines.

There definitely needs to be something that makes it easier to parse and otherwise interact with a PDF. But, for network-effect reasons, it's probably easier to introduce a parseable overlay for PDFs than to replace the format wholesale.


There is a tug of war taking place here. TeX is nice because the documents are formatted for whatever your reading situation happens to be. PDF is formatted for exactly one situation, the A4 sheet you targeted.

But on the flipside, the TeX document will often be ugly no matter how you are reading it, and the author can't easily apply tweaks for aesthetics or readability--many try, hence the TeX markup horrorshow.


Honestly, most of these issues are just a product of .tex's long and storied history. If some foundation plunked down $1-$10 million, it could definitely produce an open source successor to .tex (with extensive, maintained, and documented libraries like Mathematica) that avoids most of the badness.


i disagree. While i agree, that PDFs are there and they will stay for a long time. The problem can be easily solved by journals. They make you use of their own LaTeX templates. It would be very easy to just force the submitters to add the LaTeX together with the pdf. Sometimes the easy solution is too obvious, i guess


Well, basically everything on the arXiv has the .tex available, but that doesn't seem to make the problem much better. The problem isn't getting raw access to the text. It's possible to copy-past from PDF's with labor too, or to examine the inside of it (which is its own typesetting system, like .tex). The problem is that this data is very difficult to parse.



readcube is a spectacular somersault in the wrong direction. It is so much worse than a PDF - it actively fights me when I try to extract information.


Isn't that the idea? My impression is that it's a format introduced to help publishers like Nature introduce frictions to copying their content.


Wasn't very impressed by it either, but clearly there are a few startups in this area.


If I was an evil journal editor I would use the metrics on biorivx to accept or reject papers. This would make it easy to predict the future impact factor and help you game the impact factor of your journal.


This is actually very clever. With this we will just be moving the goal posts but at least it will prove to scientists that any notoriety acquired in the pre-print stage can lead to later career success.


It is actually quite hard to predict which papers will become hits and which become misses. If I look back over my own papers there were a few I knew were going to (minor) hits, but quite a few of the papers I thought would be of interest were ignored while others I thought were pretty minor attracted a lot of interest.

One area I have never understood (apart from editor laziness) is letting the authors write the article title and abstract. So many great papers are overlooked because the authors wrote a boring or misleading title or abstract.


The concerns over "peer review" seem ridiculous to me. The peers who would review it would still have access, and it would open it up to exponentially more people.


What about ODF? The Open Document Foundation has been separate from the content creation applications for a couple of years now.


What's stopping them from writing in HTML with some standard CSS?


MathML's lack of support?


Information wants to be free, BITCH... But seriously, given the increasing "media savviness" of subsequent generations (from Baby Boomers who grew up with TV, to Gen Y for whom the Internet is a given), the general ability across the spectrum of humanity to synthesize disparate information sources and filter them, compare and contrast, decide what is 'truthy' vs actually true... is increasing. Given all the information scientists have to process... What if machine learning was applied to this problem? The role of traditional gatekeepers is breaking down. I see this is the publishing industry - lots of content, most of the self-published books are awful, but books like "Wool" are able to rise to the top.

At least I hope humanity is getting more sophisticated. What is the median of the age of Trump supporters and the one sigma std dev? That would be an interesting statistic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: