Hacker News new | past | comments | ask | show | jobs | submit login
Library-managed 'arXiv' spreads scientific advances rapidly and worldwide (cornell.edu)
423 points by tosh on Dec 27, 2016 | hide | past | favorite | 132 comments



Can I just make a general plea?

You should upload your paper to arXiv. When you do, please upload your source (tex, or word I imagine), as well as a PDF.

For the blind, PDF is the worst possible format, and tex and word are the best formats. Don't hide, or lose, the blind-accessible version of your paper.


When I worked at the Cornell Theory Center in the late 90s, I didn't know about the arXiv, but I sure as hell knew about screen readers. One of the best FORTRAN programmers (numerical analysis/applied math research faculty I think) was blind as a bat.

I learned a lot about not underestimating people there.

It's a strange university but it always makes me happy to think that it (specifically Ginsparg, and to some degree strogatz and the law scholars) pushed forward what the Web was supposed to be.

Not a place to buy stuff, but a place to learn stuff, without waiting for it to make its way through a hyper politicized review process so that it could be printed in a 17th century fashion and mailed to some corners of the world, eventually perhaps reaching a fraction of the people who could use it. Rather, everyone everywhere with a connection.

Anyone who doesn't deposit preprints (arXiv, biorxiv, or wherever) or who doesn't agitate for their coauthors to do so is not really in it for the science. It's fine to be competitive -- deposit yours first. Make the fucking discovery instead of parasitically piggybacking on those who do the work.

But that last part, that's hard. Very hard. As long as there is enough money left in academia to encourage lazy shits, those of us who care about scholarship will have to push, hard, to remove the last refuge of these scoundrels.

You're either on the side of justice -- open data, open formats, open scholarship -- or you are tacitly endorsing Elsevier & Springer, who haven't the slightest problem using crap like incremental JavaScript & mangled PDFs to deny access to scholarship even to those who have paid.

David (blocking on his last name) at CTC made that choice for me. He set an example that forced me to admit what was right. I hope others will do the same. It's the right thing to do.


The OpenAcess and self-publication has removed an important check on research papers. If nobody cared about your subject, or if a journal's peer review process resulted in boring articles, it would not be financially viable.

I think advocates of sharing science should also understand that, at present, the majority of the content is asinine or incorrect. Removing checks and balances is not helping.


The majority of content in most journals is asinine or incorrect. About the only thing that can fix this is quicker turnaround (cf STAP, NgAgo, the godawful GWAS paper in Neuron recently, etc).

It's not at all clear to me that the checks or balances were helping beforehand. Hiring a shit ton of MBA types to administer research probably didn't help much either.

Also, out of curiosity, is this open access / fast review problem what led to machine learning falling into irrelevance?

Because we wouldn't want that to happen to other fields. Can you imagine if raw clinical trial data was out there, for example? (The Horror! Best keep these appallingly expensive results locked up!)


>>fix this is quicker turnaround

Why? As the first author of several publications, more than half the reviews I received were not well thought out. This part of the process doesn't happen, Open Acess and self publishing websites exploited this fact for greater throughput.

>>machine learning

I would argue that the machine learning bloom in recent years is because of an availability of new technology from industry, supported by industry. This is also why they were able to tolerate less conventional venues for publication - as these were not viewed as mile markers for progress. Yet, when I look at more "settled" machine learning fields like speech, I see the same plethora of academic papers filled with useless crap.

Perhaps the growth of ML has nothing to do with Open Acess. There is although certainly there is an underlying profit motive for successful technologies, Google ain't going to use a technique that doesn't work.

>>raw clinical trial data

It would be great if it were published, and indeed many people would pay for quality trial data. I see OA and self-publishing of research results having no relation to authors withholding clinical data.


Re: 1: no shit. The turnaround is for rebuttals. I try to do thorough reviews (I have published plenty as well) but realistically I recognize that no one has the sort of free cycles required to catch everything in advance. What's worse is that an awful lot of reviews are pure political bullshit.

In the end it's the people who may or may not try to build on your work (or mine!) that are the best judges of whether it's BS.

Re: 2: probably, but diffusion of ideas benefits massively from readily accessible documentation. One of the things that will cause me to immediately reject a paper is if its method implementation does not work as described. I will always reject in such cases. They are alarmingly common and this is one of the few easy cases in reviewing. Lack of an implementation is also cause to reject.

Re: 3: it has everything to do with it. Unless you're running and publishing trials please don't lecture me on this. See opentrials for a particularly frightening take on why published trials so poorly represent those submitted to the FDA or registered with CTEP. Meanwhile actual RCTs get buried by shitbags demanding more subjects than there are recorded cases in the past 40 years (no joke).

If you're not doing trials, you may have to take my word for it (or get involved in RCTs and find out for yourself). A lot of the enterprise is actively harmful to science by omission.


This argument applies to all speech that has now become easier and cheaper:

"If nobody cared about your opinion, it would not be financially viable to publish it. But now we have blogs/tweets."

Effectively, cost of publishing was working as a quality filter. Now, we need to actually implement proper filters instead of bad proxies for quality.


Could you try and make that argument without the term "financially viable"? We're talking about science here. Capitalism is a tiny thing that really isn't required to rule everything, and particularly in this case I don't think it has any authority based on which to implement that "check".

If I just leave that out you seem to imply science publications should be selected by whether enough people care about, or whether they are boring or not. Both (especially the latter) are rather bad measuring sticks for the quality and importance of research.

Adding "financially viable" to "less boring" produces "EXCITING!", adding it to "enough people care" yields "POPULAR!". These are definitely not forces that should be pulling scientific research.

Not saying there shouldn't be checks. Just that market forces are probably too stupid for it.


CNS (Cell, Nature, Science) == Buzzfeed science

Although that's not giving BuzzFeed enough credit, they tend to check their sources & verify results/reports. Also if a piece of fake news is simply resubmitted, they don't pretend like they never checked on it before.

There are plenty of checks (mostly from NIH, mostly for incremental garbage work on dead models like cell lines) and very few balances against the tyranny of CNS...


arXiv isn't meant to replace peer review. When I submitted to phys. rev. d, a submission to arXiv to allow people to see the work ahead of review was expected.


I'm a (PhD student) mathematician and every paper I've uploaded has included the TeX. I looked at about 10 papers in math at random, and 9/10 of them also included the TeX.

But this isn't something I've looked for in the past. So I wonder: in your experience, about what percentage of papers on the arXiv have included the source?


It's far less common in physics :/


Almost nobody in experimental life sciences uses Tex. My advisor refused to read anything that wasn't in MS Word.


We're working to provide a Word-like interface on git repositories for researchers to write their manuscripts in at authorea.com. Our editor allows researchers to write in markdown, LaTeX, or rich text all within one article.


> For the blind, PDF is the worst possible format

I'm surprised. Nobody has created an accessibility solution for PDFs after all these years of ubiquity? What's the story?


Well, I don't know about PDF, but PostScript, which it's based on is basically just a programming language for drawing symbols in specific positions on a page. Depending on how it was written (or more likely, generated) this could be readable, or incredibly unreadable.

As an example, in post script the following example from Wikipedia would simply show the text Hello World:

     %!PS  
     /Courier             % name the desired font  
     20 selectfont        % choose the size in points and establish   
                          % the font as the current one  
     72 500 moveto        % position the current point at   
                          % coordinates 72, 500 (the origin is at the 
                          % lower-left corner of the page)  
     (Hello world!) show  % stroke the text in parentheses  
     showpage             % print all on the page
Now if you're lucky you can just extract all quoted text and read those in order, but that's unlikely to work for all documents.


Almost all formats would similarly require content to be parsed from code; for example, consider HTML, Word, or Excel.

It makes me wonder how screen readers work. Thinking out loud, it seems that the screen readers should let the applications (e.g., Word) handle their own parsing and presentation and obtain the data after that. Otherwise, the screen reader would have to reinvent many wheels, interpreting the code for all applications including all their versions, features, quirks, and platform integration issues - such a daunting and difficult task that it seems unlikely. But where do screen readers hook into the content? After it's output from the application but before it's an image for the screen (which could require OCR)? Ironically, I suppose PostScript or PDF could provide common interfaces.


Sure all formats require some parsing to get to the content. However Tex, HTML, Word, Excel and languages like that were designed to allow people to format text and other data into a document. Therefore they generally make text and other information appear sequentially, and separate display logic from content.

PostScript and PDF have no such separation, the are the display logic. They are fully fledged programs that list the position, size, font, colour, of every symbol on every page, if you're lucky in a vaguely logical order, but there's no reason it should be. If the files were written by a human you may have some hope of extracting some of the content, but almost nobody writes PDF or PostScript by hand any-more.


It's not that there's code, it's that word and HTML are document formats while PostScript/PDF is a vector graphics format. Generally, if you remove the formatting tags from HTML or word (leaving just the text) the characters are extremely likely to be in the same order as the rendered document (sans things like running head and page numbers). Furthermore, HTML and word both explicitly delimit things like paragraphs while PostScript just changes the drawing position.


I think the difference with word, excel and Tex is that because they are editable, the content must be stored in a way where the flow is expressed, and each part can be clearly broken into its parts.

With PDFs, all you have is the position of lines and characters on the page. There is no flow, ordering, or semantics.


Well, less can extract the text; so I don't see why that's an issue.


It’s only an issue with crap PDFs (it’s possible to omit text information or obfuscate it to the point you copy & paste garbage out of them; pdfTeX-created PDFs are of course fine).


It doesn't work often for multi column pdfs or tables, ligatures are usually misparsed, and maths is just destroyed.


It's totally possible (and a relatively frequent occurrence) to have pdfs where the order of characters in the code has no relationship at all to how those same characters are laid out visually on the page. Anything marginally more complex than a series of paragraphs with no formatting at all basically requires you to render out the whole pdf and figure out the order that you are actually supposed to read the characters in.


Yup. For instance, Word's PDF output, has an absolutely positioned textbox for every word (and sometimes sub-word). This is for kerning purposes. If you want your original text back, you're going to need some OCR-like preprocessing and heuristics to guess what textboxes belong to the same line. If you have multiple columns, good luck distinguishing them from accidental rivers.

It's not impossible, but I wouldn't know immediately what tools get this most right. And it's always a lossy operation going back and forth.


PDF is only accessible if it is specifically crafted to be so (and then again, I don't think that goes very far). To the best of my knowledge only Adobe's tools and Word actually output that (and at least in the latter they tend to look like train-wrecks, which may or may not be PDF's fault more than Word's).


They probably can get at the text, but I have doubts about any formulae...


It would be great if arXiv upload screen said this. Not sure if it does or not.


That's certainly been my experience.

I've had arXiv automatically block bare uploads of TeX-derived PDFs (presumably identified through PDF metadata). For these, it required that the source be uploaded and compiled on their server.

On balance, it's probably best to require source as arXiv does, but this can create interesting issues from time to time. Since the source is downloadable, researchers can inadvertently end up sharing partial results or snark that was commented out in the TeX source.


I haven't uploaded a paper in a while (coauthors have been doing it more recently), but I'm pretty sure it encourages uploading document sources somewhere during the process. The issue that sibrahim mentions, about stuff in comments, is real: on a multi-author tex the file tends to end up being pretty messy, with lots of informal remarks and old versions in the comments. I always strip out all the comments before submitting.


For the record, if you upload the TEX, arXiv autogenerates the PDF.


The problem I face is that my papers is in multiple tex files in several folders connected to a main.tex using \includes. Is there a convenient option that will take care of this issue?


The latexpand perl script, the flatten and flatex programs. Please upload your papers in the future! :)

https://tex.stackexchange.com/questions/21838/replace-inputf...


Setting up something like webui for storing those files for every paper, something like gitlab or gogs, could really help. It could add more transparency, ability to add some feedback in the issues, and so on.


I've had success using \import{subdir/}{BasenameInSubdir} from the import package.

The basic (and unfortunate in a xkcd/1479 way) issue appears to be that the arXiv compile server doesn't allow writing to subdirectories. A subdir \include implicitly requires write access for the aux file.


I know that Word has decent accessibility built in (because Microsoft actually cares about this) but I'm surprised that you're getting mileage out of TeX which is a very visual format. Do you basically screen-read the source? Or is there a non-visual output for it that works?


Yes, you do have to learn Tex to read it, but I don't know of any other format any blind mathematician uses. It is common to teach maths with latex -- while it isn't perfect by any means, it is better than any alternative.


PDF readers manage to parse text back somehow effectively, maybe not on formula / formatting heavy PDFs. Anyway good call, accessibility is not only for mainstream websites. I'm sure the blind dude that aced math classes in college would agree.


Formulas and many tables do very badly.

Would be nice if latex embedded the source of equations into pdfs. Wonder how hard that would be to add?


Embedding anything in PDF doesn't seem hard (considering what a few security talks said).

How large are latex files ? probably not much a few 100KBs. It's easy to add a compressed stream in a PDF. It's a great idea.


This article misses one of the biggest value-adds of arXiv, at least in my field (Statistics): since almost everyone posts to arXiv, you can almost always find a free version of a published and potentially pay-walled paper. In the past, publishing in a peer-reviewed journal would (1) improve the paper through peer review, (2) signal the quality of the paper based on the prestige of the journal, and (3) distribute the paper. With arXiv, publishing now only does (1) and (2).


Publishing sometimes does 1), and rarely does 2) (as a statistician you surely know that the relationship between impact factor and retraction is nonlinear and rises in strength as you get into CNS, NEJM, and the like).

I review for others because others have done 1) for me. But I'll never review for Elsevier, and lately I've had the luxury of reviewing for the most cited of open journals (by operating bioRxiv, and accepting direct submissions from it, I claim that Genome Research is "close enough").

It makes me very happy that this is possible (my CV has not suffered for only publishing as first author, and whenever possible as senior or co-senior, in fully open journals). I'm pretty sure this wasn't possible for most people a few short years ago. That engenders optimism about the future of scholarship, for me at least.

Hopefully you as well.


> you can almost always find a free version of a published and potentially pay-walled paper.

On personal research, I've used it for exactly this, but since what I've seen was only preprints, I've often wondered about the final version. It looks like I'm not alone.[1] Do many or any of the arXiv papers get updates with the improvements that come from peer reviews? Is there a need for arXiv for finals or do publishers demand exclusives on finals?

[1] http://mathoverflow.net/questions/41141/should-i-not-cite-an...


Publishers (in this subfield at least) usually demand ownership only on the final typeset manuscript PDFs. Those cannot be uploaded, but people are usually free to update the arxiv manuscript by uploading their own "final" version files, with content equivalent to the published one. In the corner where I come from, I'd say this is done most of the time, especially if there are major changes. In practice, people often read only the arxiv versions anyway since publisher's web pages can be crappy.

Also, since you submit manuscripts to most journals in TeX, there's very little extra work involved in uploading the updated files also to arxiv. You maybe miss the copy editor's grammar corrections etc., but those are almost without exception unimportant --- also, more often than not, the copyediting by the publisher introduces errors not present in the original manuscript.


Agree with everything. I also want to point out that the final published version is not always better -- it represents compromises made with reviewers / editors to get papers through. Often these are positive, but not always. Sometimes it's useful to be able to send people the preprint rather than the final version.


The answers to your questions, unfortunately, are no and yes. Many journals, especially the higher-profile ones, make a big show of being "pre-print friendly" but then explicitly bar you from uploading revised versions to (bio)arXiv. This can be very annoying when the manuscript changes a lot between submission and publication.


This is an excellent point. It makes one wonder why academic journal publishers even need to exist anymore. The peer reviewers (who don't get paid anyway) could just as easily do the same job and issue a "stamp of approval".


There is at least some value in filtering low-quality submissions, wrangling reviewers, and making editorial decisions when reviewers disagree or are just being assholes (e.g. attempting to hinder a competitor).


Yes, but this can all be done without any actual journal, i.e., no physical product, no website (other than the submission link), no typesetting, and most importantly nothing bound by a copyright. This substantially lowers costs. They are called "arXiv overlay journals".

http://quantum-journal.org/announcing-quantum/ http://www.nature.com/news/open-journals-that-piggyback-on-a... https://www.aps.org/publications/apsnews/201602/arxiv.cfm


Same reason CAs exist for signing certificates: trust.


If the peers were properly authenticated as such, then wouldn't that obviate the need for the journals, if trust is their only value add?


you can be properly authenticated but not authorized, whatever it means in the context (something like not competent?) trusted journals are trusted for their competence in filtering crap out, not for being able to prove that authors are really authors.


Sure but it feels like there is some close relation there: proving an author is genuine and proving the author is producing genuinely valuable work wrt some given publication's specific audience.

If one were to build a system along that line, meant to replace prestigious academic journals of today, of course it would be gamed. But isn't the general consensus that the current system already is being gamed and usually at expense of the researchers doing valuable research and the public at large?


The academic system of universities, degrees and professorships already provides a pretty elaborate system for "authenticating" academic credentials, so I don't think trust is really that big an issue. The Journals don't really add any extra layer of effort to find "trustworthy" referees, they just find academics who are already employed in a given subfield who are willing to review papers.

(and in anycase, most subfields just have a couple thousand individuals involved at the PI level, who frequently interact at conferences, went to the same schools, shared an advisor, etc. So there doesn't really need to be an elaborate scheme to verify someones credentials. Chances are two individuals are already aware of eachothers reputations, or at least know some third party who is).


Publication in trusted journals is a major component of getting hired and tenured at universities. There should be less emphasis on this, but it's an attractive option for review boards working outside the field of the person under review.


Ah, the original http://xxx.lanl.gov/ that I knew and loved in the 90's, when people thought we were surfing nudies in the Physics department and not papers on differential geometry. I helped establish and run the za.arxiv.org mirror at WITS University, mostly to learn how to configure RedHat, Apache, rsync and other tools. I'm glad it still exists.


I've worked at Los Alamos for 6 years and I didn't know this existed. Pretty cool.


ArXiv is incredibly useful for research, but I think people also use it for a sort of "I posted it to arXiv first, therefore I solved it first" kind of thing, which imo can be misleading at times, if not everyone follows that. Also there is the eprint.iacr.org which seems to do the same thing, except for cryptography (or is it cryptology?), so I'm not sure if every important preprint in that topic gets to arXiv.


> I think people also use it for a sort of "I posted it to arXiv first, therefore I solved it first" kind of thing, which imo can be misleading at times

True, but I don't see that fights over precedence are unique to ArXiv either, or even made worse by it, no? I mean, at least now there is an unambiguous date-stamped public place to cite in this kind of fight. And those fights provide a built-in incentive to put stuff up there, which is good for all of us.

Basically: who cares about spitballs as long as the papers end up on ArXiv? Seems like a cost worth paying to me.


> True, but I don't see that fights over precedence are unique to ArXiv either, or even made worse by it, no?

I'm not familiar enough with other methods of preprint publishing besides arXiv / eprint.iacr, but you may be right that it is not unique to arXiv.

My personal preference would to be to have bits of research done through something like git, so that work along the way can be seen, otherwise one may solve a problem and then be 'out-arXived' by someone who spends an all nighter tex-ing your solution (this is a hyperbolic example, but I think the idea of the potential flaw in the system should be clear).


Besides disputes involving patents, papers that are withing a short time frame of each other are usually understood to be cases of parallel invention, it's happening quite frequently in deep learning atm since there are still a lot of relatively low hanging ideas, to the extent that people are commenting/joking that they consider the risk of colliding with someone else when deciding what to work on.


> withing a short time frame of each other are usually understood to be cases of parallel invention

I see what you are saying, but I don't think it's that cut and dry, otherwise I could just take someone else's work from yesterday (or whatever a short time frame is), and re-solve it (easily -since now the tricky parts have been revealed) and post it today - tada, I parallel invented it!


Typically the work in a paper, if substantial, is done over a long time, so even if the main destination ends up being same, it's unlikely the route and sidestops are the same. So often you can wriggle a little bit and expand the paper sideways, so that it is still publishable work even if the other work is given priority.

It's actually not that rare to have similar papers appear in arXiv one or two weeks later after you submit --- to me, it happened several times within last few years. In these cases, it is possible to see that the approach differs enough (and moreover, often you know the people in question, or you know someone who does).


There's also http://eccc.hpi-web.de/reports/menu/ for complexity theory.


I hope it is replaced with something better soon. You cannot see access statistics concerning the papers you upload, and they provide this absurd reason for not doing it: https://arxiv.org/help/faq/statfaq (it seems they think arxiv users are idiots or something, so they have to take care of us). Also getting the uploaded latex files to be compiled without errors is a pain, and they don't let you to just upload the pdf (this has pros or cons, but I wish there was the freedom to choose... and I guess that 99.999% of the time people just download the pdf).


After reading your comment, I was inclined to agree with you about the statistics. After reading their FAQ, I was convinced to side with them.

Their point is that the stats are garbage-level useless. And I can imagine people bragging elsewhere that their paper received X,000 hits when in reality it's all spam or bots. It's not arxiv's responsibility to monitor that, but it wouldn't feel good to facilitate that kind of disinformation or invite hit inflation. Especially as scientists, we want to either publish good data or no data, not data that we know to be garbage.


As a scientist, give me the data and I will know what to do with it. AFAIK in the http://biorxiv.org/ they provide some statistics and it does not represent a problem.


As a fellow scientist, I'm much more concerned with how others will interpret these access data. I'm not excited about the prospect of yet another unreliable signal for e.g. hiring committees to latch onto, as they often do with journal impact factors and such.

It might be nice if ArXiv would perhaps provide the data to researchers on request. Just curious -- what kinds of questions would you use this data to answer?


I want the data for the same reasons that any content producer in the Internet wants it. Bloggers, youtubers, any company...everyone. Despite the noise this data might contain, it seems it's useful for everyone except for scientists...to whom I am surprised to hear that it's better not to give the data, in case they misinterpret it. Very risky statement and precedent.


I didn't mean to imply that the data wouldn't be useful, I was more asking to see if you had any specific questions in mind that this data could shed some light on. Relating download rates to citation is the first thing that comes to mind for me, though honestly I'd be much more interested in analyzing the full citation graph for my field, which generally doesn't post papers to the ArXiv.

It's not that I am personally concerned with misinterpreting the data. I just think there could be some downsides to releasing the data without limiting access in some way. For one, I think there are already issues with the citation metrics are used and interpreted, for example in tenure evaluations. I don't think it would be a step in the right direction if this data were used towards the same end...


Not providing raw download counts seems like a good thing; it's strongly privacy preserving.

On the other hand, perhaps a way for registered users to star papers that they like (similar to how Github lets you star projects) might be a good thing. It serves much the same purpose as a rough measure of popularity, but is entirely voluntary.


What's the privacy advantage of not providing anonymous download counts?


Requiring error-free latex is almost certainly a reasonable proxy for real curation effort.


The issue is that their LaTeX installation is fairly old, so there's a real chance of running into old bugs that have long since been fixed. It's a bit tiresome to work around those. I've had issues with their pgfplots version and had to resort to compiling the figures to pdf locally and including those.


Nah, I mean that it is a pain to upload error-free files. Due to dependencies with libraries and other reasons, a file that compiles in your computer often fails to compile in the arxiv.


There is one HUGE reason for not using PDFs -- PDFs are very blind-inaccessable, whereas tex is perfect.

For that reason alone, arXiv is really helping the blind community in academia.

EDIT: Add missing 'not' :)


> There is one HUGE reason for using PDFs -- PDFs are very blind-inaccessable

I think you may have mistyped this. ;-)


Needs a [2012]


Interesting to learn how to pronounce it correctly. I've always just said arx-iv like it's spelled.


It is pronounced like it's spelled; the X is a chi.


Is it spelt that way, though?

X is LATIN CAPITAL LETTER X

Χ is GREEK CAPITAL LETTER CHI

Everywhere I have ever seen it spelt, it has been written with the Latin letter. I have never seen it spelt as "arχiv" or "arΧiv", only ever as "arXiv".

I can understand why, for example, having a Greek letter in the URL would be undesirable, but if one is going to consider that letter to be Chi, then that should be the authoritative spelling and it should actually be spelt that way where possible.


Given the the "Archive" homophone, it is obviously intended to be pronounced that way. The spelling is secondary to the common pronunciation, as usual in English.


Given that the arXiv predates widespread Unicode adoption, I think we can forgive the faux pas of using a Latin X to represent a Greek χ.


excellent points. Indeed when I visit the page, the title is spelled as "arΧiv", not "arχiv".


Except the chi-derived 'X' in English is pronounced /ks/. As is true in Latin, from which our word 'archive' comes. If you really wanted to use a Greek chi, it should have been αρχείο.org...


Oh no, we can't have luddites running around saying "ah-pex-io dot org"


Is there any reason why a project like this wouldn't be open sourced?

Follow up question, how does a site like this have a $500k annual budget? I was napkin calculating the costs of running this and couldn't get anywhere close to $500k without having extensive staff salaries.


Looking at it from a Cornell point-of-view, the most innocuous reason I can think of is that they want a canonical library of papers that others can mirror rather than researchers having to search each individual university's arXiv. If they let others fork and set up their own servers it could lead to interesting modifications/applications but it would no longer be in their control and might make the preprint locations fragmented. (and the other servers might not have the same moderating standards)

The other more greedy explanation is always money. Of course open source isn't antithetical to profit, but as mentioned before you do lose control and maybe Cornell doesn't want competition. Even if the project was started with the best of intentions, they still need to make it self-sufficient and maybe even profitable so they probably decided it's in their best interest. Of course this is all just me speculating.


It must be mostly salary and maybe a small fraction bandwidth. Hardware costs must be in the noise. For the salary, don't forget university overhead. 200K alone might be going to support Ginsbarg. A software developer + sysadmin could be at a similar rate. Again, with overhead included. Praise be the bureaucracy, and give onto it its tithe.


Here's a recent lengthy FAQ Ginsparg did on the arXiv (ironically behind a journal paywall).

http://onlinelibrary.wiley.com/doi/10.15252/embj.201695531/f...

Here's a discussion on HN of a blog post by me sparked by a conversation with Ginsparg.

https://news.ycombinator.com/item?id=9415985


Am I the only one who still uses xxx.lanl.gov ?


Probably not. :) (Is it a redirect now, or is it an actual mirror?)

My understanding is that they switched to the new domain after people noticed the original was being blocked as porn by a bunch of automatic content filters.


Yes they put the new masthead on instead of the skull n crossbones


There's something I've always wondered about.. what do you do if you upload your journal submission to arxiv but it's later rejected? That possibility has always been a deterrent to submitting to arxiv for me. Seems to me this discussion assumes arxiv uploads will be accepted to some journal eventually..


Hooray for arxiv :)

Long live open science!


>Eleven years ago Ginsparg joined the Cornell faculty, bringing what is now known as arXiv.org with him. (Pronounce it "archive." The X represents the Greek letter chi.)

Been pronouncing it "ar ziv" until now. :P


I still like referring to it as triple-x from when it was http://xxx.lanl.gov


It may be interesting to note that the ancient greek 'X' is suposed to be aspirated so the English pronunciation of 'chi' is almost certainly incorrect anyways.


But, "archive dot org" is an entirely different and also noteworthy organization!


I think that's "ark ive" where this is "ar chive". I'm gonna call it ar chive dot org now and get more stupid looks, aren't I? :/


But I bet not faster than Sci-Hub... har har har. :D


Significantly faster than sci-hub. Sci-hub is, afik, based on published work. This is preprints, so well in advance of that.


Well, however, we can argue here that a pre-print is not of the same quality as a published work. Not that published works cannot have errors (and be retracted), but at least they have passed a first-level scrutiny of an editorial board.


Yes, but passing through the editorial board takes time. In some fields research moves extremely fast and by the time papers is properly published it's already obsolete.


That's the price of scrutiny, man. Time. And these "extremely fast fields" are the same fields that need to backtrack and retract their claims. The loss of time, effort, and resources for other scientists if they use pre-printed claims and data from faulty publications can be really enormous and probably already wildly underestimated.


The flow is significantly slower than is required for checking work. I'm fairly sure my wifes papers didn't really take nearly two years to review.

Machine learning is a field that's moving quickly and can also largely be checked quickly by other people in the field too. The problems that can't be checked easily are also those that won't be checked by the journals anyway (no journal I know of will retrain one of googles deep nets to see if they get the same result before publishing).


If you think peer-reviewing is slow then it's probably because you don't get the PEER part of the reviewing process. The PEERS are simply colleagues who have their own lives, priorities, and research. They don't stop what they are doing to review your wife's papers.

And please, I see what's happening here: I have been portrayed as the "traditional journal apologist" although it's definitely what I wanted here. I simply wanted to express my doubts that pre-printing is the answer to it all.

Anyway, is there a study that shows how many pre-printed articles in ML have been retracted or refuted so far? Until this is done, and shown as small as in peer-reviewed journals, keep a small basket.


> If you think peer-reviewing is slow then it's probably because you don't get the PEER part of the reviewing process.

I understand the process fully. It doesn't make it fast, nor does it make it a necessary cost to pay. Delaying access to content for several years does not solve a problem. A not-insignificant time was spent bouncing between people to sort out who was paying for the costs, then there is also a delay between acceptance and publication. This now averages just a month in pubmed, but papers can bounce around this point for a lot longer.

I am not arguing for pre-prints to replace traditional publishing, but the speed of spreading information is undeniably faster, and that's what started this whole chain of comments.

> Anyway, is there a study that shows how many pre-printed articles in ML have been retracted or refuted so far? Until this is done, and shown as small as in peer-reviewed journals, keep a small basket.

I don't get the phrase "keep a small basket", but no I've not seen this. The point is simply that PEERS (if we need to shout the word) can in many cases replicate the work and assess the results much more quickly than the traditional review & edit process. I picked the field because I see people re-implement work described in the papers very quickly, or the original authors share the trained models and code.

I'd also caution against using "retracted" papers as a measure, some journals charge for a retraction.


[flagged]


> You don't understand a thing and that was the proof of it, at least for me. The process of peers evaluating a paper takes a lot of time because it cannot be automated and is serious, especially for the better journals. Of course bouncing people (referees) is part of the process, to find the better and/or most available one.

The time of the actual reviewing does not alter either of the two other time sinks that I posted. The median post acceptance to publication time for the journal of clinical neuroscience is over three months, and other journals head over a year. [0]

> Of course you can cut corners and pre-print,

Preprints are not an alternative to publication. They are something you can do before publication. Hence the name.

> Pre-printing might be the case in fields like CS and its subfields where verification is a very quick thing, but totally unsuitable in fields such as biology or medicine

There is absolutely nothing unsuitable about releasing your work early in any field. There is a problem in assuming un-vetted work is vetted, but preprints don't make any claim to have been vetted.

0 http://www.nature.com/news/long-wait-for-publication-plagues...


"The time of the actual reviewing does not alter either of the two other time sinks that I posted. The median post acceptance to publication time for the journal of clinical neuroscience is over three months, and other journals head over a year."

Hey, you started claiming two years, no it's three months. Three months is perfectly acceptable for quality peer-reviewing. For good papers from experience authors (who know what critic to expect) this waiting might even be lower. High-quality publishing demands this.

"Preprints are not an alternative to publication. They are something you can do before publication. Hence the name."

Yeah, but the pre-print paper is almost never retracted if it has been completely revamped for the final publication, after corrections through peer-reviewing.

This makes a disservice to science cause in the meantime many scientists might have used the wrong data/methods found in the pre-printed version. That's the price of speed publishing. In some fields (biology, medicine) that price is very high.

"There is absolutely nothing unsuitable about releasing your work early in any field. There is a problem in assuming un-vetted work is vetted, but preprints don't make any claim to have been vetted."

Here we go again. I never said it was unsuitable. I, myself, sometimes attempt to pre-print. I just try to explain to you that pre-prints are not the silver bullets you imagine for high-quality spreading of scientific advances.

Yes, they spread fast and with no control, but high-quality leaves much to be desired. That's why it is probably prudent to accept as readers pre-prints only from established scientists who are known for their work ethos (by past results) and they will most probably submit their work to peer-reviewed journals as well.

Last but not least, everyone should be cautious about what he or she reads on pre-prints, especially on "slow" fields.


> Hey, you started claiming two years, no it's three months. Three months is perfectly acceptable for quality peer-reviewing. For good papers from experience authors (who know what critic to expect) this waiting might even be lower. High-quality publishing demands this.

You're not interpreting this correctly. This is not the time for review, this is once the paper has been reviewed and accepted. The journal has agreed to publish the paper, but there is still a significant delay before anyone actually gets to read it. This is why I'm saying it doesn't take two years to review the papers. It won't have done, but it still took that long between submission and the time others could actually read it.


This happens because you forget there is a que, a pipeline so to speak. Papers pile up, especially for highly desirable journals that command many eyeballs.

If yo do not want that, choose another journal. Most new ones can guarantee maximum time. All things considered, things get better here but still you get that fundamental step of peer-reviewing, which is the main difference between traditional journals and pre-print servers. In the latters you are essentially alone.


That's not necessarily true across all "fast fields". There's definitely selection pressure against fields sensitive to faulty experimentation to "move fast." Computer Science is one of the bigger subjects within Arxiv. Retraction rates within this discipline is extremely low and the field as a whole moves very rapidly. Having access to preprints makes it much more scalable for researchers to stay at the cutting edge.


Look, you are free to abstain from using pre-printed articles for your research if you do have doubts. Why start a crusade though?

ArXiv is working very well for a lot of scientists in Machine Learning, since must results can be doublechecked by simply running the code. I won't disregard a proved approach that I've myself seen working just because it hasn't been published yet.


> In some fields research moves extremely fast and by the time papers is properly published it's already obsolete.

Could you provide some examples for a layman? To me it always seems that science moves far slower than one could hope for (e.g.: Every 'breakthrough' in batteries/Graphene of the last ten years and still no products)


Think of it in terms of the speed of the field relative to the size of the paper - you aren't going to see no-graphene to industrial-graphene in a month, but you might see Johnson's 8% graphene be replaced by Smith's 8.01% graphene.


Hmm actually in my experience the versions of papers uploaded to preprint servers are the final versions, including reviewers comments. This is in cryptography, but I think the same can be said for complexity theory.


It is pretty controversial. The value of peer review has never really been demonstrated and despite peer review, the median scientific paper is wrong.


> the median scientific paper is wrong

What does "wrong" mean in this context? Does it mean that most scientific papers will be shown to be inaccurate in the due course of time? That is bound to happen given the nature of scientific progress.

But peer review aims to assess the methodology and rigour of the papers being presented. I agree that it is often debatable whether this happens in reality, but that doesn't explain your statement I quoted above.


How do you want this value to be demonstrated? Peer-reviewing is exactly this: peers evaluating the methodology and results of your experiment or idea. If this is faulty, imagine how much more faultier is an attempt to pre-print without even passing that step, which mind you, is pretty important for major journals such as Nature or Science.

Peer-reviewing has a lot of weak points, but saying that pre-print is the answer to them all is plainly wrong.


It doesn't seem that anyone is saying that we need to get rid of the (admittedly valuable and proven) peer-review model of publishing in journals. Or that this is uniformly better for all use-cases, esp. those that peer-review excels at.

It's merely a supplement to peer-reviewed journals that has some nice characteristics, for some use-cases, which has been beneficial, to some researchers, in some fields.


>the median scientific paper is wrong.

Are you being literal about that or is that just a figure of speech?

Could you please provide a source for those claims, if true?


I assume he's alluding to Ioannidis's 2005 paper "Why Most Published Research Findings Are False"[0] and/or the replication crisis[1] in general.

[0] http://journals.plos.org/plosmedicine/article?id=10.1371/jou...

[1] http://www.nature.com/news/1-500-scientists-lift-the-lid-on-...


Partly that and also based on my experience getting a PhD in physics where I found: (i) published physics papers with 30 pages of math in them frequently had errors, (ii) if you have a large number of signs the odds are 50-50 you will get the sign right in a long calculation, and (iii) you can't say high energy physics papers are "right" or "wrong" anyway... I mean, does anybody think we really live in an anti-DeSitter space?

Medicine on the other hand has the problem that you can't really afford large enough sample sizes to have sufficient statistical power.


>"Medicine on the other hand has the problem that you can't really afford large enough sample sizes to have sufficient statistical power."

This really isn't the major problem w/ modern medical research. In fact, if they had properly powered studies there would be far too many "discoveries" and the real problem would become obvious.

The real issue is that the efforts to come up with and study models capable of precise predictions (eg Armitage-Doll, SIR, Frank-Starling, Hodgekin-Huxly) have been all but choked out in favor of people testing the vague hypothesis "there is a correlation". There is always some effect/correlation in systems like the human body, so it is only a matter of sample size. As explained long ago by Paul Meehl, this is a 180 degree about-face from what was previously called the scientific method: http://www.fisme.science.uu.nl/staff/christianb/downloads/me...


I hate arXiv, I can never figure out where is the PDF, if there is a PDF... long live eprint.


I'm not sure whether you're serious, but on any article's page there a "Download" section with a link to the PDF (labelled "PDF").


Not always.


Can you link three examples.


Here's an example: https://arxiv.org/abs/1611.06999

I believe the PDF link is only removed after a paper has been withdrawn. But if you click on v1, you can still access the original paper (incorrect, in this case).


v2 is a 0kB file. As you note, v1 is still available, just as with any other paper that was updated.

In any case I don't think that withdrawn papers support the OP's claim that PDF links are hard to find for normal papers.


ePrint indeed offers a much nicer interface. However, I wish they didn't discard prior versions when people revise papers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: