This is one of the things most scientists envy physicists for. I would estimate 95% of those papers are copyrighted, and yet, since we have had this structure for so long, no publisher tries to pursue us for sharing our work with the public for free.
I don't know about other fields of physics, but in astro, most of the data is free access as well. I personally work only with public data and I'm paid to do it. A string attached to governmental funding from the Euro or NSF is usually a mandated free access database.
Sometimes I take for granted the fact that my morning ritual involves reading every publication in my field from the day before, without license. And then I download some free data, program in my free languages, write in my free latex editor, and then publish my work for free in a place anyone can read it. It's utopic.
Side note: My dad (RIP. Princeton PhD high energy physics working at UCSD as a professor/researcher.) lived in the high energy realm for decades. He worked on every major particle accelerator known and some unknown.
True story. He had a hobby going in a public storage unit with a surplus military linear accelerator. Smallish. About 30 feet long. Of course it required huge amounts of power so he cut a hole in the unit and ran a line to the nearest pole and siphoned 480 mains volts. And the gamma radiation was very dangerous so he hauled in several tons of lead destined for EPA long term sequestering. We worked one summer building shielding walls and measuring the operational radiation. After the unit was 'safely' running, we would take various pieces of thrown away Lucite from the physics machine shop and turn them into polished beam trees (Google it). We then gave them away for Christmas gifts. What fun for a 10 year old kid!
Thanks. An addendum to his life: Apparently my dad led an early experiment at Fermi lab that discovered scaling violations which led to QCD. It was a while until it was officially confirmed and published. He also worked with a physicist named Masek at a Stanford SPEAR experiment which discovered a new quark/anti-quark. Neither got recognition which is how the good old boys network functions in basic research.
You should really do some digging and do a proper write up, it would make one hell of a story and I'm sure your dad would approve. So many people working hard without any recognition but that's no reason not to illuminate that a bit.
He's gone. And I only have the anecdotal info from colleagues he worked with. MANY scientists are swept under the carpet due to the recognition power play that occurs at the high institutional echelons. I can't prove that either. My dad was a modest man and never extolled his past successes. So, I'll never really know.
Looked up Masek. According to his obituary in Physics Today [1], his team at SPEAR "discovered new bound states of a charm and anti-charm quark." Important work, but not a new quark/anti-quark.
Amazing. This got me thinking about citation counts. The most cited paper in Computer Science of all time is Vapnik's Statistical Learning Theory (1998) with about 10k citations. The most cited paper of any kind of all time is Protein measurement with the folin phenol reagent by Lowry et al. (1951) with > 300k citations. There's a big time gap here, but not big enough to make up for a > 290k difference in citations. I always thought that CS was one of the more prolific paper writing communities, clearly not the case.
PS. I'm not sure which paper in the arXiv has the greatest number of citations. I don't think either of these papers are there.
That line of reasoning isn't convincing to me (though I have no data to confirm or deny it); CS could still be a more prolific paper writing community, just that papers compete more for citations. In CS if I want to cite something, I have my choice of 10 papers from the same era from roughly the same group of people saying roughly the same thing (some might prefer to cite earlier papers, as the "original", others might prefer later papers as the ideas are more clarified). In other fields, there might be one "standard" paper to cite for a era/topic/group.
Couldn't you say the same about other fields? Although I understand where you are coming from. If I wanted to make a more accurate comparison it'd only be fair to examine the distributions of different fields as well as their top performers, but I still think that is too huge a gap to make up for anything besides some heavily skewed distributions.
In CS it's customary to stop citing papers at some point. E.g., lots of papers are published about Turing machines without citing Turing.
Also, absolute limitations on page count is really common in CS, and the page counts tend to be pretty low. In other areas, journals might allow for more citations or citations might not count toward page count.
> Have there been any significant CS papers published in the last ~5 years that aren't on Arxiv?
Even if not, there might be insignificant CS papers not indexed by Arxiv which cite significant papers which are indexed ;) This makes the citation counts comparatively lower if most insignificant physics papers are in Arxiv.
That said, it doesn't surprise me much that worldwide there are still more people working in physics, biology or mathematics than in CS.
It is true that CS is more conference-oriented, however most top conferences require that a paper be submitted, reviewed, and (if accepted) published in the conference proceedings before you can present your work there. This does vary by discipline though: algs/theory is more traditional journal oriented, I believe.
My initial estimation of 10k was from a CiteSeer list that I didn't realize was limited to only documents in the CiteSeer database: http://citeseer.ist.psu.edu/stats/articles
Wow awesome! Only a couple of weeks ago I tweeted about the idea of building a genealogy tree by walking along a graph generated by arXiv. This is a really neat visualization. Is the codebase open-source?
Any ideas how could 'content discovery' work (or be improved) with the research papers? What is the current standard, just the keywords/topics/authors or is there something else?
Content discovery does work using citations. How it can be meaningfully improved, I don't know. Often, the missing piece will come from a completely different discipline. I don't see how this gap could be bridged using only citations, unfortunately.
That's the thing. I understand that citations are good enough when you know what you're looking for (at least from my perspective), but imo there's no good solution to finding seemingly unrelated paper/research that could be 'the missing piece', hence the question about 'content discovery' :).
This is seems to be quite hampered by only including references that are found in the arXiv. Two of my papers from grad school are surrounded by papers that have very little to do with them. My three papers that are on the arXiv are very spread out, with the distance between two of them being ~90% of the map height and the third somewhere in the middle. They are all in physics, but very focused on experiment and apparatus. I think that physics theory (vs. experiment) is over-represented on the arXiv and connections to theory papers are much more influential on the map. It would be interesting to redo this with a database like http://adsabs.harvard.edu/, which doesn't depend on author self-selection.
I was thinking while navigating this that, if I was researching something related to physics, etc., this would much better than using some a engine, because you might not know exactly what you want to look for, until you see it.
> In laying out the map, an N-body algorithm is run to determine positions based on references between the papers. There are two “forces” involved in the N-body calculation: each paper is repelled from all other papers using an anti-gravity inverse-distance force, and each paper is attracted to all of its references using a spring modelled by Hooke’s law.
However it must have taken them a while to converge for 10^6 particles.
It's based on citations. If you go to 'about' on their site, there's more information about what x, y, size, color, brightness encode in the visualization.
Papers are the top of the iceberg if we consider applied science and technology. Patents, actual products / services and, above all, money generated are much more important imho.
CS folks are not really used to upload their papers on the arXiv. So this is probably not a good indication of the number of papers published in each field.
CS folks are not really used to upload their papers on the arXiv.
Maybe not to the same extent as physics people, but there is still a lot of CS on arXiv. More so in some subfields than others, but there's a pretty steady stream of CS papers showing up there. Enough that one person can't keep up with reading and digesting all of them as they appear.
That said, I don't disagree that there's a lot more physics than CS on arXiv. :-) I'm just not sure if that's because CS people don't upload to arXiv, or because CS people publish fewer papers in general, or "other".
Any good resources where CS folks typically upload their papers? I've used the Google/Twitter/etc. "published research" pages, but those are obviously company-specific.
Many CS researchers publish in conferences, not journals, so they tend to be pretty spread around. Each field usually has a major conference whose proceedings are worth looking into when they roll around. Of course, conference papers are behind paywalls, but you can usually find a free version if you search the authors/paper title in Google Scholar. The system could be better.
dl.acm.org won't have the paper uploaded if it wasn't published in an ACM journal or conference, but it has the metadata, including citations, for a huge number of papers published elsewere.
Interesting. I visit arXiv often and notice that most of the new papers are in the 'astrophysics' and 'high-energy' field, and the map exactly resembles that.
Can you please enlighten us about the technical details behind the scene, right from collecting the data to processing it.
I'm also working with a large graph entity and would love to read about your process.
How could we go about making a 3D version of this? I had a distinct feeling of travelling a galaxy using this. It could be awesome to actually be sitting in a 'spaceship' (knowledgeship?) and travelling the paths between these papers
Yes, both tSNE and force-directed layouts can do 3D as well as 2D. The following link goes to a "spaceship" force-directed visualisation of Python Github projects, the same author has used his engine for other visualisations too.
EDIT: Clicking a paper and then "(citations)" will you show the one-level graph of citations, and under the search bar you can see how many results there were.
Wow very cool!! I was looking at the little dots around the edge of the cluster and thought, "hmm I wonder what these are?". Then I realized I needed to dust my monitor...
I don't know about other fields of physics, but in astro, most of the data is free access as well. I personally work only with public data and I'm paid to do it. A string attached to governmental funding from the Euro or NSF is usually a mandated free access database.
Sometimes I take for granted the fact that my morning ritual involves reading every publication in my field from the day before, without license. And then I download some free data, program in my free languages, write in my free latex editor, and then publish my work for free in a place anyone can read it. It's utopic.
edit: two archives with a lot of different missions data for example: http://irsa.ipac.caltech.edu/frontpage/ https://archive.stsci.edu