The Future of Science

nkoren · on April 29, 2012

On a related note, some years ago I read an academic paper -- alas, it was printed on a dead tree, and I can't find a link to a digital version of it -- which pointed out that the rate at which papers were cited was driven primarily by the rate at which papers were cited. This is not as much of a tautology as it sounds like.

Think about the process of writing a paper: you do some keyword searches for recent articles on related subject. You then look at the bibliographies for those articles, pick out whatever looks relavent to your topic, look at the bibliographies of those articles, etc. What this means is that apart from your initial keyword search, the primary criteria for including an article in your research is: "has it been cited already?". Relevance is merely a secondary filter.

This paper pointed out the effects of this phenomena: the vast majority of published scientific papers are never cited again; a moderate number are cited only a few times, and the remaining few -- having reached a bibliographic critical mass -- are cited thousands of times. The authors of the paper made a strong case that this was not a good reflection of the quality of research. In many cases, the process of reaching bibliographic critical was simply based on the almost random chance of acquiring those first few citations. The authors provided several examples of important scientific ideas which had been lost for decades, arguably because they had not attracted a critical mass of citations in the years immediately after publication.

In other words, humans suck at pagerank.

Anyhow, it occurred to me that this is a problem which could be solved with technology. Imagine an online word processor which -- in a sidebar -- suggests potentially related articles from ArXiv and Google Scholar. This would be based not on crawling bibliographies, but rather on semantic analysis of the adjacent paragraphs.

I think this would create some real benefits. It would remove much of the problems with citation bias, ensuring that important ideas aren't lost, and also that prior research isn't unwittingly duplicated. Wish I had time to implement something like this!

sb · on April 29, 2012

I think that your impression of doing related work is overly simplified. Since you are submitting to a conference/journal in your field, chances are high that the reviewers are knowledgable in the subject area and will point out errors in attributing due credits to related work.

While I agree that there are systemic problems with peer review and how the science "enterprise" works, there is fitting analogy from politics by Winston Churchill: "The worst form of government except for all the others."

nkoren · on April 29, 2012

I think you misunderstand. I'm not talking about attribution errors. I'm talking about the fact that discoverability and cross-linking is severely hampered the bias towards previously cited works. This doesn't create errors per se, but can narrow the scope of inquiry to the point where it becomes detrimental to the institution of science as a whole. It's a naturally emergent silo, but a silo nonetheless.

A NLP-based referencing system would not need to be correct all the time; it would merely need to be helpfully suggestive. As you're writing your paper, it would put tips in the sidebar: "Maybe this is relavent? (Hover to read abstract)". As long as there aren't an intolerable number of false positives, it would be quite a useful tool, I think.

sb · on April 30, 2012

Well, actually I was not only having attribution errors in mind, too. Peer review ensures that colleagues will tell you about related work that you don't know about. Sometimes, people will tell you that something is related, even though you yourself don't actually think it's related work. Only with some time and acceptance, you will see that the remarks are really related, probably not directly to your own contribution, but to the bigger field that you orient yourself in.

Come to think of it that this is probably the most important detractor for having an NLP based "recommender." Personally, something like this might be interesting, probably even a great help, but at the end of the day, people need to really read a lot of papers, follow the proceedings of their target conferences, journals, and asking colleagues for their bibliographies. This has the added benefit of teaching them how to present their own work in contrast to others, do meaningful evaluations (in the best of all worlds, of course!) and figure out who is doing interesting work and might be valuable to get into contact with. Of course, some parts could be automated, but there is currently no incentive for scientists to do so.

IMHO, it would be a much more important step for CS researches to publish their code, too, because I frequently come across papers that have no implementation or evaluation at all--and that's really bad, because then the least-publishable unit becomes an idea with nice pictures. Researchers can be very successful using this "publication strategy." Come to think of it, there should be another approach to rank scientists by the number of publications, or their impact; unfortunately, I have no idea what could work instead.

eli_gottlieb · on April 30, 2012

Peer review ensures that colleagues will tell you about related work that you don't know about.

Not really, because other researchers are advancing their careers based on how often and how much they (a) publish and (b) get cited. So the colleagues most likely to be in a position to review your work are those who got cited a lot, who would primarily know about their work that got cited a lot.

The application of a relevance/NLP/PageRank-like additional-citation recommender program could come as a step in the reviewing process. Rather than having just human reviewers suggest further reading, a "machine reviewer" would as well, placing the query results in front of everyone involved in publishing a paper.

RichardPrice · on April 30, 2012

I totally agree about the importance of scientists to publish their code. That is critical. It's one of the many parts of the scientific process where the community would benefit from greater sharing.

ktizo · on April 29, 2012

I think you may mean the worst form of government except all the others that have been tried.

We need to try more.

RichardPrice · on April 29, 2012

I think that is a great sentiment. To find systems that work a lot better than the current one, which is incredibly slow, we are going to need to try a lot of different ideas.

When you are venturing into the unknown, innovation is tough and challenging, but it's also very rewarding.

ajuc · on April 30, 2012

Unfortunately trying new ideas when it comes to government, sometimes produces milions of victims.

But I agree - we should experiment more. One idea - we should be A/B testing any new law, before introducing it in the whole country.

ktizo · on April 30, 2012

trying new ideas when it comes to government, sometimes produces milions of victims.

Usually only with the forms of government that don't care about millions of victims. Is really a no-brainer that any totalitarian system that intends to make an enemy of a large section of its population is probably going to be pretty awful. So I would strongly suggest weeding those forms out in advance, rather than using the common historical method of allowing them to be implemented through force by the paranoid and insane.

This is less of an issue when it comes to systems of improving feedback in science, unless it somehow enables an aspiring evil genius, or something.

gliese1337 · on April 29, 2012

I am always looking for ideas for software that I could work on that seems more useful than cat pictures, and this gets pretty high up in my estimation. Most of my minimal NLP expertise is being used up in developing software for foreign language instruction in the near future, but I'm definitely bookmarking this just in case I can ever take it up later.

sb · on April 29, 2012

I am not sure that this will work anytime soon, even if NLP gets better substantially. I think the problem reduces to examination of patents and their novelty. I think that having algorithms to find related work is going to be hard to the point of it being unreliable. For example, in CS there is a trend of changing nomenclature every n years, which I think makes it hard to find related work.

eli_gottlieb · on April 30, 2012

Google manages to serve advertisements despite changing nomenclature. We're just talking about "advertising" papers based on various data about the paper under submission.

nkoren · on April 29, 2012

It would be really great to see somebody take this on! Although if this idea makes you into a millionaire, then you'll owe me a burrito, okay?

lhnz · on April 29, 2012

Somebody needs to make a relevance algorithm which takes into consideration, (1) noise at low levels of citation activity, and (2) social proof being misleading at high levels of citation.

3am · on April 29, 2012

Admirable cause, but the author doesn't do themselves any favors by dramatically overstating the role of publication in knowledge sharing (informal channels & conferences exist, publication serves more of a recognition purpose), and with somewhat offensive, unsupported claims like,

"The stakes are high. If these inefficiencies can be removed, science would accelerate tremendously. A faster science would lead to faster innovation in medicine and technology. Cancer could be cured 2-3 years sooner than it otherwise would be, which would save millions of lives"

Thrymr · on April 29, 2012

I agree strongly, conferences in particular are very important in this regard. Perhaps in some fields people are cautious about sharing full results at conferences, but in most it is very much encouraged and beneficial to get your ideas out there to the community before a paper can come out. The "12 month time-lag" cited in the article usually includes at least one conference presentation, in which the main results can be presented, often receiving useful feedback that can result in a stronger finished paper as well.

Could scientists make better use of modern communication media? Sure. I particularly wish more fields would adopt the arXiv model. But the peer-reviewed journal is not going to be displaced anytime soon, the best we can hope for is to make the process more transparent and open, more reflective of the interests of Science and scientists, and less of the for-profit journal industry. I highly doubt that the problem will be solved by "science startups." There are large non-profit interests as well, I expect the progress to be made by individual scientists, universities, and professional organizations who have an interest in destroying the status quo.

RichardPrice · on April 29, 2012

I don't think informal channels, and conferences, which are infrequent, and really expensive to travel to, are enough. Before the 1600s, science was largely done by wealthy people who had large enough houses to have a laboratory in. Scientific results weren't publicly shared; at best they were shared between the experimenter and a few of his/her friends, who communicated by private letters.

In the late 1600s, the first journal was founded, which meant that it became the norm for all scientific results to be shared publicly. This era coincided with the birth of the Scientific Revolution, which was an incredible flourishing of scientific thinking, that formed the basis of modern science.

I think that with a much more connected scientific community, that operated more as a global brain, rather than relatively disconnected nodes, scientific progress could double. So if cancer would normally be cured in, x years, I can see that coming down to 1/2x years, with an accelerated science, and, given the length of x, that shortening is likely to be a matter of years.

3am · on April 29, 2012

Best of luck with academia.edu!

mturmon · on April 30, 2012

Indeed, the post completely overlooks conferences, which in many fast-moving disciplines (most of CS, e.g.) are really the main venue for presenting new work, superseding journal publications. Think ICML or NIPS in machine learning -- 6 months apart.

I question the need for a for-profit enterprise like the one the author promotes inserting itself into the research enterprise. Publishers are bad enough.

timdellinger · on April 29, 2012

The problem with distributed, grass roots peer review is that you get poor quality reviewers. The current structure is slow and very "old media", but it is this way because it's the only way to guarantee quality peer reviews.

If journals cease to exist, and a new publish-it-anywhere-then-publicize-it paradigm emerges, along with some associated metrics (kinda sorta like Reddit), then I predict that conference presentations will become the new metric of success. They have gatekeepers, and scarcity due to limited bandwidth (i.e. there are a limited number of time slots available). The whole journal publishing infrastructure will just be shifted over to conferences... along with the ecosystem of for-profit vs. trade group, etc., and the Slowness and Single Mode of Publication problems that the OP describes.

thisisnotmyname · on April 29, 2012

"The norms don’t encourage the sharing of an interactive, full-color, 3 dimensional model of the protein, even if that would be a more suitable media format for the kind of knowledge that is being shared."

This is simply wrong - when you solve a protein structure is is mandatory that you submit it to the pdb (i.e http://www.pdb.org/pdb/101/motm.do?momID=148) and nearly every journal I read has both color figures and extensive online supplementary materials.

RichardPrice · on April 29, 2012

You're right, there are some trends in the right direction, which is terrific. But these are early. By and large scientists only get credit for publishing papers, which means they aren't taking advantage of the full interactive power of the web.

It's rare for scientists to share things like data sets, or other things like a video of a certain physical process that is going on. Most graphs or tables in scientific papers are non-interactive: you can't change the x and y axes, or other properties of the graph, as you can with graphs in Google Analytics, or generally data that is displayed for native web consumption. Similarly the code that scientists use to run on their data sets, which generates conclusions that end up in their papers, doesn't get shared.

I think the key to opening up richer sharing is to provide credit metrics that incentivize this kind of activity. When scientists can get credit for sharing data-sets, code, videos, and a wider array of rich media, they will start sharing more, and taking greater advantage of the rich media power of the web.

wiggins37 · on April 29, 2012

I'm glad that the author is thinking about ways to increase communication between scientific authors, but some of the statements he made, specifically regarding "curing cancer 2-3 years sooner" make him sound ignorant of some of the challenges facing researchers. Not all scientific knowledge is presented only through journal articles. As others have already mentioned, conferences with "poster presentations" are pretty common in medicine to discuss ideas before the paper comes out. In addition, labs across the country working on similar problems often exchange ideas and substrates by email and mail respectively. I agree that it would be great if there was a more centralized repository of information online to get information. If anyone has any experience with blogs, forums or websites specifically addressing oncology (that are not just press releases) I would appreciate learning about them.

archgoon · on April 29, 2012

No 3d models for new proteins?

The protein databank exists precisely for that reason.

http://www.rcsb.org/pdb/home/home.do

rflrob · on April 29, 2012

More broadly, lots of data that doesn't lend itself to a single figure has either repositories (like the Gene Expression Omnibus) or supplemental attachments to the paper that can be included.

tel · on April 30, 2012

Imagine if all the stories in your Facebook News Feed were 12 months old. People would be storming the steps of Congress, demanding change.

To play devil's advocate, the time lag forces your conversations to strive to a higher standard of quality, comprehensiveness, correctness, and context than Facebook updates could even be compared to.

Then again, striving for the higher standard also invents pseudosciences, bad stat, and outright fraud.

In short, I don't think the solution is to replace the paper with something instantaneous. I agree that instantaneous (public) communication could be better used in the academic community, but there's a trend that way already as blog posts begin to signal a certain kind of good advisor.

I especially don't agree that search engines have any business replacing peer review.

stephenhandley · on May 2, 2012

Many of the new approaches to science publishing I’ve seen haven’t done enough to directly address the silo problem or provide significantly improved alternatives. I suspect this is primarily because they’re trying to create viable scientific publishing businesses of their own. I believe taking a different approach around free, distributed, open source publishing and aggregation software would be better suited to transforming scientific communication into a more open, continuous, efficient, and data-driven process.

more here: http://tldr.person.sh/on-the-future-of-science

kirk21 · on April 30, 2012

It always strikes me how much time it takes to finish your paper (eg. conference template, correcting spelling mistakes, formatting figures etc.) while you could outsource this. Student-assistants are an option

Furthermore it would be nice to discuss your ideas without having to spend months writing a paper. Guess there is a difference between alpha and beta sciences.

Finding relevant conferences is a challenge as well (since I'm still a junior researcher).

mukaiji · on April 29, 2012

I recently had a chit-chat with a 5th year phd friend of mine in front of the stanford bookstore. We both did academic research, and both know all too well the incredible frustration of tech not-having full penetrated academic research. If ever you'd like to work toward making research faster, reply to this. We could think about a couple of things and start cranking out some solutions.

drewbuschhorn · on April 29, 2012

(since you're in cali, and I'm not) I'd like to draw your attention to the science hack days in SF. Have a look at this wiki[0]. WilliamGunn and cazDev on github are two people I know who've participated in / have some connections that might help a project like this happen.

[0] http://sciencehackday.pbworks.com/w/page/45740104/SFideas#ag...

eli_gottlieb · on April 30, 2012

Yo. My email address is in my HN profile, and above I was thinking about the idea of adding an "automated literature reviewer" to paper submissions who would "advertise" relevant papers the way Google tries to advertise relevant to search queries.

janardanyri · on April 30, 2012

I did just enough academic research to discover that real progress could be more effectively pursued elsewhere. I would love to explore this; janardan.yri at me.com.

RichardPrice · on April 29, 2012

mukaiji - I would love to chat to you. Drop me a line at richard [at] academia.edu.

simonster · on April 29, 2012

Yes, there is a time lag problem. However, instant distribution has been around for a long time (in the case of arXiv.org, since 1991). It's widely accepted in the physics community, but it hasn't gained much traction in most other scientific disciplines. I think there are two reasons for this: the chicken and egg problem, and the peer review problem.

The chicken and egg problem is that no one in these disciplines publishes unreviewed manuscripts because no one reads them. The corollary here is that if you do something interesting and someone happens to read it, take your idea, and publish first, as far as credit goes, you're fucked. This happens with any form of public presentation of ideas, not all that often but often enough that every scientist knows someone who it has happened to. If you just sank a year of your life into a project, you want to make damn sure you're going to get credit for it. At present, instant distribution is too risky. If the profile of instant distribution can rise to the point where a manuscript will be sufficiently widely read to be acknowledged as the source of an idea, scientists in less competitive areas may be more open to it.

The bigger issue is, I think, that scientists actually appreciate peer review. Peer review ensures both quality and fairness in research. If I read a paper in a high-impact journal, I generally believe can trust the results regardless of who wrote it. By contrast, any reputation-based metrics will be strongly colored by the reputation of the lab from which the paper originates. (I have a hunch that this is already true for citation metrics.) Replacing peer review with reputation-based metrics may mean research gets out there faster, but it may also mean that a lot of valuable research gets ignored. This still sucks, and it may suck more. Turning a paper into a startup that may succeed or may fail depending on how well a scientist can market his or her findings would absolutely suck ass. IMHO, scientific funding is already too concentrated in the hands of established labs, and these labs are often too large to make effective use of their personnel. Reputation-based metrics would only contribute to this problem. They would also lead to confusion in the popular press, which is already somewhat incapable of triaging important and unimportant scientific results. This is a much bigger deal in biomedical science than in theoretical physics, because the former has direct bearing on individuals' lives.

On top of this, citation metrics are simply not peer review. In his previous article, Richard Price pointed out that researchers need to spend a lot of time performing peer review. This is absolutely the way it should be. Researchers should spend hours poring over new papers, suggest ways of improving them to the authors, and ultimately ensure that whatever makes it into press is as high quality as possible. IMHO, the easiest way to get quality research out faster is to encourage journals to set shorter peer review deadlines and encourage researchers to meet them, not to throw away the entire system.

OTOH, I think open sharing of data sets among researchers will massively enhance scientific progress, and has a reasonable chance of happening because the push is coming from funding agencies, not startups. As a scientist, the idea of being able to ask my own questions with other people's data gets me far more excited than being able to read their papers before release.

RichardPrice · on April 29, 2012

I totally agree with you about data-sharing. I wanted to spend more time on that in the article, but didn't because I didn't want to make the article longer. I think the ability to share and ask questions about data really has enormous potential to drive science forward. The fact that enormous amounts of scientific data remains private to the lab, and not shared, is really a big loss to science. It's going to be very exciting as that data starts getting shared more.

The key to making that happen is disrupting the credit system. Right now scientists aren't incentivized to curate and share their data, so they don't put in the work to do it. You can't put data-sets on your resume, much like you can't put blog posts, or anything that is not a paper. As soon as scientists start getting credit for sharing data-sets, I think we'll start to see it happen.

Similar points apply, as you mention, to instant distribution. Instant distribution will happen more as scientists start getting credit for scientific ideas that they distribute instantly. You are already seeing some disruption to the credit system. In the last 5-10 years, since citation counts have been made publicly available by Google Scholar, citation counts have started to play a much larger role in resource allocation decisions, e.g. decisions by hiring committees and grant committees. I did my PhD at Oxford in philosophy from 2001-2007, and remained involved with some of the hiring decisions at the Oxford philosophy department until 2011, and it's been very interesting to watch the increased influence, over those years, of citation counts in hiring decisions.

Citation counts aren't perfect, but they are another signal. Hiring committees, I have experienced, are desperate for more signals that they can take into account when comparing candidates. Comparing candidates is a tough job. As with any signal, to wield it properly, you need to know its pros and cons. Fundamentally what the community is looking for here is a variety of signals that show how much a highly respected chunk of the scientific community has interacted with a piece of your content, and found it useful.

To get data-sets, and other media, to attract scientific credit, we need to develop metrics that demonstrate the traction that those pieces of media are getting in highly respected parts of the scientific community. I think those metrics will get developed, and that new metrics will play an enormous role in allowing different kinds of media to be shared, and everything to be shared faster.

reasonattlm · on April 29, 2012

This is a time for revolution in the methods of science and the funding of science, long overdue and enabled by the internet. It will be a mix of removing the barriers to entry, blurring the priesthood at the edges, open and iterative publishing of data, drawing crowdfunding directly from interested groups of the public rather than just talking to the traditional funding bodies.

Astronomy has long been heading in this direction, actually - it's a leading indicator for where fields like medicine and biotechnology are going. People can today do useful and novel life science work for a few tens of thousands of dollars, and open biotechnology groups are starting to formalize (such as biocurious in the Bay Area).

There is a lot of good science and good application of science that can be parallelized, broken up into small fragments, distributed amongst collaborative communities. The SENS Foundation's discovery process for finding bacterial species that might help in attacking age-related buildup of lipofuscin, for example: cheap, could be very parallel. In this, these forms of work are much like software development - consider how that has shifted in the past few decades from the varied enclosed towers to the open market squares below.

This greater process is very important to all of us, as it is necessary to speed up progress in fields that have great potential, such as biotechnology. Only a fraction of what could be done will be done within our lifetimes without a great opening of funding and data and the methodologies of getting the work done.

fl3tch · on April 29, 2012

I agree with a lot of what you said, but this:

> People can today do useful and novel life science work for a few tens of thousands of dollars

makes me wonder if you've ever done bench work or furnished a lab. Sure, you can do a few weeks or months of work for tends of thousands of dollars (which doesn't produce a lot of useful results in that time frame, but can produce some), but that's assuming you have a functioning lab. It often takes $100K just to stock one, which is why most new investigators get special startup money just for that. To produce useful results often takes years at a rate of at least $100K a year. Equipment and reagents (especially enzymes) can be expensive. $300 for a gel box here, $1000 for a pipette there, $5000 for a thermocycler, $2000 for an enzyme, it adds up. And that's not even getting into salaries. Most people don't work alone.

I think it would be incredibly difficult to crowdfound science a la something like Kickstarter, especially with the amount of money currently spent on science (about $50 billion annually in the US alone). But maybe someone on HN will be the person who proves me wrong.

RichardPrice · on April 29, 2012

That's a cool point about parallelization. There was a fascinating experiment done a few years ago by the mathemetician Tim Gowers, called the 'Polymath project', where he took a problem in math, and asked for the mathematicians who read his blog to solve parts of the problem. 40 people took part, and 7 weeks later, Gowers announced on his blog that the problem was 'probably solved'. A couple of papers came out of it, published by 'D.H.J Polymath'.

More info on this is on the Wikipedia page http://en.wikipedia.org/wiki/Polymath_Project

You're right, it would be cool if we could see more of this kind of thing happening. There is now a whole site dedicated to applying parallelization to other problems in math http://polymathprojects.org/