Hacker News new | past | comments | ask | show | jobs | submit login

Most PDF scientific articles sadly don't have good embedded metadata [0], so this & the DOI issue make this not very useful (at least for the journals I read).

I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)

[0] http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/ [1] http://www.sno.phy.queensu.ca/~phil/exiftool/ [2] http://poppler.freedesktop.org/




What's "the DOI issue"?

(Also, if your favourite publisher isn't putting the metadata you want in PDFs, write to them and ask! It may make a difference...)


the DOI issue is as described by "aroch"

Quickscrape is capable of downloading PDFs, given a DOI I think: https://github.com/ContentMine/quickscrape

If publishers listened to their readers we would have had 100% Open Access ten years ago. Traditional academic publishers do not listen and don't care.

Even if I could convince say PLOS to do something about this it wouldn't change much. We need all or the majority of publishers to provide good embedded metadata. Not just an isolated one or two. I don't see a good mechanism for making that happen, sadly.


Well hopefully the coverage of full-text link metadata in Crossref will increase over time. Until then, best of luck to ContentMine!

Improving metadata coverage is a different issue than changing business model. Objectively, one's easier to implement than the other.

It is indeed tricky getting ~5,000 publishers to provide optimal metadata, but it's a good thing to have an industry-standard platform to do it in.

(I work at Crossref)


Do you mean having all DOIs for all referenced papers included in the pdf metadata somehow? From my experience, authors don't know LaTeX well enough to do that. Half of them don't even read the submission guidelines that clearly say not to put page numbers on the camera-ready paper; they're never going to understand complicated metadata commands...


The metadata in papers is the responsibility of publishers not authors.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: