Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PDFx – Extract Metadata and URLs from PDFs, and Download Referenced PDFs (metachris.com)
93 points by metachris on Oct 26, 2015 | hide | past | favorite | 35 comments



Most PDF scientific articles sadly don't have good embedded metadata [0], so this & the DOI issue make this not very useful (at least for the journals I read).

I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)

[0] http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/ [1] http://www.sno.phy.queensu.ca/~phil/exiftool/ [2] http://poppler.freedesktop.org/


What's "the DOI issue"?

(Also, if your favourite publisher isn't putting the metadata you want in PDFs, write to them and ask! It may make a difference...)


the DOI issue is as described by "aroch"

Quickscrape is capable of downloading PDFs, given a DOI I think: https://github.com/ContentMine/quickscrape

If publishers listened to their readers we would have had 100% Open Access ten years ago. Traditional academic publishers do not listen and don't care.

Even if I could convince say PLOS to do something about this it wouldn't change much. We need all or the majority of publishers to provide good embedded metadata. Not just an isolated one or two. I don't see a good mechanism for making that happen, sadly.


Well hopefully the coverage of full-text link metadata in Crossref will increase over time. Until then, best of luck to ContentMine!

Improving metadata coverage is a different issue than changing business model. Objectively, one's easier to implement than the other.

It is indeed tricky getting ~5,000 publishers to provide optimal metadata, but it's a good thing to have an industry-standard platform to do it in.

(I work at Crossref)


Do you mean having all DOIs for all referenced papers included in the pdf metadata somehow? From my experience, authors don't know LaTeX well enough to do that. Half of them don't even read the submission guidelines that clearly say not to put page numbers on the camera-ready paper; they're never going to understand complicated metadata commands...


The metadata in papers is the responsibility of publishers not authors.


On a related note, these past couple of weeks I've found myself wanting to import several years' worth of accumulated PDFs into a BibTeX file. This has involved metadata extraction, text scraping, querying Google Scholar (good, but rate-limited) and CrossRef (no limit, but not as accurate).

I've written a very rough guide to the approaches I've taken so far at http://chriswarbo.net/essays/pdf-tools.html , with a bunch of links to external tools, some NixOS package definitions, commandline snippets and descriptions of Emacs macros.

Not quite the same problem as the author's, but the tools and scripts I've been using can do similar things :)


Thanks, interesting read.


This is really neat! For work, I've found myself from time to time exploring the tech around PDFs. I find this tech strangely fascinating. It's like a shim on top of something old and ugly that enables integration with much more modern systems.

Some quick feedback (and a shameless plug):

The CLI interface should output JSON. It would be nice to combin with a CLI JSON parser such as jq[0].

Shameless plug: I've been working on a PDF CLI aimed at making it easier to programmatically fill out PDF forms: https://github.com/adelevie/pdfq. It provides an interface and some wrappers on top of the main pdf form-filling tool, pdftk. For example, you can get json out of a pdf form like this:

    pdftk hello.pdf dump_data_fields | pdfq
Or you can generate FDF from a json file:

    cat hello.json | pdfq json_to_fdf
You can also fill a pdf without touching an fdf code:

    pdfq set foo bar input.pdf output.pdf
[0] https://stedolan.github.io/jq/


PDF is less proprietary than most people think. It is an ISO standard after all and it is a bit complicated but it does solve the problem of making "printable" documents produced by all sorts of tools available online.


pdfx will output json if you use the -j flag!

    pdfx -j <file-or-url.pdf>
jq looks neat btw.


I should have read the [] manual :)


Slightly off-topic, pardon me. But, does someone have any good tips on how to remove pdf security?

Let me clarify why; I frequently come across datasheets (eg to flash memory ICs) that have security enabled for some strange reason. Nothing secret, just plainly downloaded from the Internet. I can open, and print, but not highlight or add remarks.

Existing solutions I've found so far are inadequate since they typically are either 'download this obscure-sounding executable', 'upload and convert on this sketchy possibly-malware-injecting-website', or resort to printing the entire thing to a new pdf document (eg via PDF creator) - but this makes text un-highlightable.

I don't mind anything involving hex-editing, some node.js or python-lib, or chanting and dancing, as long as it gets the job done.

I just want to be able to highlight and copy text :(


Use a reader like evince and toggle the checkbox.


Nice! Thanks for that! If you only knew how much I've looked for something like this, and, not to brag or anything, my internet-search-skills are prettttttty sharp.

For others information (from wikipedia): "Evince used to obey the DRM restrictions of PDF files, which may prevent copying, printing, or converting some PDF files, however this has been made optional, and turned off by default[...]"


You're welcome, have fun! I only found out about it by luck myself.


Nice work Chris .. doesn't work on all my PDF's, though:

    j@w1x8-dev:~/Documents/PDF Documents {}
    $ pdfx xhyve\ –\ Lightweight\ Virtualization\ on\ OS\ X\ Based\ on\ bhyve\ _\ pagetable.pdf
    Traceback (most recent call last):
      File "/usr/local/bin/pdfx", line 9, in <module>
    load_entry_point('pdfx==1.0.1', 'console_scripts', 'pdfx')()
    File "build/bdist.macosx-10.10-x86_64/egg/pdfx/cli.py", line 66, in main
    File "build/bdist.macosx-10.10-x86_64/egg/pdfx/__init__.py", line 137, in __init__
    AttributeError: 'NoneType' object has no attribute 'items'
    j@w1x8-dev:~/Documents/PDF Documents {}
If you want some sample PDF's on which it is borked, just let me know .. in the meantime I'm using pdf_scraper for most of these ..


Thanks for the stack trace. Yes please, sample PDFs with problems would be great! You can find my email in my profile.


Maybe I just show you next time we run into each other at the 'lab or so ..


While nice, it really only works with the reference has a direct link to the PDF while the majority of citations use the DOI in the sciences.

DOI traversal would be required


DOIs are great for humans, unfortunately they'll take you to the publishers webpage and I don't know of a standard way of getting an actual PDF from a DOI. Maybe with PLOS, I know they're good at serving up different versions (xml with a different accept header iirc).

Searching for something on the page that looks like a download PDF button and trying that might get you 80% of the way there, along with at least giving the user the remaining DOI urls to visit themselves.

This is actually really relevant to a lot of problems I see so if anyone has a general solution / a 90% solution then I'm all ears :)


The standard way for getting the actual PDF from a DOI, when it's a Crossref DOI (which it probably is) is to use the full-text link, available in the CrossRef API.

For DOI 10.1155/2010/963926

http://api.crossref.org/works/10.1155/2010/963926

From the returned JSON message -> link -> there's the PDF!

    [
      {
        intended-application: "text-mining",
        content-version: "vor",
        content-type: "application/pdf",
        URL: "http://downloads.hindawi.com/journals/jo/2010/963926.pdf"
      },
      {
        intended-application: "text-mining",
        content-version: "vor",
        content-type: "application/xml",
        URL: "http://downloads.hindawi.com/journals/jo/2010/963926.xml"
      }
    ]
Publishers are still getting round to including the full-text links in metadata, but there are 16,000,000 DOIs with such data. Not all are open-access however.

When a PDF has Crossref CrossMark, the DOI is embedded in the metadata (I can't say how but I can find out)

http://www.plosone.org/article/fetchObject.action?uri=info:d...

Drop us a line on labs@crossref.org


Thanks, I'd not thought of the crossref api for this. I use the API pretty heavily for other things though, really good work!

Just noticed this part of the response:

    "affiliation": [],
How well filled in is that? I find it's currently a really poorly provided thing on many sites (although there are metatags, they're often wrong).

> When a PDF has Crossref CrossMark, the DOI is embedded in the metadata (I can't say how but I can find out)

That's likely to come in really useful, thanks.


    http://api.crossref.org/works?filter=has-affiliation:true

    => total-results: 964,696
Do ask us questions on labs@crossref.org or raise a ticket on https://github.com/CrossRef/rest-api-doc


Maybe consider a Google Scholar integration? Its search results include links to full-text pdf at times. Even not, extracted links to publisher websites could be helpful for a batch review of referenced articles.


Not wishing to nit-pick, but a DOI is a link to the publisher website.


afandian's reply below is exactly how I would go about it. Most DOI's in science papers are crossmark OR convertable to a crossmark DOI


Good point, and I will definitely take a look at that!

Would you open an issue on Github, and perhaps reference a few papers?


Sure, I'll do it this evening


Glad if this tool/lib is useful to some. I'm happy to answer any and all questions!

A Kivy [1] based cross-platform GUI would be a nice addition at some point.

[1] http://kivy.org


How does this tool handle XFA forms?

Or I should ask how does any tool besides Adobe LiveCycle designer handles XFA. I've been using Apache Tika with Apache Solr to index PDF documents -- it worked quite well up until I tried to index XFA-based documents and got nothing, well I got metdatada, but not the content.

[pdftk](https://www.pdflabs.com/docs/pdftk-man-page/#dest-drop-xfa) seems to have an option called drop_xfa, which does just that.


I don't know! If you could send me sample PDFs I'm more than happy to look into it!


I've installed pdfx and saw the help info as the demo. But when I tried to download the example 17 pdf files the following error msg jumped in the end

ERROR 2: len() takes exactly one argument (2 given)

What does this mean??


Thanks, you found a bug! I've fixed it right now. You can update pdfx with

    $ easy_install -U pdfx





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: