Show HN: I am building a new Python library to read/write PDF files

ESultanik · on Nov 18, 2022

Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?

Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.

https://github.com/trailofbits/polyfile

sdgluck · on Nov 18, 2022

In the README for that repository it mentions "schizophrenic files". What is a schizophrenic file, out of interest?

quirino · on Nov 18, 2022

Not OP. It seems they're files which display different contents depending on the program you open them in.

Here's a CCC talk on it: https://media.ccc.de/v/MRMCD2014_-_6008_-_en_-_grossbaustell...

And the slies from the talk: https://www.slideshare.net/ange4771/schizophrenic-files-v2

julian_t · on Nov 18, 2022

"The ALAN Parsers Project"

Bravo! Best wordplay I've reaed today.

mdaniel · on Nov 18, 2022

I never knew about the J number suffix in python: https://docs.python.org/3/reference/lexical_analysis.html#im... which it would appear is used to represent references: https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_p...

I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:

    ValueError: could not convert string to float: b'5.0.0'


    104 0 obj <<
    /Producer (pdfTeX-1.40.10)
    /Creator (TeX)
    /CreationDate (D:20131209161146-08'00')
    /ModDate (D:20131209161146-08'00')
    /Trapped /False
    /PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian) kpathsea version 5.0.0)
    >> endobj

as it seems a string with nested parens jams up the parser

fgeiger · on Nov 18, 2022

The PDF format is diverse enough for such a new project to still have plenty of incompatibilities. If you wanted to find many of them quickly, you might want to have a look at the documents used as test cases in other projects, such as pdf.js:

https://github.com/mozilla/pdf.js/tree/master/test/pdfs

jonnycomputer · on Nov 18, 2022

Valuable practical, actionable advice

mdaniel · on Nov 18, 2022

This is tangential to your submission, but PDF is the file format I use for exercising any library that claims to be a declarative file format (ala https://github.com/kaitai-io/kaitai_struct_formats#readme )

BiteCode_dev · on Nov 18, 2022

J is for complex numbers. While math.sqrt(-1) raises an exception, cmath.sqrt(-1) returns 1j.

There is not distinction between values and references in python, everything is a reference. In fact, primitive like numbers are big struct objects in cpython, you cannot just manipulate the raw numbers.

The difference will rather wether you can modify an object or not. You cannot modify numbers, as they are immutable. Any increment will produce a new object. But you can modify a list. This gives the feeling numbers are passed as values and list are passed as references.

lordgrenville · on Nov 18, 2022

Parent was referring to "references" in the PDF spec (eg 42 0 R)

svat · on Nov 18, 2022

This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: https://github.com/shreevatsa/pdf-explorer)

In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.

mdaniel · on Nov 18, 2022

  Commiting a change (from Jun 5 20:57) that I don't understand any more 

     // From real life, lightly modified. Note the "/companyName, LLC" as key!

With absolutely no slight toward the author, that matches my mental model of dealing with PDFs: `git commit -mwtf`

svat · on Nov 18, 2022

I'm the author and I just meant I had left behind a small uncommitted diff back when I stopped working on it, and I didn't bother to read the diff before committing. I actually understand it just fine, on second look…

Overall, at least so far, I haven't encountered much "WTF" dealing with PDFs actually. The spec (especially the Adobe version: the ISO version based on it is only slightly different but feels much worse) is quite pleasant to read. There are some warts from backward compatibility with earlier poor decisions, but not too many of them. And while it's surprising what different PDF programs will produce as long as any PDF reader in existence happens to accepts it (Hyrum's law) (e.g. in this example, the dictionary key having a space in it), for my purposes it hasn't been a big deal as I'm only trying to do the first level of parsing, and when even that is problematic I can happily just declare the PDF malformed.

programmarchy · on Nov 18, 2022

I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.

PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.

However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.

Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.

There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.

Will be curious to see how this project develops!

[1] https://pypi.org/project/PyPDF2/

[2] https://pypi.org/project/reportlab/

missblit · on Nov 18, 2022

QPDF is a good C++ library for "content preserving" PDF transformations, and is used by the Python PikePDF library.

pronik · on Nov 18, 2022

I've found out the hard way that boxing/unboxing of PDF primitives to Python is _really_ expensive, so that my workflow has been counter-intuitively quite a lot slower than with PyPDF2.

lhuser123 · on Nov 18, 2022

I had the same experience. Thanks for the summary. Need to read that the next time.

gettalong · on Nov 18, 2022

As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.

If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at https://pdf-issues.pdfa.org/32000-2-2020/index.html.

As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see https://github.com/gettalong/annotated-pdf-spec. That might help you in parsing some invalid PDFs

mdaniel · on Nov 18, 2022

Merely for your consideration, if those were actual issues on that repo, (a) it would allow adding labels to them (as in https://github.com/pdf-association/pdf-issues/issues?q=is%3A... ) (b) folks could comment, acting as a low-rent stackoverflow, and (c) it would allow anyone to contribute new ones versus the "PR against README.md" situation right now

That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not

Eatcats · on Nov 18, 2022

Thank you, it is much needed, right now the most reliable way of generating PDF's I used not so long time ago is - create DOCX with content and some template variable strings, like {{}} - unpack document and get into text, replace text - use DOCX->PDF linux tool to generate document.

Maybe this will be the good solution

mkl · on Nov 18, 2022

Latex seems way better than either to me (but then, I know Latex). Certainly Latex makes it much easier to get consistency and precision. For your use-case, generate Latex code from a template using Python or whatever, substituting in what you need, then compile the Latex into a PDF. If you need graphical elements or precise layout control, use the Latex package TikZ.

If what you need is very simple (e.g. no word wrapping, same number of variable strings in the same positions), even manipulating the code of a template PDF directly is not too hard. This library would help with that.

Terretta · on Nov 18, 2022

> the most reliable way of generating PDFs

See pandoc: https://pandoc.org/

And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.

Including the author's own djot: https://github.com/jgm/djot

EDIT:

Sibling reply suggests latex. OK, but then you're also learning latex.

UglyToad · on Nov 18, 2022

Good luck, I once started to scratch the same itch to learn this file format, several years later I think I got about 30% of the way through!

More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.

mdaniel · on Nov 18, 2022

> find a version of iText RUPS application from somewhere on the internet

You mean this, right? https://github.com/itext/i7j-rups#readme

UglyToad · on Nov 18, 2022

Yes, that's the one. I found a compiled version somewhere because I was too lazy to install/learn Maven/Java stuff in order to build from source.

99112000 · on Nov 18, 2022

I once had to help an accountant friend to fill in 1000's of docx files, and convert them to pdf. No open source tool does a proper conversion, it really sucked.

chrisdbanks · on Nov 18, 2022

I once had to do this and turned all the docx into one document, used Word to export as PDF and then used a PDF splitter to get separate documents.

coupdejarnac · on Nov 18, 2022

You could script Libre Office to do this.

pbhjpbhj · on Nov 19, 2022

>No open source tool does a proper conversion //

Presumably the key word here is "proper" because LibreOffice, etc., read docx and write pdf. For example, `libreoffice --headless --convert-to pdf myfile.docx`.

forinti · on Nov 18, 2022

I have used Perl and Win32::OLE for this kind of job.

Converting to PDF is actually quite easy. Before Office 2010, you had to print to Postscript and then convert to PDF using Ghost. Nowadays Word gives you the option of saving to PDF.

andylynch · on Nov 18, 2022

Yeah, this would be horrible, but on flip side relatively easy to do with Word to hand.

thejosh · on Nov 18, 2022

it's such a complex messy format that i'm really not surprised.

daemoens · on Nov 18, 2022

ilovepdf.com is free but isn't open source.

jl6 · on Nov 18, 2022

Good to see work in the PDF space. It’s still one of the most important formats. I would love to see more time invested in tools that can create PDF/A documents, which I believe to be the sane subset of PDF.

dagw · on Nov 18, 2022

A PDF generator library that only generated guarenteed PDF/A compliant PDFs would actually be really good selling point for a new PDF library.

password4321 · on Nov 18, 2022

Is there a list of open source PDF libraries for various languages?

And related: the best tools to generate PDFs from HTML.

e12e · on Nov 18, 2022

As I'm currently fighting with css3/paged media[pm] - I've recently tried to figure it out.

Theres a rather comprehensive list at: https://www.print-css.rocks/tools

As far as Foss tools go, I've only found paged.js (a polyfyll) in combination with a browser print-to-pdf (eg wkhtmltoodf (webkit) or puppeteer (chrome)) that has any semblance of css support.

There's also ghostscript - but AFAIK it doesn't support much/any css3 for print.

[pm] https://www.w3.org/TR/css-page-3/

2rsf · on Nov 18, 2022

I've seen Puppeteer being used successfully in a few projects, it's ugly as hell (you need a VM, a browser and extra software to generate a file) and might require maintenance but it works quite well for simple, mostly textual documents.

password4321 · on Nov 18, 2022

Playwright is a new favorite browser automation tool, I wonder if they've done anything to help with generating PDFs?

mdaniel · on Nov 19, 2022

https://playwright.dev/docs/cli#generate-pdf

e12e · on Nov 19, 2022

The main issue with the browser based tools is that developers seem to have given up on implementing html/css print standards. So there's a limit to what automation tools have to work with, so to speak. That said, paged.js makes a good effort.

mdaniel · on Nov 18, 2022

"list" is probably harsh, but I've had good luck trawling through the GitHub topics to find such things: https://github.com/search?p=1&q=topic%3Apdf&type=Repositorie...

Makes me miss freshmeat.net which would have been my answer a few years ago (freshcode.club just isn't the same, although bless them for trying)

macintux · on Nov 18, 2022

Circa 25 years ago I was newly wed and my wife happened to be watching over my shoulder as I looked up something online. I started typing "fr" and "freshmeat.net" came up as a possible completion.

She was, to put it mildly, immediately suspicious of my browsing habits.

criddell · on Nov 18, 2022

The best generator is Prince XML but it can be expensive.

pronik · on Nov 18, 2022

Seconded. I've recreated a corporate CI with it and had a great time. To have a single compile target is a blessing.

Also: the only software I know of written in Mercury.

criddell · on Nov 18, 2022

Mercury and Rust (according to Wikipedia).

pronik · on Nov 19, 2022

Rust is probably a recent development, it's been a long time I last checked up on PrinceXML. Might also mean that Mercury is on the way out long to middle term.

stuaxo · on Nov 18, 2022

In python I would choose WeasyPrint most of the time these days.

g8oz · on Nov 19, 2022

Dompdf is a good html to pdf library for PHP.

truemotive · on Nov 18, 2022

Please, for the love of all that is holy, run away!

strangus · on Nov 18, 2022

You brave soul, I wish you luck.

scoofy · on Nov 18, 2022

I desperately need to be able to display .SVG files with gradients on .PDFs, but no library currently exist in python as far as I know.

I would be willing to help make this happen, but I do not know much about the PDF format.

jchw · on Nov 18, 2022

It's actually not so bad: it's mostly ASCII, even though some parts of it really need to be treated as binary. If you open up a simple/old PDF in your favorite text editor, you can begin to grok the basic structures quite easy.

One trick for getting started: PDFs are read from the bottom. The first thing that is read is actually an offset pointing back to the xref table, at the end of the file. Then, the xref table itself points to the latest version of all of the objects.

The part you're most likely interested in is the content streams, which contain postscript-like drawing commands. To get a feel for it, following the official spec when reading a simple-looking document can help.

edit: I didn't link any actually useful resources, in part because I actually just have a corpus of files in various file formats that I keep handy as a reference for some weird reason. However, Googling for simple PDF files yielded this, which I feel is very readable in a text editor. https://www.africau.edu/images/default/sample.pdf

viraptor · on Nov 18, 2022

If you look for "pdf inspector" apps, there's also lots of those that will let you poke around the parsed tree.

manv1 · on Nov 18, 2022

You can include binary data in PDF files, so it's not necessarily ASCII.

jchw · on Nov 18, 2022

The structure however is still largely ASCII text. It needs to be treated as binary of course, due to the use of offsets everywhere and the fact that the xref table is hardcoded to have a specific length per xref in bytes. But if you look at a lot of simple or old PDFs, it's not hard to find some that don't use any binary encodings.

naniwaduni · on Nov 18, 2022

Sounds a lot like the tar file format.

layer8 · on Nov 18, 2022

The PDF spec is quite readable. Use the “Adobe equivalent” of PDF 1.7 here: https://www.pdfa.org/resource/pdf-specification-index/

bobince · on Nov 18, 2022

ReportLab can render gradients, but it's poorly documented. See eg https://stackoverflow.com/questions/452074/creating-a-gradie...

I use the method with `canvas.clipPath(path, stroke=False, fill=True)` on a path I've parsed manually from SVG then `canvas.linearGradient`.

phonon · on Nov 18, 2022

This issue?

https://github.com/Kozea/WeasyPrint/issues/1719

neilv · on Nov 18, 2022

Neat. Another use case for which you might want to think about a sample is extracting data from filled PDF forms. (That use case is why I once had to write a PDF parser.)

Since you read&write, maybe also a use case of programmatically filling some form fields in an editable PDF form. Such pre-filling some of the fields for a particular Web site user in a dynamically-modified PDF form they download. But the source PDF form can be hand-crafted and maintained separately, like people often want to do, not generated from scratch by your code.

Helmut10001 · on Nov 18, 2022

I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it. Before I ended up using pdfplumber, I tried at least three other PDF packages and they did not work as easily or as expected.

[1]: https://github.com/jsvine/pdfplumber

poxrud · on Nov 18, 2022

Fun project story... During the first covid school shutdown my son's day care wanted parents to print a daily screening symptom checklist, take a photo of it and email it to them every morning. This was a tedious process that I automated with PyPDF2 + PDFtk + pypdftk. It's easy to generate your own PDF's but it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.

Eventually I turned it into a website, added AWS API Gateway + Lambda and put the whole thing up for other daycare parents to use. Two weeks later the daycare switched to google forms and my project was not useful anymore.

mdaniel · on Nov 18, 2022

> it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.

That has been on my wishlist for several years: build a "PDF annotation" service that takes in a PDF that is not an XObject form (e.g. this random example: https://www.dentalworks.com/wp-content/uploads/2021/08/Patie... ) and replace those _____ areas with actual PDF inputs. My handwriting is terrible, and it's a waste of human capital for some poor soul to try and decipher handwriting only to (almost undoubtedly) re-type it into a computer on their end

I am sure we ended up in this situation because people just "File > Print to PDF" from Word or whatever, because knowing that PDF forms exist and then how to use Adobe(R) whatever(tm) to make a real editable PDF is "too much to ask."

I have had about 10% success with Preview.app detecting the lines and allowing me to click on them and type, but having https://notstupidpdf.example.com/www.dentalworks.com/wp-cont... would be much better for humanity

RantyDave · on Nov 18, 2022

I once had to parse a reference manual, provided as a PDF and emit a mostly-usable CSV of its content.

That shit was hard. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.

ehutch79 · on Nov 18, 2022

Table elements not being in consecutive display order in any direction. Like elements 0-5 from column a, followed by elements 10-12 from column b, elements 6,8.19 from column a, then elements 19-4 in reverse order from column d, all of column c....

This is a real thing I dealt with.

larsonnn · on Nov 18, 2022

This reminds me back in the day where we got some properties and thought, PDF is a defined file format. Every pdf has this values…

We were so naiv and didn’t know.

nomel · on Nov 18, 2022

I asked the vendor if they had a 3d model viewer, so we could inspect a part they were making. He sent me a PDF. Full pan, tilt, zoom, and hideable pieces. I don't know what kind of witchcraft was involved, but I suspect OP will come out of this cursed by its unholy nature.

mkl · on Nov 18, 2022

Most PDF viewers don't support the 3D models feature, and just show a static image (literally an embedded image; they don't look at the 3D data at all). I've used it to make 3D diagrams for multivariable calculus, done with Asymptote (https://asymptote.sourceforge.io/).

IshKebab · on Nov 18, 2022

PDF has a load of advanced features - 3D models, video, Flash. Basically only Adobe supports any of it though.

I wish more readers supported video but IIRC the standard doesn't actually support a normal modern format.

Spivak · on Nov 18, 2022

I wish you all the best! This space has a lot of stuff in it and they’re lacking in some aspect. And that’s not a admonishment, PDF is such a complicated format that there will never be a library that doesn’t come with asterisks — it’s just a matter of picking the thing you want your library to focus on and be good at and you can pretty easily be someone’s favorite lib.

notacop31337 · on Nov 18, 2022

My understanding is that this is largely because you're fighting an adversarial format provider in Adobe, I've read a few papers and journal entries on file format polyglotting, with some focus on PDF and approaches are constantly shifting in nature due to Adobe mooting pathways to success, I think it's partly for security and also IMO partly for obscurity as PDF is a horrific format in all reality except for human visual interpretation. Many organisations and industries ONLY produce public data via PDF and it makes the parsing of that information a far more difficult task (again, I suspect by design). After trying a few parsing options for PDF, I've come to hate it as a format. Luckily though, some options from a few cloud providers seem to be really hammering the problems complexity down, but the cost of the solutions can be very steep.

manv1 · on Nov 18, 2022

PDF is a subset of Postscript, which is a full-blown programming language disguised as a page description language.

People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.

The problem is that PDF is not a file format, it's a defined subset of a programming language (PostScript) used for portable rendering with fidelity. It's portable, in the sense that it should render the same way on whatever device it's rendered on (printed on a page or mastered to a display). And it's portable because it doesn't allow any postscript job-level commands, and it tries to ensure that each PDF File is standalone and can be concatenated together into a multi-page document or embedded in another document.

Postscript (and PDF) are also postfix, which can be confusing.

hanche · on Nov 18, 2022

> PDF is a subset of Postscript

That’s a bit of an oversimplification. There’s a whole layer of structure atop the postscript subset. Much software deals only with that layer, never looking into the chunks of rendering code. That’s plenty complicated already!

> Postscript (and PDF) are also postfix, which can be confusing.

I handwrote quite a bit of postscript wsy back when. It wasn’t that bad, really, you just had to keep the state of the stack firmly in your head. Being used to HP scientific calculators helped. I would never dream of handwriting a pdf file, though. Even the low level parts are harder to deal with, since most command names have been shortened to a single letter for efficiency.

jjgreen · on Nov 18, 2022

Postfix is fine, you get used to it fairly quickly. But when you've finished and want to go the the toilet, you walk there backwards.

jeremynixon · on Nov 18, 2022

What libraries do you see as being SOTA? Fitz? Tika?

My hope is that computer vision + OCR will solve this once and for all in near future.

notacop31337 · on Nov 18, 2022

To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!

999900000999 · on Nov 18, 2022

OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.

mythrwy · on Nov 18, 2022

There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.

999900000999 · on Nov 18, 2022

Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.

cafard · on Nov 18, 2022

I had some luck with Camelot (https://camelot-py.readthedocs.io/en/master/). However, as many of the comments here say, PDF is a beast.

elmcrest · on Nov 18, 2022

Hey desgeeko,

from a past project we‘ve left a python PDF renderer - might be somehow useful or inspirational…

https://github.com/systori/bericht

tmaly · on Nov 18, 2022

there is a Perl library that does this, but it only supports pdf 1.5

https://metacpan.org/pod/CAM::PDF

I have used it in the past.

Silencerxyz · on Nov 18, 2022

Hey, I took a look at your GitHub Page and I'm wondering can you provide more information in the readme so I can understand the value of the product better.

cochne · on Nov 18, 2022

You’re in for it! I highly recommend checking out mupdf, it was one of the more pleasant Python libraries I dealt with for this purpose.

cgdae · on Nov 18, 2022

I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).

PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.

[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]

pyuser583 · on Nov 18, 2022

Good luck! Really! I hate ReportLab!

I hate using ReportLab … reading its code is fascinating. Interesting seeing what 1990s Python code looked like.

meitham · on Nov 18, 2022

Curious to hear what you hate about it? Any specific point? If it’s just the code age then keep in mind the main author is Robin is celebrating his 75th birthday today. Like most early Python pioneers he comes from lisp background and that’s pretty much the same style of code you see by Peter Norvig and other of his generation.

yupis · on Nov 18, 2022

Is it possible to directly edit the text?

voz_ · on Nov 18, 2022

Wish you luck.

jonathansa · on Nov 18, 2022

Awesome work

chazeon · on Nov 18, 2022

On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution.

[1]: https://github.com/jcushman/pdfquery

mythrwy · on Nov 18, 2022

There is a command line utility (pdf2text) that will also parse the pdf to an XML tree and you can query with XPaths. I found it works well.

https://pdfminersix.readthedocs.io/en/latest/reference/comma...

mdaniel · on Nov 18, 2022

That makes sense, as "pdfquery" uses pdfminer.six as a dep: https://github.com/jcushman/pdfquery/blob/master/requirement...

contentboot · on Nov 18, 2022

This is great, thank you for posting it.

jeremynixon · on Nov 18, 2022

Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?

jahewson · on Nov 18, 2022

Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.

Plus all the fun of the fact that you can embed the following formats inside a PDF:

PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.

All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.

Note the lack of OpenType fonts, also lack of proper Unicode!

userbinator · on Nov 18, 2022

JPEG 2000 (dead)

Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.

joe_guy · on Nov 18, 2022

Believe it or not, but digital cinema projection is done with jpeg 2000 https://en.wikipedia.org/wiki/Digital_cinema

jahewson · on Nov 18, 2022

That’s wild!

jahewson · on Nov 18, 2022

I mean dead as in the fact that it’s used somewhere is noteworthy.

I’d love for JPEG XL to replace such uses!

jwilk · on Nov 18, 2022

> lack of proper Unicode

What do you mean?

pwg · on Nov 18, 2022

PDF was defined way back before Unicode was ever a thing. It is natively an 8-bit character-set format for text handling. The way it gets around this limit of only 256 characters available is because it also allows defining custom byte to character glyph mappings (think both ASCII and EBCDIC encoding in different parts of the same document). To typeset a glyph that is not in the current in use 256 character sub-set mapping you switch to a different custom byte value to character glyph mapping to typeset that other character.

jwilk · on Nov 18, 2022

> PDF was defined way back before Unicode was ever a thing.

Unicode 1.0 was released in 1991.

PDF 1.0 was released in 1993.

dbrueck · on Nov 18, 2022

Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.

For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).

layer8 · on Nov 18, 2022

1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.

2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.

3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.

So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.

autotune · on Nov 18, 2022

My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.

bmitc · on Nov 18, 2022

Droit does similar things.

autotune · on Nov 18, 2022

That is awesome! Can “cross border” do things like process GDPR compliance regulations or is that not the intended use case?

newsclues · on Nov 18, 2022

PDF has always seemed to be a janky Adobe product.

Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?

jahewson · on Nov 18, 2022

That would be XPS https://en.m.wikipedia.org/wiki/Open_XML_Paper_Specification

copperbrick25 · on Nov 18, 2022

Sadly XPS is not supported by most software, I'd love to use something better than PDF, but even LibreOffice can't export as OXPS.

userbinator · on Nov 18, 2022

Anything related to XML is arguably even worse.

mdaniel · on Nov 18, 2022

I know it's fun to hate on XML, but as compared to inventing a new pseudo-text-pseudo-binary format, its parsing mechanics are well understood

I'm not claiming all of PDF's woes are related to its encoding, but it's not zero, either. Start from the fact that XML documents have XML Schema allowing one to formally specify what can and cannot appear where. The PDF specification is a bunch of English which makes for shitty constraint boundaries

manv1 · on Nov 18, 2022

It was a fight between DiskPaper and PDF. PDF won because the tools were better and it was cross-platform.

And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.

It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.

mdaniel · on Nov 18, 2022

I know this is likely a case of "you know what I meant," but there already is a PDF 2.0: https://www.pdfa.org/resource/iso-32000-pdf/

steampilot · on Nov 18, 2022

I think it's not too late to create a modern open-source alternative to PDF. I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing. Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience. The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.

macintux · on Nov 18, 2022

It does seem like that would be a good opportunity to weed out some of the insecure aspects of the format.

Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.

brailsafe · on Nov 18, 2022

It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.

userbinator · on Nov 18, 2022

The PDF format is itself a weird hybrid of text and binary.

(I have written a PDF parser myself.)

sarahhenry · on Nov 18, 2022

Amazing Python guide, Thank you. How long you have been working as a developer?