Hi HN!
This is my pet project, written from scratch because there is so much to discover and learn in the process. The focus is on simplicity and incremental updates.
Progress is slow because I do not have much spare time to work on this, but I would love to hear some feedback.
Regards
Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?
Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.
The PDF format is diverse enough for such a new project to still have plenty of incompatibilities. If you wanted to find many of them quickly, you might want to have a look at the documents used as test cases in other projects, such as pdf.js:
J is for complex numbers. While math.sqrt(-1) raises an exception, cmath.sqrt(-1) returns 1j.
There is not distinction between values and references in python, everything is a reference. In fact, primitive like numbers are big struct objects in cpython, you cannot just manipulate the raw numbers.
The difference will rather wether you can modify an object or not. You cannot modify numbers, as they are immutable. Any increment will produce a new object. But you can modify a list. This gives the feeling numbers are passed as values and list are passed as references.
This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: https://github.com/shreevatsa/pdf-explorer)
In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.
I'm the author and I just meant I had left behind a small uncommitted diff back when I stopped working on it, and I didn't bother to read the diff before committing. I actually understand it just fine, on second look…
Overall, at least so far, I haven't encountered much "WTF" dealing with PDFs actually. The spec (especially the Adobe version: the ISO version based on it is only slightly different but feels much worse) is quite pleasant to read. There are some warts from backward compatibility with earlier poor decisions, but not too many of them. And while it's surprising what different PDF programs will produce as long as any PDF reader in existence happens to accepts it (Hyrum's law) (e.g. in this example, the dictionary key having a space in it), for my purposes it hasn't been a big deal as I'm only trying to do the first level of parsing, and when even that is problematic I can happily just declare the PDF malformed.
I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.
PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.
However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.
Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.
There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.
I've found out the hard way that boxing/unboxing of PDF primitives to Python is _really_ expensive, so that my workflow has been counter-intuitively quite a lot slower than with PyPDF2.
As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.
If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at https://pdf-issues.pdfa.org/32000-2-2020/index.html.
As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see https://github.com/gettalong/annotated-pdf-spec. That might help you in parsing some invalid PDFs
Merely for your consideration, if those were actual issues on that repo, (a) it would allow adding labels to them (as in https://github.com/pdf-association/pdf-issues/issues?q=is%3A... ) (b) folks could comment, acting as a low-rent stackoverflow, and (c) it would allow anyone to contribute new ones versus the "PR against README.md" situation right now
That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not
Thank you, it is much needed, right now the most reliable way of generating PDF's I used not so long time ago is
- create DOCX with content and some template variable strings, like {{}}
- unpack document and get into text, replace text
- use DOCX->PDF linux tool to generate document.
Latex seems way better than either to me (but then, I know Latex). Certainly Latex makes it much easier to get consistency and precision. For your use-case, generate Latex code from a template using Python or whatever, substituting in what you need, then compile the Latex into a PDF. If you need graphical elements or precise layout control, use the Latex package TikZ.
If what you need is very simple (e.g. no word wrapping, same number of variable strings in the same positions), even manipulating the code of a template PDF directly is not too hard. This library would help with that.
And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.
Good luck, I once started to scratch the same itch to learn this file format, several years later I think I got about 30% of the way through!
More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.
I once had to help an accountant friend to fill in 1000's of docx files, and convert them to pdf. No open source tool does a proper conversion, it really sucked.
Presumably the key word here is "proper" because LibreOffice, etc., read docx and write pdf. For example, `libreoffice --headless --convert-to pdf myfile.docx`.
I have used Perl and Win32::OLE for this kind of job.
Converting to PDF is actually quite easy. Before Office 2010, you had to print to Postscript and then convert to PDF using Ghost. Nowadays Word gives you the option of saving to PDF.
Good to see work in the PDF space. It’s still one of the most important formats. I would love to see more time invested in tools that can create PDF/A documents, which I believe to be the sane subset of PDF.
As far as Foss tools go, I've only found paged.js (a polyfyll) in combination with a browser print-to-pdf (eg wkhtmltoodf (webkit) or puppeteer (chrome)) that has any semblance of css support.
There's also ghostscript - but AFAIK it doesn't support much/any css3 for print.
I've seen Puppeteer being used successfully in a few projects, it's ugly as hell (you need a VM, a browser and extra software to generate a file) and might require maintenance but it works quite well for simple, mostly textual documents.
The main issue with the browser based tools is that developers seem to have given up on implementing html/css print standards. So there's a limit to what automation tools have to work with, so to speak. That said, paged.js makes a good effort.
Circa 25 years ago I was newly wed and my wife happened to be watching over my shoulder as I looked up something online. I started typing "fr" and "freshmeat.net" came up as a possible completion.
She was, to put it mildly, immediately suspicious of my browsing habits.
Rust is probably a recent development, it's been a long time I last checked up on PrinceXML. Might also mean that Mercury is on the way out long to middle term.
It's actually not so bad: it's mostly ASCII, even though some parts of it really need to be treated as binary. If you open up a simple/old PDF in your favorite text editor, you can begin to grok the basic structures quite easy.
One trick for getting started: PDFs are read from the bottom. The first thing that is read is actually an offset pointing back to the xref table, at the end of the file. Then, the xref table itself points to the latest version of all of the objects.
The part you're most likely interested in is the content streams, which contain postscript-like drawing commands. To get a feel for it, following the official spec when reading a simple-looking document can help.
edit: I didn't link any actually useful resources, in part because I actually just have a corpus of files in various file formats that I keep handy as a reference for some weird reason. However, Googling for simple PDF files yielded this, which I feel is very readable in a text editor. https://www.africau.edu/images/default/sample.pdf
The structure however is still largely ASCII text. It needs to be treated as binary of course, due to the use of offsets everywhere and the fact that the xref table is hardcoded to have a specific length per xref in bytes. But if you look at a lot of simple or old PDFs, it's not hard to find some that don't use any binary encodings.
Neat. Another use case for which you might want to think about a sample is extracting data from filled PDF forms. (That use case is why I once had to write a PDF parser.)
Since you read&write, maybe also a use case of programmatically filling some form fields in an editable PDF form. Such pre-filling some of the fields for a particular Web site user in a dynamically-modified PDF form they download. But the source PDF form can be hand-crafted and maintained separately, like people often want to do, not generated from scratch by your code.
I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it. Before I ended up using pdfplumber, I tried at least three other PDF packages and they did not work as easily or as expected.
Fun project story... During the first covid school shutdown my son's day care wanted parents to print a daily screening symptom checklist, take a photo of it and email it to them every morning. This was a tedious process that I automated with PyPDF2 + PDFtk + pypdftk. It's easy to generate your own PDF's but it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.
Eventually I turned it into a website, added AWS API Gateway + Lambda and put the whole thing up for other daycare parents to use. Two weeks later the daycare switched to google forms and my project was not useful anymore.
> it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.
That has been on my wishlist for several years: build a "PDF annotation" service that takes in a PDF that is not an XObject form (e.g. this random example: https://www.dentalworks.com/wp-content/uploads/2021/08/Patie... ) and replace those _____ areas with actual PDF inputs. My handwriting is terrible, and it's a waste of human capital for some poor soul to try and decipher handwriting only to (almost undoubtedly) re-type it into a computer on their end
I am sure we ended up in this situation because people just "File > Print to PDF" from Word or whatever, because knowing that PDF forms exist and then how to use Adobe(R) whatever(tm) to make a real editable PDF is "too much to ask."
I once had to parse a reference manual, provided as a PDF and emit a mostly-usable CSV of its content.
That shit was hard. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.
Table elements not being in consecutive display order in any direction. Like elements 0-5 from column a, followed by elements 10-12 from column b, elements 6,8.19 from column a, then elements 19-4 in reverse order from column d, all of column c....
I asked the vendor if they had a 3d model viewer, so we could inspect a part they were making. He sent me a PDF. Full pan, tilt, zoom, and hideable pieces. I don't know what kind of witchcraft was involved, but I suspect OP will come out of this cursed by its unholy nature.
Most PDF viewers don't support the 3D models feature, and just show a static image (literally an embedded image; they don't look at the 3D data at all). I've used it to make 3D diagrams for multivariable calculus, done with Asymptote (https://asymptote.sourceforge.io/).
I wish you all the best! This space has a lot of stuff in it and they’re lacking in some aspect. And that’s not a admonishment, PDF is such a complicated format that there will never be a library that doesn’t come with asterisks — it’s just a matter of picking the thing you want your library to focus on and be good at and you can pretty easily be someone’s favorite lib.
My understanding is that this is largely because you're fighting an adversarial format provider in Adobe, I've read a few papers and journal entries on file format polyglotting, with some focus on PDF and approaches are constantly shifting in nature due to Adobe mooting pathways to success, I think it's partly for security and also IMO partly for obscurity as PDF is a horrific format in all reality except for human visual interpretation. Many organisations and industries ONLY produce public data via PDF and it makes the parsing of that information a far more difficult task (again, I suspect by design). After trying a few parsing options for PDF, I've come to hate it as a format. Luckily though, some options from a few cloud providers seem to be really hammering the problems complexity down, but the cost of the solutions can be very steep.
PDF is a subset of Postscript, which is a full-blown programming language disguised as a page description language.
People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.
The problem is that PDF is not a file format, it's a defined subset of a programming language (PostScript) used for portable rendering with fidelity. It's portable, in the sense that it should render the same way on whatever device it's rendered on (printed on a page or mastered to a display). And it's portable because it doesn't allow any postscript job-level commands, and it tries to ensure that each PDF File is standalone and can be concatenated together into a multi-page document or embedded in another document.
Postscript (and PDF) are also postfix, which can be confusing.
That’s a bit of an oversimplification. There’s a whole layer of structure atop the postscript subset. Much software deals only with that layer, never looking into the chunks of rendering code. That’s plenty complicated already!
> Postscript (and PDF) are also postfix, which can be confusing.
I handwrote quite a bit of postscript wsy back when. It wasn’t that bad, really, you just had to keep the state of the stack firmly in your head. Being used to HP scientific calculators helped. I would never dream of handwriting a pdf file, though. Even the low level parts are harder to deal with, since most command names have been shortened to a single letter for efficiency.
To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.
Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).
Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!
Hey, I took a look at your GitHub Page and I'm wondering can you provide more information in the readme so I can understand the value of the product better.
PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.
[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]
Curious to hear what you hate about it? Any specific point? If it’s just the code age then keep in mind the main author is Robin is celebrating his 75th birthday today. Like most early Python pioneers he comes from lisp background and that’s pretty much the same style of code you see by Peter Norvig and other of his generation.
On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution.
Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?
Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.
Plus all the fun of the fact that you can embed the following formats inside a PDF:
PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.
All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.
Note the lack of OpenType fonts, also lack of proper Unicode!
PDF was defined way back before Unicode was ever a thing. It is natively an 8-bit character-set format for text handling. The way it gets around this limit of only 256 characters available is because it also allows defining custom byte to character glyph mappings (think both ASCII and EBCDIC encoding in different parts of the same document). To typeset a glyph that is not in the current in use 256 character sub-set mapping you switch to a different custom byte value to character glyph mapping to typeset that other character.
Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.
For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).
1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.
2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.
3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.
So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.
My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.
PDF has always seemed to be a janky Adobe product.
Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?
I know it's fun to hate on XML, but as compared to inventing a new pseudo-text-pseudo-binary format, its parsing mechanics are well understood
I'm not claiming all of PDF's woes are related to its encoding, but it's not zero, either. Start from the fact that XML documents have XML Schema allowing one to formally specify what can and cannot appear where. The PDF specification is a bunch of English which makes for shitty constraint boundaries
I think it's not too late to create a modern open-source alternative to PDF.
I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing.
Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience.
The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.
It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.
Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.
https://github.com/trailofbits/polyfile