Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I am building a new Python library to read/write PDF files (github.com/desgeeko)
279 points by desgeeko on Nov 17, 2022 | hide | past | favorite | 121 comments
Hi HN! This is my pet project, written from scratch because there is so much to discover and learn in the process. The focus is on simplicity and incremental updates. Progress is slow because I do not have much spare time to work on this, but I would love to hear some feedback. Regards



Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?

Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.

https://github.com/trailofbits/polyfile


In the README for that repository it mentions "schizophrenic files". What is a schizophrenic file, out of interest?


Not OP. It seems they're files which display different contents depending on the program you open them in.

Here's a CCC talk on it: https://media.ccc.de/v/MRMCD2014_-_6008_-_en_-_grossbaustell...

And the slies from the talk: https://www.slideshare.net/ange4771/schizophrenic-files-v2


"The ALAN Parsers Project"

Bravo! Best wordplay I've reaed today.


I never knew about the J number suffix in python: https://docs.python.org/3/reference/lexical_analysis.html#im... which it would appear is used to represent references: https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_p...

I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:

    ValueError: could not convert string to float: b'5.0.0'


    104 0 obj <<
    /Producer (pdfTeX-1.40.10)
    /Creator (TeX)
    /CreationDate (D:20131209161146-08'00')
    /ModDate (D:20131209161146-08'00')
    /Trapped /False
    /PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian) kpathsea version 5.0.0)
    >> endobj
as it seems a string with nested parens jams up the parser


The PDF format is diverse enough for such a new project to still have plenty of incompatibilities. If you wanted to find many of them quickly, you might want to have a look at the documents used as test cases in other projects, such as pdf.js:

https://github.com/mozilla/pdf.js/tree/master/test/pdfs


Valuable practical, actionable advice


This is tangential to your submission, but PDF is the file format I use for exercising any library that claims to be a declarative file format (ala https://github.com/kaitai-io/kaitai_struct_formats#readme )


J is for complex numbers. While math.sqrt(-1) raises an exception, cmath.sqrt(-1) returns 1j.

There is not distinction between values and references in python, everything is a reference. In fact, primitive like numbers are big struct objects in cpython, you cannot just manipulate the raw numbers.

The difference will rather wether you can modify an object or not. You cannot modify numbers, as they are immutable. Any increment will produce a new object. But you can modify a list. This gives the feeling numbers are passed as values and list are passed as references.


Parent was referring to "references" in the PDF spec (eg 42 0 R)


This is really wonderful, thank you! It's great to see someone focusing on the internal structure of PDF files (the "Syntax" chapter of the spec), and doing things with a focus on browsing the internal structure etc. (I had a similar idea and did something in Rust/WASM back in May; let me see if I can dust it off and put it on GitHub. Edit: not very usable, but here FWIW: https://github.com/shreevatsa/pdf-explorer)

In particular, there are so many PDF libraries/tools that simply hide all the structure and try to provide an easy interface to the user, but they are always limited in various ways. Something like your project that focuses on parsing and browsing is really needed IMO.


  Commiting a change (from Jun 5 20:57) that I don't understand any more 

     // From real life, lightly modified. Note the "/companyName, LLC" as key!
With absolutely no slight toward the author, that matches my mental model of dealing with PDFs: `git commit -mwtf`


I'm the author and I just meant I had left behind a small uncommitted diff back when I stopped working on it, and I didn't bother to read the diff before committing. I actually understand it just fine, on second look…

Overall, at least so far, I haven't encountered much "WTF" dealing with PDFs actually. The spec (especially the Adobe version: the ISO version based on it is only slightly different but feels much worse) is quite pleasant to read. There are some warts from backward compatibility with earlier poor decisions, but not too many of them. And while it's surprising what different PDF programs will produce as long as any PDF reader in existence happens to accepts it (Hyrum's law) (e.g. in this example, the dictionary key having a space in it), for my purposes it hasn't been a big deal as I'm only trying to do the first level of parsing, and when even that is problematic I can happily just declare the PDF malformed.


I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.

PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.

However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.

Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.

There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.

Will be curious to see how this project develops!

[1] https://pypi.org/project/PyPDF2/

[2] https://pypi.org/project/reportlab/


QPDF is a good C++ library for "content preserving" PDF transformations, and is used by the Python PikePDF library.


I've found out the hard way that boxing/unboxing of PDF primitives to Python is _really_ expensive, so that my workflow has been counter-intuitively quite a lot slower than with PyPDF2.


I had the same experience. Thanks for the summary. Need to read that the next time.


As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.

If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at https://pdf-issues.pdfa.org/32000-2-2020/index.html.

As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see https://github.com/gettalong/annotated-pdf-spec. That might help you in parsing some invalid PDFs


Merely for your consideration, if those were actual issues on that repo, (a) it would allow adding labels to them (as in https://github.com/pdf-association/pdf-issues/issues?q=is%3A... ) (b) folks could comment, acting as a low-rent stackoverflow, and (c) it would allow anyone to contribute new ones versus the "PR against README.md" situation right now

That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not


Thank you, it is much needed, right now the most reliable way of generating PDF's I used not so long time ago is - create DOCX with content and some template variable strings, like {{}} - unpack document and get into text, replace text - use DOCX->PDF linux tool to generate document.

Maybe this will be the good solution


Latex seems way better than either to me (but then, I know Latex). Certainly Latex makes it much easier to get consistency and precision. For your use-case, generate Latex code from a template using Python or whatever, substituting in what you need, then compile the Latex into a PDF. If you need graphical elements or precise layout control, use the Latex package TikZ.

If what you need is very simple (e.g. no word wrapping, same number of variable strings in the same positions), even manipulating the code of a template PDF directly is not too hard. This library would help with that.


> the most reliable way of generating PDFs

See pandoc: https://pandoc.org/

And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.

Including the author's own djot: https://github.com/jgm/djot

EDIT:

Sibling reply suggests latex. OK, but then you're also learning latex.


Good luck, I once started to scratch the same itch to learn this file format, several years later I think I got about 30% of the way through!

More open source PDF code is good. If you can find a version of iText RUPS application from somewhere on the internet it's a useful tool for viewing the syntax / structure.


> find a version of iText RUPS application from somewhere on the internet

You mean this, right? https://github.com/itext/i7j-rups#readme


Yes, that's the one. I found a compiled version somewhere because I was too lazy to install/learn Maven/Java stuff in order to build from source.


I once had to help an accountant friend to fill in 1000's of docx files, and convert them to pdf. No open source tool does a proper conversion, it really sucked.


I once had to do this and turned all the docx into one document, used Word to export as PDF and then used a PDF splitter to get separate documents.


You could script Libre Office to do this.


>No open source tool does a proper conversion //

Presumably the key word here is "proper" because LibreOffice, etc., read docx and write pdf. For example, `libreoffice --headless --convert-to pdf myfile.docx`.


I have used Perl and Win32::OLE for this kind of job.

Converting to PDF is actually quite easy. Before Office 2010, you had to print to Postscript and then convert to PDF using Ghost. Nowadays Word gives you the option of saving to PDF.


Yeah, this would be horrible, but on flip side relatively easy to do with Word to hand.


it's such a complex messy format that i'm really not surprised.


ilovepdf.com is free but isn't open source.


Good to see work in the PDF space. It’s still one of the most important formats. I would love to see more time invested in tools that can create PDF/A documents, which I believe to be the sane subset of PDF.


A PDF generator library that only generated guarenteed PDF/A compliant PDFs would actually be really good selling point for a new PDF library.


Is there a list of open source PDF libraries for various languages?

And related: the best tools to generate PDFs from HTML.


As I'm currently fighting with css3/paged media[pm] - I've recently tried to figure it out.

Theres a rather comprehensive list at: https://www.print-css.rocks/tools

As far as Foss tools go, I've only found paged.js (a polyfyll) in combination with a browser print-to-pdf (eg wkhtmltoodf (webkit) or puppeteer (chrome)) that has any semblance of css support.

There's also ghostscript - but AFAIK it doesn't support much/any css3 for print.

[pm] https://www.w3.org/TR/css-page-3/


I've seen Puppeteer being used successfully in a few projects, it's ugly as hell (you need a VM, a browser and extra software to generate a file) and might require maintenance but it works quite well for simple, mostly textual documents.


Playwright is a new favorite browser automation tool, I wonder if they've done anything to help with generating PDFs?



The main issue with the browser based tools is that developers seem to have given up on implementing html/css print standards. So there's a limit to what automation tools have to work with, so to speak. That said, paged.js makes a good effort.


"list" is probably harsh, but I've had good luck trawling through the GitHub topics to find such things: https://github.com/search?p=1&q=topic%3Apdf&type=Repositorie...

Makes me miss freshmeat.net which would have been my answer a few years ago (freshcode.club just isn't the same, although bless them for trying)


Circa 25 years ago I was newly wed and my wife happened to be watching over my shoulder as I looked up something online. I started typing "fr" and "freshmeat.net" came up as a possible completion.

She was, to put it mildly, immediately suspicious of my browsing habits.


The best generator is Prince XML but it can be expensive.


Seconded. I've recreated a corporate CI with it and had a great time. To have a single compile target is a blessing.

Also: the only software I know of written in Mercury.


Mercury and Rust (according to Wikipedia).


Rust is probably a recent development, it's been a long time I last checked up on PrinceXML. Might also mean that Mercury is on the way out long to middle term.


In python I would choose WeasyPrint most of the time these days.


Dompdf is a good html to pdf library for PHP.


Please, for the love of all that is holy, run away!


You brave soul, I wish you luck.


I desperately need to be able to display .SVG files with gradients on .PDFs, but no library currently exist in python as far as I know.

I would be willing to help make this happen, but I do not know much about the PDF format.


It's actually not so bad: it's mostly ASCII, even though some parts of it really need to be treated as binary. If you open up a simple/old PDF in your favorite text editor, you can begin to grok the basic structures quite easy.

One trick for getting started: PDFs are read from the bottom. The first thing that is read is actually an offset pointing back to the xref table, at the end of the file. Then, the xref table itself points to the latest version of all of the objects.

The part you're most likely interested in is the content streams, which contain postscript-like drawing commands. To get a feel for it, following the official spec when reading a simple-looking document can help.

edit: I didn't link any actually useful resources, in part because I actually just have a corpus of files in various file formats that I keep handy as a reference for some weird reason. However, Googling for simple PDF files yielded this, which I feel is very readable in a text editor. https://www.africau.edu/images/default/sample.pdf


If you look for "pdf inspector" apps, there's also lots of those that will let you poke around the parsed tree.


You can include binary data in PDF files, so it's not necessarily ASCII.


The structure however is still largely ASCII text. It needs to be treated as binary of course, due to the use of offsets everywhere and the fact that the xref table is hardcoded to have a specific length per xref in bytes. But if you look at a lot of simple or old PDFs, it's not hard to find some that don't use any binary encodings.


Sounds a lot like the tar file format.


The PDF spec is quite readable. Use the “Adobe equivalent” of PDF 1.7 here: https://www.pdfa.org/resource/pdf-specification-index/


ReportLab can render gradients, but it's poorly documented. See eg https://stackoverflow.com/questions/452074/creating-a-gradie...

I use the method with `canvas.clipPath(path, stroke=False, fill=True)` on a path I've parsed manually from SVG then `canvas.linearGradient`.



Neat. Another use case for which you might want to think about a sample is extracting data from filled PDF forms. (That use case is why I once had to write a PDF parser.)

Since you read&write, maybe also a use case of programmatically filling some form fields in an editable PDF form. Such pre-filling some of the fields for a particular Web site user in a dynamically-modified PDF form they download. But the source PDF form can be hand-crafted and maintained separately, like people often want to do, not generated from scratch by your code.


I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it. Before I ended up using pdfplumber, I tried at least three other PDF packages and they did not work as easily or as expected.

[1]: https://github.com/jsvine/pdfplumber


Fun project story... During the first covid school shutdown my son's day care wanted parents to print a daily screening symptom checklist, take a photo of it and email it to them every morning. This was a tedious process that I automated with PyPDF2 + PDFtk + pypdftk. It's easy to generate your own PDF's but it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.

Eventually I turned it into a website, added AWS API Gateway + Lambda and put the whole thing up for other daycare parents to use. Two weeks later the daycare switched to google forms and my project was not useful anymore.


> it's harder to take an existing, outdated, non-editable PDF and automatically fill it out.

That has been on my wishlist for several years: build a "PDF annotation" service that takes in a PDF that is not an XObject form (e.g. this random example: https://www.dentalworks.com/wp-content/uploads/2021/08/Patie... ) and replace those _____ areas with actual PDF inputs. My handwriting is terrible, and it's a waste of human capital for some poor soul to try and decipher handwriting only to (almost undoubtedly) re-type it into a computer on their end

I am sure we ended up in this situation because people just "File > Print to PDF" from Word or whatever, because knowing that PDF forms exist and then how to use Adobe(R) whatever(tm) to make a real editable PDF is "too much to ask."

I have had about 10% success with Preview.app detecting the lines and allowing me to click on them and type, but having https://notstupidpdf.example.com/www.dentalworks.com/wp-cont... would be much better for humanity


I once had to parse a reference manual, provided as a PDF and emit a mostly-usable CSV of its content.

That shit was hard. Writing PDF is one thing but there are some psychopathic PDF's out there when you scratch below the surface. People do .... well, you'll find out.


Table elements not being in consecutive display order in any direction. Like elements 0-5 from column a, followed by elements 10-12 from column b, elements 6,8.19 from column a, then elements 19-4 in reverse order from column d, all of column c....

This is a real thing I dealt with.


This reminds me back in the day where we got some properties and thought, PDF is a defined file format. Every pdf has this values…

We were so naiv and didn’t know.


I asked the vendor if they had a 3d model viewer, so we could inspect a part they were making. He sent me a PDF. Full pan, tilt, zoom, and hideable pieces. I don't know what kind of witchcraft was involved, but I suspect OP will come out of this cursed by its unholy nature.


Most PDF viewers don't support the 3D models feature, and just show a static image (literally an embedded image; they don't look at the 3D data at all). I've used it to make 3D diagrams for multivariable calculus, done with Asymptote (https://asymptote.sourceforge.io/).


PDF has a load of advanced features - 3D models, video, Flash. Basically only Adobe supports any of it though.

I wish more readers supported video but IIRC the standard doesn't actually support a normal modern format.


I wish you all the best! This space has a lot of stuff in it and they’re lacking in some aspect. And that’s not a admonishment, PDF is such a complicated format that there will never be a library that doesn’t come with asterisks — it’s just a matter of picking the thing you want your library to focus on and be good at and you can pretty easily be someone’s favorite lib.


My understanding is that this is largely because you're fighting an adversarial format provider in Adobe, I've read a few papers and journal entries on file format polyglotting, with some focus on PDF and approaches are constantly shifting in nature due to Adobe mooting pathways to success, I think it's partly for security and also IMO partly for obscurity as PDF is a horrific format in all reality except for human visual interpretation. Many organisations and industries ONLY produce public data via PDF and it makes the parsing of that information a far more difficult task (again, I suspect by design). After trying a few parsing options for PDF, I've come to hate it as a format. Luckily though, some options from a few cloud providers seem to be really hammering the problems complexity down, but the cost of the solutions can be very steep.


PDF is a subset of Postscript, which is a full-blown programming language disguised as a page description language.

People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.

The problem is that PDF is not a file format, it's a defined subset of a programming language (PostScript) used for portable rendering with fidelity. It's portable, in the sense that it should render the same way on whatever device it's rendered on (printed on a page or mastered to a display). And it's portable because it doesn't allow any postscript job-level commands, and it tries to ensure that each PDF File is standalone and can be concatenated together into a multi-page document or embedded in another document.

Postscript (and PDF) are also postfix, which can be confusing.


> PDF is a subset of Postscript

That’s a bit of an oversimplification. There’s a whole layer of structure atop the postscript subset. Much software deals only with that layer, never looking into the chunks of rendering code. That’s plenty complicated already!

> Postscript (and PDF) are also postfix, which can be confusing.

I handwrote quite a bit of postscript wsy back when. It wasn’t that bad, really, you just had to keep the state of the stack firmly in your head. Being used to HP scientific calculators helped. I would never dream of handwriting a pdf file, though. Even the low level parts are harder to deal with, since most command names have been shortened to a single letter for efficiency.


Postfix is fine, you get used to it fairly quickly. But when you've finished and want to go the the toilet, you walk there backwards.


What libraries do you see as being SOTA? Fitz? Tika?

My hope is that computer vision + OCR will solve this once and for all in near future.


To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!


OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.


There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.


Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.


I had some luck with Camelot (https://camelot-py.readthedocs.io/en/master/). However, as many of the comments here say, PDF is a beast.


Hey desgeeko,

from a past project we‘ve left a python PDF renderer - might be somehow useful or inspirational…

https://github.com/systori/bericht


there is a Perl library that does this, but it only supports pdf 1.5

https://metacpan.org/pod/CAM::PDF

I have used it in the past.


Hey, I took a look at your GitHub Page and I'm wondering can you provide more information in the readme so I can understand the value of the product better.


You’re in for it! I highly recommend checking out mupdf, it was one of the more pleasant Python libraries I dealt with for this purpose.


I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).

PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.

[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]


Good luck! Really! I hate ReportLab!

I hate using ReportLab … reading its code is fascinating. Interesting seeing what 1990s Python code looked like.


Curious to hear what you hate about it? Any specific point? If it’s just the code age then keep in mind the main author is Robin is celebrating his 75th birthday today. Like most early Python pioneers he comes from lisp background and that’s pretty much the same style of code you see by Peter Norvig and other of his generation.


Is it possible to directly edit the text?


Wish you luck.


Awesome work


On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution.

[1]: https://github.com/jcushman/pdfquery


There is a command line utility (pdf2text) that will also parse the pdf to an XML tree and you can query with XPaths. I found it works well.

https://pdfminersix.readthedocs.io/en/latest/reference/comma...


That makes sense, as "pdfquery" uses pdfminer.six as a dep: https://github.com/jcushman/pdfquery/blob/master/requirement...


This is great, thank you for posting it.


Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?


Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.

Plus all the fun of the fact that you can embed the following formats inside a PDF:

PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.

All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.

Note the lack of OpenType fonts, also lack of proper Unicode!


JPEG 2000 (dead)

Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.


Believe it or not, but digital cinema projection is done with jpeg 2000 https://en.wikipedia.org/wiki/Digital_cinema


That’s wild!


I mean dead as in the fact that it’s used somewhere is noteworthy.

I’d love for JPEG XL to replace such uses!


> lack of proper Unicode

What do you mean?


PDF was defined way back before Unicode was ever a thing. It is natively an 8-bit character-set format for text handling. The way it gets around this limit of only 256 characters available is because it also allows defining custom byte to character glyph mappings (think both ASCII and EBCDIC encoding in different parts of the same document). To typeset a glyph that is not in the current in use 256 character sub-set mapping you switch to a different custom byte value to character glyph mapping to typeset that other character.


> PDF was defined way back before Unicode was ever a thing.

Unicode 1.0 was released in 1991.

PDF 1.0 was released in 1993.


Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.

For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).


1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.

2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.

3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.

So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.


My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.


Droit does similar things.


That is awesome! Can “cross border” do things like process GDPR compliance regulations or is that not the intended use case?


PDF has always seemed to be a janky Adobe product.

Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?



Sadly XPS is not supported by most software, I'd love to use something better than PDF, but even LibreOffice can't export as OXPS.


Anything related to XML is arguably even worse.


I know it's fun to hate on XML, but as compared to inventing a new pseudo-text-pseudo-binary format, its parsing mechanics are well understood

I'm not claiming all of PDF's woes are related to its encoding, but it's not zero, either. Start from the fact that XML documents have XML Schema allowing one to formally specify what can and cannot appear where. The PDF specification is a bunch of English which makes for shitty constraint boundaries


It was a fight between DiskPaper and PDF. PDF won because the tools were better and it was cross-platform.

And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.

It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.


I know this is likely a case of "you know what I meant," but there already is a PDF 2.0: https://www.pdfa.org/resource/iso-32000-pdf/


I think it's not too late to create a modern open-source alternative to PDF. I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing. Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience. The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.


It does seem like that would be a good opportunity to weed out some of the insecure aspects of the format.

Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.


It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.


The PDF format is itself a weird hybrid of text and binary.

(I have written a PDF parser myself.)


Amazing Python guide, Thank you. How long you have been working as a developer?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: