Can you give us an example of something you'd consider to be a complicated probl...

layer8 · on May 31, 2023

> you could look at PDF as a boring-ass "follow the spec" experience

If only… the problem is that the spec is underspecified.

gumballindie · on May 31, 2023

I honestly have no clue about what makes pdf parsing a complex task. I wasnt trying to sound condescending. Would be great to know what makes this so difficult, considering the pdf file format is ubiquitous.

yalue · on May 31, 2023

I'm not anyone involved in this thread (so far), but I've written a minimal PDF parser in the past using something between 1500-2000 lines of Go. (Sadly, it was for work so I can't go back and check.) Granted, this was only for the bare-bones parsing of the top-level structures, and notably did not handle postscript, so it wouldn't be nearly enough to render graphics. Despite this, it was tricky because it turns out that "following the spec" is not always clear when it comes to PDFs.

For example, I recall the spec being unclear as to whether a newline character was required after a certain element (though I don't remember which element). I processed a corpus containing thousands of PDFs to try to determine what was done in practice, and I found that about half of them included the newline and half did not---an emblematic issue where an unclear official "spec" meant falling back to the de facto specification: flexbility.

It's honestly a great example of something a GPT-like system could probably handle. Doable in a single source file if necessary, fewer than 5k lines, and can be broken into subtasks if need be.

jahewson · on May 31, 2023

Having spent considerable time working on PDF parsers I can say that it’s a special kind of hell. The root problem is that Acrobat is very permissive in what it will parse - files that are wildly out of spec and even partially corrupt open just fine. It goes to some length to recover from and repair these errors, not just tolerate them. On top of that PDF supports nesting of other formats such as JPEG and TTF/OTF fonts and is tolerant of similar levels of spec-noncompliance and corruption inside those formats too. One example being bad fonts from back in the day when Adobe’s PostScript font format was proprietary and 3rd parties reverse-engineered it incorrectly and generated corrupt fonts that just happened to work due to bugs in PostScript. PDF also predates Unicode, so that’s fun. Many PDFs out there have mangled encodings and now it’s your job to identify that and parse it.

MrTortoise · on May 31, 2023

TEXT ISNT STORED AS TEXT ITS STORED AS POSTSCRIPT PRINTING INSTRUCTIONS IIRC