"You either die a hero, or you live long enough to see yourself become the villa...

sillysaurusx · on April 18, 2021

Much of the problems could’ve been avoided if PDFs embedded their own source code (e.g. the TeX that generated it) much like a website has a View Source option.

Alas.

aardvark179 · on April 18, 2021

PDF is not like HTML and is solving a different problem, and viewing the source in the way you can with a webpage would just show the long list of drawing commands used to render a document as that is what a PDF represents.

Embedding the source document that generates a PDF seems pointless unless you want to handle hundreds of formats and the processes that might have turned those documents into PDFs. In the case of GIS software I’ve worked on the closest you could have is an XML document specifying the viewport and visibility, but all the interesting stuff would have been read from the database and pushed through a style system that you wouldn’t have any knowledge about.

What might have been useful is a meta layer that could tie lines of text together into blocks so that less guesswork had to be done when implementing things like text selection and screen reading. We have tricks like adding blank fonts which allow for text selection on OCRed documents, but they could probably have been improved if they had been baked in more from the start.

teddyh · on April 18, 2021

LibreOffice Writer has an option to do that when you export a document to PDF; i.e. it has a “Hybrid PDF” option which embeds the ODF document in the PDF.

twoWhlsGud · on April 18, 2021

Has anyone tried using the relatively new features for "tagging" - that allows one to indicate the structure of the document:

https://en.wikipedia.org/wiki/PDF#Logical_structure_and_acce...

? Wondering if these are usable and how far they go to fixing the problem?