Hacker News new | past | comments | ask | show | jobs | submit login
The story of the PDF (2018) (vice.com)
138 points by mr_tyzic on Nov 1, 2020 | hide | past | favorite | 91 comments



I'm thankful PDF won, because otherwise I think it would have been Microsoft Word. There was a time when papers, books, resumes, contracts, etc. almost always came as Word. Does anyone else remember getting a book as preface.doc, chap1.doc, chap1a.doc, chap2.doc, subchap2a2.doc, and so on, and a mess of jpegs and gifs and trying to figure out how it had to be assembled, and discovering something was missing, or that one chapter was newer than the others. That's one reason I really like PDF -- it's one file, self-contained, and linear.

On the other hand, I really wish it was more diff'able. If for example a credit card company changes one word in their terms & conditions PDF, it seems like 90% of document changes at the binary level. I know that PDF diff tools exist, but there must be tremendous internal complexity in the PDF format for tiny changes to alter the whole structure.


PDF objects within the file are usually compressed. That means if anything changes, the whole compressed binary blob changes.

Other than compression and such encodings, PDF files are actually text files, with a drawing model largely based on PostScript but without the programming. If you want to diff them, use `mutool clean -d -a` to first turn them into pure ASCII text.

That said, since it's a "baked" layout format, if one word pushes the rest of the text forward, everything after that will show up with changed coordinates. It's closer to a vector image format like SVG than a markup format like HTML or ODF.

There are also things like font subsetting, where removing a word that was the only use of a character, or adding a word that uses a new character, might change the font data to add/remove those characters.


"but there must be tremendous internal complexity in the PDF format for tiny changes to alter the whole structure."

Imagine a simple file format that doesn't support text wrapping, but allows you to specify elements as (x, y, s) where (x, y) specify a position, and s is a string that will be written left-to-right, truncated at the edge of the screen.

That's a simple file format, right?

But inserting a word somewhere early in that document would change the string within every element in the rest of the paragraph. And maybe move the y position of every element later in the document.

That would be a PITA to diff. Even more so if the document has more than one column.


The irony is when I get told that people want an application to output PDF instead of Word, because it is read only.

I always get amused by proving those people how to edit PDFs.

It is the same logic that documents sent by Fax are legally binding but the same document sent by email not.


As someone that does this, it’s because they are harder to change for the average user, and that barrier also means ‘don’t do it’ in a soft sense. If I send a PPT you are almost saying to a client “you can edit this if you want, because I provided it in a format which is designed for editing rather than a format that was designed for view-only”.

99% of the time it stops the “Oh great, so when you took that presentation we prepared for you as a consultancy, you kept our logo on it but changed the content and also removed our caveats!”

Also it stops people seeing my personal notes and comments that I have included throughout the document if it’s a ppt, and it also stops me sharing the data behind any graphs which is normally internally stored in the ppt file. You can set a ppt as view only, but nobody does it and clients don’t like it.


Yes - the dangers of sending a PPT to someone are real... your theme, your logo, someone else’s words. Unless a person is setup specifically to edit PDFs, there are limited tools to do so. Change to many words and spacing and alignment gets off. And it’s annoying.


The right way to do this is via digitally signed PDFs. The signature is invalidated if the document is edited (other than adding a signature).

Disclaimer: I work for Adobe, but not directly on Document Cloud.


Wait until you get a PDF which is just a bunch of poorly scanned JPGs and no OCR.


That is why one pays for Adobe Acrobat.


>I'm thankful PDF won, because otherwise I think it would have been Microsoft Word.

Well, probably Microsoft XPS, which was actually a fairly well designed format. But Microsoft didn't have the fight in them to really push it as a competitor to PDF. In part, I suspect b/c it's hard to justify investing a lot of money in your competing document standard as there is not much revenue you can derive from it. As of 2018, Microsoft no longer bundles XPS support in Windows 10.


XPS happened during the Microsoft era, which means no one really wants another format dictated by them. So there was very little incentives, interest and adoption.

XPS ultimately became an Open Standard as Open XML Paper Specification. But the fear and burn during IE era were far too great.


Interestingly it still exists inside the print spooler, it's the default spool format for modern printer drivers.


They had also RTF which was one of the best formats created by Microsoft.


I don't have an opinion on how good a format RTF is, but I kind of like it.

As part of a Java project (a while ago), I studied the RTF format, partly by reading the spec, and partly through reverse engineering - by creating multiple incrementally larger RTF docs, starting from zero content, then adding a word, then a font style, then a paragraph, a table, etc. And after each addition, opened the RTF in a hex editor and viewed the content, to help decipher the format rules. Then wrote a small RTF generation library in Java, that we used in the project to programmatically generate reports from DB data fetched via EJB. I also provided some ability to vary content and style independently. Good fun.


Just out of curiosity, why the hex editor? Isn’t RTF just ASCII?


You are right, it is just ASCII text. Probably a brain fart there, sorry. I may have said that (hex editor) out of sheer habit of using one to inspect various formats. Or I may have used a text editor, if so TextPad, IIRC, since the project was on Windows (at least dev env was). It was years ago, so not sure.


Aight :)


In what ways is it preferable to ODT (OpenDocument format)?


sidpatil answered (sibling comment). The RTF spec was freely available (on MS's site and/or MSDN CDs then) and should still be around, at least on some sites, since RTF is still used a lot as an exchange format between word processors and even other software. So you can read the spec; it is straightforward.

RTF is to Word like CSV is to Excel. In fact, we generated RTF because the output was to be input to Adobe InDesign.


RTF is a much simpler format than ODT. RTF source code resembles TeX at first glance; ODT is based on XML.

Unfortunately, it's not an open standard AFAICT.


It's not a standard in the sense of an ISO-style organization approval, but RTF has been thorougly documented by Microsoft for a very long time. [0]

[0] https://interoperability.blob.core.windows.net/files/Archive...


Right, I think it still may be a de facto standard from MS. It was so then.


The main thing for me is I don't need the originating application or fonts. I have twenty year old PDF files created by some long gone software that I can still read.


PDFs act more like images than text. I made a tool for diffing PDFs at the visual level a little while ago (http://parepdf.com) because I needed a way to see the explicit differences between PDFs.

Diffing PDFs at the textual level is a much harder problem though since lines of text need to be reordered and concatenated with each other. Unfortunately there is nothing built into the format that allows you to know what line belongs with what other line beyond guesswork.


That's a very nice tool!

I attempted something similar (https://nicediff.com), and found the textual approach to be basically useless:

Tax form example: https://www.nicediff.com/view/7a5f41ba3c76ae9bb45f42a4faa8b6...


We'd probably be using PostScript and maybe later XPS. Word never had a print-oriented format with exact layout.


I don't know of a better alternative to PDF that was around at the time, but I can't say I'm a fan. It undeniably works well as a way of placing pixels precisely on a page but then so does PNG, and PNG is far simpler and compresses better for computer generated content.

Sadly some information I only get as PDF's, so I have to scrape them. Easy right? It can be, if the PDF is structured sanely. But PDF isn't some well defined data structure for laying out the page, it's a Turing complete stack based computer program that can do whatever it damned well pleases. The font tables don't necessarily have ' '=32, 'A'=65, 'a'=97. Why not optimise it and get rid of all those gaps, so now ' '=0, 'A'=30? And it doesn't have to be drawn in any sane order. It can be just a mess that makes even copy & paste near impossible, and some are.

Did we really need to invent a DSL that has to be executed every time we wanted to view page? I remember it being pushed as a cool solution at the time. It doesn't look so cool now. SVG would be an improvement.


PNG doesn't support multiple pages and didn't supplant GIF until 2000 or so. TIFF does, but in practice it's always uncompressed (did it even support compression in the 90s?). Either solution didn't allow for text blocks or vector zooming or form fields.

It's not difficult to improve upon a sane subset of PDF, but that would require backing and coordination. Reviving XPS (but not under MS auspices) should also be possible.


PNG is of course an image format, and that means it doesn't really do text well. (Oh, and PDF is fully Turing complete and can even execute JavaScript, to in some contexts calling it a DSL is straining the definition a bit.)


> If for example a credit card company changes one word in their terms & conditions PDF, it seems like 90% of document changes at the binary level.

Convert to text with: pdftotext -layout


> because otherwise I think it would have been Microsoft Word.

No, those are formats with completely different scopes. They don't compete and are essentially non-interchangeable.

> There was a time when papers, books, resumes, contracts, etc. almost always came as Word.

There was never such a time. I mean, sure, you could (and can) send people Word/LibreOffice documents, but things that needed some reproducibility and finality [1] were distributed or published is MS-Word format - almost ever. Postscript used to be pretty popular though.

[1] - Yes, PDFs can be edited too, I know.


It is a pity that DjVu[0] wasn't even mentioned; an open format that was superior to PDF in many ways[1], including better optimization, efficient storage.

[0] http://djvu.org/ [1] https://en.wikipedia.org/wiki/DjVu


DjVu is a great format for scanned images, which is its primary use-case, but I'm not seeing where you can have actual, selectable text in a DjVu document, like you can with PDF and PostScript. It seems like it's all images.


> 3.3.2 Hidden text

> Every DjVu image optionally includes a hidden text layer that associated graphical features with the corresponding text. The hidden text layer is usually generated by running Optical Character Recognition software. This textual information provides for indexing DjVu documents and copying/pasting text from DjVu page images.

I copied that text from the DjVu spec, which is in the DjVu format.


I have not read the specification, but the DJVu format must have a way to store the plain text besides the images and that way is frequently used.

I do not remember ever reading a DJVu file that did not allow searching and selecting the text, while PDF files which do not allow those, because they store only the scanned images, are quite frequent.


Man, I haven't seen a DjVu file in years. It used to be somewhat common in scans of magazines and other media that relied on images. Pity that it didn't catch on, although I suppose it still could, if some of its benefits were refined. I find that larger PDFs tend to tank optimization, is that a problem for DjVu files at all?


I don't see how DjVu solves vector graphics, which is a pretty important usecase for PDF.


It's crazy that Yann LeCun was involved in the creation.


Yes, Yann and another machine learning celebrity: Leon Bottou!


One issue I have with the "archival" aspect of pdfs discussed in the article, is you are archiving a picture of something, not the blueprint.

So much pain and time will be spend on machine learning models extracting semantic meaning from pdfs that could have been saved if archivers were to also save source formats or machine readable data. But for some reason, publishers have an allergy to submitting those so its a lost cause.


In the early summer of 1995, the Mac community was fairly small. But it dominated the publishing industry.

At the conference for Macintosh network administrators, we were all super excited about this World Wide Web thing. The potential for a while new paradigm for information publishing, from creation to distribution, for in-house corporate operations or mass media companies, it was a new medium that would make paper obsolete.

The Adobe reps were visibly exasperated by all this. They had solved this problem, years ago. You could click on any element of a PDF, and go to a different place in the current document, or open any other file on your computer. Powerful tools for graphical interactive PDF creation and editing. Even the ability to trigger AppleScript actions in response to mouse or keyboard events...

The Web, by comparison, was primitive and naive. Why was it getting all the attention?


• Because one was proprietary, the other was not.

• Because one was top heavy, the other was not.

• Because one was a document format shared between an application that creates ne one that displays, the other was a whole server/protocol/client stack.

• Because one would insist on rigidly paginating it’s content as output by the generating application, while the other defined content that would be streamed to your client and allow it to adapt the content to your display and reflow it’s (admittedly primitive-looking) text & cetera.

• Because one was designed for use within corporations to distribute documents, while the other was intended to allow collective authorship beyond corporate confines and that this consumer/researcher technology would later seep back into the corporate domain and possibly screw up their plans.

Another way to look at it: if the Adobe folks were angsty, irritable, annoyed, or otherwise flustered, it’s probably because they knew (some?) of the above (and perhaps more) and realised that they were going to have a fight on their hands.


HyperCard did some of those things!


There was a period of time when I thought PDF’s days were numbered. That was over a decade ago.

There is now first class support in many applications. I don’t think it’s going anywhere.


It's pretty much essential to the publishing industry. Until we actually stop printing books, we'll be using pdfs.


There's nothing special about PDF. Only reason PDF is useful is because it's easy to convert to/from Postscript and Adobe was pushing out a free viewer for PDFs but not Postscript files. That, and the difference in fees they chose to charge on the formats.


That may be true, but it is so deeply embedded into publishing workflows now it would be hard to dislodge.


Some in the publishing industry have actually moved to HTML - O'Reilly comes to mind.


For online and authoring, sure. If they print the books, it gets converted to a pdf.


Which is what pdfs should be (mostly) restricted to.


https://en.wikipedia.org/wiki/PDF

It became open source in 2008 so it’s definitely here to stay.


It is a format that doesn't know what it wants to be.

Is it an image?

Text?

Vector graphics?

Electronic forms?

How about all of the above.


It's basically digital paper and since paper can be all those things so can PDFs.


Can paper run ad-tracking analytics scripts though? ;-)


Have a single copy of the paper in a place with a sign in sheet and now you're tracking everyone who looks at it.


can pdf?


Given that PDFs can embed JavaScript, they can embed ad tracking software, yes. And Linux VMs. And a port of WinAmp.


I don’t think any of that is possible with the JavaScript APIs exposed (if any) by common PDF readers. I’ve tried to do useful things with JavaScript in a PDF and failed utterly. https://stackoverflow.com/questions/32597283/can-javascript-...


Paizo uses JS to trigger map layers (for RPG products)


Tracking is one possible problem. The other is, that JavaScript can modify the document itself, so a part of a contract for example might print out differently depending on conditions.

That's why PDF/A for archiving was created which disallows various components, cf. https://en.wikipedia.org/wiki/PDF%2FA


I didn't know they could embed javascript. That's horrifying.


Fortunately many PDF viewers don't run them.



It renders properly. HTML doesn't even know if it wants to be an application or a document.


> there was a period of time when I thought PDF’s days were numbered

And indeed you are correct! A time shall arise, sooner or later in the future, a moment when the last PDF file is created, as well as a moment when a PDF is consulted for the last time. Depending on your definition of format obsolescence, this might be well beyond its expiry date, or it might actually mark the moment of death.

(Let’s forget that the Apple lineage of OSes derived from Display PostScript-using NeXTstep such as OS X [latterly macOS], iOS, iPadOS, watchOS and tvOS all use PDF as a mechanism for drawing primitive sources onto the screen.)

Anyway... after that long preamble, statements like these remind me very much of Goldfinger’s famous quote, and in honour of Sean Connery’s passing yesterday I will allow myself to elucidate:

Bond: “Do you expect me to talk?” Goldfinger: “I expect you to die!

The latter being a very reliable expectation, but one that can sometimes take a lot longer to come true than the utterer might have in mind when they make the assertion.


Original article, without so many obnoxious ads: https://tedium.co/2018/02/27/pdf-file-format-history/



PDF has been bad news, as it embodies assumptions from an earlier age: how paper works.

I want to read flowable text that adapts to my screen and my size needs. I want to be able to reliably select and extract text. I don’t need something that apes an archaic IO system (printer+paper) with all its flaws and, when on scree, none of its advantages.


Agreed. Adobe's recently-announced [1] Liquid Mode for mobile devices is a step in the right direction.

1: https://techcrunch.com/2020/09/23/adobes-liquid-mode-uses-ai...


Seems to be a feature of their reader rather than an improvement of the format. I am unsure if that is actually in the right direction


That is my understanding also. It would certainly be better if it were part of the format, but I imagine they're very concerned with backward compatibility. So perhaps this is the best we can hope for from Adobe.


I use pdf's for data sheets. Last thing I want is flowable text. Also 25 years on in selecting and extracting text from html is hot garbage.


I still use lots of paper and PDF is the ideal format for it.

There are other formats for flowable text in screens.


The difficulty is that people don't worry about it and turn everything into PDF, even if it shouldn't. How many scientific papers are actually printed out for reading? Yet the huge majority of them are published as nothing but PDFs. Good luck reading one on mobile.


Most of them, conference proceedings by Springer and friends are still a thing.

I have Acrobat and Moon+ Reader on my mobile.



This has something of a misleading argument in it in the form that PDF is the "basis" for document world. PDF is not the basis.

Lemme explain: for each format there is a basis and there is the most used format. For sound that's .WAV / .MP3; for pictures that's .BMP / .JPEG (or .PNG if you're a purist).

And for documents that's .RTF / .PDF. You see a PDF is not the absolute basis, it's just the most convenient trade between usability and fidelity. Nobody except snobs wants pure .WAV files for their preferred songs and everybody uses .MP3 instead. If you want the absolute purest form of a document, you use .RTF

My 2 cents.


The analogy is not particularly apt.

For one thing, PDF can encode information that rtf cannot.

There's also lots of approaches to document layout (the underlying descriptions of what should appear, not just different styles).

Pedantically, the analogy works somewhat better for postscript than rtf, but not really, except maybe the bmp->png part.


if only .PDFs could easily be converted back to a useful raw format. parsing them is a bloody minefield, irregularly stuffed with proprietary metadata galore


PDF is a printing format, not an editing format - hence the trouble when you want to convert it back to an editable document. It's the same as going back from .JPEG to .BMP, you'll never get back your original pixels.


yes but unlike .bmp -> .jpg compression is optional. you can display exactly the same content and layout in e.g. HTML, but there is no standard to govern or reverse this


pdftotext -layout


Sometimes works well, depending on the structure and content of the PDF. Other times it's hopeless.

Certainly not a general solution. Indeed, there isn't one, because the design of PDF allows far too many things that can't be reliably deciphered back to the source data.

That's why Adobe is throwing all their ML at it, to try and come up with something that guesses near enough right more of the time.


as with the hundreds of other converters, it probably will produce varying results


Even a snob wouldn't want a WAV file; FLAC is lossless.


Sometimes it's not about preferences but about what's most widely supported. For example, my Octatrack only supports wav/aiff files so that's what I'm stuck with.


The only problems with PDFs are that they are misused.

They are amazing at exactly reproducing a printed document, and far superior to a jpg at doing that because it is vector, searchable, can contain links, etc

If you've ever tried to read a math textbook in ebook format on a ipad then switched to pdf, you can see how pdf shines.


When I first got a computer magazine in a PDF format in the late 90s, I knew this is going to be the future. It looked so slick on my CRT monitor and I've been looking at pages, zooming, zooming out, just for the sake of it. Whenever I open a PDF file my mind goes back in time and relives these moments of joy.


Man, this reminded me of XPS (https://en.m.wikipedia.org/wiki/Open_XML_Paper_Specification), which I haven't thought about in ten years. Glad it never won.


Any have a recommendation for a good FOSS PDF reader for Linux.


zathura with the mupdf engine, evince (if you can stomach poppler's speed), sumatrapdf inside wine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: