Unit Testing PDF Generation

gbro3n · on Feb 27, 2023

This is more of an integration test than a unit test. And if you're going to test for a pixel perfect image match, why not check for full equality with a pre-existing PDF file, byte for byte? And then what are you testing? That something has changed? You'd likely know that the output was going to change, so to fix the broken test you need to use the failure result to create the new comparison file, and if your always going to use the failure output as an input for correcting the test, what is the point. "Don't test that the code is like the code" is a similar principle.

tantalor · on Feb 27, 2023

It's more like "golden" or "snapshot" testing.

These are very common for web apps, because at the end of the day you don't care about the actual html & CSS, only how they are rendered.

> This is more of an integration test than a unit test.

That's debatable. An integration test generally tests 2 or more systems. This kind of test has 2 systems, the generator and the renderer, and we care about the output of the renderer, so it kind of looks like an integration test. However in an integration test you also have control over the implementation of both systems; a regression can be in any of the systems. But that's not true in snapshot tests: the renderer is a given. If the test fails, it's very unlikely to be due to a regression in the renderer. So in that sense, you are really only testing a single component (the generator) hence it is more like a unit test.

eyelidlessness · on Feb 28, 2023

I was also going to mention snapshot tests. But since you beat me to it,

> because at the end of the day you don't care about the actual html & CSS, only how they are rendered

It really depends what you’re testing. I’ve generally been skeptical of this kind of test, similarly to OP, because “nothing changed” vs “update snapshots” feels intuitively low value to me.

Despite all that, I recently added a slew of snapshots (literally >1m lines, yikes) along with a custom snapshot serializer. For this use case I do care about the HTML (and XML) because those are the project’s primary responsibilities. The custom serialization slightly relaxes the snapshot value from “nothing changed” to accept known-insignificant changes: it collapses 1+N whitespace characters, sorts attributes alphabetically because their order doesn’t matter, and trims their values because no downstream users are that pedantic about leading/trailing attribute whitespace. Everything else will be treated as an API contract violation. There are some project-specific details which might result in additional custom serialization logic and creating new snapshots, where downstream users are expected to treat certain markup values as semantically equivalent to their equivalent representation in another output property.

Allllllll of that is a really long way around the barn to get to: these snapshot tests are more valuable than “exactly equal” comparison specifically because there’s a known and finite set of things that can fluctuate and a lot of caution around accepting anything into that category. And adding them at all, with known flexibilities, provides value because the underlying library has very high expectations for stability. It’s very unlikely they’ll ever be updated for any change which isn’t either additive or strategic.

(And the reason they were added in the first place was to allow for much needed performance improvements and refactors to proceed with high confidence that they’re safe. Since adding them, the project’s performance monitoring charts needed to run for a period of time to crop to a whole new Y axis range, and a refactor is in review which will enable it to run in client environments without need for server deployments. None of this would have been reasonable by my team’s standards or my own without a large body of evidence that it didn’t introduce regressions. And we very well may retire it after this exercise!)

tantalor · on Feb 28, 2023

Check out "characterization test" article:

https://en.wikipedia.org/wiki/Characterization_test

eyelidlessness · on Feb 28, 2023

This seems particularly pertinent:

> Traditional tests check individual properties (whitelists them), where characterization testing checks all properties that are not removed (blacklisted).

And I’d encourage anyone who finds snapshot testing appealing but problematic to consider this approach.

zoover2020 · on Feb 27, 2023

Bravo! Excellent summary

patrickg · on Feb 27, 2023

This is what I do

I have ca. 190 test cases on which I run my software and compare the md5 sums of the resulting PDF. If they are not the same, I create a PNG for every page and compare visually with imagemagick.

The trick is to remove all random stuff from the PDF (like ID generation or such).

This takes about 3 seconds on the M1 Pro laptop. I think this is very much okay.

Links: https://github.com/speedata/publisher/tree/develop/qa (the tests) https://github.com/speedata/publisher/blob/develop/src/go/sp... (the Go source code for the comparison)

mattgreenrocks · on Feb 27, 2023

These are typically called smoke tests, and can be valuable for regression testing of third party libraries you depend on.

An alternate approach: generate the PDF, then run it through a PDF reader library to scrape the text out and ensure it is there.

izacus · on Feb 27, 2023

Your approach will completely miss big changes like missing pictures, broken layout, missing background and other breakages in rendering. Also missing text which isn't embedded as a text layer.

mattgreenrocks · on Feb 27, 2023

Of course. It was meant for argument, not as an omnibus to comprehensively testing PDFs. :)

DavidSJ · on Feb 27, 2023

This sort of tests can be useful when you change things under the hood in such a way that the output shouldn’t have changed.

seanhunter · on Feb 28, 2023

> why not check for full equality with a pre-existing pdf file, byte for byte?

This was what I built once to do unit tests on a pdf generator. The use case was I was working for a software vendor at a very important financial client. Our software was used in their research group to produce output (massive reports) which fed into their trading decisions and had been heavily customized by the client in ways that our dev team couldn’t test (the client had a very exacting internal confidentiality regime and these reports and the various code customizations required to generate them were heavily restricted). We were trying to upgrade our software by 2 major versions and make sure everything was still going to run correctly in spite of the massive upgrade.

Each report contained >100 pages of 9 widgets per page and all the data going into these widgets was restricted as well as the calculations and outputs themselves. So we needed to somehow prove that the new system would generate these reports exactly the same as the previous system in spite of the fact that everything had changed under the hood and we couldn’t see either the old or new reports.

What we decided to do was treat the entire system as a big blackbox test. I built a junit harness that would generate the pdf of each of a set of reports using the old system and the new system and then diff the output. Initially we literally just used a normal text diff on the pdf file, but once we had fixed the first few (hundred) differences we refined it to snapshot both pdfs to images and produce the diff images because that made it easier to find and fix the problems.

It was a very painful process but an extremely effective test because the reports were the end product of the entire system and nailing all the differences in the reports proved that all the input data, all the calculated outputs and everything else was working correctly. It ended up with us doing the upgrade successfully.

systems_glitch · on Feb 27, 2023

At a previous job, we created a PDF visual diff tool for this. In automated tests, we could look for either red (present in sample but not test output) or green (not present in sample, but present in test output) to fail a test, or issue an automated change approval request.

ks2048 · on Feb 27, 2023

I’ve never seen a clear distinction between unit tests and integration tests. If you have a black box, “F”, with input/output pairs you want it to replicate, you encode these and call them “tests of ‘F’”. Why have different names for whether “F” is simple or complex?

CiaranMcNulty · on Feb 27, 2023

They both revolve around a coherent concept of what a 'Unit' is - if you have a (shared, project-level) understanding of a Unit then a 'Unit test' is what tests it, and an 'Integration test' involves >1 Unit

simonw · on Feb 27, 2023

I find the distinction between the two extremely frustrating.

Some people act like there's an obvious definition, and maybe there is if you're doing pure TDD Java as described in one specific text book... but in my experience most developers can't provide a good explanation of what a "unit" is.

And those that do... often write pretty awful tests! They mock almost everything and build tests that do very little to actually demonstrate that the system works as intended.

So I just call things "tests", and try to spend a lot more time on tests that exercise end-to-end functionality (which some people call "integration tests") than teats that operate against one single little function.

ascotan · on Feb 28, 2023

unit tests test a component in isolation, integration tests test the components when it's connected to something else.

bobbylarrybobby · on Feb 27, 2023

A not terrible solution is convert to PNG and check pixels up to some threshold, e.g., average pixel squared diff doesn't exceed 1e-4. Can also perform this test over windows to get a finer-grained view.

pharmakom · on Feb 28, 2023

> if your always going to use the failure output as an input for correcting the test

sometimes it's useful to know that something changed.

geraldwhen · on Feb 27, 2023

It is extremely hard to make two pdfs have the same output binary, especially on CI vs local.

gbro3n · on Feb 27, 2023

*you're

jimjimjim · on Feb 27, 2023

This can get very difficult. Especially with pages that are more than just text and images. Lines, interactive content, optional layers, annotations, embedded content, blend mode transparencies. All of this and more make things complex.

The real problem is that reading a pdf is vastly more complex than writing a pdf.

The spec (1000+ pages) is open to interpretation and different readers interpret it differently. A page that might render perfectly in adobe may look different when viewed in firefox or chrome or ghostscript.

eddsh1994 · on Feb 27, 2023

It's interesting how different peoples use of testing terminology is across teams/companies/professions. Vocab is standardized by various ISO's, ASQ, and ISTQB so we could all share the same language, then we don't have to debate about what integration/unit/smoke/component/regression/golden/snapshot testing means

flandish · on Feb 27, 2023

Isn’t testing the physical generation of a pdf more aligned with “integration” test not unit testing? Testing the api that makes the pdf is ok, but testing like this post suggests, with bitwise comparison is integration testing, no?

izacus · on Feb 27, 2023

Is naming these tests a seriously useful thing to bikeshed on?

PaulStatezny · on Feb 27, 2023

There is a distinct and meaningful difference between unit tests and integration tests. flandish is not bikeshedding.

Unit tests are about testing a single unit in isolation. Integration tests are about testing the integration of multiple units.

With unit tests, the industry's general attitude is that there should be no side effects, such as reading/writing to databases or the disk. Side effects are generally embraced for integration tests, on the other hand.

As a result, unit tests are mostly useful for "pure" functions, ones where the output is 100% derived from the input, regardless of any state external to the function. (Such as database records.) However, a large portion of the industry hasn't realized this and so you get millions of lines of dependency-injected unit tests that really don't provide much value in terms of catching actual bugs. (If these tests were integration tests, they'd catch actual bugs 10x more often.)

A unit test for generating a PDF will not actually involve writing a PDF for disk. An integration test, however, might.

So as I said, this isn't bikeshedding. ;-)

flandish · on Feb 27, 2023

…yes. Because different energy, documentation, and sometimes entire groups of people are on different phases.

It’s not always a single 100x-elite-monster-drinking coder cranking out monoliths in a silo.

I have a hard enough time with project management getting it wrong:

- testing an api’s public methods is far “faster” than testing how files are made on diff procs or fstabs..

- that translates to silly gantt charts…

You get the idea.

leni536 · on Feb 27, 2023

Well, the blog post could have just called it "test", and nobody would bikeshed it.

quectophoton · on Feb 28, 2023

Yeah the blog author accidentally triggered one of the fundamental laws of internet:

"The best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer."

Also known as Poe's Law.

mardifoufs · on Feb 27, 2023

Having well defined terms, and using them well, is essential to any type of engineering. I don't know why aiming for precise terminology is only controversial in software engineering.

DSMan195276 · on Feb 27, 2023

> Isn’t testing the physical generation of a pdf more aligned with “integration” test not unit testing? Testing the api that makes the pdf is ok, but testing like this post suggests, with bitwise comparison is integration testing, no?

The fact that it writes the PDF out to a file potentially makes it an integration test, but the rendering aspect I don't think so. The poster is not testing the integration of the tool with ghostscript, rather ghostscript is simply used as an oracle for verifying the result. The only thing actually tested is the original a4pdf API, but some way of verifying the resulting PDF was needed, which is what ghostscript accomplishes. Effectively it's no different from a fancy assertion.

kccqzy · on Feb 27, 2023

I have a much more liberal view of what constitutes a unit test: everything that can be run inside a single container is a unit test. Writing files? Unit test. Using databases? As long as that database is started by the test fixture in the same container and destroyed along with the container, still a unit test.

Of course, if your test needs a database the natural follow-up question is whether it can populate the database with data known at build time, or it needs to reach out to get some realistic looking data. Only the latter makes it an integration test.

flandish · on Feb 27, 2023

I reckon so. It could align nice with a mock fs, I suppose.

But if differences in fs or architecture are crucial - the real proof is in the integration.

zubspace · on Feb 27, 2023

We do something similar, but from my experience small changes, like fonts or lines rendering a tad different after library changes can be quite frequent. Usually small changes you can't really see, only if you compare them as 2 layers in paint.net or something.

Adding something like an error margin for all pixels or subsections sometimes makes sense, but this can be tricky. Downscaling the image and comparing grayscale values with a small error margin is another option. It all depends on how accurate your tests have to be.

izacus · on Feb 27, 2023

Well, but those changes are triggered by something aren't they? So when you upgrade your font lib or pdf rendering library, you're warned that you're now generating different output and can update the golden set.

Your dependencies aren't changing without a cause are they?

zubspace · on Feb 27, 2023

Yeah sure, it just starts to be a problem when you're having dozens of tests failing because of small rendering changes which can be ignored. Someone still has to look at all the test output, compare it to the old state and update the tests with the new state. In our case this happened quite a lot.

This is not an issue at first, but the more you use tests like this and the more people work with your code, false positives start to drag you down.

t344344 · on Feb 27, 2023

PDF may have generation date etc, much better to use OCR and compare strings.

crazygringo · on Feb 27, 2023

No, at the end of the day the proposed approach of rendering to an image and comparing pixels is best. Things can go wrong graphically that OCR won't catch, like an entire background color is missing or an image is missing.

If you're worried about a generation date in the margin, then compare inside of a bounding box that includes most of the page but not that margin. Or just use a fixed date for the test, even better -- since otherwise you've got to be careful about running the test within a few seconds of midnight anyways.

ks2048 · on Feb 27, 2023

The example here is drawing a red rectangle, so OCR won’t do anything.