This is more of an integration test than a unit test. And if you're going to test for a pixel perfect image match, why not check for full equality with a pre-existing PDF file, byte for byte? And then what are you testing? That something has changed? You'd likely know that the output was going to change, so to fix the broken test you need to use the failure result to create the new comparison file, and if your always going to use the failure output as an input for correcting the test, what is the point. "Don't test that the code is like the code" is a similar principle.
These are very common for web apps, because at the end of the day you don't care about the actual html & CSS, only how they are rendered.
> This is more of an integration test than a unit test.
That's debatable. An integration test generally tests 2 or more systems. This kind of test has 2 systems, the generator and the renderer, and we care about the output of the renderer, so it kind of looks like an integration test. However in an integration test you also have control over the implementation of both systems; a regression can be in any of the systems. But that's not true in snapshot tests: the renderer is a given. If the test fails, it's very unlikely to be due to a regression in the renderer. So in that sense, you are really only testing a single component (the generator) hence it is more like a unit test.
I was also going to mention snapshot tests. But since you beat me to it,
> because at the end of the day you don't care about the actual html & CSS, only how they are rendered
It really depends what you’re testing. I’ve generally been skeptical of this kind of test, similarly to OP, because “nothing changed” vs “update snapshots” feels intuitively low value to me.
Despite all that, I recently added a slew of snapshots (literally >1m lines, yikes) along with a custom snapshot serializer. For this use case I do care about the HTML (and XML) because those are the project’s primary responsibilities. The custom serialization slightly relaxes the snapshot value from “nothing changed” to accept known-insignificant changes: it collapses 1+N whitespace characters, sorts attributes alphabetically because their order doesn’t matter, and trims their values because no downstream users are that pedantic about leading/trailing attribute whitespace. Everything else will be treated as an API contract violation. There are some project-specific details which might result in additional custom serialization logic and creating new snapshots, where downstream users are expected to treat certain markup values as semantically equivalent to their equivalent representation in another output property.
Allllllll of that is a really long way around the barn to get to: these snapshot tests are more valuable than “exactly equal” comparison specifically because there’s a known and finite set of things that can fluctuate and a lot of caution around accepting anything into that category. And adding them at all, with known flexibilities, provides value because the underlying library has very high expectations for stability. It’s very unlikely they’ll ever be updated for any change which isn’t either additive or strategic.
(And the reason they were added in the first place was to allow for much needed performance improvements and refactors to proceed with high confidence that they’re safe. Since adding them, the project’s performance monitoring charts needed to run for a period of time to crop to a whole new Y axis range, and a refactor is in review which will enable it to run in client environments without need for server deployments. None of this would have been reasonable by my team’s standards or my own without a large body of evidence that it didn’t introduce regressions. And we very well may retire it after this exercise!)
> Traditional tests check individual properties (whitelists them), where characterization testing checks all properties that are not removed (blacklisted).
And I’d encourage anyone who finds snapshot testing appealing but problematic to consider this approach.
I have ca. 190 test cases on which I run my software and compare the md5 sums of the resulting PDF. If they are not the same, I create a PNG for every page and compare visually with imagemagick.
The trick is to remove all random stuff from the PDF (like ID generation or such).
This takes about 3 seconds on the M1 Pro laptop. I think this is very much okay.
Your approach will completely miss big changes like missing pictures, broken layout, missing background and other breakages in rendering. Also missing text which isn't embedded as a text layer.
> why not check for full equality with a pre-existing pdf file, byte for byte?
This was what I built once to do unit tests on a pdf generator. The use case was I was working for a software vendor at a very important financial client. Our software was used in their research group to produce output (massive reports) which fed into their trading decisions and had been heavily customized by the client in ways that our dev team couldn’t test (the client had a very exacting internal confidentiality regime and these reports and the various code customizations required to generate them were heavily restricted). We were trying to upgrade our software by 2 major versions and make sure everything was still going to run correctly in spite of the massive upgrade.
Each report contained >100 pages of 9 widgets per page and all the data going into these widgets was restricted as well as the calculations and outputs themselves. So we needed to somehow prove that the new system would generate these reports exactly the same as the previous system in spite of the fact that everything had changed under the hood and we couldn’t see either the old or new reports.
What we decided to do was treat the entire system as a big blackbox test. I built a junit harness that would generate the pdf of each of a set of reports using the old system and the new system and then diff the output. Initially we literally just used a normal text diff on the pdf file, but once we had fixed the first few (hundred) differences we refined it to snapshot both pdfs to images and produce the diff images because that made it easier to find and fix the problems.
It was a very painful process but an extremely effective test because the reports were the end product of the entire system and nailing all the differences in the reports proved that all the input data, all the calculated outputs and everything else was working correctly. It ended up with us doing the upgrade successfully.
At a previous job, we created a PDF visual diff tool for this. In automated tests, we could look for either red (present in sample but not test output) or green (not present in sample, but present in test output) to fail a test, or issue an automated change approval request.
I’ve never seen a clear distinction between unit tests and integration tests. If you have a black box, “F”, with input/output pairs you want it to replicate, you encode these and call them “tests of ‘F’”. Why have different names for whether “F” is simple or complex?
They both revolve around a coherent concept of what a 'Unit' is - if you have a (shared, project-level) understanding of a Unit then a 'Unit test' is what tests it, and an 'Integration test' involves >1 Unit
I find the distinction between the two extremely frustrating.
Some people act like there's an obvious definition, and maybe there is if you're doing pure TDD Java as described in one specific text book... but in my experience most developers can't provide a good explanation of what a "unit" is.
And those that do... often write pretty awful tests! They mock almost everything and build tests that do very little to actually demonstrate that the system works as intended.
So I just call things "tests", and try to spend a lot more time on tests that exercise end-to-end functionality (which some people call "integration tests") than teats that operate against one single little function.
A not terrible solution is convert to PNG and check pixels up to some threshold, e.g., average pixel squared diff doesn't exceed 1e-4. Can also perform this test over windows to get a finer-grained view.
This can get very difficult. Especially with pages that are more than just text and images. Lines, interactive content, optional layers, annotations, embedded content, blend mode transparencies. All of this and more make things complex.
The real problem is that reading a pdf is vastly more complex than writing a pdf.
The spec (1000+ pages) is open to interpretation and different readers interpret it differently. A page that might render perfectly in adobe may look different when viewed in firefox or chrome or ghostscript.
It's interesting how different peoples use of testing terminology is across teams/companies/professions. Vocab is standardized by various ISO's, ASQ, and ISTQB so we could all share the same language, then we don't have to debate about what integration/unit/smoke/component/regression/golden/snapshot testing means
Isn’t testing the physical generation of a pdf more aligned with “integration” test not unit testing? Testing the api that makes the pdf is ok, but testing like this post suggests, with bitwise comparison is integration testing, no?
There is a distinct and meaningful difference between unit tests and integration tests. flandish is not bikeshedding.
Unit tests are about testing a single unit in isolation. Integration tests are about testing the integration of multiple units.
With unit tests, the industry's general attitude is that there should be no side effects, such as reading/writing to databases or the disk. Side effects are generally embraced for integration tests, on the other hand.
As a result, unit tests are mostly useful for "pure" functions, ones where the output is 100% derived from the input, regardless of any state external to the function. (Such as database records.) However, a large portion of the industry hasn't realized this and so you get millions of lines of dependency-injected unit tests that really don't provide much value in terms of catching actual bugs. (If these tests were integration tests, they'd catch actual bugs 10x more often.)
A unit test for generating a PDF will not actually involve writing a PDF for disk. An integration test, however, might.
Having well defined terms, and using them well, is essential to any type of engineering. I don't know why aiming for precise terminology is only controversial in software engineering.
> Isn’t testing the physical generation of a pdf more aligned with “integration” test not unit testing? Testing the api that makes the pdf is ok, but testing like this post suggests, with bitwise comparison is integration testing, no?
The fact that it writes the PDF out to a file potentially makes it an integration test, but the rendering aspect I don't think so. The poster is not testing the integration of the tool with ghostscript, rather ghostscript is simply used as an oracle for verifying the result. The only thing actually tested is the original a4pdf API, but some way of verifying the resulting PDF was needed, which is what ghostscript accomplishes. Effectively it's no different from a fancy assertion.
I have a much more liberal view of what constitutes a unit test: everything that can be run inside a single container is a unit test. Writing files? Unit test. Using databases? As long as that database is started by the test fixture in the same container and destroyed along with the container, still a unit test.
Of course, if your test needs a database the natural follow-up question is whether it can populate the database with data known at build time, or it needs to reach out to get some realistic looking data. Only the latter makes it an integration test.
We do something similar, but from my experience small changes, like fonts or lines rendering a tad different after library changes can be quite frequent. Usually small changes you can't really see, only if you compare them as 2 layers in paint.net or something.
Adding something like an error margin for all pixels or subsections sometimes makes sense, but this can be tricky. Downscaling the image and comparing grayscale values with a small error margin is another option. It all depends on how accurate your tests have to be.
Well, but those changes are triggered by something aren't they? So when you upgrade your font lib or pdf rendering library, you're warned that you're now generating different output and can update the golden set.
Your dependencies aren't changing without a cause are they?
Yeah sure, it just starts to be a problem when you're having dozens of tests failing because of small rendering changes which can be ignored. Someone still has to look at all the test output, compare it to the old state and update the tests with the new state. In our case this happened quite a lot.
This is not an issue at first, but the more you use tests like this and the more people work with your code, false positives start to drag you down.
No, at the end of the day the proposed approach of rendering to an image and comparing pixels is best. Things can go wrong graphically that OCR won't catch, like an entire background color is missing or an image is missing.
If you're worried about a generation date in the margin, then compare inside of a bounding box that includes most of the page but not that margin. Or just use a fixed date for the test, even better -- since otherwise you've got to be careful about running the test within a few seconds of midnight anyways.