BreezyPDF Lite: HTML to PDF generation as a Service

jim-a-1020401 · on June 29, 2018

I've been working on a documentation process with Markdown -> HTML + CSS -> PDF for a while and I found that Weasyprint works best. It supports everything I need except CSS target-counter for auto-generation of page numbers in a TOC but otherwise really good.

I really don't like that this uses something non-standard for header/footer and stuff in the margins since that's already covered by standardized CSS @page stuff; https://www.w3.org/TR/css-gcpm-3/. I'm using that with Weasyprint for automatic header/footers with auto page-numbering, including setting strings from the html to use for heading names, document title, author, etc... example CSS: `h1 { string-set: h1-title content(), h2-title ""; }` + `@page { @bottom-left {content: string(h1-title) string(h2-title);}`.

Weasyprint seems closest to princexml and is free.

deedubaya · on June 29, 2018

You can use CSS page sizing as well, passing it via meta isn't required.

https://docs.breezypdf.com/metadata#use-css-for-page-size

Headers/and footers are just HTML strings and can be super rich with images etc and customized with CSS. Page numbering is free as well in headers/footers.

Of course you could just use properly positioned <header> and <footer> tags and do whatever you need to with JS for page numbering.

jim-a-1020401 · on June 29, 2018

That's good for page sizing but as other comments on this thread noted, no browsers seem to have proper support for CSS paged media including Chrome. When I tested it, none of the @bottom type stuff or other print/page related CSS worked at all which is why I ended up with Weasyprint since it has it's own engine (and is really great). Example of some more paged media stuff not supported by Chrome; https://www.smashingmagazine.com/2015/01/designing-for-print.... Another thing Weasyprint does really well which surprised me is automatically carrying over table headers onto subsequent pages when a table is split across pages, which is really nice.

oliwarner · on June 29, 2018

Speaking as somebody who also puts a lot of stuff through WeasyPrint, I love it but it's bloody slow.

Tasked with solving the same problems again, I'd probably look at headless Chromium or Firefox. Their JS PDF renderers are fast and if you're doing enough to keep an instance loaded all the time there's no start up time.

deedubaya · on June 29, 2018

Turns html like this: https://github.com/danielwestendorf/breezy_pdf_lite-ruby/blo...

Into a pdf like this: https://www.dropbox.com/s/v4j4n1cvtm032w9/breezy-pdf-dashboa...

radiusK · on June 29, 2018

Very impressive. I wonder how it got the div heights to fit perfectly with the PDF page height. Was it tweaked a lot to fit or the flexbox just magically fill to the page following the page break rule?

brendandahl · on June 29, 2018

And it all comes back to HTML as dropbox uses pdf.js to show that PDF.

deedubaya · on June 29, 2018

The circle is complete

nightbrawler · on June 29, 2018

Despite being a bit pricey, PrinceXML is the best tool I've seen and used for HTML to PDF conversion. Massive set of features and very reliable output. We've used it as part of our reporting engine to output PDF's since 2011. Supports spanning tables across multiple pages, JavaScript, image loading and lots of other cool stuff

Scarbutt · on June 29, 2018

Prince renders HTML somewhat different than browsers and doesn't support the latest HTML + CSS standards, why choose it for new stuff over headless chrome and/or puppeteer?

wolfgang42 · on June 29, 2018

I use PrinceXML at work, for generating press-ready output. (E.g. a catalog, generated straight from the product database and ready to be sent to the printer. Previously they were doing this all manually with PageMaker, which was very tedious and prone to mistakes.) It supports the CSS Paged Media spec, so you have full control over page layout: margins, widows and orphans, page breaks, bleeds, and so on. It also understands page numbers, and gives you full control over headers and footers. Other things it supports include press crop marks, color management, and PDF options like bookmarks and PDF link regions.

Headless browser rendering is fine if all you need is a two-page invoice PDF, but it falls down when you need control of anything other than basic stuff like the font size.

deedubaya · on June 29, 2018

> it falls down when you need control of anything other than basic stuff like the font size

This is definitely not the case anymore, and BreezyPDFLite supports most of the features you mention, while supporting the same HTML/CSS/JS you might be displaying to end users across evergreen browsers.

wolfgang42 · on June 29, 2018

Does BreezyPDFLite support CSS like this? Chrome didn't support any of this last I checked.

  h2 {flow: static(header);}
  @page :right {
      margin-bottom: 1.4cm;
      @top-left {content: flow(header);}
      @bottom-right {content: counter(page);}
  }

Edit: Also, tables of contents, with page numbers and links in the PDF:

  ul.toc a::after {content: leader('.') target-counter(attr(href), page);}

jim-a-1020401 · on June 29, 2018

No browsers support those features from all the testing I've been doing over the past month or two. Weasyprint supports everything I tried except target-counters and dot leaders which I've only found support for with princexml. Weasyprint does author a proper outline within the PDF so my current workflow (no $ for prince) is to use Weasyprint to make the PDF with a TOC that has no page numbers, extract the outline from the PDF with a python tool to get the page numbers, update the original HTML with page numbers then run it through Weasyprint again. That'll be bundled in an automated build thing triggered by a git commit of the original markdown file straight to PDF so it's not something I have to do manually each time.

deedubaya · on June 29, 2018

I'm not sure about those specific selectors, you'd need to check caniuse.com

Building TOC is just as easy as building the HTML and linking to the ID's appropriately.

Page numbers are supported in header/footer templates, or via manual computation when you render the HTML or with JS.

kyriakos · on June 29, 2018

No browser supports CSS paged media features. So there's no proper control over repeated elements, paging etc

jacquesm · on June 29, 2018

PDF is 'lossy' when compared to HTML, you will lose a ton of semantic information, and on top of that the resulting PDF will be much larger than the source material.

I'm currently involved in an effort to do the reverse and there isn't a day that I don't curse the PDF specification and the various implementations. And with the 'data:' source for graphical element and MathJax there isn't much reason for for instance scientific papers to be published as pdfs to begin with.

https://github.com/thomaspark/pubcss

PubCSS has the right idea.

invaliduser · on June 29, 2018

PDF is a format designed for the exact display of documents, not for anything else. It's like ranting about your printer losing all those html tags semantic when printing a web page. The pdf format is not to blame here.

sebazzz · on June 29, 2018

Convert to PDF can be very useful for generating reports. For instance, you display a chart using Javascript, and get exactly the same chart in the generated report.

We use EvoPDF for this purpose, which also uses a website based webbrowser under the covers. Unfortunately it is quite slow, especially when Javascript is required for the report. It also handles tables badly across multiple pages, and full page backgrounds are also cumbersome.

contravariant · on June 29, 2018

There's a pretty good reason for scientific papers to be published in PDF. First of all TeX was originally designed to work with PostScript which was intended for communication to printers, so the step to PDF was easy to make, second of all PDF has had fairly broad support and unambiguous representation for longer than HTML has (and even though HTML appearance is now somewhat standardised it's still nowhere near as stable as PDF has been).

jacquesm · on June 29, 2018

PDF stable? The only party that properly renders PDFs is the one that came up with the format, it's as bad as Microsoft Word in terms of interop.

That it is a supposedly open format is a joke, there is so much old cruft in there that you could implement if you had access to Adobe's source code but unfortunately nobody but Adobe does.

PostScript (which PDF is a restricted version of in many ways) is clean, but PDF is not.

nicolasMLV · on June 29, 2018

It is mainly a wrapper around this chrome headless command : (maybe you only need that)

`#{chrome_alias} --headless --disable-gpu --print-to-pdf="#{pdf_path}" "#{html_url}"`

mapgrep · on June 29, 2018

I am guessing the API/service architecure is what differentiates it from the command line tool https://wkhtmltopdf.org/ ?

forgot-my-pw · on June 29, 2018

WKHTMLTOPDF is pretty inactive nowadays. It uses an older webkit engine, and there's only been 1 release in the past 2 years (which happened to be 19 days ago, actually).

deedubaya · on June 29, 2018

Wkhtmltopdf uses webkit, and at last check, an ancient version of WebKit at that.

This uses Google Chrome headless for the actual PDF generation.

cygned · on June 29, 2018

We have built something similar internally. Macroservice running a cluster of headless Chromes that turn an uploaded HTML file into a PDF. Used for reporting.

We now switch to a Microsoft Word rendering backend where we process Word files with template strings in them and then run Word in headless mode to save files as PDFs. While the HTML-to-PDF approach works, most of our users work with Word all day so we are solving the wrong problem.

wilsonfiifi · on June 29, 2018

Can you elaborate a bit more on your Word based reporting? To date the only solution i've been satisfied with is Aspose Words. Unfortunately licensing is a bit steep so it would be good to know if any cheaper options are available.

cygned · on June 29, 2018

We basically process the template with a node.js library. We then have a node.js server on a Windows machine that takes a file, saves it onto disk and runs a Visual Basic application on it which itself starts Microsoft Word headless to save the file as PDF.

benbristow · on June 29, 2018

> runs a Visual Basic application on it

Clever, but that does sound a bit messy.

cygned · on June 29, 2018

Sounds messy. However, this VB app is literally just 15 LOC, pretty straightforward. And it is the only solution we have found so far that allows us to leverage Word.

laktek · on June 29, 2018

If you like a hosted API, I've built https://pdf.cool

deedubaya · on June 29, 2018

Or BreezyPDF.com ;)