Hacker News new | past | comments | ask | show | jobs | submit login
Wkhtmltopdf, shell utility to convert html to pdf using webkit rendering engine (code.google.com)
92 points by tilt on April 16, 2012 | hide | past | favorite | 39 comments



I spent a lot of time 2-3 years ago assessing different tools to convert HTML+CSS to PDF [1]. At the time, this was to convert HTML plus custom tags into well-formatted legal documents.

At the time the hands down winner was Prince XML [2]. It's relatively expensive ($3800 for a single server license) but it just works, works from many languages and produces beautiful results quickly (look at their samples). It doesn't take a lot of developer time to make up that purchase cost.

I haven't checked out this particular project but with the others I have they tended to work for smaller samples but would die, take forever or have unpredictable results on even moderately large documents (~150k).

For any commercial project, honestly I'd just fork over the $3800 for Prince without hesitation. It's simply that good.

EDIT: actually, looking over the SO question I think I did check out an early version of this project and didn't have much success with it. The one thing that concerns me about this project now is the last news item is over a year old. Is it still being actively maintained?

[1]: http://stackoverflow.com/questions/391005/convert-html-css-t...

[2]: http://princexml.com/


Not to mention that Prince XML has excellent support for CSS paged media (margins, page breaks, headers & footers, etc). Contrast with the printed output of any major browser -- they're all quite disappointing.

It would be nice if that $3800 included free upgrades to subsequent releases, though...


+1 for Prince XML.

Ryan Tomayko (a githubber) is extremely impressed. And when Ryan (or any other community respected hacker) is impressed, I can use that product without any further thinking.

http://tomayko.com/writings/princexml

If you are on .NET, I also recommend Essential Objects PDF library (http://www.essentialobjects.com/Products/EOPdf/Default.aspx). I have been using it for a production project and it is rock solid. At 549$/developer, it is much more affordable.


I used wkhtmltopdf in a previous project and found it to be extremely reliable and easy to use. I was extracting the HTML mime parts from incoming email, converting them to PDFs with wkhtmltopdf, then converting that to a PNG with ImageMagick and displaying the PNG to the user in a web browser.


Why bother with the intermediate stage and not go direct html-to-png?


Originally because I didn't find a free app which would do that. Then I decided to keep the PDF as it was quite useful. Unlike with the PNG, the HTML links were retained in the PDF. Ie, HTML anchor tags are still clickable in PDFs generated by wkhtmltopdf.

EDIT: A PDF will also let you select text, unlike an image. However, an image is nicer to embed in a webpage. So I utilised both in order to get the best of both worlds.


Oops just noticed this comment. See my other comment (wkhtmltoimage is part of the package, allows you to render HTML+CSS into PNG, compile into x64 and place binary in your git repo directory to use on heroku)


There are two reasons I have with wkhtmltopdf that still have me fall back to my own printing stuff I've done in 2004 in order to create PDFs for our users:

* WebKit's support for printing is a bit behind the times and stuff like "display: table-header-group;" isn't quite supported, so whenever you have to print big lists across multiple pages, you are practically forced to do your own page breaking.

* Due to an issue somewhere between qt and webkit, it's not possible to hyphenate text. Well. It is possible, but it causes the hyphen not to be painted in most cases.

Having not to deal with manual page breaks or being able to hyphenate (and thus do real justification) were the two reasons for me to move off my home-grown solution, but as those two are the things missing in wkhtmltopdf, I'm staying with my own solution.

Aside of that: If you can live with these shortcomings and with the fact that you are for all intents and purposes forced into using their static build (patched qt, kerning issues for everybody trying a build with the same qt patches), then this might be the perfect solution for PDF generation.

It feels great to use CSS with mm measurements and getting exactly what you need. Or creating barcodes by just embedding SVG or being able to use the full capabilities of HTML, CSS and even JS when building your page.


It's worth noting this log entry:

> Aug 11 2009: Development moved to git http://github.com/antialize/wkhtmltopdf


WebKit is quite powerful and can be quite easily used for generating a PDF, SVG, PostScript of PNG in almost no effort.

I wrote a simple Deck.JS [1] and S5 [2] PDF converter using a few lines of scripting. These programs take a slide presentation written in HTML5 and convert them into a portable PDF document. This is very handy since you can then share a single file that includes all graphical elements (fonts, images, layout) intact.

I have a GitHub toy repo [3] where I made a few tests with WebKit. On the the programs there (screenshot.pl) even lets you use XPath to find the subnode to grab.

[1] https://github.com/potyl/perl-App-deckjs2pdf

[2] https://github.com/potyl/perl-App-s5pdf

[3] https://github.com/potyl/Webkit


Every time I see a utility like this, I think maybe I could switch to producing some materials in HTML as the primary, or main intermediary, source format. Then I try the utility and realize that that would be silly.

For example, I currently make PDF slides for talks. In theory I'd like to make HTML slides, but would still like the ability to render a PDF for a robust record. However, neither this utility (or PhantomJS, which I just tried) immediately do a good job of converting something like: http://bit-player.org/deck.js/limits-to-growth-Harvard-2012-...

EDIT: also just tried cutycapt, with similar results to wkhtmltopdf (got all slides rather than just visible one, with bad page breaks, and no TeX maths).


I take some of it back. Getting the latest version of wkhtmltopdf and telling it to wait (probably longer than necessary) to process javascript, works pretty well.

    wkhtmltopdf --javascript-delay 10000 --no-stop-slow-scripts 'http://bit-player.org/deck.js/limits-to-growth-Harvard-2012-03-30/ltg-talk.html#Lotka-Volterra' slide.pdf
It's a bit slow, and a bit too hacky for me. But this tool does the best job of those I've seen.

And I've just received an email pointing me to: http://search.cpan.org/perldoc?deckjs2pdf https://github.com/potyl/perl-App-deckjs2pdf that will specifically deal with Deck.JS slides.


Well, I am looking for some feedback on a project that converts XML to PDF. Give it a try: https://github.com/kelvin0/PyXML2PDF


I am looking for a command-line utility that could do:

    webpage2pdf 'http://bit-player.org/deck.js/limits-to-growth-Harvard-2012-03-30/ltg-talk.html#Lotka-Volterra' slide.pdf
and actually work (create a sensible PDF representation of what I can see in a browser). So my feedback wouldn't be useful, as my use case is out of scope for your project: "PyXML2PDF is NOT compatible with any XHTML/HTML/CSS. It uses a small set of tags to quickly allow generation of PDFs."


It seems that what you want is deckjs2pdf, get it from CPAN [1] or GitHub [2]

[1] http://search.cpan.org/perldoc?deckjs2pdf

[2] https://github.com/potyl/perl-App-deckjs2pdf


Would it be sufficient to create PNGs of the web pages and extract the text of the webpage to place in the background of a PDF file (for search, screenreading)?


Not for me. Personally I'll stick to ways of making decent PDFs that don't go via HTML.


Over the years I've tried various HTML to PDF utilities and have yet to find one that works correctly.

Previously I was using htmldoc, a project that was abandoned years ago and doesn't work with CSS. It worked for what I was doing but without CSS it's very inflexible and hard to maintain.

I recently moved to wkhtmltopdf but it has plenty of its own issues. The biggest problem I've found is that it doesn't wrap text between pages correctly. If you have a multi-page document it's likely the last line of text on a page will be split over 2 pages. IMO this is a show stopping bug. It has been known for a while but it seems no one is working on it.

The OS X version is broken. It was creating 5MB+ PDFs that should be about 50k. The Linux version doesn't have this bug.


It's practically unusable when not in an environment with X11. I had to use it on a Windows system and any text would have incorrect letter-space. Every letter would bleed into the next, it's a typographic nightmare. You could use Arial Unicode MS to get a somewhat acceptable result but that won't support bold or italic text cleanly.

I'm not quite sure but I think the fix isn't even part in the 0.11 release. One has to compile wk himself to get it working.

When this issue is resolved this will be perfect, though. It has great capabilites to render footers and headers and JS-based output (in my case Highcharts). For the time being you can't even switch to commercial systems - ActivePDF, for one, has the same issue in the latest release.


You can work around to the text kerning issue by using custom web-fonts. But I agree, they definitely need to fix this issue.

http://code.google.com/p/wkhtmltopdf/issues/detail?id=72


you can use xvfb as a lightweight alternative to full X, works well.


Tip: wkhtmltoimage is part of the package, allows you to render HTML+CSS into PNG.

I used this for a project which needed a CSS powered image builder, which created sharable images:

Builder: http://circlek-flugtag.heroku.com/entries/shipomatic Thumbnails: http://circlek-flugtag.heroku.com/entries


I use to use this, but it has started segfaulting on a large range of websites.

I've since switched to cutycaps which handles all my needed features out of the box.


I think you mean http://cutycapt.sourceforge.net/ which I only found by switching back to Google; DDG was not able to decipher your type-o. :-)


Wkhtmltopdf does have its share of warts, but if you need to do a quick and dirty PDF dump of an entire site, it can help.

I used it with wget to scrape a site for conversion: http://darrenknewton.com/blog/2011/10/30/mirror-site-and-con...


I used this for a project the other day, but after discovering PhantomJS I feel like it has more traction.


How specifically did you use PhantomJS for PDF generation?


There's an example here: http://code.google.com/p/phantomjs/wiki/QuickStart#Rendering (gives examples of .png and .pdf generation)


Thank you for posting. I recently did a bit of research on producing thumbnails with Ruby using this project: https://github.com/csquared/IMGKit.

Only problem was getting the screenshot to work with Flash. It seems as thought the javascript delay option on wkhtmltopdf didn't delay at all.

Can PhantomJS handle pages with Flash?


Not any more, as a web search immediately reveals: http://code.google.com/p/phantomjs/issues/detail?id=418


PhantomJS has a "render(fileName)" API which supports PDF. Essentially, it is trivial to implement something like Wkhtmltopdf. However, Wkhtmltopdf is a shrinkwrap solution, which does not require you to install nodejs.


PhantomJS doesn't run on node.js


Does anyone know of any other engines like this, either paid or free? We are using this to produce catalogs of 200+ complex pages and it is not handling generating PDFs of this size very well. It will often become unresponsive and create a memory leak.


That seems to be a qt problem, as far as I know. I think a saw somewhere how to recompile qt to get a more robust solution. The issue tracker of wkhtml is quite helpful here.

An easy solution could be to just use extremely short URLs as these seem to affect the space used by wkhtml as well. But that was just my solution for a 200 page output. In addition, if you are using HTML footers or headers, try not to give them any parameters, if possible.

Edit: I can't find the best entry at StackOverflow (I remember there must be a Python based solution as well), but this might be a good overview:

http://stackoverflow.com/questions/633780/converting-html-fi...

Some of those are commercial.


I think wkhtmltopdf and PhantomJS are the most active open source solutions.

There's also http://pypi.python.org/pypi/xhtml2pdf/, written in Python and using ReportLab (which certainly has some nice properties).



If you are not allergic to java, flying-saucer is quite good http://code.google.com/p/flying-saucer/ .


Catchy name.


While this maintains hyperlinks as such, it breaks those using relative URLs.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: