Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: please review my app - html to pdf API (pdfcrowd.com)
52 points by jgresula on March 31, 2010 | hide | past | favorite | 55 comments



Have you thought about the reverse, i.e., a tool that could convert pdfs to html faithfully?

I would be willing to pay money for a reliable tool that didn't need much manual editing after processing.

Unfortunately, the pdftohtml project (http://pdftohtml.sourceforge.net/) has been inactive, and the current version has trouble with even moderately complex layouts.


That's a non-trivial task. There are no such objects like tables, styles, lists or paragraphs in PDF so you would need to reconstruct this kind of information. Also, text and vector graphics is positioned absolutely. Tagged PDFs contains some meta-information about the document structure which could help but still it is a lot of work.

The fundamental problem is that PDF stores the document presentation while html defines the document and the presentation is created by the browser. And obviously, to restore a document definition from its presentation is hard as lot of information is missing.


That's a non-trivial task.

Yes, that's true.

I only bring it up b/c if your goal is to turn pdfcrowd into an app that people would pay money for (and I would be one of them), solving that problem would go a long way towards achieving it.


Solving it perfect is non-trivial (I've known entire PhDs to be spent working on a small subset of the problem). There are a number of products/projects that solve it to some extent (techniques include absolute positioning & making sweeping assumptions about what constitutes a paragraph) - would this be enough for you to consider paying for, given that their assumptions/workarounds might produce HTML files that aren't quite to your 'taste'?


There already many apps and pieces of software that charge for the feature he already has so I don't see why it is a requirement for him to monetize. It definitely would be an easy feature to charge for but I think what he has already has potential.


Total noob question, couldn't you programmatically capture a browsershot and then convert that into a PDF?

HTML -> png seems to have been figured out. Is .png -> pdf that hard to do?


No, .png to .pdf is not difficult.

I believe dpapathanasiou's suggestion is not to blindly convert a pdf into html file with one giant image file of the pdf.

Instead, he wants to create an html document that maintains the same content and layout from the pdf.


D'Oh! Got myself mixed up there a bit.


NitroPDF does a remarkably good job translating PDF to Doc and RTF. I think the application (windows :() is better/has more output options, but they have a free web service: http://www.pdftoword.com/


I can easily see a use for this. I'm doing a pro bono project for a small non profit, and part of the project requires generating simple PDF reports. They don't have any money so we need to keep it low cost.

One of the ways of doing this is to host it on a simple shared server (it's not a heavily used app).

Downside of this is that it's unlikely we'll be able to use any of the PDF tools I've used in the past (since they need to be installed). This should work fine for our purposes.

Thanks, I was wondering how I'd get around this.

To all those who were dissing this because they couldn't immediately see a use for it, try to have a more open mind.


I'm also developing an HTML->PDF feature and jumped when I saw this! I tried smashingmag.com - which is funny because i meant smashingmagazine.com but actually smashingmag.com is some japanese site. either way i got back a totally blank pdf - maybe japanese character set is to blame?

One other caveat is that having the ability to view flash would be awesome as well. main function of pdf as i understand it is to create a document that PRINTS completely identically on every setup, so frequently people are going to want to print flash, which is already a huge pain in the ass. Unfortunately it looks like it blanks out completely if there is flash on the page (2advanced.net)

if you could solve that i would start paying tomorrow.


mpdf is really nice if you care about page breaks and the like...


The quality of http://www.princexml.com is amazing. It's not open source and there is a cost to use it commercially (<1K, if I recall). I used it to convert my HTML documentation for Sleep into a camera-ready PDF.

http://sleep.dashnine.org/manual/ - original docs http://sleep.dashnine.org/download/sleep21manual.pdf - result


I have exhausted myself trying to persuade prince xml to not blur my images. That's the biggest hurdle for me.

If PDFCrowd can effectively handle images, I'll brand their logo into my bicep.


Have you tried flying saucer? I found it to be excellent for my purposes.


Have not. I will definitely look into. Thanks!


Nice execution - as per the comment below, something like this would've saved me lots of manual fiddling back when I was doing lots of PDF stuff.

Given the focus on APIs I guess you're aiming it at those wanting to programmatically generate PDFs using a familiar markup, rather than conversion of existing (static) content into PDF? If so, maybe investigate the ability to overlay rendering onto an existing PDF template at some point - in my experience it's been a common requirement (think form letters, account statements, etc).

Interesting that it appears to execute Javascript; guess it's a sign of the times that you need to in order to render many sites correctly nowadays. I haven't poked it too hard, but suspect there might be one or two security challenges there...


Well, your default HTML generates one screwey PDF. When viewed in Mac OSX Preview, I get the text "T pe our HTML here..." Then, when I select the text, certain letters get partially removed or overwritten and I end up with gibberish.

I've just spent weeks working on HTML -> PDF conversion code, so I know it's not just my viewer. I've put all kinds of crazy stuff through there.


Exact same thing works perfectly for me (OS X Preview, version 5.0.1). I'd be interested to know what this turns out to be.


Thanks for the report. I will look into it as there should be no default HTML in the editor.


Slick design, but out of curiosity, why wouldn't developers just use http://code.google.com/p/wkhtmltopdf/ ?


There is no doubt that many developers will use wkhtmltopdf.

I think that the Pdfcrowd's selling points could be 1) wide availability - only HTTP is needed so it can be used theoretically on any platform 2) no need to install any 3rd party software which makes the applications more portable 3) API bindings


We used an html->pdf conversion service (I believe it was http://www.htm2pdf.co.uk/ but I'm not positive) for awhile to do billing and our biggest problem was that it went down all the time. We ended up purchasing a (pretty cheap) license to a Java library that does pdf generation for us and is pretty easy to use. This is definitely a service that people will pay money to use - best of luck to you!


A lot of html to pdf conversion is useless if page-break-* properties are not followed. Shame, too. I've been building something like this all week.


I don't know the exact status of how WebKit handles these properties. I know that at least "page-break-after: always" works since that is what I use when the user clicks the 'Insert Page Break' button in the editor (http://pdfcrowd.com/editor/).


NICE! You have beat me (and I am sure a dozen others hackers) to the realization of this idea...

Here is an idea for an extra feature: make a print bookmarklet -- clicking on it you get a nice PDF version of the page you are viewing right now. I can't stand firefox's print renditions of some pages... terrible...

(also you might want to set the page size to letter or A4 depending on the geolocation of your visitor's ip address)


Excellent stuff.

I notice there are some questions about how to make money. One may be to position yourself as a way to get PDF reports generated from phone apps, in which case you may want to do per app licensing and provide facilities for email delivery of PDFs.

I could see this being useful porting apps from iPhone (can easily generate PDFs) to Android (which does not appear to support PDF output).



I used this for a major company's site-edit auditing system. (No, they didn't want HTML snapshots of each revision. It had to be a screenshot of the browser...)

It works really well. The only quirk is that it needs a fake X server (for font loading), but Xvfb works just fine for that.


Yes, I have seen it but have not tried it yet.


Woah. A 3rd party does your entire value add and yet your hack up your own?


I did not know about this project at the time I started with pdfcrowd. But anyway, I just took my existing pdf library and integrated it with WebKit which was not that hard as one could think.


First of all, I don't know how well wkhtmltopdf works, but there are many, many solutions to the HTML-to-PDF problem, and most of them suck. It's not surprising the creator decided to put together a library from scratch, it's the special sauce for his business.

Also, the "value add" comes from the fact that wkhtmltopdf is a library, and PDFcrowd is an API.


The pdf conversion is awesome! I just tried printing http://times.com/ to a pdf in firefox and it ended up putting the main content of the site on page 2, whereas yours seemed to render it perfectly.


Looks good. I'm keen to use (and pay for) a service like this - if its reliable and quick. With a ruby gem its particularly attractive as all other rails to pdf solutions are incomplete, require a pdf specific dsl or are very expensive.


This is awesome. I'm at once excited about using this in the future, and dismayed thinking of the time I've spent manually generating PDFs because none of the HTML -> PDF options worked.

I fed it my homepage, and it nailed it. I'm impressed.


Haven't tested this, but great idea. I've used a couple of the PDF creation tools and it seems so tedious to build out even a simple table view on a PDF. Good luck with this!


This is great! The only downside that I saw after converting one of my pages is that the colors dulled substantially.


That's a known problem on my todo list. The colors are dulled only in Acrobat but other PDF readers render the colors correctly. Please, could you post the link to that page if possible? Thanks.


http://your.gridspy.co.nz/powertech dulled substantially.

Also, you don't support the CSS3 styling of the header text.

The fonts look super aliased.

Finally, you don't snap the rendered HTML to the nearest page, leading to a page containing only the footer.


Worked great for me, good job. I'd be interested in a PHP binding too, and knowing what the eventual cost will be.


I like it, but my site didn't come out correctly (www.convertyourcds.com). Perhaps my html is screwed up?


Sorry, don't know why. Your site does not validate but it could be problem on my side as well.


I tried a relatively complex site - CNN - expecting the results to look bad, but it looks great


How much will it cost?


My current plan is to charge for conversion tokens but I'm not decided how much yet.


Check out: http://www.htm2pdf.co.uk/htm2pdf-web-service.aspx Their pricing indicates 40,000 conversions for $90. I'd pay that.


Similar, also support pdf by e-mail http://www.web2pdfconvert.com


Uh, "a2ps file.html"? Doesn't even need an API key...


Why no PHP API binding?


I've gone ahead and built one.

Try it:

http://www.tokyomuslim.com/2010/04/php-class-to-run-pdfcrowd...

I don't blame anyone for not wanting to use PHP's poorly documented CURL classes.


Thanks for getting this done. But come on, CURL is documented pretty well. There are even examples. What's there to know? Init a connection, set the flags, pass in whatever you like, submit, and check response. Pretty straightforward to me.


The CURL stuff seems so oddly unlike the rest of the PHP commands; it's more-or-less a direct port of the c++ library, names & all included.

The place where it reallllly irritates is the cookie management, but thankfully I didn't have to deal with that in this case. (I did for a client at a newspaper - nightmarish.)


No special reason - I just don't know PHP. But it is on the todo list.


Why do I need this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: