Show HN: Docverter: convert plaintext to PDF, Docx, or ePub. Now in open beta.

pestaa · on Oct 12, 2012

What does this add to pandoc? In other words, why would I pay for this service? It obviously runs in the cloud, but I haven't found much else.

In fact, my observation is that the API documentation was merely copy-pasted from the original. Example:

Pandoc docs[1]: http://dl.dropbox.com/u/144454/hn/from.png

Docverter API[2]: http://dl.dropbox.com/u/144454/hn/to.png

[1] http://johnmacfarlane.net/pandoc/README.html#header-identifi...

[2] http://www.docverter.com/api.html#toc_425

However the author went the extra mile to rename sections thus making it sound like the Markdown extensions are in fact Docverter's.

Sorry to be so negative, but this almost seems like acting in bad faith and selling a GPL-licensed software as service under a new name.

kevinmcconnell · on Oct 12, 2012

I don't think it's bad faith at all. They're up front about what tools the system is built on. It's more about whether it makes the most sense for you to build and host something like this yourself, or pay for what you use on a service that someone else runs for you. There are pros and cons to each, of course, but I think there are a lot of situations where the latter is going to be better.

One example being: if I was building a new product and wanted to add some reporting/exporting features, I would much rather use a service like this. Document conversion is just infrastructure for those features, and so any time I spend setting up my own system for that is potentially wasted time until I can prove that the features are a success.

samstokes · on Oct 12, 2012

If the author is conversant in Haskell (or willing to become so), an interesting way to add value would be to support and document the various output formats, and add new output formats.

Pandoc's an awesome conversion tool, but because of its many supported output formats, some are more reliable than others. For example, it can actually output an HTML/JS-based slide deck - using any of four different JS slideshow libraries - but in my experience only one of them is actually usable, and it's not clear how to customise/style the output.

Of course fixing that as a developer is a simple matter of reading docs / code, but if this product is aiming to be "Pandoc for non-developers", that would be an interesting angle.

zrail · on Oct 12, 2012

You're right, I went through and renamed some sections. I did that to reduce confusion, because there are a bunch of parts where the pandoc docs talk about things that don't make sense when you're POSTing instead of using command line options. I did give attribution at the bottom in the Authors section.

You'd generally pay for a service like this if you're running on Heroku or another PaaS and don't want to have to deal with getting Pandoc and various other supporting tools up and running.

alexchamberlain · on Oct 12, 2012

Whilst I agree that people should be open about the software they run, let us remember that selling access to GPL software running on your machine is perfectly legal.

rodw · on Oct 12, 2012

Also, for what it's worth the author was very open and explicit about the provenance. In the footer:

"This is a copy of the Pandoc README file, modified to suit Docverter's manifest format."

I'm confused though. In another comment [1] zrail says this (the HTML-to-PDF in particular) is built on a Java library. Is Flying Saucer based on Pandoc? Do you use one sometimes and the other other times?

[1] http://news.ycombinator.com/item?id=4645497

zrail · on Oct 12, 2012

Docverter is a collection of a few pieces of software that get used at various times. For example, if you do markdown to docx you'll just be using Pandoc. If you convert to PDF you'll be going through flying saucer. If you go markdown to PDF you'll go through both. MOBI conversions go through Pandoc to get an ePub and then through Calibre to get the mobi.

The point is that you don't have to worry about those pieces, though, since Docverter abstracts over them with a simple API.

memset · on Oct 12, 2012

This is super cool. One thing that would be useful is an "examples" page. That is, have some sample .txt or .html files, and show us what the output .doc, .docx, .pdf, .epub, etc files look like. Just a list of static files, really.

You could also do something more fully-featured, like a sandbox where you can upload your own files. Perhaps it could be an "evaluation plan" which has a maximum of 10 conversions per month. (Then again, $5 isn't really that much to pay to evaluate a service.) Or maybe unlimited conversions with the evaluation plan, but the output files have a watermark?

I had no idea this was based on pandoc until I read the HN comments. So, cool!

zrail · on Oct 12, 2012

Thank you for the kind words. There are a few examples on the API page in the Advanced section but you're right, they should be featured more prominently. Also, the free dev plan will give you full access but will indeed insert watermarks.

alexchamberlain · on Oct 12, 2012

I'm impressed someone has had the balls to launch a service without a free-tier. Note that there is a developer's access plan for free though, so no moaning!

endlessvoid94 · on Oct 12, 2012

Same here. It's amazing what having customers vs. users can do for a business, right from the get-go.

hntester123 · on Oct 12, 2012

nlh · on Oct 12, 2012

This looks great (and is something I'll very likely use on a project I'm working on now!)

My quick "dumb" question -- what's the pitch for using this vs. what I would call more traditional conversion tools? My project will need some HTML -> PDF goodness and I was planning on researching and running some sort of local / server-side package (which I presume exists, though I haven't researched them yet).

Either way, congrats on the launch - this makes a lot of sense and sounds like a great utility.

zrail · on Oct 12, 2012

Thanks! The pitch is that HTML to PDF conversion tools, as a rule, are not very good. Docverter is actually on it's third iteration of that particular conversion because the first two didn't provide even close to satisfactory results.

highace · on Oct 12, 2012

That's interesting. So you've built your own HTML to PDF converter? Can you provide an example (a screenshot, maybe) where your version excels against an existing solution?

zrail · on Oct 12, 2012

It's a small web service that wraps around a Java library named Flying Saucer, so there isn't anything to look at really.

Flying Saucer excels against the alternatives I looked at in a few ways. First, Pandoc's built in PDF writer uses a LaTeX intermediary which doesn't support anything that web writers have come to expect. Second, the other tools were webkit based which variously didn't support the page media spec, didn't support embedding fonts, or both. Others were custom one off of desktop tools that wouldn't work how I need for Docverter.

sciboy · on Oct 12, 2012

Just so you are aware, flying saucer while nice when you first use it has tonnes of bugs and isn't really being developed these days. You'll find yourself more and more diving into the code because the output is substandard. We used it for years and have now moved away because we couldn't stand customising it for every little edge case more, plus it doesn't support html5/CSS3 which is essential nowadays. Take a look at the codebase - you won't want to be adding to that! Additionally it expects documents to be completely in memory, which means it will take down your Java server sometimes.

zrail · on Oct 12, 2012

Out of curiosity, what did you move to?

sciboy · on Oct 13, 2012

We've been using phantomjs, but it has it's issues too. I'm not sure there is a good solution anywhere in the open source world unfortunately - and tools like prince are expensive.

zrail · on Oct 12, 2012

Good to know. Thanks.

vamega · on Oct 12, 2012

What features do web writers expect that LaTeX doesn't have? I think LaTeX supports fonts (well XeLateX does).

zrail · on Oct 12, 2012

In particular, styling and layout with CSS. People seem to want seamless HTML to PDF conversion, and providing that without a LaTeX intermediary seems to be the best way to go.

zrail · on Oct 12, 2012

Hi HN,

I posted last week about Docverter, my plain text to rich text conversion tool. It's actually ready for people to start using now. I'll be here all day to answer questions.

Edit to add: I added docverter PDF conversions to my blog last night and it took all of an hour. Check out the Download PDF link toward the bottom here: http://bugsplat.info/2012-08-11-task-oriented-dotfiles.html

Code is here: https://github.com/peterkeen/bugsplat.rb/blob/master/app.rb#...

liamk · on Oct 12, 2012

Looks great. If you added docx and pptx to the inputs then you could easily compete with some big names (your prices are very competitive).

zrail · on Oct 12, 2012

Thanks. I added docx and pptx inputs to the todo list. I'll have to look into how to parse them into HTML.

eric_bullington · on Oct 12, 2012

You're in for a world of pain when you start parsing docx and pptx. On the bright side, if you can figure out a good solution, you'll likely have a solid business model. I would imagine that there would likely be significant demand for converting docx and pptx files into html or markdown, as a service. If you do come up with a nice, well-documented API for all of this, I'd certainly recommend your service. If you come up with an outstanding docx parser, then I'd use your service myself (I am using my own somewhat primitive solution for a current project involving the conversion of docx files).

Here's a few projects to look at, if you haven't already:

1. http://www.docx4java.org/trac/docx4j 2. https://github.com/mikemaccana/python-docx There are some interesting forks and more active forks, but this is the original python-docx

zrail · on Oct 12, 2012

Thanks for the links! I'll dig into them later.

Having never looked at it myself I'm not really sure why it would be so painful. Are the formats just super wacky?

pknight · on Oct 12, 2012

Are you going to add a way to convert urls to pdfs? It would be awesome to create a pdf of a webpage from the push of a button.

josephlord · on Oct 12, 2012

You might need to be careful that you don't end up liable for any copyright infringement. I would speak to a lawyer before pulling content from a URL and transferring it to an unrelated destination.

Maybe T&C can protect you in this scenario but maybe not.

pknight · on Oct 13, 2012

The use case I was thinking of was using it on my own website though, for things like printing out a nicely formatted invoice or a printable mockup of a webpage.

Recently I designed some flexible forms meant for printing, but I used php/html/css to generate it. I discovered that it's really hard to get a good quality print out of a webpage. If you use screenshots you get poor resolutions, and direct to print/pdf conversion tools didn't render the CSS all that well.

Don't see how T&C couldn't cover such scenarios in which the user has explicit permission to generate a copy of a webpage.

zrail · on Oct 12, 2012

I was actually just talking to someone else about that over email. It's definitely possible, I just haven't had time to explore it yet.

evolve2k · on Oct 12, 2012

Can this service handle more complex structures like converting a table in Textile or Markdown into a Table in docx? Can it handle word styling?

I'm building software where the output must be in docx, wondering how far I can go in not having to deal with word automation to get the output I want into a Word Doc.

zrail · on Oct 12, 2012

It can definitely handle word styling if you provide it with a reference docx to copy the styles from. I haven't tried tables. If you'd like, feel free to create a dev account and give it a spin.

eblade · on Oct 12, 2012

This is likely a service I'd use once in a while. Say I want to convert an html to mobi once per month, the listed pricings do not fit my use case. That means one conversion for 5 bucks.

I know it isn't easy to set up the pricings but would there be any "pay per use" for people like me?

lrem · on Oct 12, 2012

Is there any benefit to this over simply running Pandoc?

zrail · on Oct 12, 2012

If you're running your app on Heroku or another free PaaS, it's a pain to get pandoc running and stay within the slug size limits. Additionally, if you want to convert HTML to PDF and have reasonable results (i.e. CSS support of any sort) you need to run a secondary conversion process which is also not trivial to set up.

don_draper · on Oct 12, 2012

Can you watermark and password protect pdf files? If so this would be a great service for selling ebooks.

zrail · on Oct 13, 2012

Just wanted you to know that I added support for password protection. There's an example and documentation on the API page.

zrail · on Oct 12, 2012

Not at the moment. I'm not really sure how to do that, but I'll look into it.

zrail · on Oct 12, 2012

Edit: Actually it looks pretty easy to add. I'll work on that tonight and update the @docverter twitter account when it's ready.

sebnukem2 · on Oct 12, 2012

I want to evaluate the service before paying anything for it.

zrail · on Oct 12, 2012

There's a dev plan link at the bottom of the pricing page, specifically for evaluating before you purchase. It watermarks your documents but other than that it's the real deal.