Hacker News new | past | comments | ask | show | jobs | submit login

A question to experienced Pandoc users:

I want to write a small book that I want to generate in 3 formats: HTML pages, EPUB and PDF. What is the best input format (source format) for the book? Pandoc Markdown? CommonMark? GFM?

I'm a little hesitant to committing myself to Pandoc Markdown or any Markdown because they all have tiny differences with each other. Each is like its own standard.

I considered Org-mode for some time but there are so many edge cases in which Pandoc does not parse Org-mode properly. I mean sometimes simple things like internal links are not rendered properly by Pandoc in the generated output.

So what's the best format to write the input in? Any ideas? Opinions?




FYI, I've done this myself several times, though usually working with an extant book, either from a text or OCR dump, or hand-typing it myself (don't ask).

For works which consist principally of standard sections (e.g., Book / Part / Chapter / Section / Subsection / ...), fairly standard font styles (normal/roman, italic, bold, code blocks / pre / poetry), footnotes/endnotes, and perhaps a few tables, illustrations or images, Pandoc-flavour Markdown is far more than sufficient. Writing or formatting is virtually seamless.

If you're writing something with more complex internal formatting, then I'd lean more strongly into LaTeX. You can do most of your initial authoring in Markdown and generate LaTeX from that, for further finishing work, or simply start with LaTeX. The key discriminator here would be either mathematical formulae or complex image placement. Note that creating the output you want in HTML or ePub (itself effectively a specialised HTML format) might still be challenging.

The next step up would be a specific layout tool (Krita or Adobe Illustrator, say).

But start with Markdown + Pandoc and see if you like the results. It should be Good Enough, and if not offers a smooth path to more powerful tools.


If you're familiar with markdown and your book is basically text with some images, I'd strongly recommend Pandoc Markdown.

Pandoc Markdown-as-input is probably the best-supported input format for Pandoc, as far as "reasonable defaults for outputs in other formats" is concerned, and it's broadly compatible with the norms of other markdown styles.

You can always drop into latex, include custom CSS headers, etc. It's also a format where the "formatting" won't generally get in the way of actually-writing, unlike HTML or LaTeX (speaking on behalf of mere-mortals, here).


Nice! Good luck.

I'm writing a small book. I shared my experiences with Pandoc and Asciidoctor in case it helps you or anyone:

https://adammonsen.com/post/2122/

Your use case may differ from mine (I didn't see you mention printing), but my anecdote above might help suss out tooling differences between Pandoc and Asciidoctor.

Here's an example printable book generator using Asciidoctor PDF:

https://github.com/meonkeys/print-this/


What snet0 suggested might be your best bet. You can use pandoc markdown but then slip into LaTex if you need to do something more complicated.


If anyone else was looking for what snet0 suggested and where, here's their comment: https://news.ycombinator.com/item?id=39227851


I'd use leanpub markdown. They generate those formats plus you get a page for the book, can charge for it, and advertise it. (Some of those tasks cost money).

Won't work if you want to keep everything local, though.


I came here to say the same thing. Laying content out for a book comes with way more issues than "just" converting between document formats. Leanpub does it well out of the box.

If you want to look down the path of implementing book layout yourself, here are two breadcrumbs from my bookmarks:

https://journal.stuffwithstuff.com/2014/11/03/bringing-my-we...

https://iangmcdowell.com/blog/posts/laying-out-a-book-with-c...


For leanpub, I started using it for the layout. And kept using it for the marketing.

That's one thing I learned, having written a couple of books. Even though writing is tough, it is easier in many wyas than the marketing.


Thanks! Didn't know Leanpub has its own markdown too. Yes I do want to keep everything local.


Ah, then probably not a fit.


There is not a good answer. Markdown is a poor format and there are a number of almost compatible variations. However Markdown in all formats is somewhat limited and so there will be some things you cannot do that if you work on a complex project you will want to work on.

However markdown - if you stick with the subset that everyone supports is the most widely supported alternative. If you go with a specific markdown you lose support for something else you might want. If you go for non-markdown you will lose support for most of the world.

I personally selected restructured text which is really powerful for the complex documentation I'm trying to create. However I keep running into nothing else supports it problems (I can extract doxygen from C++ - but only with tools that don't support the latest. I haven't figured out what to do about Rust documentation)


Is there any reason for why you’re not considering AsciiDoc?


Seconding this. Asciidoc was created to be a simple mapping to DocBook - which was specifically designed for writing books. It should be able to handle everything you need, but if there is some esoteric requirement, you can write your own processor.

Then again, content is king. Write it on napkins if you must. When complete, you can spend two days to transcribe to whatever format the publisher requires.


That sounds like what I need. I have been using Latex (which I find distracting) and restructured text (which I am less familiar with at the moment) with Sphinx. How does Asciidoc compare to restructured text? Both were intended for writing documentation so are similar in capabilities?


I've written a book in asciidoc using proprietary tooling. It was a fine experience.

However, the open source tooling doesn't support what I need for a physical book, so I chose to use pandoc and markdown instead mostly for to the market size of markdown.

(I previously wrote my own tool chain for rst to latex and epub, so I'm week seats of what features are needed to make digital and physical books. I was sick of using a format that had limited tooling while the world had moved on to markdown.)


No such reason. I'm willing to try out AsciiDoc.

I mean I did not try AsciiDoc until now because there are so many choices of input formats and the ones I've tried so far have been disappointing one way or the other.

I talked about Org-mode rendering broken in edge cases. Same with Latex too. I see Pandoc has first-class support for its own Pandoc Markdown format. But the support for all other input formats seem patchy.

If you think Pandoc has good support for AsciiDoc without any edge case issues, I'll be most certainly trying it out.


Pandoc has no support for asciidoctor as an input format — you're expected to just use asciidoctor itself to convert adoc files (and there's no reason not to). Asciidoctor can do HTML and PDF, not sure about EPUB though.


My opinions: * write in Pandoc Markdown. * give each sentence its own line. It helps with composing and reordering, and gives much cleaner diffs if you keep this in a git repo. * personally, I used GitHub for html, Pandoc to make the epub, then Calibre to turn the epub into a pdf.

The "internal links" thing is a pain, admittedly. I have an idea for a workaround:

* sprinkle hidden, unique <a id="ch1.2"></a> around

* on GitHub, use links like chapter1#ch1.2

* for Pandoc, preprocess to remove the filename before the #

I'm working with a big enough book that it's an undertaking, so I haven't done this yet.


Interesting idea re:internal links. For sufficiently complex issues of this nature, pandoc filters[0] are a powerful tool for this kind of mid-conversion processing. I've made some cool projects with the Python package panflute[1]

[0] https://pandoc.org/filters.html

[1] https://github.com/sergiocorreia/panflute


If you are committed to using Pandoc to generate the three formats then I don't see why you can't commit to Pandoc's flavor of markdown.


I'd just use Pandoc Markdown.

For simple things, you can easily write Markdown that is compatible with all dialects. Mainly, remember to indent your lists 4 spaces if you soft wrap.

But Pandoc Markdown has the most extensive support for other extensions, like footnotes, figures, etc. That's useful because it minimizes how much HTML or Latex you need to write, which in turn makes your documents more portable.

There are other formats that support more features, but in my experience the communities are smaller and the syntax is not as pretty. Ultimately you're betting on that format continuing to exist longer than Pandoc, which I think is not a great bet in most cases. The only format which I think might have better long-term support and compatibility is CommonMark, but it comes at the tradeoff of substantially fewer features. Which again means sacrificing portability because everything you can't do in the base language you need to do in HTML or Latex.


Like others above, I write long form in Word/docx and convert with pandoc. Word supports styles that help experiment document-wide with look, it has an outline mode that helps with re-organizing material, and it handles inline media gracefully. Any of that is painful in markdown. There's no way I would write 20-500 page document without Word.

I use markdown for the 90% of my writing that is blurbs (and parse it into something like a graph/knowledge kb, and often render via pandoc to pdf, word, Anki, and html).

In both cases, I restrict myself to using features that can be parsed.


I have used markdown with custom lua filters for things like chapter delimiters, non breaking spaces and notes for inserting images. But my setup was kinda wonky as I exported from markdown to ODT and then used LibreOffice to convert it to PDF while adding images to the layout manually as this is impossible to mechanize with LibreOffice.


Depends on what features you want in your book. Out of most of the free/open source tooling, pandoc is probably the best.

I've written multiple books with it and have also used proprietary tooling of publishers. I have a few plugins I use to customize my books, but have yet to find a tool that I wouldn't need to customize.


I've translated HTML markup to Markdown, and from Markdown to LaTeX before fine-tuning the LaTeX, to produce PDFs for printing hard-cover books.

Does Markdown have any way to specify eg "begin a chapter on a new page" ? I don't think this is really a thing in Markdown or HTML but I'm admittedly a casual Pandoc user.


If you are writing a book, presumably you would have each chapter in a separate markdown file.

So, then you convert each chapter file to pdf, and then join the pdfs.


You can inline LaTeX chapter/section break commands even when processing Markdown, and in several ways (dropping \newpage directly in the content before chapters, using header templates, as YAML metadata in the Markdown file, even on the command line).

Google has many examples; one's here using a header file: https://medium.com/@sydasif78/book-creation-with-pandoc-and-...

More are here, including an example using the header-includes YAML metadata param: https://github.com/Wandmalfarbe/pandoc-latex-template/issues...


I use LateX mostly but if it's not a very complex document I'd say markdown wins for simplicity.


I have written several books in pandoc and I love it. I make an easy script to output drafts in PDF, docx, odt, and epub with one command. I am very happy with pandoc Markdown for this.


Might depend on what kind of book it is - do you have a lot of images, tables, cross-references, ... or is it mostly plain text?


Mostly plain text but some images, tables and cross-references too.


Personally, whatever helps with the specific writing part of it all the most is what's best. If you find writing in a given dialect of Markdown or LaTeX or Org-mode is easiest, do that. For me, that's Markdown with embedded LaTeX, for others it's Org-mode, or RST, and so on.

Pandoc handles these fairly seamlessly, and with many options for PDF engines, though I'd say it has a preference for LaTeX and HTML in the backend and Markdown in the frontend, based on my experiences with the edge cases (sometimes entirely solvable with a little Haskell or Lua).

Since LaTeX is the default for PDFs, it pays to keep that in mind and help LaTeX help you (you can use it inline with Markdown or included as preamble in configuration), but sometimes I've just had better luck converting via HTML to PDF ("-t html output.pdf" or directly chaining on from output.html) for what I'm writing in the moment, though other times I'm not stressing LaTeX as much and can just go straight from Markdown to PDF (for example, just writing up something with inline maths). I prefer to avoid LaTeX or HTML's escaped character encoding and often need far more than a single Latin font can provide, so I've ended up dealing with LaTeX's limitations here (even in lualatex and xelatex) more than what I'd suspect is typical. Meanwhile, the standard HTML to PDF backend uses Qt, and I've found it works for everything else I've needed when LaTeX isn't the right backend (and it does come up). On one occasion, I did have to switch that to weasyprint, and that was everything sorted. Alternative backends is an unsung power that few have, while pandoc not only has many built-in (or it is at least internally aware of) but will also integrate with any CLI needed.

Output to all three with HTML, EPUB, and PDF can just need a bit of fiddling before it comes out right, depending on how much you're willing to mess with specific metadata for each versus accepting the limits of what Pandoc can handle universally in its AST. Invariably, some compromise is required, but the core semantics of Markdown (including extensions) almost always translate without an issue. The dialect problem of Markdown is really just in the confluence of said semantics with things that have not been separately included, such as the lack of an actual header in Markdown (Pandoc here allows YAML for some, or you just fall back to HTML).

So, tldr; there's no "best" input format, except the one that you find most comfortable to just write the book in, but I find Pandoc is usually best approached from Markdown with the LaTeX or HTML backends. It's powerful and oh so very handy, but it's not going to do all the thinking for you, just a lot of the grunt work, same as any other tool. When in doubt, the user manual is quite readable, and I've found it answered almost every question I had. When it doesn't, other people do, and when they don't, it means I'm either going about it the wrong way or I get to solve an actual problem (but usually the former). But, as always, the most important thing is actually writing it, distribution comes later, so focus your efforts on that and the tools you need to do that effectively.


> If you find writing in a given dialect of Markdown or LaTeX or Org-mode is easiest, do that.

I find Org-mode the easiest but like I said in my comment, the conversion quality is not great. Pandoc breaks a lot of stuff in Org-mode in edge cases. One example I shared in my comment was Pandoc breaking internal links.

So by selecting something I find the easiest I have burned many hours of troubleshooting figuring out why the output does not look right.

That's why I want to draw upon the wisdom of the community here to find out which input format works best and by best I mean flawlessly. No edge case issues. No rendering flaws. If I get the specific recommendations, I'll try them out for sometime and then commit myself to it instead of burning more time trialling all of the different input formats.


Unfortunately, the perfect is very much the enemy of the good here. Aside from HTML, I'm afraid that PDF and EPUB are very much driven by purpose-built tools designed to show interactively what it will look like as output. This means that they've both delved into a depth of subtle semantic differences that makes flawless output an extremely difficult task. Of course, practically, pandoc can resolve the vast majority of what people actually use, but everything will still be hit by edge cases from time to time, leading to subtle issues or incompatibilities between EPUB, PDF, and HTML. Each edge case can, of course, be solved in isolation, so finding something that's solved the ones you are encountering already is the ideal, providing a seamless experience for your work. Sadly, each of those is built to solve someone else's specific work, and so sometimes we just have to accept that we either need to compromise on something, we need to paper over the gaps by combining the right tools, or we have to write something ourselves. Fortunately, it isn't the 80s anymore, so many of the tools we have are the "right" ones, and pandoc is very good at combining them.

Again, I find that Markdown (with inline LaTeX or HTML) seems to be Pandoc's preferred starting point, and that the HTML backends are quite useful (particularly when not needing full LaTeX), so perhaps there's some luck to be had there, since HTML may preserve Org's linking and such a bit better, though I don't use Org myself so can't attest to it. And if there's really a problem, then perhaps Pandoc needs some help sorting Org-mode out!


Riffing on crafting pipelines by combining tools...

Org mode can also export html and markdown, so that's three potential pandoc inputs, with potentially different properties. All of which might be massaged before input. And in extremity, an org-mode parser permits emitting customized input. Then pandoc's parsing and filters permit altering the pandoc ast in flight. And the ast isn't hard (assuming comfort with ASTs), so if some other tool has templates and output one likes, one might skip the pandoc backend and emit it oneself from pandoc ast json. Rather than hoping to persuade that other tool to both accept and generate what's needed.

So for instance, last year I had a project written in a project-specific markdown dialect, kludged to pandoc-flavored markdown, parsed with `pandoc -t json`, and html emitted custom from the pandoc ast. With embedded directives from dialect to emitter. And html templates copied from non-pandoc tools. In a language with nice pattern matching (julia's Match), the emitter was a short page of code.

"Avoid reinventing wheels, but sometimes it's easier to assemble a satisficing custom vehicle, than to find and adapt a previously-built one."


Great comment! Thanks for engaging in this discussion and offering some good perspective about my Pandoc issues. Really appreciate it!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: