From Word to Markdown to InDesign: Fully Automated Typesetting Using Pandoc

Animats · on Dec 2, 2015

Well, yes, if you dumb your document down to the level of "markdown", it's not hard to grind them out as plain text with some styling. You can write HTML in that style, too. People did that 20 years ago.[1] That was the original vision of HTML.

Some people wanted to stop there, and limit HTML to describing the semantics of a document, not its visual appearance. They lost.[2]

Pandoc doesn't really use "markdown". It uses "enhanced markdown":

"Pandoc’s enhanced version of Markdown includes syntax for footnotes, tables, flexible ordered lists, definition lists, fenced code blocks, superscripts and subscripts, strikeout, metadata blocks, automatic tables of contents, embedded LaTeX math, citations, and Markdown inside HTML block elements."

Pandoc's "markdown" now has roughly the feature set of HTML 3.

[1] http://www.animats.com/papers/articulated/articulated.html [2] http://www.w3.org/People/Raggett/book4/ch02.html

rhythmvs · on Dec 2, 2015

You are right. But “markdown” — not the brand, but the concept of lightweight markup — is a work in progress still.

Just look at the list of “feature requests” on the CommonMark forum.¹ Pandoc’s Markdown supports lots of extra features from Github Flavored Markdown (GFM), MultiMarkdown and MarkdownExtra. John MacFarlane, Pandoc’s author, is also the author of the CommonMark spec. And while, eventually, the goal is to extend the CommonMark syntax with a sufficiently broad vocabulary of content element types, the focus now is on getting the basics right, speccing the edge cases, and defining standard parser behavior.

Meanwhile, there is AsciiDoc (which is a port of DocBook xml to lightweight markup), for more demanding use cases. And then there is also reStructuredText, and org-mode, etc. But the markdownish-style of syntactical sugar won the race, and while doing so, it will be the basis for a future lightweight markup language that will do indeed support not-dumbed down written content, with the full variety of nuance that offer the likes of TEI, DocBook, TeX, and will be fit to reproduce the akme of complex documents and typesetting, such as critical editions and technical manuals.

Until we reach that point, I bet on dumbed-down Markdown, and while doing so, can always add-in raw html ad libitum. The bulk of most written content, after all, is just headings and plain text, with a few emphasized phrases here and there, a link, an image. That covers a lot of use cases already.

¹ http://talk.commonmark.org/c/extensions

jolux · on Dec 2, 2015

If you prove that those who thought HTML should be exclusively for semantics and not visual appearance "lost" by linking to a history of HTML that was last updated in 1998 it makes it seem as if you missed HTML5 when great pains have been taken to remove visual styling information from HTML in place of semantic markers.

In fact most of the web developers I speak with regularly agree. The consensus in recent years is that HTML is supposed to be exclusively semantic and markup should be styled with CSS.

http://www.csszengarden.com/ is this concept taken to an extreme. Same HTML, same semantics, different styling.

Animats · on Dec 2, 2015

The claim was made for HTML5 that it would do that. Some HTML5 page source does look like that, but not most of it.

jolux · on Dec 5, 2015

No, there is a specification for HTML5, and it does exactly what I said it does. Granted, most "HTML5" is a mix of HTML 4 and HTML5 (sometimes even a little XHTML) but these are not strict validated sites.

Here, try running a few websites through https://validator.w3.org/ and see what pops up. You'd be surprised.

For example, https://news.ycombinator.com is extremely bad HTML5, and if you look at the report on it, https://html5.validator.nu/?doc=https%3A%2F%2Fnews.ycombinat... you can see a bunch of "Use CSS Instead" suggestions. This is the separation of style from semantics that HTML5 was made to reinforce.

leephillips · on Dec 2, 2015

Even the list you quote has features that are not in any version of HTML and never will be (automated tables of contents in HTML 3?). Plus, there is much else in pandoc-markdown that HTML will never be able to do, as it requires computing and is beyond declarative markup.

maxerickson · on Dec 3, 2015

html tags are valid markdown. I don't think that is just pedantic nitpicking either, part of the design philosophy was to lean on html to do things that were not supported with markdown shortcuts.

It's discussed early here:

https://daringfireball.net/projects/markdown/syntax

rubidium · on Dec 3, 2015

So 30% of my current job is writing technical user guides for large, custom, automated research systems (the rest is design and development of those systems).

Each one is slightly different, and so needs its own guide. We currently are stuck using Word b/c it's so fast. Yes it's ugly. Yes it's a pain to work with sometimes. But I haven't found anything that's faster to produce these docs with.

The reason we haven't switched to html, latex, or xml etc... is because I haven't found a typesetting program that is as easy to drag and drop images into and out of. Each guide has 30+ images on average. I _need_ that interface to be drag and drop otherwise it's too tedious to write the guide. The formatting of the text I'd love to do in some sort of markup language, but the images have to be dead simple.

Short version: Anyone know of a typsetting solution with some markup language, version control, print to PDF happen at click of a button, and drag and drop images?

rhythmvs · on Dec 3, 2015

That’s exactly the sort of use case we are focussing on and building our automated typesetting service for.¹ You’ll probably also benefit from file inclusions so that you can keep a library of text snippets, with variables, keeping them versioned separately, and which you can then re-use and assemble to form a new, slightly different manual, with the variables automatically populated at typesetting time.

Could you be more specific on the problem you’re facing with placing images? Is it that there are like hundreds of them, for which you would want to have created the references (and ![Caption](syntax)) automatically, because typing out the file paths manually is too tedious? That is doable, although it would be more of a drag-and-drop feature to be implemented by a dedicated Markdown editor, or an ST plugin.

Or would you like to have more fine-grained control over floats, sizing and placement of the images relative to their place in the text narrative? For which a WYSIWYG interface is indeed more well suited, re: many document author’s gripe with TeX’s figure placement.

¹ http://textus.io/

rubidium · on Dec 3, 2015

Glad to hear it.

Here's some bullet point thoughts: 1) Workflow needs to be uber-fast and not require saving/naming the images (at least at the user level, the program will of course). Many images are screen shots from SW installs. Workflow goes: Run SW in VM-> screenshot portion of interest -> Copy+Paste into word doc. I never even save the image in that case. The rest of the images are photos of HW. I just want to drag and drop into the document, not rename them all and move them to a same folder with the document. The document/CMS/program should take care of that all for me (like Word does).

2) I do need the text and images to match better and be more WYSIWYG. Location is more important than aesthetics for a guide. It cannot be the abomination that LaTeX is. LaTeX figures are terrible to work with (and that's having published multiple academic papers + PhD thesis in LaTeX). Sizing is less important (that can be [width=x, height=y]).

3) Variables/snippets: Libraries of text snippets and variables is what we do quite a bit of. Right now with a few kludged together macros and custom Word doc properties. Lots of docs go "Run the <SW install name> for <device>" with <> items defined at document level. Sounds like you do this, and it's probably better than how word does :)

sorry for dumping on ya, but I'm excited to find people working in this space. It's something I feel like there should be better tools for but just haven't found yet.

brazzledazzle · on Dec 3, 2015

You might be interested in Problem Steps Recorder[0]. Word should be able to import the resulting mhtml file.

[0] http://blogs.msdn.com/b/patricka/archive/2010/01/04/using-th...

ian97531 · on Dec 3, 2015

Try a product called Habitat made by a company called Inkling.

https://www.inkling.com/inspiration/

https://www.inkling.com/habitat/

rubidium · on Dec 3, 2015

looks promising! Definitely attacking the same problem.

hsitz · on Dec 3, 2015

I think Org-mode (in Emacs) does something similar to what you want. Here's link to page with add-on function to add drag'n'drop images to Org files. The web page itself is what was exported from Org, although the export could have as easily been to pdf. There's also link on this page to youtube video illustrating things: http://kitchingroup.cheme.cmu.edu/blog/2015/07/10/Drag-image...

Org-mode itself uses its own rules for adding Markdown-like formatting, but Org-mode has features that go way beyond Markdown. Here's a fairly simple technical doc with a few images and footnotes: http://kieranhealy.org/files/misc/workflow-apps.pdf

and here's the org file and images that you were used to author that document and export it to pdf: https://github.com/kjhealy/workflow-paper

girzel · on Dec 3, 2015

tl;dr: In your Org mode file, hit C-u C-c C-l, navigate to the image file, hit RET RET, and you're done!

Depending on your computer usage habits, it's likely faster than drag-n-drop.

hsitz · on Dec 3, 2015

There are probably a few different ways out there. Here's link to another add-on method that allows for drag and dropping of images directly from a Firefox web page into Org, code handles async downloading of image to local directory: http://thread.gmane.org/gmane.emacs.orgmode/77708

EDIT: the above-linked thread is actually about the beginnings of the 'org-download' extension, now found here: https://github.com/abo-abo/org-download

tambourine_man · on Dec 3, 2015

In Vim, I usually type <img src="images/…

and AutoComplPop[1] sets in, showing all files in that directory. As long as they have a descriptive name, it's probably easier/faster/more reliable than drag and drop.

In fact, I actually type img ctrl-E, which expands to the tag above and places cursor at the right place.[2]

I know other editor have similar features, so it might be an alternative

  [1] http://www.vim.org/scripts/script.php?script_id=1879
  [2] https://github.com/rstacruz/sparkup

pjmlp · on Dec 3, 2015

Professional Docbook and DITA based systems like Oxygen XML, Framemaker among others ?

https://www.oxygenxml.com/

http://www.adobe.com/products/framemaker.html

That is how we did it in a few enterprise projects.

jolux · on Dec 2, 2015

Is Markdown really "relatively new?" According to Wikipedia, it's older than OOXML, aka .docx .xlsx .pptx etc.

https://en.wikipedia.org/wiki/Markdown https://en.wikipedia.org/wiki/Office_Open_XML

rhythmvs · on Dec 2, 2015

Relative to widespread brand recognition, I guess, and relative to similar initiatives. But indeed, Markdown — as in, the concept of _lightweight markup_ — predates the Web and is at least as old as Usenet. I footnote 9, I linked to my Github repo with Markdown resources; there you’ll find the following chronology:

- Setext (Ian Feldman, 1992) - AFT (Todd Coram, 1999) - Grutatxt (Ángel Ortega, 2000) - atx (Aaron Swartz, 2002) - AsciiDoc (Stuart Rackham, 2002) - MediaWiki (Magnus Manske & Lee Daniel Crocker, 2002) - reStructuredText (David Goodger, 2002) - Org-mode (Carsten Dominik, 2003) - Textile (Dean Allen, 2004) - Markdown (John Gruber & Aaron Swartz, 2004)

jolux · on Dec 2, 2015

Yes, I'm quite aware of the history, which is why I thought it was strange that you referred to it as "relatively new." Otherwise, it's a great post.

rhythmvs · on Dec 2, 2015

Thanks! And yeah, if people would only know their history… Relatively new, to me, unfortunately, means that lightweight markup still is not something which the broader public is familiar with, not even by its proxy brandname “Markdown”, even with big outlets (e.g. Reddit) pushing it forward.

tjl · on Dec 2, 2015

It's been around since 2004, but only really gained momentum when the iPhone app store came out. Since you couldn't have styled text, people would use apps that supported Markdown on their iPhone and once processed, see it styled.

shortformblog · on Dec 3, 2015

From the perspective of someone with experience working in the print industry: This is impressive work, and you nail down the biggest issue with Word—if you don't use pure style sheets, you get cruddy code. (Google Docs has this problem, too.)

I think that lots of folks are trying to find different ways to solve this problem in the print world, especially as stories have to be pulled into CMSes as well.

Personally, I'd be curious if it'd be possible to cull copy from a InDesign file to convert into Markdown, complete with links to images used in the document that could then be edited.

katabasis · on Dec 3, 2015

It's great to see others in the publishing world moving away from proprietary software like Word and towards plain-text based processes. As the author points out there are many benefits of this.

Here's something to consider: you've ditched MS Word, maybe you can ditch InDesign too? In my experience once content goes into an Adobe program, it's hard to get it out again in a clean way. It's pretty impressive what you can accomplish using CSS3 for print layout[1].

I work at an academic publisher, and I've spent much of the last year preaching the benefits of a similar workflow. Some of the editors are now editing manuscripts in Markdown files directly. Currently we're building a system where a single set of text files get fed into a program like a static website generator. This produces a web version, but also PDF, ePUB, etc. automatically. We're getting pretty close[2]. I think this is the future for many forms of publishing.

I think one of the big remaining pieces of the puzzle is creating a better Markdown editor, something suited for the needs of scholars & academics with support for things like footnotes, bibliographies, etc, while remaining a plain-text format.

[1] http://alistapart.com/article/building-books-with-css3 [2] http://egardner.github.io/posts/2015/building-books-with-mid...

zyxley · on Dec 2, 2015

Unlinked footnotes in a webpage?

With all that unused margin space on a desktop, it seems like they would have made more sense as Tufte-style marginal notes anyway.

rhythmvs · on Dec 2, 2015

Though they are linked, see e.g. http://rhythmus.be/md2indd/#fn2 — it’s Pandoc after all, which creates those ;-)

The wide margins are due to typography best practices, which dictates between ~50 and 70 characters on a line. One could of course blow up font-size, as is in fashion, since Medium. But then there are too few lines above the fold.

But you’re entirely correct as regards marginal notes. Unfortunately, they’re not trivially implemented, since you’d need to swap end-notes to side-notes relative to the viewport, cq. media query breakpoint, but while that involves DOM manipulation, it cannot be done with CSS alone. Let alone dealing with collision detection for stacking longer footnotes vertically…

munificent · on Dec 2, 2015

I implemented this in my online book[1] and it wasn't too bad. Try resizing the window to see.

The entirety of the JS (not including jQuery) is:

    $(document).ready(function() {
      $(window).resize(refreshAsides);

      // Since we may not have the height correct for the images, adjust the asides
      // too when an image is loaded.
      $('img').load(function() {
        refreshAsides();
      });

      // On the off chance the browser supports the new font loader API, use it.
      if (document.fontloader) {
        document.fontloader.notifyWhenFontsReady(function() {
          refreshAsides();
        });
      }

      // Lame. Just do another refresh after a second when the font is *probably*
      // loaded to hack around the fact that the metrics changed a bit.
      window.setTimeout(refreshAsides, 200);

      refreshAsides();
    });

    function refreshAsides() {
      // Don't position them if they're inline.
      if ($(document).width() < 800) return;

      // Vertically position the asides next to the span they annotate.
      $("aside").each(function() {
        var aside = $(this);

        // Find the span the aside should be anchored next to.
        var name = aside.attr("name");
        var span = $("span[name='" + name + "']");
        if (span == null) {
          window.console.log("Could not find span for '" + name + "'");
          return;
        }

        aside.offset({top: span.position().top - 3});
      });
    }

In my book, the asides are positioned very precisely next to certain lines. If you don't need that level of precision, a pure CSS solution is possible, I think.

Overlapping footnotes could be a problem, but to me that's a case where trying to completely separate design from content is a bad idea. Design should optimize for the actual prose you have and not all possible copy you might write. Likewise, it's often worth tweaking copy a bit to make it look better with your design.

In a couple of cases with my book, I tweaked or rearranged asides to avoid them overlapping.

[1]: http://gameprogrammingpatterns.com/introduction.html

Scarbutt · on Dec 2, 2015

Nice book, what did you use to create the HTML from the Markdown? Or did you do some CSS fiddling?

munificent · on Dec 4, 2015

A very simple cobbled together Python script:

https://github.com/munificent/game-programming-patterns/blob...

Arnavion · on Dec 3, 2015

> // On the off chance the browser supports the new font loader API, use it.

> if (document.fontloader) {

> document.fontloader.notifyWhenFontsReady(function() {

FYI it's `document.fonts.ready()` promise now.

dredmorbius · on Dec 3, 2015

ORLY?

http://codepen.io/dredmorbius/pen/OVmKZX

https://m.imgur.com/a/TXpis

akavel · on Dec 3, 2015

@rhytmvs Have you considered using SILE [1][2] instead of LaTeX (and maybe even InDesign)? (I'm not affiliated, but I believe it may become a worthy successor to TeX in future.)

[1]: http://video.fosdem.org/2015/main_track-typesetting/introduc...

[2]: https://archive.fosdem.org/2015/schedule/event/introducing_s...

cossatot · on Dec 2, 2015

Will it work with equations? A major peeve of mine is trying to get Tex to deal with figures in a better manner, especially in space-limited situations (e.g. grant proposals), and I used to use InDesign for that before my work got too mathy.

adiM · on Dec 3, 2015

Try ConTeXt. The float placement is much more flexible than in LaTeX. Being built on top of TeX, it supports all the math

pjstew · on Dec 3, 2015

I've been looking into the same problem for a while, and came to exact same solution a few weeks ago, coincidently. I haven't actually got round to completing all the code for it, but have tested each section. I was delighted when I spotted your article this morning, but was hoping you would have also shared your code... No git repository? I'm sure I can make the whole system myself, but I'm always happy to use others work if it exists. If you do have a working version of this process, please do share it!

pessimizer · on Dec 3, 2015

I prefer asciidoc: http://powerman.name/doc/asciidoc http://asciidoctor.org/docs/what-is-asciidoc/

todd8 · on Dec 3, 2015

I'm really looking forward to the evolution of Markdown, but it will not be a complete replacement for TeX (for many years). TeX is designed around a powerful (macro based) programming system. This is an excerpt from a comment that I posted on HN a while back that is apropos this discussion:

TeX's macro style of programming is too difficult. Nevertheless, people have done amazing things with it.

TeX has somewhere around 325 primatives, and one of the most important is the \def primative used to define macros. These primatives are used to define additional macros, hundreds of them, available in different so called formats. A basic format known as Plain TeX includes about 600 macros in addition to the 325 primatives. LaTeX is another format, the most widely used, but there are others, like ConTeXt, that are also very capable. Each of these extend TeX's primatives with their own macros resulting in different kinds of markup language. TeX's primatives are focused on the low level aspects of typesetting (font sizes, text positions, alignment, etc.). LaTeX provides a markup language that is focused on the logical description of the document's components: headings, chapters, itemized lists, and so forth. The result is a system that does simple things easily while allowing very complex typesetting to be performed when needed.

In addition to the TeX core primatives and the hundreds of commands (implemented as macros) in a format like LaTeX there are additional packages, classes, and styles that are used to provide support for any conceivable document. LaTeX has a rich ecosystem of packages. Typesetting chess? There's a LaTeX package for that. Complex diagrams and graphics, there's a LaTeX package for that. Writing a paper in the style of Tufte? Writing a book? or a musical score? or building a barcode? there are packages for that. The documentation for the Tikz & PGF graphics package is over 1100 pages long! The documentation for the Memoir package is 570 pages.

The amazing thing is that all of this is built out of macros. Diving into this, and once one needs to customize the look of a document it's inevitable, you find yourself in a maze of twisty little passages. Once upon a time, while writing assembly language for large computers, I enjoyed writing fancy assembler macros. I was facinated with Calvin Moore's Trac programming language based on macros and Christopher Strachey's General Purpose Macrogenerator. These were early (mid 1960's) explorations into the viability of macro processors as means for expressing arbitrary computations. Reader's interested in trying out macros for programming can try the m4 programming language (by Kernighan and Ritchie) found on Unix and Linux systems. m4 is used in autoconf and sendmail config files. Yet, TeX macros are in a whole other dimension. All of these powerful macro systems have one thing in common: parameterized macros can be expanded into text that is then rescanned looking for newly formed macros calls (or new macro definitions) to expand as many times as one wants. This isn't just an occasional leaky abstraction; it is programming by way of leaky abstractions. Looking at TeX packages is some of the most difficult programming that I've done. It's unbelievably impressive what people have come up with (e.g. floating point implemented via macro expansion in about 600 lines of TeX), but it's also unbelievably frustrating to program in such an environment. The LaTeX3 project is an attempt to rewrite LaTeX (still running on top of the TeX core). Started in the early 1990's it is still not done. I think its just that they are mired in a swamp of macros. They do have a relatively stable set of macros written, with the catchy name expl3, that are intended for use when writing LaTeX3. Here's a sample

     \cs_gset_eq:cc
    { \cf@encoding \token_to_str:N  #1 } { ? \token_to_str:N #1 }

This is described in the documentation as being a big improvement over the old macros and "far more readable and more likely to be correct first time". I can't wait.

I think LaTeX is absolutely without peer, but I wish improving it's programming method wasn't so daunting. I keep toying with starting a project to do just that, but so many others have tried and failed. It's disheartening.

Links:

[TRAC] https://en.wikipedia.org/wiki/TRAC_(programming_language)

[GPM] http://comjnl.oxfordjournals.org/content/8/3/225.full.pdf

[m4] info pages available on Unix and Linux

[Tikz & PGF] https://www.ctan.org/pkg/pgf?lang=en

[Memoir] https://www.ctan.org/pkg/memoir?lang=en

[expl3] https://www.tug.org/TUGboat/tb30-1/tb94wright-latex3.pdf

akavel · on Dec 3, 2015

Did you maybe have a look at SILE [1][2]? (I'm not affiliated, but it got me interested very much and I believe it may become a worthy successor to TeX in future.)

[1]: http://video.fosdem.org/2015/main_track-typesetting/introduc...

[2]: https://archive.fosdem.org/2015/schedule/event/introducing_s...

todd8 · on Dec 3, 2015

Thank you. I'm definitely going to look into it.

todd8 · on Dec 3, 2015

Thank you.