I do all of my academic writing in pandoc. As compared to LaTeX this means no boilerplate (yet you can still use full LaTeX syntax for equations and the like) and, if the publisher 'needs' a Word file, you are one click away from providing it. All with plain text files that you can put under version control, get meaningful diffs, etc. It's just great.
Sorry to be pedantic - but I didn't think that 'pandoc' was an actual document format purely a tool for converting between formats. Do you mean that you do your writing in kind of a 'pandoc flavoured' markdown? [0]
Well, between Pandoc's markdown flavor and how it has its own way of letting you insert latex code anywhere, you're not going to be able to process your document using anything that's not trying to be compatible with pandoc documents.
I’ve never understood the impetus for not using full LaTeX in an academic contex, given that the boiler plate is so minimal and presumably one has a built up a personal template over time.
For blog posts and notes I see the appeal, since the boilerplate can be a hindrance to spontaneous writing.
Latex can't produce web output, which is increasingly a target I want.
Also, Latex can't produce any output which is accessible to blind people (other than giving them the raw LaTeX). The PDFs latex produces are probably the least accessible format available (much worse than a word proeuced pdf, or some html). This matters to me, and should matter more to other people (in my opinion).
BUT that is what makes Pandoc powerful. You convert your latex or your whatever into: (Can we please add Racket's Scribble? It is by far the reason why Racket has the best documentation of any language. https://docs.racket-lang.org/scribble/)
Markdown, reStructuredText, textile, HTML, DocBook, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Vimwiki markup, OPML, Emacs Org-Mode, Emacs Muse, txt2tags, Microsoft Word docx, LibreOffice ODT, EPUB, or Haddock markup to
HTML formats
XHTML, HTML5, and HTML slide shows using Slidy, reveal.js, Slideous, S5, or DZSlides
Word processor formats
Microsoft Word docx, OpenOffice/LibreOffice ODT, OpenDocument XML, Microsoft PowerPoint.
Ebooks
EPUB version 2 or 3, FictionBook2
Documentation formats
DocBook version 4 or 5, TEI Simple, GNU TexInfo, Groff man, Groff ms, Haddock markup
Archival formats
JATS
Page layout formats
InDesign ICML
Outline formats
OPML
TeX formats
LaTeX, ConTeXt, LaTeX Beamer slides
PDF
via pdflatex, xelatex, lualatex, pdfroff, wkhtml2pdf, prince, or weasyprint.
Well, except LaTeX probably isn't the best base format to write in -- Pandoc's LaTeX parser isn't very good, it doesn't parse (from a quick check) any of the papers I've written. They've tried hard, but I think it's a losing battle, particularly once people start using a large range of packages.
That's not surprising -- it's basically impossible to "parse" LaTeX, as it's defined by execution.
iirc pandoc's markdown provides the set of functionality that one is capable of transforming back and forth. So as long as you stay within those formatting confines, you are set.
This works for everything except table notes a la ```threeparttable```
What about htlatex? It is quite powerful. In most of the cases, it produces nice HTML pages out of the box, with automatic rendering of figures and mathematical equations into PNG. It is part of most LaTeX distributions. On Linux, for example, just type
For me, at least, htlatex never works just quite right. There are a lot of edge cases where it's broken. If you want to preserve having non-PDF output, starting in something like Pandoc Markdown is a better idea. And I do most of my documents in regular LaTeX.
>Also, Latex can't produce any output which is accessible to blind people
This sounds like it should definitely be a target of a grant. I guess most government organisations around the world are using Word et al, which isn't too bad these days accessibility wise (AFAIK).
Can you provide a small example of a LateX document that produces an inaccessible PDF?
If you grab any academic paper (particularly two columns) there is a good chance getting the text out will be hard, and any part of the paper with maths or tables will be unusable. Sorry. I'm away from a computer now, to make a smaller example.
Any chance you could post the source code for this? It's using bitmaps for characters instead of proper fonts, which shouldn't happen nowadays. Maybe you should put "\usepackage{lmodern}" at the start? See for example https://tex.stackexchange.com/questions/1291/why-are-bitmap-...
I work with course materials made in Latex, and students sometimes need/want to copy and paste from them, so I try to avoid these kinds of problems.
Accessibility is a big current push from the TeX Users Group. The president, Boris Veytsman, has made moving it forward a big goal. I know that a lot of people are working on aspects of that, but the name I hear the most is Ross Moore, who I have heard talk on making the output be PDF/A-3a compliant. I understood that it is a long way there.
I hope so, because honestly, Tex generated PDFs are the single biggest problem with being a blind researcher (I'm not blind, but I know a blind researcher).
>I’ve never understood the impetus for not using full LaTeX in an academic contex, given that the boiler plate is so minimal and presumably one has a built up a personal template over time.
I don't find the boilerplate minimal at all. Contrast the following:
\begin{itemize}
\item First
\item Second
\item Third
\end{itemize}
with
- First
- Second
- Third
I won't even get into the hell that is tables.
I loved LaTeX until I discovered Org Mode. Pandoc also scratches the same itch.
I agree. If one is going to use LaTeX directly or indirectly via Pandoc, eventually one would have to build up a personal template to fine-tune the look and feel of the documents.
If one is going to write LaTeX code anyway, it seems easier and cleaner to use LaTeX all the way, move all the boilerplate along with the personal template to say, a file named preamble.tex, and \input{preamble.tex} in the documents.
However, there are situations where Pandoc can be convenient. For example, I wanted a document[1] to be written primarily as README.md (CommonMark format), so that GitHub could render it as the project README. At the same time I wanted to render a PDF output from a customized form of the content. Pandoc is convenient for cases like this although it takes a bit of work to fine-tune the formatting and customize the content for each output format.
>If one is going to write LaTeX code anyway, it seems easier and cleaner to use LaTeX all the way, move all the boilerplate along with the personal template to say, a file named preamble.tex, and \input{preamble.tex} in the documents.
Not sure why you think it has to be that way. I author LaTeX documents using org mode. Org mode handles most of the boilerplate, and I can still put pretty much any custom LaTeX within the org document, wherever I want it (this includes \newcommand, etc). I lose nothing by going to org mode, and I gain much in terms of reduced boilerplate.
Yup. I’ve got a pandoc template for doing org-latex-pdf conversion, as well as some org templates for common documents that my clients need. Hack away on the document in org (which I’m probably going to be doing anyway, since the rest of my life is in there too), and then when it’s ready to hand off, turn it into a PDF using a shell script.
My absolute favourite moment with that flow was a client who wanted one as a docx instead of a PDF. Pandoc obliged and they commented that I must have spent a lot of time reformatting things for them :)
That's a good question! The flow started out as markdown->latex->pdf via pandoc, and then when I got back into Org, it just slid right into that workflow to replace Markdown.
It isn't clear to me whether you are saying that Pandoc is necessary or if you are saying that Pandoc is unnecessary and LaTeX alone is sufficient for all purposes.
I think your parent comment was saying that LaTeX alone is sufficient. You also seem to be saying that LaTeX alone is sufficient while using Org mode. Would you please clarify if I am interpreting your comment correctly or not?
>It isn't clear to me whether you are saying that Pandoc is necessary or if you are saying that Pandoc is unnecessary and LaTeX alone is sufficient for all purposes.
I'm not saying either. The parent said it's easier and cleaner to use LaTeX all the way. I was pointing out that it is easier to write in a format like Org mode and export to LaTeX (whether via Pandoc or Org mode's built-in exporter).
Of course LaTeX is "sufficient". It is also, IMO, painful.
I wrote my dissertation using Pandoc. It might seem that the LaTeX boilerplate is minimal, but Markdown is even more minimal, and it preempts the urge to fuss with your layout. Writing in Markdown means that you can wave your hand at the document and say, "It's a draft, I'll fix the formatting once I'm sure I even want this material." Afterwards, fixing the layout is really easy because you can drop raw LaTeX in wherever you need to, and you haven't wasted countless hours laying out a float you later end up cutting.
I didn't use LaTex for years, is it still a hell to make tables? And also very difficult to use templates to generate good looking documents that doesn't like an academic paper?
Yes! It's great to be able to put LaTeX-formatted equations directly into your pandoc-flavored markdown source file.
Incidentally, I really like the thoughtful syntax additions Pandoc makes over olde Markdown (eg., tables, definition lists, and span & div syntax as well). Such a great all-around doc tool.
Not OP, but I used `+citations` and `pandoc-citeproc` along with a bibtex file that I managed by hand for https://bernsteinbear.com/dat-paper/ (a small senior project paper). It worked pretty well for me.
Add bibliography=path/to/library.bib (and optionally specify a csl for bibliography formatting; I like econometrica) in frontmatter yaml. Insert citations with @bibcitekey. compile with --pandoc-citeproc filter.
It was a couple of years ago that I wrote my dissertation using Pandoc, so things may have changed. At the time, I started out using pandoc-citeproc with my BibTeX database, but eventually I needed more control over formatting and I switched to writing \cite everywhere. Even with hundreds of references, it only took an afternoon, so I'm happy I did it the way I did. My approach with Pandoc is to use it until you have to invest LaTeX-level effort into making it do what you want. At that point, swapping in LaTeX is rarely painful. Often you can get away with editing Pandoc's generated LaTeX and pasting it back in to your source.
You can control the formatting pandoc-citeproc (which is now built in to pandoc) produces with a CSL file. That's great if your institution provides one, otherwise... you'll have to learn CSL ;-/
> if the publisher 'needs' a Word file, you are one click away from providing it
Once the work has moved into a Word file, isn't that where it stays? Editors and publishers often make heavy use of features like track changes and notes. Doesn't pandoc lose that information?
It does. I think the assumption here is that the author is the only contributor to the document. Exporting into a Word doc would serve the same function as exporting to a .pdf, others could read it and even mark it up, but the author would have to make the noted changes in their original plain text document themselves.
I tried and it didn't work for me. Pandoc's conversion functionality is good but unfortunately also fails very often, at least in my experience. I suppose with custom templates and a lot of trickery I could get it working for the kind of papers I write, but I've found it easier to convert LaTeX to Word manually when needed - which is a pain in the ass, too, of course.
I have to put in a word for Racket's Scribble. Programmiclly creating documents is powerful, and this system makes it simple. You can also basically use it as a "Markup-less" system.
Scribble Code Example:
#lang scribble/base
@title{On the Cookie-Eating Habits of Mice}
If you give a mouse a cookie, he's going to ask for a
glass of milk.
@section{The Consequences of Milk}
That ``squeak'' was the mouse asking for milk. Let's
suppose that you give him some in a big glass.
He's a small mouse. The glass is too big---way too
big. So, he'll probably ask you for a straw. You might as
well give it to him.
@section{Not the Last Straw}
For now, to handle the milk moustache, it's enough to give
him a napkin. But it doesn't end there... oh, no.
Scribble -
Scribble is a collection of tools for creating prose documents—papers, books, library documentation, etc.—in HTML or PDF (via Latex) form. More generally, Scribble helps you write programs that are rich in textual content, whether the content is prose to be typeset or any other form of text to be generated programmatically. - https://docs.racket-lang.org/scribble/
Some languages based on Scribble
Skribilo -
Skribilo is a free document production tool that takes a structured document representation as its input and renders that document in a variety of output formats: HTML and Info for on-line browsing, and Lout and LaTeX for high-quality hard copies.
The input document can use Skribilo's markup language to provide information about the document's structure, which is similar to HTML or LaTeX and does not require expertise. Alternatively, it can use a simpler, “markup-less” format that borrows from Emacs' outline mode and from other conventions used in emails, Usenet and text. https://www.nongnu.org/skribilo/
Pollen -
Pollen is a publishing system built on top of Scribble and Racket. So far, I’ve optimized Pollen for web-based books, because that’s mainly what I use it for. But it can be used for small projects too, and non-webby things like PDF.
As a publishing system, Pollen includes:
A programming language. The Pollen language is a variant of Scribble, with specific dialects tailored to different kinds of source files. You don’t need to use the programming features to do useful work, but they’re available when you need them.
A set of tools & libraries. Pollen can produce output in any format, but it’s especially useful for markup-style formats like XML and HTML.
A development environment. Pollen works with the DrRacket IDE. It also includes a project web server so you can dynamically preview and revise your publication. http://docs.racket-lang.org/pollen/Backstory.html
They are Domain Specific languages that excel at outputting awesome HTML and PDF. They really aren't markup but really they are a Macro system that is built on top of a full Lisp (Racket) It is easier and much more powerful then anything I have seen on Pandoc and Latex (I use Latex still for specific targets but not for general papers anymore).
Racket has the best documentation period and it is because the documentation
Finally, a great feature, that hasn't been mentioned here, is pandoc filters. Basically, pandoc provides a way for scripts (in any programming language) to hook into the transformation pipeline and modify the document AST (similar to the HTML DOM) in-between the reading and writing steps. See http://pandoc.org/filters.html
everytime I see a project using google groups, i think it is already dead. Gladly yours seems to be used quite often. At least you can search it even years later, compared to an irc or slack channel.
IRC channels and mailing list are excellent for informal questioning about a project. You can search for guidance, see if a feature would be well received, and receive a green light before starting to implement something.
Other day I thought about contributing to Yarn, the Javascript package manager, but the only way that I found to communicate with the developers were issues in GitHub. Since I didn't know if the feature I wanted would be well received, I just quit.
i'm not the parent, but that is the main reason i try to abstain from posting on public forums unless its under a pseudonym, which my github account isn't.
I'm not trying to say that my anonymity is guaranteed with irc, its just unlikely that future employers and similar link it to me.
Have a look at Firefox Multi-Account Containers. You can open a tab that has a different color, and it uses a different cookie database. Very useful, because you can create an extra Github account and quickly switch between those accounts.
Hey, just a one-time contributor here (fixed a small bug in the wikitext parser), I have to say that the community is really, really great! I had never done any haskell before, but with just a little guidance from the IRC channel (#pandoc on freenode), I was put on the right track and submitted my small PR, which was merged quickly.
Overall great experience. Thanks for the great tool :).
Probably. I'm not too familiar with leanpub, but seems like they're actually using pandoc to import docx.[0] And with pandoc you can also export to epub, pdf, docx, indesign, etc.
My favorite pandoc hack is using it to convert word docs into markdown which can then be diffed similar to source code. Works great for legal redlining.
Can you diff two Word docs with Word? Afaik you can only hit the "track changes" button, which doesn't help if you got a new version of a document from someone else.
Setting aside the hyperbole, I believe it would be a conceptual nightmare to start defining concat of two docs. For simple cases, feel free to make copy pasta :-)
> I believe it would be a conceptual nightmare to start defining concat of two docs.
Just append the pages of document 2 to the end of document 1. Then the user can decide whether to remove the page break introduced by it. I have this in my env for doing this with PDF:
$ cat =concatpdf
#!/bin/bash
if [ "$1" = "to" ]; then
shift
gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=$*
else
echo Usage: concatpdf to out.pdf in.pdf in2.pdf in3.pdf
fi
The peculiar syntax with "to" ensures that I do not invocate it incorrectly.
Classic Word workflow: I make a document and send it to my boss; he makes some changes but suggests further research etc. and sends me his altered version, but I stupidly work on my own version. Making this right takes like four clicks.
Syncdocs [0] is also pretty good for merging and tracking changes between Word and Google Docs. It also has a feature with real-time collaboration between Google Docs and Word.
It works if you don't change formatting, don't edit the same section, don't rearrange chapters and don't have lots of changes or a large document. You do any of that, and the application crashes.
It's good for a small fix, but not something to rely upon in your main documentation workflow.
Even if you can do it with Word, GUI interactions aren't composable and extensible the way shell commands are, so you're limited to the features the GUI designer thought of.
>GUI interactions aren't composable and extensible the way shell commands are
They might not be in word, but they absolutely can be, and in fact are a superset of CLI interactions (since a GUI interaction step in e.g. Automator can invoke any shell command).
>so you're limited to the features the GUI designer thought of
And in the traditional shell pipeline (that is, not Powershell) I'm limited to working on dumb streams from one command to another.
Aside from multi platform, BC has things like CSV compare and marking unimportant changes are fairly robust. I do personally prefer open source myself, but BC is certainly one of the few that I didn't mind paying for.
Is this underlining, and not redlining as defined in financial services? (redlining: differential pricing based on demographic makeup of a zip code or neighborhood)
I built a pipeline to convert a Markdown file to publishing-ready files for ebooks, Kindle and paperback for my novel; the whole thing is described here: http://www.gabrielgambetta.com/tgl_open_source.html
My website itself is static, generated from a bunch of Markdown files, some HTML templates, and a bit of postprocessing. But most of the work is done by Pandoc.
"Soon the structure underlying The Da Vinci Code and Angels and Demons and The Lost Symbol was laying bare before my eyes. I could see why the stories worked.
I had reverse-engineered Dan Brown."
Could you talk a little more in depth about what Dan Brown's pattern/structure is?
I get that request a lot, so I'll have to write something about it :)
All I can offer right now is my raw notes, which are in Spanish. This is a structural analysis of Angels and Demons, The da Vinci Code, and The Lost Symbol: https://imgur.com/bX6ByJA
This is an one-page treatment of the three books, with the "blanks" filled appropriately for each: https://imgur.com/LlDVUKn
I doubt any of this is groundbreaking. Story structure is a widely studied topic (and one that I find fascinating). But it seems like Dan Brown uses a very well defined, customised version of this, that makes for engaging, fast-paced books.
I sort of proved (for myself, at least) that this works, by writing a novel whose structure was originally based on this pattern (although it later diverged a bit), and which causes the expected effect - a couple of readers have read it in a single sitting :)
It appears that Pandoc generates PDF documents via LaTeX. One problem with this is that, as far as I can tell, LaTeX can't generate tagged PDFs. This is an accessibility problem. Granted, for documents that are heavy on math and/or graphics, the point is probably moot. But many technical documents that are distributed as PDFs would benefit from being tagged.
Luckily, LibreOffice can produce tagged PDFs. And unoconv is a convenient utility for doing this from the command line. So you can use pandoc to convert to a format that LibreOffice can consume, then issue a command like this:
Pandoc can convert into ConTeXt which can produce PDF/A (tagging included) easily.
Why this can't be done in one command like with xelatex, wkhtml2pdf and what else is supported, I don't know.
Many programs can be used to create PDFs but the quality of output isn't always the same.
Pandoc's creator, John MacFarlane, is also the lead guy on CommonMark[1].
There are a small number of corner cases that need to be spec'd out before CommonMark can declare a v1.0 release[2]. If you have the skills for this kind of thing, please weigh in!
In a similar vein, I use pandoc to convert markdown pages to man pages, and write new/add notes to manpages. I think it's definitely easier than actually writing groff files.
I find it easier to write man pages directly. Admittedly, I write mdoc (not the ancient "man" macros), which has been around only since the 80s. It's easier for me to remember the semantics ("Is this a flag/command/function?") than the correct traditional markup ("Should this be bold/italic/nothing?").
I sometimes use pandoc to clean up my markdown-formatted documents, especially given its abilities to "wrap" text and add indentation-style whitespace that makes plain-text documents look nearly suitable for publishing as-is (almost kinda like RFC docs but without header/footer cruft).
There are a few things (in latest version, 2.2.3.2) that don't really survive round-trip from markdown back to markdown:
- reference-style links (e.g. `[foo][f]`). They are converted to inline links e.g. `[foo](http://...)`.
- setext vs hashmark headers. `foo\n=====` will get converted to `# foo`.
- markdown allows for forced-linebreak <br>s to be added with two trailing blank spaces at the end of a line. Pandoc escapes these with a trailing `\` at the end of the line.
These are only occasional nuisances, but overall the documents (at least in my experience) are not butchered.
I also occasionally go from markdown to docx for the purposes of uploading to google-docs and copy/pasting large sections into other docs. This is the only markdown-to-google-docs workflow I've found that works to preserve formatting. It's never really butchered anything, except a few times the syntax-highlighting for code-blocks gets confused and keywords get the wrong colors.
I "love" how many comments are one person praising pandoc for helping them in some workflow, and then commenters ripping into them for not using some other tool. I wonder if there's a corollary to some internet rule that the more generally useful a tool is, the more detractors will push for other tools to be used? It would help explain why programming language discussions get so contentious.
Pandoc is seriously a great tool! I love the way it's designed and have found it useful off and on over the years. Truly marvelous for making information available in any needed format.
I love pandoc. I've been using it intermittently for years to turn my Markdown and org-mode documents into other formats. Just wish it would take Asciidoc as an input format.
Asciidoctor and the other asciidoc tools do the job that I use pandoc for: tables, custom numbering, all the other markdown extensions that one needs to be able to create a highly structured document. With Asciidoc, you don't need md extensions. It's all in there.
I mainly use Asciidoc for two reasons. 1) Ability to include external code snippets. This is not possible in pandoc without installing the pandoc-iclude-code filter which doesn't have Windows binaries. I am on Windows. 2) Tables. Asciidoc has powerful support for tables. You can create tables that include rowspan and colspan among other features. You can even specify an external CSV file as a table.
I tried creating a workflow from Asciidoc through Pandoc to MS Word but that didn't work so well. Tables being the biggest issue.
I used pandoc to format my book [0]. Not everything worked perfectly, I'm pretty happy with how everything turned out (especially the print version).
It was a little work to set up the workflow with scripts etc, but being able to write the book in markdown and still having full control over the design was definitely worth it.
I used a similar workflow. The CSS is for the EPUB and maps to the html elements supported. But if you get too fancy cross device support could get hairy.
Maybe I used an older version but my attempts to use pandoc usually resulted in the document being butchered because the internal representation was not as expressive as the source or target formats.
You could also just download one of the packages from the "Installing" page on the Pandoc web site, which has prebuilt binaries for Windows, macOS, and Linux. Installing a whole Docker image to do this seems like it might be overkill for a lot of uses.
Yet another pandoc user here. I built a blog engine using Pandoc as the core. Code available here : https://github.com/subinsebastien/kyll And the website built using the blog engine is available here : http://xtel.in/
I tried to use pandoc a while ago to convert the latex-sources of arxiv.org documents to epub, since those are often much more comfortable to read on small devices than pdfs.
The problem I had was that latex was turned into images, but changing the font-size of the reader did not change the size of the images, making the text readable, but the maths barely readable.
This is something I would love to see happen though.
You can add some CSS to the generated EPUB to change that. But if your EPub reader supports MathML, you can do that with pandoc. See http://pandoc.org/epub.html#math
Emacs and emacs ord mode, and then you can export to html5 latex/pdf, etc. My notes, calendar, todo, data science workbooks, etc all live in emacs org mode. Especially love the ability to call programs on the fly in my data science workbooks, so I can call R, Julia, python, and bash all in one place.
Asking what editor HN uses is a pretty loaded question, but it looks like there's a couple neo/vim plugins for live markdown preview. This one[0] says it can use pandoc as a backend. I'm pretty sure that emacs offers something similar, and org-mode may be worth consideration all on its own. I hear spacemacs and spaceneovim are nice.
I tried Caret and loved it but had to uninstall because of the huge font size on equation renders in a math-heavy document. Is there a way to fix that? I tried to look but they don't have much documentation yet.
> We provide a binary package for amd64 architecture on the download page. This provides both pandoc and pandoc-citeproc. The executables are statically linked and have no dynamic dependencies or dependencies on external data files.
I wish I'd known about this sooner. I don't spend much time with text documents outside the web, but when I do, pandoc handles the disparate formats admirably. The only inconvenience is when I update my system, there's guaranteed to be a huge pile of Haskell libraries to download.
Do any of your tools use long options (prefixed with a double dash)? If so, make sure you disable the "smart" extension, otherwise you might end up with en dashes.
Maybe he edited his post, but it now ends with "... Let's just say I get a bit lost when it isn't available. " suggesting that he actually does use pandoc for stated tasks.
It sounded like a rhetorical question, and based on the context that you use it for a lot of things, I sensed a bit on excitement there too, and thus the exclamation.. like "Isn't that awesome‽".
For novels, I tend to just use Markdown, as kerning will be done in CSS.
For academics, I use LaTeX and Asciidoc together, but some paragraphs might be inserted in various other formats - whatever is easier. The build tool doesn't care what the format is, it'll take any input pandoc accepts.
i have been using catdoc and pdftotext to convert doc and pdf files, respectively. nice to see that there's an alternative that also includes a library, will be checking this out.
a couple questions i have, seems firstly that old school .doc files are not supported, docx yes. unfortunately i still get a lot of docs in .doc format which seems to be microsoft's proprietary format (docx seems to be more open).
my second question is whether or not there's a filter for golang, most of my development is in golang, so i either need to call your cli as a forked process or best to have a native library. i have never worked with haskell so not sure if i can import a haskell library from golang directly. i imagine there'd need to be a golang wrapper around the cli.
You could use Libreoffice's command line interface to convert from .doc to a more manageable format.
lowriter --convert-to odt some-document.doc
odt is not the only supported target, but doc --libreoffice--> odt --pandoc--> plain seems to give better results than e.g. doc --libreoffice--> txt or doc --libreoffice--> docx --pandoc--> plain.
if that's the case, i'll stick with catdoc. my use case is to create a full text search index of the content, trading libre office cli for catdoc, i'd rather just stick with catdoc, but thanks.
Maybe this ( http://tyorex.com/iWorkConverter/ ) and then a batch doc->odt converter? (Though for the sake of sanity, I recommend avoid word processors where possible.)
Pandoc is great! I use pandoc for all kinds of formal writing (conversion to PDF via LaTeX). We also run pandoc in production to produce customer-facing PDFs.
I believe the best you could do is extract the raw OCR'd text from the document (with some other tool). No formatting or text hierarchy is preserved in the OCR process, only the physical locations and size of the text on the page. From text, you can convert to Markdown or whatever and then manually edit to give the OCR text some structure.
I write any document that doesn't need extensive custom typesetting (which is 90% of stuff) in org-mode and then use pandoc to convert it to "normal people" formats at the end. I have made a basic template for MS Word that looks pretty good.
The core problem with Pandoc is that the internal document representation is limited to its particular flavor of Markdown. Any feature PD-MD doesn't support is ignored or loses semantics. You can see this in the poor ReST support (try converting captioned figures). It would be useful to rearchitect it with a Docbook-style semantics internally since they are more comprehensive than Markdown.