Hacker News new | past | comments | ask | show | jobs | submit login
Pandoc – A universal document converter (pandoc.org)
756 points by johnsonjo on Oct 24, 2020 | hide | past | favorite | 168 comments



I can't express enough my gratitude on a daily basis for what pandoc enables me to do. I made a simple Emacs script that I use to output files, and I use it constantly for Latex PDFs, HTML output, RevealJS slides, and odt/docx/etc. All with bibliographies fron Zotero in zillions of formats. As a professor and journalist, I need to use a wide range of output formats, but as a human being I like to work in clean, simple text files that will never be obsolete. Pandoc, way more than any tool, gives me the freedom to work in any writing environment I like and keep that distinct from whatever weird formatting preferences a journal, magazine, or publisher might have. I've written two books with Markdown and a huge variety of articles. I am so thankful for the care with which it has been built and maintained. Thank you.


One thing I love about pandoc that I don't see mentioned here is the ability to apply filters to transform documents mid-conversion.

I'm using Pandoc to write my PhD thesis at the moment, from Markdown source, using certain filters to "augment" what Markdown can do. Examples:

https://github.com/LaurentRDC/pandoc-plot

https://github.com/lierdakil/pandoc-crossref

More info here: https://pandoc.org/filters.html


Yeah, filters are great. Writing filters is easy: Pandoc basically converts the input document into a universal AST (json), and a filter is just any program that takes this json as an input and outputs a modified json AST.

I wrote a filter that automatically converts URL citstions in markdown to "real" citations in any style you want - very useful for writing papers without fighting with bibtex and managing bibliographies manually: https://github.com/phiresky/pandoc-url2cite


What a coincidence! I am also writing my PhD thesis with pandoc and filters, but I use panflute for the latter: http://scorreia.com/software/panflute/


I'm the author of panflute, and--in fact--wrote my PhD thesis in pandoc+panflute!


I'll be starting a PhD soon and would love to use a pandoc-based workflow (with MD or another format) for the gross of my writing.

How did you all structure the commenting on your writing? I find converting to odt/doc before sending, managing all the exported versions with comments etc. becomes quite tedious. But I'm a bit reluctant to force my supervisors to use eg. git+criticmarkup[1]. I would love to hear you experiences!

[1]: http://criticmarkup.com


A few things:

- In my case, my supervisors mostly had handwritten notes which rendered that point a bit moot. However, when I send the almost-complete draft to a professional copy editor, it was indeed a pain to add the comments. Either handwritten+scanned, as acrobat comments, or word comments, they had to be manually input into the markdown file.

- Everything else worked relatively better. It was a bit tedious to type loooong pandoc commands "pandoc --filter=... etc etc" so I recently coded pandocmk [1] to make my life easier. It's not super well documented (but it's a quite short script, so readable), but the idea is that you type the command line options as metadata at the top.

[1] https://github.com/sergiocorreia/pandocmk


I've used that to process tables, essentially using markdown as a commented CSV format. The only nuisance is that a table can't yet have attributes — https://github.com/jgm/pandoc/issues/6317 — the workaround being a pre-filter to copy them from a surrounding div.

I've also toyed with using it to process code blocks, as a dead-simple literate programming tool.


Pandoc tables can have attributes now! see https://github.com/jgm/pandoc-types/blob/master/src/Text/Pan...


This is going on a decade old, but I wrote my bachelor's thesis in pandoc. It made the otherwise painful very straightforward.


I wrote my MSc in markdown with Pandoc. Loved it.


my uni requires MS Word...


That is the point of pandoc.

You can write in markdown and then convert it to word for your uni.


What do you do when your advisor sends you back changes using the Word track feature? Can Pandoc apply the diff?


yes, see the `--track-changes` option: https://pandoc.org/MANUAL.html#option--track-changes


For MS Word Pandoc is a blessing: I configured a master file and now pandoc does all the tedious formatting!


It must be pretty slow if you have time to change the settings while it is converting?


Not sure if this is a joke but if you read the linked docs [0] you'll see that the concept of filters is that you the user can write programs (essentially plugin-style) that modify the AST Pandoc generates in order to perform the conversion. But this explanation is ultimately worse than the actual doc page, so I'd recommend just reading that.

https://pandoc.org/filters.html


None of this has anything to do with how much time is taken. The whole thing could run in a millisecond.


> One thing I love about pandoc that I don't see mentioned here is the ability to apply filters to transform documents mid-conversion.

I read that as "mid-conversion" meaning that he can apply filters while the document is being converted


Pandoc applies the filters, not the user.


like using Middleware


fwiw Pandoc's author, John MacFarlane, is also behind these projects that try to unify the Markdown ecosystem:

- Babelmark, a tool to compare how different Markdown parsers interpret the same Markdown input. https://johnmacfarlane.net/babelmark2/

- CommonMark, the first formalized Markdown standard, and now the de-facto Markdown standard. https://commonmark.org/ (He's the first listed member of the team.)

I feel like John is probably the single largest contributor to what Markdown is today, other than perhaps the creator of Markdown. Thank you for your work!


> other than perhaps the creator of Markdown.

The creator of Markdown hasn't touched it in over a decade and yet decided to throw a temper tantrum because CommonMark dared to initially call itself Standard Markdown.


As a software engineer working in a data interoperability role (not that I would claim authority, but pragmatic experience):

I'm not sure of the specifics but personally I prefer formats that don't evolve over time. So not changing a spec for over a decade should not be considered pathological but actually commendable, if the nature of spec is complete enough for it's purpose.

I know vanilla Markdown is too limited for some use cases. But that is no reason to "overwrite" it.


> I know vanilla Markdown is too limited for some use cases.

CommonMark does not make Markdown less vanilla -- it's not like GFM or one of the other standards that adds support for tables or other features.

From Jeff Atwood (who is one of the CommonMark creators):

> The goal of CommonMark is not to redefine what Markdown is, or change the syntax, but make it parseable and predictable.

https://talk.commonmark.org/t/what-changed-in-commonmark/15

Here's CommonMark's statement on why it exists:

https://spec.commonmark.org/0.29/#why-is-a-spec-needed-

CommonMark's entire purpose is to fix data interoperability.


> So not changing a spec for over a decade

The problem was there was no _specification_. It was a 'how to use' summary, And each implementation could be (and was) different in subtle edge cases.

The point of CommonMark was to define the specification and stick to it.

I agree with GP, I thought it rather sucky that he objected to using the name.

Various markdowns have extension mechanisms, they always have. That's not what the GP was talking about.


The general idea behind your points are sound and correct.

However, the problem is you seem to be generally ignorant of widely known points of knowledge about Markdown.

> not changing a spec for over a decade should not be considered pathological but actually commendable, if the nature of spec is complete enough for it's purpose

100% agree in theory, but Markdown's creator never wrote any spec. when creating it. Initiatives like CommonMark are efforts to specify and unspecified language, not to evolve nor replace any existing spec.


> So not changing a spec for over a decade should not be considered pathological but actually commendable, if the nature of spec is complete enough for it's purpose.

The "nature of the spec is complete enough for its purpose" is the part that's not met, though (at least in many people's minds). The Markdown "spec" (either the description written by Gruber or the `Markdown.pl` file) has ambiguities and inconsistent behavior. My understanding is that there were many requests from the community for this to be clarified, but it never was. So I think a decade of inattention is not commendable in this instance. The CommonMark landing page[0] has some more about this issue.

[0] https://commonmark.org/#why


I agree with your characterization. (I didn't always -- I actually advocated at the time for CommonMark to respect Gruber's wishes and create their own branding [1].)

[1] https://talk.commonmark.org/t/the-logo-and-name-should-proba...

Sure, Gruber didn't allow CommonMark to use the Markdown name, but I feel like that's not a super big deal compared to what he did do. The Markdown ecosystem wouldn't exist if Markdown hadn't been created in the first place! I'm not confident someone would have made something like Markdown if Markdown was never created: AsciiDoc and reStructuredText came out before Markdown but have not been as successful.

Gruber's original Markdown spec lacked formality -- and that's where CommonMark eventually filled the gaps -- but I think that Markdown's focus on user experience over technicality was the key to its success over competing formats and WYSIWYG editors (the real competition). By the time CommonMark came around, Markdown had already seen viral adoption; three of CommonMark's creators are from large companies that were already prominently using Markdown.

tl;dr I think the original Markdown spec and CommonMark are both significant contributions in their own right!


I had an interesting conversation with John MacFarlane, the maintainer and author of Pandoc (lovely human being and excellent maintainer), and the subject of day jobs came up. He's a professor of logical philosophy at UC Berkeley which I thought was fascinating. It certainly makes sense given the number of document formats and such that academia deals with.


I love that he calls [1] the incredibly useful tools he built a product of structured procrastination [2].

[1] https://johnmacfarlane.net/tools

[2] http://www.structuredprocrastination.com/


And a great fiddle player!


What is it with amazing professors and musical prowess? My Cryptology professor is also a fiddle player! Ivan Damgård, of the Merkele-Damgård construction.


My pet theory is that they both require perseverence to get to a fun level and once there doing it provides intrinsic reward.


I've noticed many people in STEM are into music


Almost every upper class parent invest in musical education. Learning to play instrument is both enjoyable and discipline building activity.


That observation doesn't seem right. There are a lot of people in STEM. Most do not come from upper class parents. Many of the children of upper class parents go into non-STEM fields, including law. (Some go into music, which is not STEM.)

My decidedly working class parents insisted we learn to play piano. Most of my relatives had a piano in the house. I think it was a holdover of the days when home entertainment was self-made.

A friend lived in both the Los Angeles and New Orleans areas. He compared the two as: in LA, the parties of rich people have live music. In New Orleans, the parties of poor people have live music.

And Damgård, mentioned earlier, was born in 1957 Denmark, and plays Danish and Nordic folk music. Postwar Denmark was poor. Perhaps this interview (in Danish) explains why he started? https://www.youtube.com/watch?v=AUF_EkN4Z-g

So, 1) is there a significantly high proportion of people in STEM who are into music than non-STEM? (and not simply some sort of observational bias), and 2) is the major contributing factor to the high proportion because the parents of those people were upper class? (and not some other factor like STEM fields paying enough so people have free time for hobbies.)


I would add lots of kids in lower income homes are exposed to music being played by family members, peers, school programs, or church groups (examples). It's true that these kids might not be playing Mozart but there is nothing wrong with bluegrass, gospel, or whatever, to instill a love of playing music.

Maybe listen to "Juke Box Hero" or "Coat Of Many Colors" for inspiration on how people from modest backgrounds can have the same fulfilling experiences as wealthy people. (sorry - personal soapbox)


Actually, Music is STEM, in the true and classical sense - it's only our perverted modern view that has severed music from its moorings in mathematics.

It was even part of the quadrivium of medieval education: Arithmetic, Geometry, Astronomy, and MUSIC! A classical education not only taught these subjects, but how they were all inextricably interrelated (or intertwingled, as Ted Nelson famously says...)


STEM is not a classical term. Don't go thinking that because we don't follow ancient Greek philosophy or medieval education philosophy that are somehow perverted.

FWIW, I strongly dislike "STEM" as a term because it makes no sense to me in an educational or philosophical sense. I see it more as an attempt to lower the cost of hiring engineers and scientists by increasing the supply. For example, compare the funding going into getting more programmers and EEs, vs. marine biologists and paleontologists, even though all of them are STEM.

To clarify "no sense to me", I despise Pirsig's "Zen and the Art of Motorcycle Maintenance" because of its insistence on a clean division between romantic and classical views. I view "STEM"'s treatment of the rest of the liberal arts as being similarly incorrect in its dichotomous classification. Eg, mathematics is important for the humanities too.

But it's clear what _def is talking about by "STEM", and there's no need to suggest we or modern culture are following along with a perversion because the conversation isn't aligned with your personal views.


They also provide an excellent break from the kind of thinking required in things like programming IME. I can be exhausted from a day of coding and happily sit down and practice with the piano in a way I couldn't with other intellectual topics like maths.


Pandoc is great at bridging the gap between science-oriented data control needs, and management-oriented reporting needs.

I was on a modeling project that used scripts to generate hundreds of input parameters, embed them in models, run the models, and produce reports. The inputs and outputs shifted a lot over the course of the project, as we came to understand the domain and implications of the work better. At every update, the changes had to be transferred to a Microsoft Word document that went to the project sponsors.

Pandoc made this easy -- we just added scripts to write out the model inputs as Markdown tables, then embed those tables in a larger writeup, also written in Markdown. Pandoc turned it all into a Word document. Thus, the same toolchain that did the actual work, also drove the final report. I really don't think we could have had confidence all the tabular data was right, had it not been automated through Pandoc.


I would like to start using Pandoc in my commercial software [1] to help convert documents into different formats, but the GPL license makes that difficult (or at least confusing.) I think it's generally fine to call a GPL program from a SaaS application. I believe it's fine as long as it is providing an optional or tangential feature, and your application can continue to perform the core functions when that GPL tool is not present. AGPL licenses go a step further and prevent access to any AGPL commands over the network, so that's when a commercial license is always required.

Am I allowed to distribute GPL programs contained inside a Docker image for on-premise installations? Do I just need to provide proper credit and a link to the source code?

Or is there a commercial license available for Pandoc? (I couldn't find anything.)

[1] https://docspring.com

UPDATE: I've decided to evaluate pandoc and see if it might be useful for supporting Markdown and Word formats, etc. If it is, then I'll reach out to John McFarlane and ask about a commercial license (or just something in writing), perhaps in exchange for sponsorship on GitHub.


As a lawyer -- If you are actively running a commercial enterprise, which you seem to be, these are questions for an attorney in the field. Not me, unfortunately, licenses were never in my area of practice. But you probably want to take the time and bit of cash to make sure you're not potentially opening yourself up to litigation.


It shoudl not be a problem if GPL code is called from separate app and it output is used. Of course It's best to consult a lawyer.

Also what in GPL makes this difficult to use it commercial software? You are even free to sell it after all.

Also using AGPL doesent require to use commercial license, where does that come from?!


> I've decided to evaluate pandoc and see if it might be useful for supporting Markdown and Word formats, etc. If it is, then I'll reach out to John McFarlane and ask about a commercial license (or just something in writing), perhaps in exchange for sponsorship on GitHub.

Better to just use a GPL compatible distribution method: pandoc has 349 contributors; none of them signed a copyright assignment, so you'd need permission from each and every contributor to use the software in a way not permitted by the GPL.

If you need a freelancer with deep pandoc knowledge, please do reach out. I'm happy to help.


You seem to be focused on the intersection of GPL and AGPL code with commercial software which is actually not really relevant other than that you may care more about the legalities under those circumstances. For the GPL, the question is whether your work links in the GPL code. If it merely executes another program in userspace that shouldn't be an issue but you should consult a lawyer if you have serious questions.


I'm a long time (7 years) contributor to pandoc. Other frequent contributors often drop by here as well. Happy to answer questions, ask us anything.


I speak (and write) a right-to-left language.

I'm not a pandoc user (so far); and have struggled many times in the past with bugs and lacking features in LibreOffice and LaTeX regarding right-to-left text layout and language-specific issues.

My question: How "trustworthy" is pandoc in handling right-to-left content and side-stepping the minefield of target format issues involving such content? Is this subject getting explicit attention from maintainers?


Pandoc should be usable for users of all languages and scripts. It is possible to define the documents language via the `lang` metadata field; `ltr` and `rtl` attributes can be set for individual text elements.

Core contributors are westerners or Russian (US, UK, Switzerland, Germany, Russia), and we rely heavily on user reports to improve non-LTR scripts and languages. But the goal is to make pandoc work flawlessly for everyone.


I have used Xe(La)TeX and the bidi package for mixed rtl and ltr script documents. I don't recall any problems with that. There's also a polyglossia package, but I have less experience with that.



There seem to be not so many haskell applications that succeed to the point where they are of general use, as in not simply useful to programmers doing programming (probably in Haskell) At least this is a frequent observation about Haskell and one I've made myself. https://news.ycombinator.com/item?id=11907839 Obviously around here the ideal is we keep language wars/boosterism/accusations of being a virus etc out if it (Hey I /like/ Haskell, I've just found it useful for my brain rather than being especially useful for performing data transformations that come my way).

/If/ you accept that premise, why do you think Pandoc has been so very successful where perhaps other applications written in haskell have not? The Problem domain (something about writing parsers)? The contributors? The culture? Something else entirely?

Of course if you reject that premise I'd also be interested to hear your thoughts on it in as much detail as you care to provide.

Cheers.


First, let me challenge the premise: the list of popular Haskell projects on GitHub is far longer than you might expect. Pandoc isn't even the most popular one: https://github.com/search?q=language%3Ahaskell+stars%3A%3E10...

But there still may be some truth to the claim. A simple fact is that smaller mind share -> fewer programs -> less chance for extremely successful projects. From personal experience: it took me three tries and multiple months to get comfortable enough with Haskell to the point that I was able to write my first contribution to pandoc (the org-mode parser), despite having dabbled in functional-style Lisp for years before that. But Haskell, as used by pandoc, isn't difficult. In fact, I often find it easier to use Haskell, thanks to its excellent type system. It's just very different and requires a bit more investment up front, with huge benefits lurking down the road.

Data to support my claim that Haskell is actually easy to use: over 300 people have contributed to pandoc, with over 100 contributing Haskell code. Many of those contributors have never written any Haskell before, but the type system helped them to find their way.

I talked a bit about the whole topic here: https://youtu.be/JpNEIpLtCHs


Just to address the premise with the data in the link you provide. Click your link, remove anything that is a compiler, a linter, some other parser of programming languages, a library for use when programming haskell or a programming framework and that list gets very, very dramatically shorter.

I don't think that's entirely fair fwiw, it's github ordered by stars, that will turn up things used by programmers for programming in any language. But either way I don't find the refutation convincing.

I'd love it if the premise was no longer fair. That the data really does not support it. I want monad tutorials, there are thousands. That is no exaggeration. I want Haskell applications useful for something that isn't programming a computer - really not much.

I was kind of hoping you'd say something about the parsing problem domain and why that /seems/ to work particularly well with haskell but other domains not quite so much, at least yet, and whether that can be changed or is simply the nature of statically typed, pure functional programming languages (I really hope not).

It's not "successful" let alone "extremely successful" programs so much as "existant" that is the bar that needs clearing first.

Pandoc is great. Haskell works well for those of you hacking on it. I've used it, liked it and thank you for it! It isn't necessary to have an opinion on the topic at all, of course.


Thank You for the ever improving org-mode parser. Org-mode is in general difficult since it's a bit of a moving target, so I'm surprised that it's so well supported!


Thanks, comments like yours make my day :)

Not sure if I'll ever find the time, but I'd like to make the org-parser less useful for Emacs users. The idea is to write an org exporter which produces pandoc's AST JSON format; all Emacs Org settings would be respected that way, the detour through pandoc's parser would no longer be necessary, and remaining parser incompatibilities wouldn't matter for users exporting from Emacs through pandoc. Well, some day...


That will be great. Org’s greatest power it’s also a weakness – coupling with Emacs. I mean it’s great in all aspects except getting other people to use it.

Pandocs makes it possible/bearable to interact with rest of the world (I’m in the process of moving more things to org).

Being able to export directly to pandoc’s AST Json will probably allow to avoid using other programs to edit content at all! I’ll wait for this day to come; perhaps I’ll even learn enough Elisp to contribute untill then. ;)


> There seem to be not so many haskell applications that succeed to the point where they are of general use, as in not simply useful to programmers doing programming (probably in Haskell) At least this is a frequent observation about Haskell and one I've made myself.

Yes, repeatedly, and I'd love to know why you think it matters and what it is indicative of!


Perhaps performance plays a role; transforming documents is usually not a bottleneck (unless you are running some server farm).

Also transforming documents seems like a task well suited to functional languages.


I heard somebody say "Haskell people tend to write libraries, Rust people commandline tools". Pandoc is the excpetion that proves the rule ;-)


Are there any filters/plugins that could create a good workflow for converting a pdf that is multiple pages of very clear text images? Think of each page having a few printed multiple choice questions. Is there an easy way to get it into a text document?

Some command (or commands) that can be wrapped in a script:

> convert2txtViaOCR.sh -i input.pdf -o output.txt

Thanks.


Could you shoot me an email? I’m always on the lookout for pandoc freelancers.


I want to use it in commercial product, is it allowed?


I presume you mean a proprietary license. Probably yes, you just have to obey the license. The Linux kernel and git are also GPL. In general, if you're not linking it into your software you're fine, but see the license for details.

Under US law at least, open source software is commercial: https://dwheeler.com/essays/commercial-floss.html


Pandoc is licensed under the GPL version 2 or later. I know of a couple of companies where pandoc is used in proprietary systems server-side. IANAL, so best to consult one for your specific use case.


Pandoc is a tool used daily by those of us who write code notebooks (rmd or jupyter) or are into using markdown for their notes and occasionally need to print said notes. It is hard to overstate how useful Pandoc is for me.

I would bet many people who use Pandoc have no idea they rely on it. I don't think Jupyter or RStudio make a big fuss about it even though they both use it.


I’m a big fan of keeping md documents in source control, then publishing them wherever they need to go in the CI/CD pipeline, and I’ve used pandoc a lot for that.

I always ponder whether it’s the most practically useful Haskell tool ever written.


Either pandoc or shellcheck, for sure. Both of them are sensible choices to use Haskell for


Yes, RStudio uses it and I find it lives up to the title "The Swiss Army Knife of document conversion."


This is great to know. I use markdown for journaling, note taking, and documentation. I don't need to print anything but if I did then I'd probably go the way of mardown to html with custom css - now I will give pandoc a try first.


Probably overkill, but I use Pandoc to generate tailored resumes for roles and jobs I’m interested in.

I keep a list of all my skills, experience and education in a YAML file and have a LaTeX template that I clone when creating a new resume. Then it’s just a matter of replacing the template fields with YAML metadata and running Pandoc.


I have the same set up to generate both my resume and my website using an HTML template. Makes it easy to update one YAML file and update both my CV and my personal website

https://mehalter.com


The man page is a very nice touch! Do you have source in GH or elsewhere about this harness? I am using Restructured text and rst2pdf but this looks so much nicer!


Disregard, I followed your Keybase to you personal git server. Very nice and inspired, I will check it out!


Glad you were able to find it! I should have linked originally. For anyone else interested the repo is at https://git.mehalter.com/mehalter/mehalter.com

It was also fun to make a groff template to output a file you can open with man too lol


I also use pandoc to generate CVs, happy to know I'm not alone :) I don't do anything as sophisticated as you, but my main resume is in markdown so I use it to create a .pdf or word doc and to apply .css styling where appropriate.


You can write filters in Python and several other languages. These let you perform arbitrary computation triggered by tags in your source document, and let you extend Pandoc’s Markdown to include your own custom tags to do anything you can imagine.

Here is an article where I show how to use Panflute, a library that lets you write filters in Python, and how I wrote a set of filters to automate the tedious parts of writing a complex technical manual:

https://lee-phillips.org/panflute-gnuplot/


Pandoc is awesome! One of my favorite usecases is for Orger [0], which I'm using to automatically convert data from different services into org-mode for easier local-first/offline search, navigation etc. Often API would give you markdown (e.g. Github), and while I could embed a markdown source block in org-mode, with Pandoc I can just convert it and display in native Org syntax.

[0] https://github.com/karlicoss/orger#readme


Neat. Not quite the same thing but here is a small hack that I use to view pandoc supported formats in emacs:

https://gist.github.com/imarko/ec8f39550662fcd16908b7ec9d100...

Can be changed to use .txt or .md if preferred.


If you want to do single-file conversions with Pandoc without having to install it, try http://markup.rocks/. It’s a compilation of Pandoc into 2.2MB of JavaScript so you can convert documents (and preview their HTML conversion) in your browser as you type. Its source code: https://github.com/osener/markup.rocks.

I most often use http://markup.rocks/ for converting HTML to Markdown and for testing that my reStructuredText syntax is correct when contributing to docs.

Pandoc also has a demo web page for trying it out (https://pandoc.org/try/). The demo supports all of Pandoc’s formats and doesn’t require a large JS download, but it silently truncates inputs to 3,000 characters.


I haven't updated markup.rocks in 5 years, glad to hear it is still useful for others! Reminds me to update Pandoc and switch to https, likely sometime next month. Maybe I can try compiling it to wasm instead of JS this time around.

Let me know if there's anything you'd like to see that would make it more useful for you!


Pandoc is on the the programs that always surprises me with how good it is. Everything I throw at it works perfectly. I write my assignments for class as Markdown or plain text and it easily makes them a good looking Word or LaTeX document seamlessly.

It's also fantastic for converting my class notes from Markdown with LaTeX equations into beautiful PDFs.


Pandoc is a true work of art. Everything about it embodies the Unix philosophy of "Do One Thing and Do It Well".

I've been using Pandoc (and make) daily for over 6 years for all sorts of document writing (letter, report, thesis, design doc, performance review, you name it) and solve the occasional "interesting" format conversion problem. Its robust, reliable, fast, and a pleasure to use (and script).


If curious see also

a large thread from 2018: https://news.ycombinator.com/item?id=17855104


Always glad to see pandoc get some attention. This tool is probably in my top 5 overall, I barely make it through a day without it.


Huh. What do you use it for on a daily basis?


I'm in college, and my profs send a lot of .docx files. In general I prefer not to start up libreoffice, so I just use a script and mailcap file to view it automatically with pandoc and zathura. I also use it to write for both assignments and personal stuff, though for anything long or with weird formatting I prefer Latex.


Pandoc works great as a high-level wrapper around latex, where you can write the content in highly-readable markdown, while adding embedded latex for more complex stuff. Being able to use BibTex instead of MSWord's god-awful reference system for footnotes was an eye-opener, as was the ability to keep your manuscripts in text-based .md and .tex formats instead of docx, so you can track your revisions with git.


I've been using pandoc for a while now, but did not know that it could handle docx (I should probably read through the manual again!).

Does it handle embedded pictures well?


It handles them pretty ok when converting to pdf, obviously not converting them to markdown.


Thank you! This has been useful to know, and I now have some amount of fiddling around with pandoc and docx files on my TODO list :)


If you do I highly recommend looking into using a reference doc. I struggled to make the markdown -> docx conversion until I set a few reference docs up to keep consistent style.


I'm not the OP, but for me it's converting statistical analyses done in Rmarkdown to PDF or HTML.


pandoc is one of the few packages (among with tetex) i black listed on my distribution for automatic updates because it seems to pull in hundreds of other packages which are not used by anything else.

I don't know how they did it, but somehow they put dependency hell on a completely new level.

Yes i'm sure it's a great tool, but there's a limit how much bloat I can tolerate for a single program.


That's your distro's problem.

    $ zypper info --requires pandoc

    libm.so.6()(64bit)
    libpthread.so.0()(64bit)
    libm.so.6(GLIBC_2.2.5)(64bit)
    libpthread.so.0(GLIBC_2.2.5)(64bit)
    libm.so.6(GLIBC_2.29)(64bit)
    libdl.so.2()(64bit)
    libdl.so.2(GLIBC_2.2.5)(64bit)
    libz.so.1()(64bit)
    libc.so.6(GLIBC_2.17)(64bit)
    ld-linux-x86-64.so.2()(64bit)
    ld-linux-x86-64.so.2(GLIBC_2.3)(64bit)
    libgmp.so.10()(64bit)
    libpthread.so.0(GLIBC_2.3.2)(64bit)
    libm.so.6(GLIBC_2.27)(64bit)
    librt.so.1()(64bit)
    libutil.so.1()(64bit)
    libpthread.so.0(GLIBC_2.12)(64bit)
    libnuma.so.1()(64bit)
    libnuma.so.1(libnuma_1.1)(64bit)
    libnuma.so.1(libnuma_1.2)(64bit)
    libffi.so.8()(64bit)
    libffi.so.8(LIBFFI_BASE_8.0)(64bit)
    libffi.so.8(LIBFFI_CLOSURE_8.0)(64bit)

    $ rpm -ql pandoc | grep -v '^/usr/share'

    /usr/bin/pandoc

    $ ll -h /usr/bin/pandoc

    -rwxr-xr-x 1 root root 162M Sep 30 13:33 /usr/bin/pandoc


That would seem that your distro is statically linking all the Haskell libraries. On distros that use dynamic linking for everything, it's also going to pull in (directly or indirectly) ~130 Haskell libraries.


Correct.


This has little do with pandoc and everything with how awfully Haskell packages are packaged for some distros. Imagine if installing a program that runs on node would pull in every single npm dependency as its own package.


That's the point though: you should only need one package manager.


Can I wager a guess that you are on Arch?

The Arch (and some other linux maintainers) have made the decision to package all Haskell libraries as separate OS packages and install those as dependencies when you install, say, pandoc. This model of distribution doesn't really make much sense for distributing Haskell binaries, though.

There's a few reasons for this: 1) since most people don't have many Haskell binaries and the few that people use don't share many libraries, 2) Haskell packages are normally statically linked when building executables.

If linux maintainers would simply build/ship pandoc as a single static executable all these issues disappear.


Nice!

I'm on Arch, but I was under the impression that Debian did the same thing for Node/Haskell modules. Or am I mistaken?


> That's the point though: you should only need one package manager.

That's orthogonal to the issue. Even with just one package manager, how packages are created and maintained is a separate task.

So, if the pandoc package has dependencies vendored, dependency hell is avoided regardless of which package manager is used to install it.

If, however, the pandoc package has all dependencies listed as separate packages, dependency hell is created, again regardless of which package manager is used to install it.

So this is a matter of policy, not tooling.


I think the difference is in the visibility of dependency hell, not the existence of it.


Does that actually matter? To me dependency hell is when you have lots of conflicts where some software requires one version but some other software needs a different version. So you can't upgrade one version without breaking something else.

With pandoc and all the haskell dependencies, the only downside is the length of the list of packages when you upgrade. If it was all bundled up as haskell-all I doubt I'd even notice.


That's probably because it's compiled dynamically in your distro's package manager. If you look for a statically compiled option, it might be more to your taste.


And there are statically-compiled versions available for multiple platforms on Pandoc's download page. (I tend to use those for the Mac, rather than installing through Homebrew.)


Arch, I presume? That's mostly due to a man-power problem on the side of the Arch Haskell maintainers. Try our pandoc Docker images or use pandoc-bin from AUR for a bloat-less version. https://hub.docker.com/u/pandoc


Considering what pandoc does and how it is used, docker is a massive overkill imho. What pandoc should actually do, is come as a tar ball and be buildable the traditional configure make make install way like all unix tools of a similar fashion do. Haskell, atm, is no language for this.


Hahaha, that's actually some quality and funny trolling. Not bad :D

For everybody interested in alternative installation methods: all pandoc releases are available as statically compiled binaries for Linux, and via installers on macOS and Windows. Any major package managers ship a more-or-less recent version of pandoc. Compiling is as simple as getting the "stack" tool and running `stack install`.


If you’re concerned, run it in docker and dispose the container after you’ve got your output docs


Pandoc is great but I think it falls a bit short of being a Swiss army knife; there are a lot of conversions it cannot do, like PDF-to-anything. Thankfully Calibre's 'ebook-convert' tool covers many of pandoc's blindspots.


But real Swiss army knife does not include any magic either - even simply extracting text from PDF (ignoring all formatting) is completely non-trivial. Do not know any (non-magical) specialized tool that can convert PDF formatting.


Exactly, Pandoc chooses robustness over buggy half baked conversion. Swiss Army Knife is no good when you need to a debone a Tuna. Every tool on a Swiss Army Knife is sub-optimal. It's a terrible popular analogy in general.


I don't know much about tuna, but I once cleaned a bass with my swiss army knife.

Anyway, I don't really expect Pandoc to do everything, but when you have both Calibre and Pandoc in your toolbox, it sure feels like you could manage close to anything.


Calibre (`ebook-convert`) makes a decent attempt at converting PDFs to other formats. This of course is very far from perfect, but it takes a good stab at it and I've sometimes found the results to be usable (often with some manual cleanup.)

Another example where Caliber compliments Pandoc well is when generating ebooks for sideloading onto kindles. Pandoc can create epubs which Calibre can in turn convert to mobi.


Great thing about Pandoc - it has a clear, descriptive and yet unique name that aptly describes what it does.

That aside, I find the markdown + additional features (e.g. latex math, inline code eval), mainly as implemented in Rstudio and Rmarkdown, to be the sweet spot of power and convenience of typing and legibility in plain text form. Thanks pandoc!


I've been using pandoc a lot recently for converting DRM free epubs into plain text and then piping that into Mac's say command generally then I pipe that to ffmpeg and output the file to mp3 for compressions sake. say is a text-to-speech program. Obviously I only use the audio output for myself. But, I find mac's Books app useful for the audio because you can set the speed up to 2x the original. (I'm sure the say command also has some similar settings too.) I even set up my own Automator task to do most the work for me. I am so thankful to those who made pandoc though it has come in handy time and time again. I used it for tons of my school papers back when I was in school and now it's my go to document converter.

EDIT: I've also used this workflow for reading RFCs for OAuth and such. It's just basically a small curl piped to say away. Sometimes if I feel like reading an article I'll add a readability like cli tool piped between the curl and say commands. Unix is awesome!


A lot of tech book publishers actually release their books from their own websites in DRM free formats like epub, such as Manning, No Starch Press and often O'Reilly if you get them from the right place (humblebundle.com generally is a pretty good source for that if you're patient.) Sadly O'Reilly's website has stopped selling books directly from their website and instead you have to get them from somewhere else (but before they were DRM free).


I've self-published a couple of paperback novels that I create using LaTeX, then I run them through pandoc to get a perfectly formatted .epub that I use to sell the e-book versions.

Flawless!


I'm using pandoc for generating pdf/epub ebooks from GitHub style markdown. The default output is good enough and there are various themes that can be selected. But I wanted to customize a lot of things like chapter breaks, background color for inline code, bullet styles, blockquote style, etc. I didn't know Latex but was able to find snippets from stackexchange sites to suit my needs. I wrote a blog post on this: https://learnbyexample.github.io/customizing-pandoc/


I absolutely love Pandoc, I use it in my Makefile based static site generator. Pandoc is probably one of the most valuable pieces of open source tooling next to ffmpeg and imagemagick.


Pandoc for text, ffmpeg for audio/video and imagemagick for images?

I've used pandoc for pdf generation and ffmpeg for some audio recording/encoding/playback. I can't imagine what I would use imagemagick by itself for though (that I wouldn't use some common image processing application for). What do you use imagemagick to do?


> What do you use imagemagick to do?

Automate various transformations:

- resize - change orientation or ratio - adjust colors - convert format - do all of the above to generate thumbnails of large photos, in one command


Hadn't heard of pandoc before. Momentarily thought it converted from PDF to anything, and my heart leapt. Alas, it only converts to PDF. My hopes dashed...


That's not really a reasonable expectation, as PDF is and output format not an input format. If you want to make a PDF that others can read, the best solution is to generate a PDF that embeds the original input. LibreOffice can do this.


Poster child for Haskell


Yes, it is the only program coded in Haskell I have ever used for anything practical, to my knowledge.

I have heard of others, like git-annex, but not used them myself. I wonder if there are any I just didn't know were.

I also wonder if anything about Haskell makes it particularly suited as the implementation language for Pandoc. It must have a lot of parsers in it, and Haskell is supposed to be good for coding parsers.

There are parser generation libraries and meta-libraries for certain other languages, notably C++. I wonder what Pandoc in C++ would look like. Probably a pretty good parser meta-library could be spun out of such a project.


Allow me to shill another amazing Haskell program for general use then: https://www.shellcheck.net/



If this list rounds up the most-used Haskell programs, I can safely conclude that I don't use any Haskell program besides Pandoc.

Apparently I use a few Go programs--Docker, maybe others?--but no Java programs at all, because I delete all the JVMs from my machines without noticeable effect. Likewise, no C# programs, because I have no Mono runtime. Probably no Lisp, Smalltalk, Julia, or OCaml. Some things I run almost certainly are or use Lua, and of course Python, Perl, and even Tcl. I don't know of any in Rust, but it would be hard to tell because of static linking.



Xmonad is another famous one


I have never understood the interest in window managers. All I have seen that try to automate to expectations fail to approach mine.

I.e. a window manager does not seem to me like a useful practical application, once there is a minimal one already. Fvwm was fine.


I used pandoc with filters written in Haskell for my blog. I was surprised how far I could stretch it before I had to switch to Rust with pulldown-cmark (just went for Rust for learning although it turned out to be a good decision).

Pandoc filters allowed me to transform the AST in useful ways. For example I turned the image tag into HTML figures with captions, used the video tag if the URL was a video, and called ffmpeg to encode the video in another format for browsers that didn't support the other format.


I write my lectures and labs in .md and convert to pdf with pandoc. I like the results tex produces but I don't love the language, so pandoc is ideal.


Why not use LyX as your front end into latex/Tex?


Pandoc is wonderful. I don’t use it often, but I always have it installed and available.

+1 for being written in Haskell, indeed way back when I became interested in Haskell, I think it was noticing that this tool I was using was written in a strange programming language that influenced me to eventually adopted it many side projects and to write a little book on.


As much as i like pandoc, i hate how many Haskell dependencies it has on archlinux. And the distro is not to blame here. They do it right. In that sense pandoc might be an excellent tool, but for me it's also a reason to think twice whenever you want to use haskell in production. Because apparently, this is a haskell ecosystem issue.


This is very much an Arch issue. The publicly available debian/fedora pandoc packages are statically linked, and, until two years ago, so was the Arch Linux pandoc package. The change to dynamic linking (and therefore 700+ MB of Haskell-related dependencies) was a deliberate decision made at the time to reduce maintainer burden. A statically linked pandoc is still available on the AUR under the name pandoc-bin.


How come on Ubuntu and Debian I don’t have any problem whatsoever?


This is probably a silly question, but the last (and first) time I used pandoc, my conversion of org files to markdown resulted in a lot of whitespace within the document itself. I followed the instructions on the website, but is there a flag that I should have used to get rid of excess whitespace?


I'm the author of pandoc's org-mode parser. Can you drop me a mail (listed on my GitHub profile <https://github.com/tarleb>) or post to the pandoc-discuss mailing list?


Thanks for writing this parser!

FYI, https://orgmode.org/list/87y2jvkeql.fsf@gnu.org is about enhancing Org's syntax documentation. If you have specific needs/ideas that you'd like to share, please don't hesitate.


Long term pandoc user here!

Been using it with https://github.com/Wandmalfarbe/pandoc-latex-template to generate my documents.

Please comment if there are other nice templates, either for LaTeX or for Doc


I'm working on one! [0]

However it's not quite done, yet. I'm mostly interested in PDF output, and not having LaTeX was one of the goals, so I use weasyprint for PDF generation. Too bad they are very slow with releases, and I encountered many bugs...

[0] https://github.com/runxel/Morris


It surprised me when I couldn't find a decent tool to read markdown in a shell and I tried about a dozen tools but pandoc did it the best to read it sufficiently well by feeding it into man command.



Does anyone have practical experience maintaining an entire website through pandoc generated HTML? Is it worth it, and what are some pitfalls to be aware of?


That's how I generate my website (and eveything else). There really are no pitfalls. Whenever something is not working, I discover that the answer is in the official Pandoc manual. I suggest getting a recent Pandoc; the version in your package manager may be a bit old.

https://lee-phillips.org


use Hakyll if you want pandoc generated HTML for the website

https://jaspervdj.be/hakyll/


And with hakyll, you get a static site generator powered by all the goodness that is pandoc. Blazingly fast (compared to say, pelican) and easy to extend.


This is great! Anyone know what the format for Google Docs is and whether Pandoc or another tool is good for importing GGocs into other formats?


Google Docs exports pretty well to docx, pandoc can handle it.


Pandoc is great though I struggle with latex. Is there an easier way to go from md to pdf with your own template?


There's a popular template [0] which you can adapt to your needs. I didn't know Latex too, so I cobbled together snippets I found from stackexchange sites [1] (this was before I knew about that template, else I'd have probably started with that)

[0] https://github.com/Wandmalfarbe/pandoc-latex-template

[1] https://learnbyexample.github.io/customizing-pandoc/


- Style using CSS: Use Pandoc to HTML, and use wkhtmltopdf or chrome headless to convert HTML+CSS to PDF.

- Style using XSL-FO: Use Pandoc to DocBook, XSLT docbook-xsl stylesheets to convert to XSL-FO, Apache FOP to convert XSL-FO to PDF.


Or HTML+CSS with WeasyPrint or Prince; the latter is free for personal use.


Can anyone point me to docs/code where the internal pandoc format (AST) is described please?


I’ve used many converters in my life, but Pandoc is the one I always end up using every time


I rather expected more than just two ebook formats on something described as a universal document converter.


Pandoc ubuntu apt installation is horrible.

I have installed the latest texlive in home directory.

When I invoke 'sudo apt install pandoc' it requires me to install a massive texlive setup at the system level as part of it.

This is not specific to pandoc but many other packages. I have anaconda3 installed in my home, but image-magick requires a massive numpy/scipy system-level install (ignoring for the moment my bewilderment at why would image-magick require numpy/scipy).

I refuse to put up with this kind of bloated bs.


You're asking the system to install a package. System packages are available to all users. If the package is going to work for all users, its dependencies also need to be available to all users. This naturally leads to what you're seeing: the system will not consider software installed only for your user, so it'll end up installing the same dependency system-wide that you had installed in your home directory. While I understand your frustration, I can't immediately think of a better way to handle this.


What are you complaining about exactly? That your package manager doesn't automatically know you've installed something manually?


Considering what you get for 1G it is worth it for most users. I would guess that you aren't the target audience for it if you're that concerned over space. 1G of space these days is nothing unless you're using an older system. It's just sitting on your disk and that takes nothing away from you if you aren't loading it. It handles 10s of file types and that requires a lot of libraries


In Ubuntu and Debian the dependency from pandoc to texlive is of the "suggests" type, not "required". So you do not have to install texlive to use pandoc. You may use an interactive front-end like aptitude and simply deselect all the suggested dependencies you don't care about (or configure aptitude not to install suggested deps by default).


I think these packages contains pdf as well which makes the whole texlive installation over 1gb. Even without pdf, texlive is pretty big. I don't think there is a way around it. You can use a docker image to isolate pandoc from the system.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: