Hacker News new | past | comments | ask | show | jobs | submit login
Latex to HTML5 Conversion for Scientific Papers (github.com/smarr)
96 points by smarr on Sept 21, 2015 | hide | past | favorite | 40 comments



pdf2htmlEX (https://github.com/coolwanglu/pdf2htmlEX) would be another route if you aren't concerned about preserving the semantics of the markup. Obviously you would to go LaTeX -> PDF -> HTML5. It maintains the paginated style and looks good for scientific papers (example: http://coolwanglu.github.io/pdf2htmlEX/demo/demo.html)


That's definitely very cool - I wanted to find some more info about the project for my own interests (https://news.ycombinator.com/item?id=4528797)


Cool. That said, I use LaTeX for typesetting equations. Neither of the linked examples show output of typeset equations.


I was wondering about that as well.. Getting equations right is critically important. But, from my limited experience, the MathJax syntax is LaTeX-enough that equations might be one of the easier parts of doing this?


> MathJax syntax is LaTeX-enough

It is LaTeX syntax.


> It is LaTeX syntax.

That's LateX-enough for me :)

But in all seriousness, thanks for clarifying/confirming. When I posted I didn't want to risk overstating.


Nice! There's also this one for getting LaTeX SVGs on the fly: http://tex.la/


What does it do beyond what tex4ht does? If there is some way tex4ht could be improved, perhaps it would be best to contribute to that project? https://www.tug.org/applications/tex4ht/mn.html


It is built on top of tex4ht. It provides merely a few settings for tex4ht and post-processing scripts that beautify the generated HTML. You might ask why post-processing? Well, because it was simpler for me than figuring out how to get tex4ht to do the desired thing. I just find Tex/Latex not pleasant to use as a programming language, but that's personal taste.


If the post-processing stage is useful to others, perhaps it could be upstreamed into tex4ht?

(I sometimes think that the user interface of github puts too much emphasis on cloning and not enough on cooperation. Many useful tools ends up in a dozen forks, all with slightly different features, all equally inactive.)


What science needs, I believe, is not another tool for making TeX more usable for information interchange, but a simple (much simpler than LaTeX, whose complexity and user experience is terrible), web-oriented standard language for typing in libre scientific and technical documents that browsers would support (or it can be translated to valid HTML+CSS seamlessly). Web documents are cheaper and more accessible to people, we should concentrate on those. I don't know if there is usable language and platform of this kind already. When we'll have that, tools for converting web documents to other less important formats such as paper-printable ones can be created.


Web documents are cheaper and more accessible, but there's still a quite large usage of print documents, so at least I, as a document author, don't want to commit to a "web-only" toolchain without a good to-print workflow also being available.

It's possible for HTML+CSS to also provide a good to-print workflow, but I don't think it's there yet, at least using open-source tools. I have heard PrinceXML can produce good results, with sufficient control over the print layout to make HTML+CSS usable as a print-oriented markup language. But between the cost, and the prospect of becoming dependent on a proprietary tool with no obvious alternatives, I haven't tried it.


PrinceXML is definitely the best tool, but there are a couple similar commercial alternatives and wkhtmltopdf have become a decent open source solution as long as you want basic docs.


Agreed. I would use latex for everything but I need to work with collaborators who are only used to writing in Word with track changes. Something that would allow me to write in latex then still share for editing with less tech savvy people would be wonderful.


Have you tried pandoc for converting LaTeX to markdown or word? It generally works well unless you have a lot of equations or references.


You should definitely post some info about your project to tex4ht mailing list, I hope some interesting and more informed discussion might happen there.

you may also take a look at make4ht (https://github.com/michal-h21/make4ht), it is a build tool for tex4ht, included in TeX distributions and it also solve some of the problems as your script (ligatures, spurious span elements, image conversion, unicode, etc.). It can execute custom commands on all output files, so your script could be used with it as well


The funny thing is, I find HTML5 to be much more predictable when it comes to formatting :)

But maybe that's just my ignorance of LaTeX showing, although I did manage to write two theses in it.


Latex is a very bad format and dead end for interchanging to other formats.

Also like Pandoc and Markdown... if somebody is interested for a workflow to write a scientific paper with it check this project out: https://github.com/tompollard/phd_thesis_markdown


Strictly speaking, isn't this false by virtue of latex compiling to pdf?

And I have been looking hard for years and haven't found anything to replace TeX that fulfills even half my needs (as a person who does math typesetting in large documents almost daily). LaTeX has its issues, but it's the best we've got.


Compiling to PDF as a display format, and being amenable to translation to another markup format that retains the structure of the original markup, are slightly different tasks.

In general, I'd agree with the parent that LaTeX is something of a dead end in terms of translation. LaTeX will happily compile to PS/PDF/DVI, but translation to something like HTML is pretty reliant on using a subset of LaTeX. You can, after all, write LaTeX code to do computation.

In my experience with Pandoc, there are a good number of packages that simply don't work when translating from LaTeX. The more specialized or formatting specific the package--the further you deviate from the standard Article class and simple section headers--the less likely that you'll have a good result.


A format is a format, but point taken.

I've found that pandoc (and every other conversion utility) doesn't even produce useful results for LaTeX even if the LaTeX source doesn't use any complex macros or fringe packages.

The problem seems not to be that LaTeX is too powerful, but rather that the point of HTML and Markdown is to be as lightweight as possible. LaTeX, on the other hand, is meant to be a useful tool for humans to use to minimize the amount of boilerplate needed to write a large technical document (in print or on the web!).


This is actually kind of true. LaTeX is a Turing-complete language, which makes converting it to other things somewhat unreliable.

But, there's kind of an informal subset of LaTeX that you can use and not be too bad off with Pandoc.

In general, I'd like to do academic and report writing in Markdown (or Org!), but limitations of Markdown usually make it easier for me to do my writing directly in LaTeX, portability be damned. The biggest impediment for Markdown in academic / report writing I'm my opinion is the lack of internal references (things like "see figure X").

Org-mode can be converted via Pandoc much the same as markdown, but doesn't have the issue with cross-references.


Seconded the cross-reference gripe with Markdown. Somebody clever has to be working on that already, it's such an obvious gap. I think it would enable things like RMarkdown to reach "critical mass"


Latex has too much inertia for users to switch anytime soon. The only competition is from commercial tools like Adobe Framemaker.

I hope the future will turn out to be Lyx + Asciidoctor export. If Microsoft Word ever improves their math input interface, I think it would just kill off LaTex completely. It's already almost mandatory to use for document interchange.

Markdown is too limited. It has no standard for extensions - Asciidoc does. Markdown IMO is strictly for writing blog posts that are mostly text.


also, once you get used to latex (and emacs in my case), you can't find any editor or format that makes exactly what you want as latex.


Nice! I would suggest a sans-serif font though, aren't they better for display reading? Hyphenation would be nice too.


The main historical reason for using sans-serif fonts on displays was the displays' lousy resolution. With modern display technology that reason has mostly disappeared.


At what point does the resolution become good enough for serif fonts I wonder? Unfortunately 1366 x 768 is still quite the norm (118 ppi on a 13.3" screen).


I find serif fonts to be perfectly readable even on 1280x800 displays.


On screens it tends to be the other way around - serifs for headlines and sans-serif for body text.

Although it'd be great to see some decent research into readability of different fonts.


I've had so many ideas for studies about the impact of fonts I wish I'd pursued. A psych professor of mine was interested in them and did at least one study that I know of[1], but information on speed and legibility would be fascinating.

[1]http://www.researchgate.net/publication/237229931_The_Influe...


I find that Pandoc fulfills all of my needs.


I tried Pandoc before reverting back to tex4ht. Unfortunately, it models a rather small subset of the things I was interested in. Specifically around the typesetting of citations and listings, as far a I remember. So, tex4ht and HTML post-processing it was.


Not enough Web knowledge to easily dive into the codebase...

Could anyone be so kind as to compare this with: https://github.com/pyramation/LaTeX2HTML5


Is there full support for math mode? (it doesn't appear from the examples that the author uses math mode, so I'd guess no)


tex4ht does support math mode. By default it puts out pictures. But it is highly configurable. See for example https://www.tug.org/applications/tex4ht/mn11.html.


i have found all these utilities unsatisfying. It might make sense to write a JavaScript library that parses latex.


In general, mapping LaTeX to HTML is an unsolvable problem (and I speak as the author of one attempt to solve it (http://github.com/softcover/softcover)).


Link has an extra closing parenthesis, this link works: https://github.com/softcover/softcover


Thanks. Looks like it's due to a bug in the Hacker News link parser. I'll be more mindful of that in the future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: