Converting my PhD thesis into HTML (2021)

periheli0n · on Dec 19, 2022

The real shocker is that it’s 2022 and LaTeX is still the best writing environment for a PhD thesis. It has so many downsides: the markup syntax is ugly, it really works best only if one used paginated output such as PDF, a zoo of partly incompatible packages, need for compilation, obscure figure placing algorithms that are difficult to control, and so on.

It still beats the competition because of rock-solid referencing, both to in-text elements like equations, chapters, etc as well as citing literature with bibtex.

Plus, it’s extremely stable, so someone who learnt LaTeX 20 years ago, like yours truly, can download the newest TeX distribution and feel at home immediately.

Nevertheless, I would prefer a Markdown-based system that can use CSS and MathML, and has a 100% bibtex clone for references.

Yes, pandoc goes quite a long way along this route, but setting up such a pipeline is still too complicated for many.

analog31 · on Dec 19, 2022

It must depend on the field. A close relative of mine is a PhD advisor in a science field. He's hands-off about it, but is also aware of what his students are doing. If asked, he recommends MS Word, which is also what he uses for his manuscripts.

My own experience was as a physics student, 30 years ago. Students paid a heavy price for being able to print and submit the entire thesis with no manual intervention. The students who chose LaTeX took the longest at it. I didn't have access to a Unix terminal anyway, and banged out my thesis on an MS-DOS machine. Whatever my word processor couldn't support, I added by hand. The readers were OK with this.

My solution to all typographic problems was "take care of it after defense." I spent a few days after my defense getting my copy to be ready for duplication, including sticking all of the page numbers on with glue because I couldn't make inline figures work.

periheli0n · on Dec 19, 2022

Sure, one can write a thesis in MS Word. It has come a long way with support for large documents. But I still find its referencing clumsy, opaque and unstable.

For example, automatic updates of figure numbers in captions and references: Countless times it failed on me and I had to manually recreate the fields, bookmarks, cross-references, and whatnot is needed.

Bibliographies are hardly doable without an external tool that comes with its own headaches.

Typography in MS word is quite decent these days, though. Anyway, the content of a PhD thesis shouldn't be judged by its typography (as long it maintains a readable standard).

godelski · on Dec 19, 2022

I think things have changed a bit since you were a physics student. Conferences hand out latex templates and expect you to use them (wish they would also hand out an overleaf template. If any conference organizers are reading this...). Universities also do this with their undergrad/masters/thesis templates. Arxiv expects you to upload tex source code (it'll reject a PDF if you wrote that PDF with latex. It also is terrible at error messaging which is a huge pain since submission timing is for some stupid reason important). I'm sure latex is also easier than back then, but there's a lot of momentum in the latex direction that I think would be really difficult to undo. Even paper acceptance is highly influenced by formatting and figure design. I think it is just a different world as we have a lot more researchers now than even 30 years ago.

analog31 · on Dec 20, 2022

Amusingly, some things haven't changed. I was the first student to turn in a word processed term paper at my college, I think in 1983. And I estimate that I earned as much as a full letter grade on my GPA because the prof's had never thought about how to grade a paper that was 100% mechanically perfect. It didn't hurt that I had become a very fast typist thanks to programming. I selectively chose courses where the grading was primarily based on written work, something that most students feared.

godelski · on Dec 20, 2022

I'm sorry, I'm failing to realize why this story is about how things have/n't changed w.r.t. word/latex usage withing the last 30 years in academic writing.

analog31 · on Dec 20, 2022

Your comment about formatting and figure design influencing acceptance, triggered my droll little reminiscence. I certainly wasn't disagreeing with you.

KronisLV · on Dec 20, 2022

> If asked, he recommends MS Word, which is also what he uses for his manuscripts.

My university actually required that people use MS Word for their thesis, which seemed to work out okay for many, despite such a top down approach not seeming like the best option.

Personally, I used LibreOffice anyways and while it was certainly as clunky as Word (especially once images, diagrams and formulas got involved), it was also passable.

Except that things like bibliography refused to work correctly and completely broke, about which I wrote a bit of a rant: https://blog.kronis.dev/everything%20is%20broken/libreoffice...

nextos · on Dec 19, 2022

LaTeX has, like Org Mode, this mythical aura of being super hard. However, replicating the functionality of Word is trivial and takes an hour or two for a savvy computer user to grasp.

There's always Overleaf, Pandoc or LyX to make things even simpler. LyX in particular deserves to be better known.

Complex things, like TikZ, are of course difficult and time consuming. But those are impossible using Word.

IMHO, the biggest advantages of LaTeX are reproducibility and reference management. Big Word documents are quite fragile. And reference management is a mess.

godelski · on Dec 19, 2022

Honestly, it isn't the writing part that annoys me the most. It is tikz and the fact that I can't make animations in beamer. Just resolving these issues would go a long way for me. Tikz could be fixed simply if there was a GUI that could allow for sliders or moving specific objects. Or at least a better way to make a good grid (tip: draw a grid on your canvas, draw whatever you want, remove grid). Things are so difficult to properly line up, even if we have mathematical representations. It shouldn't be that hard...

abdullahkhalids · on Dec 19, 2022

I recently discovered this python interface for tikz https://github.com/allefeld/pytikz

While it does not directly address the issues you point at, it does alleviate some issues.

* The syntax is somewhat easier to parse.

* It is a lot easier to write functions to redraw the same components over and over again.

* Doing math calculations to systemically place objects in relation to each other is a lot easier because python's arithmetic syntax is a lot more intuitive than TeX's.

Of course, this does mean that you have to fire up python to draw figures.

godelski · on Dec 19, 2022

Since it looks like you've contributed to this project, I have one MAJOR suggestion. Show examples. There's countless tikz projects I've seen that promise a lot and show absolutely nothing. I've invested lots of time to fruitless ends that would have been resolved if I just could see some examples that would show me if this is even in the right ballpark of what I'm looking for or not.

Examples code is a must for any software (test cases work as examples btw) and example graphics are a must for any graphics software. I know this isn't your fault, but I'm just venting a bit here. I'll check this out the next time I'm writing a paper but I just don't get how people can put software out there without any examples (even toy examples). If you have any examples you're willing to share I'd find that extremely helpful.

abdullahkhalids · on Dec 20, 2022

I merely fixed a small bug in the code.

There are some examples in this notebook https://nbviewer.org/github/allefeld/pytikz/blob/master/pyti...

Otherwise, this is just a python interface for tikz. Whatever is possible in tikz is possible in this. For every command in tikz, the corresponding python command is here https://allefeld.github.io/pytikz/tikz/

However, I do agree that more examples would help. In its current form, its mostly useful to people who are already familiar with tikz and need simpler syntax, and can quickly get up to speed with this.

godelski · on Dec 20, 2022

Thanks, these examples do help!

enriquto · on Dec 20, 2022

> I can't make animations in beamer

Not yet. But you can easily learn to do them:

    \usepackage{animate}
    ...
    \animategraphics[width=10em,loop,autoplay]{4}{a_}{0}{10}

will animate at 4fps the sequence of images a_0.png through a_10.png

nicodjimenez · on Dec 19, 2022

Mathpix Markdown is an attempt and bringing together the best of words (Markdown and LaTeX) while providing excellent interoperability with LaTeX, meaning you can easily export your Mathpix Markdown documents to LaTeX, including equation references, tabular environments, images, etc:

https://github.com/Mathpix/mathpix-markdown-it

Disclaimer: I'm the founder of Mathpix.

runningmike · on Dec 19, 2022

I would strongly recommend MyST. MyST extends Markdown for technical and scientific communication. See https://www.myst.tools/

abdullahkhalids · on Dec 19, 2022

I tried MyST recently. All I see is a markup language that slowly become more and more complex over time to support more and more features that LaTeX already supports while at the same time acquiring the same syntax complexity of latex.

What people don't acknowledge is that there is a base level of syntax complexity needed to produce fully general documents. If you do, the natural conclusion is that to fix latex, you need a full rewrite of latex with minor changes to fix all the inconsistencies that have crept into it.

rowanc1 · on Dec 31, 2022

My PhD written in MyST is here:

https://phd.row1.ca/phd/introduction

Allows web-first, as well as a PDF output (via LaTeX).

V1ndaar · on Dec 19, 2022

Going by the documentation it does it by... drumroll... converting to LaTeX!

(edit: generating PDFs that is)

periheli0n · on Dec 19, 2022

To be fair, there is no better free tool than LaTeX to typeset PDFs. But it fails at non-paginated, free-flowing content.

chaoxu · on Dec 19, 2022

Have you tried Quarto? It should tick everything in your box (except MathML, but hey that might work too since Quarto is built on pandoc)

countrymile · on Dec 19, 2022

+1 for quarto, i wrote my thesis in rmarkdown which flipped easily between latex and html output, with a bibtex referencing system. It also allowed you to inline latex for more complex outputs. And inlining calculated tables and charts meant i could keep my writing and code together. Quarto is the successor.

periheli0n · on Dec 19, 2022

Thanks for the pointer, that looks interesting. Especially because it is open source!

I see it supports Jupyter notebook. Math support in those isn’t too bad at all, so it might just work for many cases.

vlmutolo · on Dec 20, 2022

I'm looking forward to trying Typst when it's available. It's the first LaTeX alternative that's ever interested me.

http://typst.app/

mcbuilder · on Dec 19, 2022

think the most solid way, and the way I'd do it now if starting my PhD all over, would be a bunch of org docs with LaTeX weaved in.

Pandoc I think really e a different niche, hard to imagine really a complicated document benefiting from even more pipelining. You're editing a lot...

thangalin · on Dec 19, 2022

> Nevertheless, I would prefer a Markdown-based system

My free, cross-platform desktop Markdown editor, KeenWrite[1], integrates with the ConTeXt typesetting software[2]. I'm working on a branch to make integration containerized[3] because its installation is painful. KeenWrite limits math to plain TeX[4] so that the output can be rendered using any TeX-based typesetter (ConTeXt, LaTeX, MathJax, εχTEX, etc.).

Here's a sample document typeset using ConTeXt (skip to page 40 for the math):

https://pdfhost.io/v/4FeAGGasj_SepiSolar_Highlevel_Software_...

That document theme is called Solare[8].

> that can use CSS and MathML

Adding CSS mixes presentation logic with content, which is something KeenWrite strives to avoid. Instead, KeenWrite implements Pandoc's annotation syntax to keep presentation logic out of the content. I've written about this extensively in my Typesetting Markdown series[5].

You can produce some pretty amazing documents just with annotations, such as the following that I wrote in Markdown and typeset using ConTeXt:

https://impacts.to/downloads/lowres/impacts.pdf

> has a 100% bibtex clone for references.

Markdown fails at references. At some point, I'd like to implement cross-references in KeenWrite. Except there's at least six competing standards for the syntax, which I've also remarked upon[6], making the choice of syntax difficult[7].

> setting up such a pipeline is still too complicated for many

FWIW, my Typesetting Markdown series, which explains how to set up a typesetting pipeline using Pandoc, is one of the reasons I developed KeenWrite: to replace that entire pipeline (R, Markdown, externalized variable interpolation, math, and typesetting) with a single tool.

[1]: https://github.com/DaveJarvis/keenwrite

[2]: https://wiki.contextgarden.net/Installation

[3]: https://github.com/DaveJarvis/keenwrite/blob/1_typeset_using...

[4]: https://github.com/DaveJarvis/keenwrite/blob/main/docs/scree...

[5]: https://dave.autonoma.ca/blog/2020/04/28/typesetting-markdow...

[6]: https://talk.commonmark.org/t/cross-references-and-citations...

[7]: https://xkcd.com/927/

[8]: https://github.com/DaveJarvis/keenwrite-themes/tree/main/sol...

vouaobrasil · on Dec 19, 2022

I disagree with the author that PDFs are a terrible format. They guarantee layout, which is very important for complex scientific presentations. Even slight differences in layout can make a complex set of equations difficult to parse. LaTeX also has a much superior word-break/hyphening algorithm to the HTML engines of browsers.

I find PDF math papers easy to browse, unlike the author. They're much easier and more organized than a website, can be easily searched and have a *proper table of contents* compared to websites. As for poorly browsable on a phone -- well I think that is irrelevant because nobody is going to read a complex technical paper in practise on a phone. They do look decent in tablets, and as for screen readers...well that's a valid point but screen readers don't work well for material with lots of equations anyway.

I applaud the author for the effort but looking at the result, I would not want to read math that way.

periheli0n · on Dec 19, 2022

> nobody is going to read a complex technical paper in practise on a phone

I do, in fact. Or rather, I often would like to but with PDF? No chance. IEEE explore online reading sometimes works, but it would work better if they cleaned up their UI to be compatible with phones.

I have read thousands of pages of fiction on a phone and quite enjoyed it. Phones are great for reading if the content reflows properly.

Now publishers and content creators would need to embrace non-paginated, reflowing output. This would not only facilitate reading on phones, but also on tablets and laptop screens.

O‘Reilly‘s online platform does a good job with their app.

There is zero reason why paginated output should be the default in 2022.

oplaadpunt · on Dec 19, 2022

Yes, fiction works because the layout is simple, consisting of text, and maybe images?

Research papers are far more complex, and have established standards that aid quick reading and parsing. I absolutely don't want to deal with reflowing equations, reflowing figures, or whatever when publishing papers. Precise margins and column widths.

periheli0n · on Dec 19, 2022

Yet, by far the vast majority of content produced today, technical or prose, is read on screens.

Responsive webdesign has been around for quite a while. I don’t see a reason, other than lack of effort/investment, why we shouldn’t be able to read technical papers on variable-width screens, in a non-paginated form.

Dealing with the technical challenges should not be the task of the author, but the publisher. And indeed, most publishers are on it.

What‘s missing is a standardised format that can be downloaded, annotated, re-shared like a PDF.

bccdee · on Dec 19, 2022

I wish there were a convention for sharing whole websites. Even a zip file containing an index.html plus images, css, other pages, etc. would be fine if browsers just supported it.

warkdarrior · on Dec 20, 2022

The SingleFile browser extension can export a webpage into one HTML with all images, fonts, etc that are needed embedded in the HTML.

bccdee · on Dec 20, 2022

Yeah but that works through base64 data urls, which are clumsy. Epubs are zips of separate resources and that works great. We should have an equivalent for webpages and other similar documents—a digital-first competitor to pdfs. Or maybe just broader compatibility for epubs, such as first-class browser support.

gildas · on Dec 20, 2022

You might prefer SingleFileZ, see https://github.com/gildas-lormeau/SingleFileZ

auggierose · on Dec 19, 2022

O'Reilly doesn't publish math books. All math books in epub/mobi format look like garbage. There isn't a single exception. If you know of one, please tell me. It seems currently too hard to get layout, resolution and inline formulas right in a portable format.

goosedragons · on Dec 19, 2022

With MathML epubs can look decent. For example take a look at the sample MathML epub "A First Course In Linear Algebra" [0] (in a reader that supports MathML of course). It looks pretty good. The problem is Amazon STILL doesn't support MathML, so publishers just churn out a gross version where all the equations are images and so then it doesn't scale properly with the text and the book becomes 300+ MB because of it. And they can't be bothered to make two versions for readers like Kobo that do support MathML.

[0]: https://github.com/IDPF/epub3-samples/releases/download/2017...

auggierose · on Dec 20, 2022

MathML is not good enough. HTML+KaTeX/MathJax would be good enough, but is not properly supported in any epub readers I know of.

abdullahkhalids · on Dec 19, 2022

I tried the book. There are several places where long equations are cut off. Other minor spacing issues here and there.

periheli0n · on Dec 19, 2022

O‘Reilly‘s online offer has not only O‘Reilly books, but ones from other publishers as well. Some of them have equations. However, they are often rendered as images.

IEEE explore does a good job rendering equations on phone screens. Therefore, it is possible.

There is no technical reason why equations couldn’t be rendered on a screen just as well as on a PDF. Sure, canvas size constraints might interfere, but this problem exists in principle also on paginated output. Plus, horizontal scrolling is a thing.

I‘m not saying a phone is the ideal platform to read a paper containing free energy-like math, but it can go a long way. Much longer than with the artificial restriction to paginated output like PDF.

auggierose · on Dec 19, 2022

Of course it is technically possible, but I haven't seen it done properly. I have never seen a book with math rendered as images that was of satisfactory quality or even close to what PDF can offer. I doubt IEEE explore is an exception, but I don't have an account, so cannot check.

I would like to be able to read a book also on a phone, but I am not going to compromise on quality for that, given that I can just read it on a large tablet in PDF format.

periheli0n · on Dec 19, 2022

It is possible to find Open Access articles with math on ieeexplore with little effort. Have a look here: https://ieeexplore.ieee.org/document/6767058

Does this live up to your maths standards?

auggierose · on Dec 20, 2022

Thank you for the example! Yes, that definitely looks good, but is still just a webpage. Also, it has pictures with bad resolution, and a latex table that has been rendered as an image for some strange reason. So as usual, it is not consistent in its quality, which is usually the problem.

To compare, open the accompanying PDF, which is also provided along with the webpage. It is of MUCH higher quality, which is partially due to the fact that layout is static.

Furthermore, the webpage doesn't support pagination. The problem is turning it into a book, and there doesn't seem to be a good standard that supports HTML+KATEX/MATHJAX properly. In theory there is no reason they shouldn't, as epub support javascript, but in practice it just doesn't work properly.

periheli0n · on Dec 20, 2022

> but is still just a web page

> doesn’t support pagination

Isn’t that exactly the point? Please, no pagination on a screen. A web page will do just fine. In fact, that was what the www was conceived for: publishing science.

> open the accompanying PDF, […] of much higher quality.

The only thing that is higher quality in the PDF is the justified alignment and pagination. The figures have the same (poor) resolution.

The bottom line is: * it is perfectly possible to publish technical papers in a format that is accessible on a phone. * non-paginated, free flowing output also works better on larger screens (e.g) I can resize the window and have my note-taking app open next to the paper. * PDFs are still great for annotating and printing, if required.

auggierose · on Dec 20, 2022

You are right, the figures have the same poor resolution. I didn't notice that because they just look higher quality in the PDF because they are seen in context.

As for your bottom line:

* Of course you can make HTML pages with KaTeX/MathJax that look great. I have done it myself, and this very HN post also points to such a web page.

* While a webpage is often fine, for anything I will spend a significant amount of time reading, I definitely want a separate entity, a book or a paper, that I can have in my library. There is no such format currently that has satisfactory quality, except for PDF.

* Non-paginated, free flowing output is the only real option for a phone, but does not work well on a large screen. Your IEEE example looks bad on a large screen, compared to a PDF. I am looking at it on a 27 inch screen, and on an iPad 12.9 inch, and it shows me one column, which separates the images from each other and from their context, and makes it harder for me to scan back and forth between related content. Pages provide a context, even though an admittedly rather arbitrary one, and non-paginated text is missing that context. You could do two columns, but how would that work for non-paginated text that knows no boundaries?

* For me resizing is not an issue, I take notes on paper or on my computer, but I can see how that might be important for somebody who is willing to trade reading quality. For me, PDFs are clearly superior for digital books and papers to anything else currently out there, at least if they have been created with care. PDFs that have been created by printing an epub or webpage, on the other hand, are useless for me and land immediately in the garbage bin whenever I encounter them.

jmhammond · on Dec 19, 2022

> and as for screen readers...well that's a valid point but screen readers don't work well for material with lots of equations anyway.

This is something that we’d like to change. There are many visually impaired students who need to learn mathematics the same as you and I.

My “eyes were opened” when I was working with a blind student in my class. The textbook I’d written in pretext (transpiled to pdf and HTML) could be read on his BrailleNote but some of the equations were wonky, so I rewrote them to work for everyone.

It would be better if we developed tools to make them work for everyone straight away, instead of relying on authors. That’s one of my career goals.

felixfbecker · on Dec 19, 2022

I applaud you for this.

I think MathML (which has gotten much better in browsers, thanks to Igalia[1]) is a much better bet we have to make this possible than LaTex compiling to PDF.

[1] https://mathml.igalia.com/

godelski · on Dec 19, 2022

You can't have animations with PDFs. Anyone using beamer is familiar with this frustration. But animations are incredibly helpful in explaining many works. 3Blue1Brown became so popular in major part due to his use of (fantastic) animations that more easily explain the material than any static image could.

jimhefferon · on Dec 20, 2022

The animate package from CTAN draws animations in PDFs. It has limitations (most pdf readers won't show them because they rely on some JavaScript), but it does work.

godelski · on Dec 20, 2022

Maybe that's my issue. But I've tried about a dozen pdf readers and failed.

jimhefferon · on Dec 22, 2022

Adobe works. I've heard of at least one other but I couldn't get it to go. People say Adobe Reader is bloated, but if you need it then it is not bloat.

jech · on Dec 19, 2022

> I find PDF math papers easy to browse

So do I. Still, I wish LaTeX produced easily reflowable PDFs, especially when a document is formatted in two columns.

enriquto · on Dec 19, 2022

But it does, doesn't it? You add the "twocolumn" option and recompile. Unless your LaTeX is too fancy this will tipically give a very good result (at worst, some figures with hardcoded sizing will be awkardly placed).

jech · on Dec 19, 2022

I cannot do that when I'm reading a paper written by somebody else, and I only have the produced PDF.

abdullahkhalids · on Dec 19, 2022

That's why arxiv is a god send, because the source is available there, if the author has uploaded it there.

Science needs a culture of open sharing, the same way physics and math has it.

hgsgm · on Dec 19, 2022

twocolumn is terrible for mobile.

what's needed are narrow margins.

mistrial9 · on Dec 19, 2022

what you are asking for is called a "round-trip" by some printers.. This was requested the week after PDF was invented! It does work, unless it does not.. the company that invented this technology is apparently infested by MBAs and charismatic nobodies, since they announced they are exiting the type "business" ? Our house of cards is showing.

baby · on Dec 19, 2022

Check zotero. It has that feature

gnull · on Dec 19, 2022

If your equations are in MathML, the browsers should be able to screen read them at some point.

> Even slight differences in layout can make a complex set of equations difficult to parse.

Such set of equations should normally be represented by a single block, I can't imagine a reason why layout should change inside that block.

The layout of pdf is unnecessarily rigid. When I'm reading it on my screen, there's no reason the text should be split into A4 pages with very specific margin values. Latex also often moves your figures a few pages ahead because they didn't fit on the specific page. There's absolutely no reason for that when you have access to the big continuous canvas of an html page. This works for equations too; if you have a long equation block that happens to be right between two pages, you either have to let one page have a gap, or reorder/rewrite your paragraphs to make the equations fit. None of this has a good excuse when it's read on a screen.

I don't think we need a website, but a js-free webpage with hyperlinks would be a lot better than pdf. Pdfs I find imperfect but ok.

periheli0n · on Dec 19, 2022

> I don't think we need a website, but a js-free webpage with hyperlinks

Wasn't this precisely the use case for HTML and the WWW as originally conceived by Berners-Lee and his fellow internet pioneers?

TacticalCoder · on Dec 19, 2022

> LaTeX also has a much superior word-break/hyphening algorithm to the HTML engines of browsers.

And because the PDF has a fixed layout it's also much easier to prevent "rivers" in paragraphs. Which hence makes it a no-brainer to use justification. To me many print publication using justified text (including LaTeX documents) are a thing of beauty and I do hate how "left align" breaks the flow of reading. I'm taking slightly different spacing between words due to justification every day over horizontal lines of different length, which I find fugly and confusing beyond repair.

More hyphenation controls are coming to CSS and, one can dream, it may be possible one day to programatically detect rivers?

Meanwhile rivers be damned, I override anyway many sites and add "text-align: justify". The nice thing is: because "text-align: left" is the default many sites and minifiers do not bother with text-align at all, so adding "text-align: justify" works for many, many, many sites.

And I only half-buy anyway the justifications (ah!) for left alignment on the Web.

It's basically saying: "We know better than people who've been working in print since decades (or more), left align is easier to read". I don't buy it. Left align breaks my reading flow. And I cannot be the only one.

To me left align is trading potentially ugly looking paragraphs (due to rivers) for certainly ugly looking paragraphs (due to left justification: just look at the right of each paragraph... Such lack of clarity, such chaos cannot be unseen. It's pure fail).

P.S: I've actually typeset books both in LaTeX and QuarkXPress and their were justified, not left-aligned.

extra88 · on Dec 19, 2022

> I override anyway many sites and add "text-align: justify".

I think you're an outlier in your strong preference for justified text but this serves as an example in favor of using HTML to present content. Well made web content is much more malleable by users to make it meet their needs and preferences.

dan-robertson · on Dec 19, 2022

I think you give latex more credit than it deserves. It gives little straightforward control over layout and the only reason documents are manageable is that pages are fixed size and layout changes are mostly local.

It’s paragraph breaking was state of the art when it was new but other systems break paragraphs now and potentially better. I also think ragged margins aren’t really a problem.

I think if layout mattered as much as you imply, scientists would have to use a tool that offers more control like indesign.

None of this is to say that getting good layout in HTML is easy, of course.

periheli0n · on Dec 19, 2022

> I think if layout mattered as much as you imply, scientists would have to use a too that offers more control like indesign.

Yes, precisely that. As a scientist I don't even want to have to deal with layout. That's what publishers are paid extremely well for. When I self-publish content I want the process to be as simple as possible. If this means ragged margins, browser-default styles for headings etc., default colors and fonts — so be it.

(but to be fair, optimising the layout is an excellent way to procrastinate on doing hard research)

ta123456789 · on Dec 19, 2022

PDF papers are also much easier to save/archive and use offline. And great for printing

hgsgm · on Dec 19, 2022

We need either an app that can compile LaTeX source (+all included libs, which sounds like a lot, but it's equivalent to a JS-heavy web page) on all the clients (preferably as a browser plugin or integrated feature!)

or authors should distribute their PDFs as bundles that include formatted versions for all of paper, large screen, and small screen.

patrick451 · on Dec 19, 2022

The standard single column of content layout of nearly every webpage is a bad fit for scientific content because the information density is way too low. A pdf, where I can display multiple pages, each with two columns, side-by-side is much better. This is really handy when you need to do something like refer to a figure/equation/table on the last page. I have yet to see any website solve this lack of density problem in any meaningful way. Of course, paper is still better, but I'll take a pdf over a web site any day.

A pdf is also much easier to archive. The job of sci-hub would but a lot more difficult if every paper came with separate html, css, javascript, and images.

CJefferson · on Dec 19, 2022

Screen readers work perfectly fine with mathml. At worst one can just get the screen reader to read the latex for maths and browse the rest in nice HTML.

On the other hand, PDFs generated from Latex are completely useless for screen readers.

mavhc · on Dec 19, 2022

Get rid of the 2 column thing and most people would be happy.

What guarantees of layout do you require?

In related news, MathML is back in Chrome v109

michaelt · on Dec 19, 2022

> What guarantees of layout do you require?

Some people write documents that can only be clearly presented on a 15" or larger display. Maybe a comparison table with a bunch of columns, maybe a detailed chart, maybe a PCB schematic, whatever.

These people, being considerate of their readers, want to ensure if someone with a 13" screen comes along, they'll get scrollbars or small text, rather than a badly reflowed table where the word 'Yes' gets split over 3 lines.

Other people want to read those documents on 5" phone displays.

dwheeler · on Dec 19, 2022

The problem here is specific to LaTeX. I wrote my PhD dissertation using OpenOffice.org (now use LibreOffice), and generating HTML was easy (I posted the HTML).

But the author is right, LaTeX is widely used, translating it to HTML is hard, and there are no incentives to make or improve tools. Even if you don't want HTML, it'd be good for the LaTeX tools to automaticalky generate reflowable PDF for accessibility. There should be a process for funding infrastructure to accelerate science, and this would be a good example.

There's an interesting trick you could try. PDF supports embedding other contents. LibreOffice, for example, can slip its original edited file into a generated PDF, producing perfectly editable PDF. Maybe a variant of this idea could be used, e.g., store the LaTeX source of HTML in the PDF, so people can "get the PDF" yet still have options. But that's just a side idea, the real issue is funding infrastructure for science.

DominikPeters · on Dec 19, 2022

I would recommend using the lwarp package for turning large latex documents into HTML. Pretty much all other converters attempt to parse the tex files, which is an almost hopeless task. Lwarp has a different strategy: it redefines all macros to produce HTML (e.g. \textbf{example} writes "<strong>example</strong>" into the output pdf) within latex, thereby producing a PDF containing HTML code. It then uses a pdf2txt extractor to get the finished HTML file. Thus, it uses latex to parse the latex.

Lwarp worked for me to produce an HTML version of the TikZ documentation (https://tikz.dev), and that's probably one of the more complicated tex documents that exists. (Though granted, this was still a major effort.)

gdprrrr · on Dec 19, 2022

Yeah, it's well known that only LaTeX can parse LaTeX because you can redefine all syntax (catcodes) in the middle of the document.

V1ndaar · on Dec 19, 2022

Currently finishing up my own PhD thesis. My approach to the same problem is quite different. I write my thesis in Org mode. Exporting to HTML is pretty painless. Been doing the same for years for my notes. PDF export via LaTeX & HTML export. LaTeX and PDFs fail pretty hard when including source code (some literate programming in Org). That was my initial motivation behind also producing HTML.

The final thesis that I will hand in is of course a regular PDF (well, a print based on that). But the HTML version can contain lots more stuff that doesn't fit (and belong) into the actual paper thesis, e.g. code snippets to generate plots etc. (optional export of Org subsections). By publishing the git repository of the thesis, linking all code and data + a bit of work -> full reproducible thesis.

hoosieree · on Dec 19, 2022

Heh, I also write papers in org and am currently writing my dissertation in org.

Source code is always a pain to export for PDF, especially when switching from 1 to 2 column layout depending on the publication.

My blog is written in org too, but I post-process to make it fit in with the rest of my static site. At some point maybe I'll get enough free time to swap out my makefile setup for org-publish, but if it ain't broke...

To anyone who'll listen I advocate for org-mode as a better alternative to Jupyter notebooks, Markdown, and LaTeX. It's in some ways the antithesis to "do one thing well". If you try to do N things well while adhering to the unix philosophy you end up learning N different tools. But org-mode is one tool that does N things well, and some of the things you learn doing thing N transfer to thing N+1, so you get economies of scale.

taink · on Dec 19, 2022

How do you plot graphs with org? I've been trying to use it for that purpose but I can't wrap my head around how to do it without some tikz incantation I don't really understand. I've seen gnuplot mentioned here and there but the setup seems pretty involved.

I'm looking for a way to plot simple numeric data signals in time series, which are pretty trivial in jupyter notebooks.

V1ndaar · on Dec 19, 2022

Well, personally as I write almost all my code in Nim and am the developer of ggplotnim [0], I simply write a source code snippet with some short Nim code, generate a plot and dump the filename into the Org file.

If I had more time and wanted something more convenient and magical, I would probably write a elisp function that takes X Y (Z) columns and generates a plot from those using a simple Nim program in the back that receives the data, generates the plot and returns it somehow. Haven't given this much thought though.

[0]: https://github.com/Vindaar/ggplotnim

hoosieree · on Dec 20, 2022

I mostly use matplotlib+seaborn in a python code block, and tell matplotlib to write pdf output. Then I'll include the pdf with a link. Here's a sketch:

    # first run this code block:

    #+begin_src python
    # some code to generate a plot at images/plot.pdf
    #+end_src

    # then put the image where you want it:

    #+CAPTION: Some plot or other
    #+NAME: fig:asdfhjkl
    #+ATTR_LATEX: :width 0.7\textwidth
    [[file:./images/plot.pdf]]

You can put some incantation on the top of your .org file to make it run all the code blocks before exporting, but I usually just run them manually if I need to make a change to a figure.

Here's a full example from one of my papers. You can see I made quite a few revisions with all the commented-out lines. Note the :results none :exports none arguments to the org-babel code block, which makes the code itself invisible in the resulting paper.

    #+begin_src python :results none :exports none
    import matplotlib.pyplot as plt
    import seaborn as sns

    def get_loss(filename):
    loss = []
    with open(filename) as f:
        for line in f:
        loss.append(float(line.split(' = ')[-1]))
    return loss

    data = {}
    # data['NLP (P=0.1)'] = get_loss( '../sample-programs/loss-epochs-nlp-0.1-30.txt')
    # data['NLP (P=0.2)'] = get_loss( '../sample-programs/loss-epochs-nlp-0.2-30.txt')
    # data['NLP (P=0.4)'] = get_loss( '../sample-programs/loss-epochs-nlp-0.4-30.txt')
    # data['NLP (P=0.8)'] = get_loss( '../sample-programs/loss-epochs-nlp-0.8-30.txt')
    data['NLP (P=0.1)'] = get_loss( '../sample-programs/loss-epochs-both-0.1-30.txt')
    data['NLP (P=0.2)'] = get_loss( '../sample-programs/loss-epochs-both-0.2-30.txt')
    data['NLP (P=0.4)'] = get_loss( '../sample-programs/loss-epochs-both-0.4-30.txt')
    # data['NLP (P=0.8)'] = get_loss( '../sample-programs/loss-epochs-both-0.8-30.txt')
    data['Plain'] = get_loss( '../sample-programs/loss-epochs-30.txt')
    sns.lineplot(data=data, palette='deep')
    # TODO add plots for combined Plain+NLP at other probabilities
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()

    filename = 'images/loss-plot.pdf'
    plt.savefig(filename)
    return filename
    #+end_src

    #+CAPTION: Training loss for data augmentation
    #+LABEL: fig:loss
    #+ATTR_LATEX: :width 0.5\textwidth
    #+RESULTS: fig:loss
    [[file:./images/loss-plot.pdf]]

janeway · on Dec 19, 2022

IMO, the pdf version should contain the exact information that any other format might contain. Which version is the thesis and is everything contained. Anything else is just for your own interest.

But your approach sounds great. Good luck!

aitchnyu · on Dec 19, 2022

In The Art of Unix Programming (2003) the authors assert simple text formats can be grepped, awked and are easy to compose in a text editor. Hand typing xml is cruel. But now text editors have perfect syntax highligting and squiggles to pinpoint errors, autocomplete, automatic formatting and toolchains to eliminate errors and extract info.

Isnt html or a high level abtraction (like Spectacle) the best tool for the job today?

https://formidable.com/open-source/spectacle/docs/#one-html-...

pjmlp · on Dec 19, 2022

Hence why one should use stuff like https://www.oxygenxml.com/

Similar products have been in business since the early 2000's.

bravura · on Dec 19, 2022

I think the simplest solution is uploading your thesis to arxiv.org, then using arxiv-vanity (based upon LaTeXML) to render your arxiv link as a responsive web page.

jimhefferon · on Dec 20, 2022

LaTeXML does great things, but it also has limitations. A doc that is in generic LaTeX is going to process much better than one with significant customizations. But it is a good tool for sure.

BrandoElFollito · on Dec 19, 2022

It was very cool for OP to do a writeup of the effort it took to convert a thesis.

I wanted to do the same with mine but I lost the sources (it was 20+ years ago and did not survive some upgrade/technology change). I was mostly concerned with the .eps files that are hardly portable to .png or similar.

This made me think a bit about preservation of ~recent data (1990-2010). a lot falls in the category of "not natively on the web yet" and "stored on stuff that does not work anymore".

jltsiren · on Dec 19, 2022

PDF documents have two main benefits. The entire document is a single file, and we know that old documents work.

I regularly read papers from 10 or 20 years ago and sometimes even ones from 30 years ago. The old PDF documents work without any major issues. I have much less confidence that future browsers will continue displaying old HTML documents laid out using then-obsolete techniques in a sane way on future hardware.

felixfbecker · on Dec 19, 2022

What HTML document from 10 or 20 years ago does not still render in a modern browser? Modern browsers are extremely good to maintaining backwards compatibility at all costs (aka "don't break the web"), to the degree many on HN often argue it hinders evolving web technologies.

jltsiren · on Dec 19, 2022

The ones that relied on Flash, for example. And the layouts designed for 800x600 displays may not work particularly well on modern computers.

easygenes · on Dec 20, 2022

I have almost the opposite opinion to the author on the use of PDFs. I quite enjoy the experience of reading PDFs as opposed to just about any other e-document format, at least for the types of documents where PDFs are typically an option (e.g. books, magazines, and scholarly papers).

In particular, the points they specifically call out about the difficulties with PDFs are either contrary to my experience or irrelevant to me:

They cite: "... difficult to browse, impossible to read on a phone, uncomfortable to read on a tablet, hostile to screen readers, impractical to search engines ..."

  Browsability: I use Calibre [1] as an e-document library and PDFs are a first class citizen. It is a lovely piece of software for which I have few complaints and much praise.
  Phone and tablet readability: I've used a mix of GoodReader and more recently Documents on my iPhones and iPads and have never had troubles with reading PDFs on them.
  Screen reader access: I'm not much of a screen reader user, so no comment.
  Impractical to search: While you may occasionally need to go to an extra step to search PDF collections, it's hardly impractical and there are many performant options. Calibre includes a full-text indexing feature for your whole document library, to name one.

Am I in the minority here for actually preferring PDFs?

  1: https://calibre-ebook.com/

davidpolberger · on Dec 19, 2022

Back in 2010, I used plasTeX (http://plastex.github.io/plastex/) to convert my thesis to HTML (http://www.polberger.se/components/). plasTeX is "a Python package to convert LaTeX markup to DOM." If memory serves, plasTeX worked rather well, and still seems to be maintained today.

patrick451 · on Dec 19, 2022

> PrinPDFs are difficult to browse, impossible to read on a phone, uncomfortable to read on a tablet, hostile to screen readers, impractical to search engines, and the list goes on. It's just a terrible format, unless you're trying to print things on paper. ting things is a perfectly reasonable thing to do, but that's really not the main use case we should be optimizing for.

The printed page is the optimal format for consuming a thesis or paper, so it absolutely makes sense to optimize for that, rather than dilettante browsing on a phone.

nicodjimenez · on Dec 19, 2022

Mathpix (https://mathpix.com) provides a drag and drop tool that converts PDF -> Markdown -> (HTML, LaTeX, PDF, DOCX). It's very handy and a lot of researchers and publishers use this tool, as well as people in the accessibility space (we make math content accessible for visually impaired students).

Disclaimer: I'm the founder of Mathpix.

pclmulqdq · on Dec 19, 2022

I originally wanted to blog using LaTeX, and convert that to HTML. I ran into all of these options, and also started writing my own LaTeX->HTML flow, but it got too complicated to be a hobby project. It turns out that there are parts of LaTeX that are really hard to generically convert to web constructs.

I settled for markdown with KaTeX for math, although I would like to return to the LaTeX->HTML project at some point soon.

kvakkefly · on Dec 19, 2022

A great alternative to latex is doconce. It can output html, latex and various other formats.

http://hplgit.github.io/doconce/doc/web/index.html

the-printer · on Dec 19, 2022

If he expects people (me) to read or read about his online PhD thesis then I think he should’ve chosen a font with a larger x-height. Reading the type feels like peeking through a dense bush during a hail storm.

bspammer · on Dec 19, 2022

The fact that it's published as HTML means that you can choose your own font if you so desire. A PDF wouldn't let you do that.

131hn · on Dec 19, 2022

The thesis french transcript of the tl;dr is written in verse/alexendrin.

Because, why not

082349872349872 · on Dec 19, 2022

excellent !

breck · on Dec 19, 2022

If I was writing a PhD thesis today, I'd use Scroll: https://scroll.pub/

CJefferson · on Dec 19, 2022

Please tell us when you are self-promoting your own stuff. I would currently recommend against any PhD student using it for their PhD -- it can't even generate a PDF with required formatting, which is required by most Universities.

breck · on Dec 20, 2022

> Please tell us when you are self-promoting your own stuff.

Sorry! I should have said our. You are right to call me out on that. Sometimes I don't reread my comments before posting.

> I would currently recommend against any PhD student using it for their PhD -- it can't even generate a PDF with required formatting, which is required by most Universities.

If it doesn't do something someone needs, file an issue or email me and we can probably add what is needed.

amelius · on Dec 19, 2022

The FAQ does not explain what scroll is or does near the top of the document. It seems to produce HTML only.

breck · on Dec 19, 2022

Better printable PDF support is coming.