Hacker News new | past | comments | ask | show | jobs | submit login
Sphinxtr: Creating a Portable PhD Thesis (jterrace.github.com)
93 points by jterrace on Oct 3, 2012 | hide | past | favorite | 73 comments



Unfortunately this (good) product will take at least a generation to make it to the academic world.

I put my thesis together in HTML using Pandoc, customized to look a bit like a Tufte publication with margin notes. It had animations, anchor links when cross-referencing paragraphs, and in my eyes, made sense for a document like a thesis.

My committee members, on the other hand, were in consensus that a thesis should be a PDF with page numbers, not some multimedia document with hyperlinks. When I made a PDF[0] and submitted it to the university, someone even checked it to make sure the page numbering switched from Roman to Arabic at the correct point as stipulated by the submission guidelines (it didn't, hence my knowledge of their thoroughness).

Considering that universities today often don't even want a hard copy of a thesis, this is a world that is tied very strongly to the paper paradigm.

[0] Fortunately Pandoc did this quite nicely, though I did have to make a few changes to deal with citations. Incidentally, there is no good, cross-platform way to put an animation in a PDF.


There's a good reason to require proper page numbers: citations. In general you can't refer to a particular passage in a huge HTML document. Anchor links are a technical solution, but that's not how normal citations work.

I think that more generally when a document becomes sufficiently complex it is better to use a page-oriented, typesetting approach. You'll want to fiddle with where certain page breaks and figures end up, and if you do that you might as well stick with one canonical format.

You can view it as the paper paradigm. Alternatively, you can also see it as sticking to a uniform, canonical format. Getting fluid layouts right is a very difficult problem, and manually getting the layout right for a fixed format is simply a much better solution for this purpose.


The HTML format does generate multiple pages for different sections and chapters, completely configurable. The format above is the singlehtml format where everything is on one page. I'd much rather have a link to a specific anchor tag on a page that goes directly to where I want than a citation to a page number on a 200 page document.


"As show on pages 105-107" looks much better than "As shown in sections 4.5.4, 5.1 and 5.1.2". Really. You don't have the first option with HTML and/or sphinx. Oh, and can I have some bold italic text please?


Actually, I despise page number references because often popular resources get republished (because no one has the book from the 70s anymore), changing the page numbers, whereas a section title / number is much likelier to remain correct, and an URL to a paragraph can always be updated.


You can link to a section with a URL. No "section xyz" or "page n" needed.


Yes, you can place a link. But the link, unlike the page range, points only to the place, not to the scope.


Why can't you link to divs on a page? When I'm given a page range, I don't typically first go to some arbitrary page in the middle of a range.


But the rules of citation prescribe that you give a section or page number. Giving a URL is a neat technical solution, but it's not a proper citation.


You can cite a URL.


You can if you must but you know what I mean. Typically you cite a peer reviewed work and reference it in the bibliography with author, year, publisher, etc.


reference it in the bibliography with author, year, publisher

Which is more than a little quaint, in 2012.


Not really. These are useful bits of bibliographic information, which can give you an idea of a source without having to consult it. And the kind of publisher gives an indication of reliability ranging from peer reviewed to self-published. Including an URL as well, when applicable, is of course very useful (given that it's a stable one).


Sure, but that's what the PDF format exists for.


There's a good reason to require proper page numbers: citations.

Right, and I have even seen page numbers in citations of theses. It wasn't common in my field at least, but it happens. For a journal citation, generally page numbers tell you where in the volume that article is, not the content that is being referred to. I think there is flexibility in citations—one can refer to a figure, table, section or even paragraph number with at least equal clarity as a page number. There is convention here, I agree, and my point in my original post is that the convention is followed dogmatically rather than pragmatically in academia.

Getting fluid layouts right is a very difficult problem, and manually getting the layout right for a fixed format is simply a much better solution for this purpose.

I have given this topic a lot of thought. Partly from working on an HTML document for a thesis, partly because I've worked in the area of cognitive ergonomics. At the moment our tools aren't great for this, but that's really our fault as software designers. Fluid layouts should be better than fixed ones.

Concerns about fluid layouts are new, a byproduct of current technology. Printing technology didn't afford us these problems. You can remember when it was commonplace to see websites that contained text as bitmaps, to force readers to consume content as laid out by the designer. There were sites that rigidly spec'd font sizes and browsers that allowed that. Today, the best web designers accept the fact that content will be consumed on different screens (sizes and pixel densities) by users with different preferences. They design layouts that can gracefully handle a large font size stipulated by the reader, rather than design the way that a designer would for print.

So what is the underlying issue with fluid layouts?

Well, we have text that relates to graphics, and as Tufte points out in some great books on the topic, it's particularly useful (and historically very common) to have the relevant text presented in conjunction with graphics. That's an obvious point, but with fluid layouts this can unpleasantly 'break.'[0] This is a solvable issue (even using current web technologies), in a way that can evolve beyond what paper can do for us.

I'd like to see a markup that allows me to designate which graphics are relevant to a block of text, and have those figures shown together. For example, if the display allows it, once you start scrolling past a relevant figure it could slide to the margin of the text and stay pinned alongside as you read relevant text for reference. Perhaps the reader could select other figures to pin, or unpin figures as well.

Think of the number of times you've flipped between pages of a PDF to read text and look at the figure described. Sometimes I end up opening the same document twice and putting them on my screen side-by-side. That's the kind of hack us human factors types love putting in slide shows about how we should be designing for users.

With high-resolution displays getting to mass-market price points, we are only missing the right tools to take technical documents to the next level. As I said, this could be done in HTML, CSS and javascript today, but it's not author-friendly. Even if the tools did exist to make this easy, it would frowned upon in the academic world for being different. But some fields are more progressive than others, and eventually we'll move past the page as a paradigm. I've got to applaud the OP for taking steps in that direction.

My futurist speculation: the move away from pages in the scientific community will happen at the same time that it will move away from traditional journals as an idea distribution channel. I don't think that's in the immediate future, though.

[0] In fact, this often breaks with fixed layouts too, except it's only really bad during document creation. Think figure placements in LaTeX. Fortunately the document creator has to do this work once, and then all of the readers see the exact same thing. Better tools for fluid documents would make things friendlier for the author as well as the readers.


Yeah, my university formatting requirements are crazy also. This allowed me to have one source that outputs (nicely) to a bunch of different formats, including the university PDF requirement.


As someone who did typeset his thesis not too long ago to forget all that lot of tiny tweaks I had to make to get a perfect print out from LaTeX, I would say that this project has a lot of issues. Some of them a more or less easy to fix, but some of them will be hard. Consider wrapping some specific paragraph in a \fussy / \sloppy pair, or hammering a specific float to this page, or ... (this list is really long). Not to mention bibliography tools. I don't claim that these issues are unsolvable, but the deeper you go into solving them the more you will discover that you are basically re-implementing LaTeX.

While the idea of multi-format thesis, or at least double format - PDF and HTML, is very compelling, I doubt there could be good enough solutions for that for any thesis which contains more then just text and some inconsiderate amount of figures/tables/formulae at all.


You made me twitch from cumulative LaTeX PTSD.

"Just write and pay attention to content, not formatting," has led to staring at the clock, wondering how 5am came around so quickly more nights than I would care to admit.

God forbid you use Sweave.

I'm looking back 22 years and getting the twitch again. All that time I saved not playing with fonts, kerning, margins, and line height has been burned aligning figures or tables or yes, trying to get that frickin' float on a page at least vaguely proximal to its reference.


Agreed, but for my case, I didn't really care too much about the PDF output. The printed version of my thesis is going to sit in my university library untouched forever. I'm much more concerned with having a semantically correct, searchable, beautiful HTML output.


Why not to go the other way round in this case? Typeset your thesis in semantically correct, searchable, beautiful HTML5, and then just add some specific CSS to get the hard print copy?


Unfortunately, browsers are not up to the job so far. For example, page-break-inside:avoid is only supported by Opera[0]. Maybe Prince [1], which is not for free, though.

[0] http://www.w3schools.com/cssref/pr_print_pagebi.asp [1] http://www.princexml.com/


One big reason is math, which TeX is very good at.


I don't argue that TeX is good for typesetting thesis. I doubt that sphinx, even with some plug-ins is good for that. Also don't forget that reStructuredText has its own problems (not found in HTML and LaTeX). Can I have some bold italic or italic monospaced phrase example in reStructuredText please? No, really?


Mathjax is very good at it too.


This is what thesis writing procrastination looks like.


To be fair, I only released it after my thesis was done :)


This is an impressive undertaking, but good lord, what an unfortunate name.


To be totally fair, a sphincter (I'm assuming this is what you're referring to) refers to a circular muscle structure in the general case. Not just the butthole.

There are sphincters in your eyes.

There is a sphincter in your esophagus.

Your body is full of sphincters.

The anal sphincter is just one of many sphincters.

And now you really do know.


That's irrelevant, because the association is ever-present.

As an example, if people lived on Venus, they would be called "Venerials" by the proper genitive form. Alas, doctors got to it first (Venus, Roman goddess of love...) so it was changed to "Venutians". (Source: Neil DeGrasse Tyson's podcast, StarTalk radio).


I know, I was being a little pedantic and tongue-in-cheek at the same time, hah.

Still true though.


...would be called "Venerials" by the proper genitive form...

I'm going to keep telling myself that your use of 'genitive' right next to 'Venerials' was entirely conscious and deliberate.


Heh, not at all.


"Your body is full of sphincters."

I'm sorry, but this cracked me up. It brought out my inner Beavis and Butthead. I apologize. But, I mean, come on. Most people do associate the word sphincter with the butthole, and surely the creator should have known that. (Unless the, umm, cheeky association is intentional?)


100% intentional


Haha, don't apologize. I was being a little goofy for the sake of it.


Whoa, you mean we've got buttholes, like, all over? Cool. Huh huh huh.


it had to be done



Sorry, but the project name is totally unfortunate.


intentional :)


Love the name, laughed my ass off ;)

Disclaimer: I'm immature...


Oh.. OOOHH. I get it :)))))


How does this compare to existing text formats that generate multiple outputs, in particular pandoc (which supports an extended version of Markdown)?


The section on typography is actually about formatting. Also, an epigraph is not the same as a quotation:

     [...]
     2. (Literature) A citation from some author, or a sentence
        framed for the purpose, placed at the beginning of a work
        or of its separate divisions; a motto. Epigraphic


Pull requests accepted :)


This is awesome! I can totally see myself using this. We also need better make files for Latex. Changing one line, hitting make, and then watching crap fly across your screen is not a pleasant experience ...


It's using http://code.google.com/p/latex-makefile/ which is a really awesome makefile. It has nice colored output and throws away all the garbage output.


At least you can build your own scripts around this stuff. Building LaTeX for me goes like this:

I use vim, change a line, and hit ":w". My git-onNotify [0] script detects a change and issues "make show". The Makefile uses rubber or latexmk to build a pdf, then issues "gnome-open $PDF", which opens the new version in my pdf viewer. If my screen is tiled, the preview on the side just updates.

Essentially, I just save my tex file and wait for the change.

[0] https://github.com/beza1e1/dot/blob/master/bin/git-onNotify


Looks nice. Currently, I am quite happy writing my thesis in LyX [0], which outputs to various formats, including HTML.

I also have a script which periodically converts the source files from LyX to both HTML and PDF, then dumps them in the webroot of the Apache server running on my Uni computer. This folder has a .htaccess file which restricts access to my supervisors and myself using the Uni's LDAP server.

It works a treat for me.

[0] - http://www.lyx.org/


I have written my Bachelor thesis in org-mode and was incredibly happy with it. There is certainly a difference in requirements for a PhD Thesis, but I cannot think of any missing feature or hard to overcome problem right away.


What did you use to convert to pdf / html?


org-mode comes with HTML/pdf/DocBook export and you can extend the exporter with your own formats.


Did you find it easy to extend with your own formats? Or to modify an existing format, say html, to produce slightly different output?

I usually use perl/sed/awk + markdown for generating html from my own made-up mini-formats. I'd love to keep the sources in org-mode instead, but I wonder whether elisp is the right language for the type of text munging I want to do.


elisp isn't really good at text munging (that sounds counterintuitive at first, given it is the extension language of an editor) and works much better if your data is structured as s-expressions. AFAIK there used to be project to build a modern parser for org-mode files, but I couldn't tell you how much this has progressed or how usable it is. So, if your export process contains heavy text-munging I'd avoid it.

Modifying the HTML output (adding classes etc.) was fairly OK, I haven't tried anything crazy though.


Looks cool. But I guess doing things like using IEEE classes is out of question? And does it really miss a way of typesetting inline math?


It has inline math. I just forgot to add an example for it. If by IEEE style, you mean bibtex, then yes, you can easily swap the bibtex format.


I mean writing things meant for publishing in IEEE conferences, using the IEEEtran class.

Another concern is: is it practical to include figures drawn with Tikz? I find it the easiest way to lay out many things, but it effectively means LaTeX lock-in.



This is a nice project, but different goals. I don't want my HTML output to look like a PDF. I want it to look like a nicely formatted web document


I don't get it, what's the problem on writing good handcrafted (no WYSIWYG editors) LaTeX?!


It seems out of place in a web-dominated world. Compare Brat Victor's "learnable programming" [0] or "Scientific Communication As Sequential Art" [1], which includes animated or interactive things.

[0] http://worrydream.com/LearnableProgramming/ [1] http://worrydream.com/ScientificCommunicationAsSequentialArt...


here's an eclipse based editor specifically built for writing/managing your thesis. http://www.chapterlab.com/


I read that as Sphincter...


If the tex2html tool is poor, why not improve that tool? Re-inventing TeX seems like quite a waste of time.

Every time someone invents a new markup format for absolutely no reason, I die a little bit inside.


TeX and restructuredtext have very different goals. The first lets you specify every little typographical detail for a fixed page-oriented output. The latter focuses on a minimal set of semantic markup which can be used for different presentation media.


What you said is just the difference between TeX and LaTeX. If you want just semantic meaning, use LaTeX without lower level TeX commands, then use an automated tool to convert it to html or other formats.


No it's not. LaTeX is still page-oriented. You can convert it to HTML and other formats but not as well as multi-format markup specifically designed for that purpose.


There is nothing page oriented about LaTeX. All commands assume only content, the exact formatting is left to the combination of class/packages used. In fact, this is one of the weaknesses of LaTeX, because people like so much to be able to position pictures, tweak with title formatting, etc.


The whole point of LaTeX and TeX is not to worry about the exact format, but to worry about the semantics of the content instead, and let the layout engine decide for you. Like all automation, this generally works very well, except when it doesn't. When it doesn't, the solution is to give it a little manual shove in the right direction, not to rewrite everything in the markup language flavor of the week.

Incidentially, semantics, not presention, was originally the point behind HTML, but it got warped and twisted over time into a kinda-sort presentation-oriented language.

restructuredtext, on the other hand, was originally developed to create documentation for Python programs. It might be good for that purpose (wouldn't know; haven't used it.) But it's certainly not good for typesetting mathematics, research papers, academic quotations and so forth. Hence the large amount of wheel reinvention going on here. It's a little bit like writing your research paper using JavaDoc comments. Sphincter indeed.


You've never used ReST but you somehow know it's not good for typesetting research papers?


I used it a little bit back when I still wrote Python. It seemed a lot like HTML, but much terser, and whitespace-sensitive. ReST could do the same kind of stuff as HTML: build bullet lists, create tables, italic text, and so forth. However, since I had already learned HTML, learning another language that did the same thing felt like a waste of time. I fully believe that Restructured Text is a better markup language than HTML in some ways; however, I simply don't care because the differences are minor, and HTML is so much more powerful.

On the other hand, TeX was developed by Donald Knuth, a guy who spent his entire life doing research and writing papers about it. It has excellent math support, and is a true semantic language. I've written a few papers in TeX and been very happy with it.

Anyway, if RestructuredText were good at typesetting research papers, there would be no need for this project, would there?


You're confusing the language with the build system. LaTeX wouldn't be very useful without the awesome build system. This project is a build system for producing both high-quality HTML and high-quality PDF (through latex) with a single, high-level ReST markup language. It also uses latex formulas for math.


This uses an existing format, restructuredtext, not a new markup format.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: