Conversion from (La)TeX to HTML

Translating LaTeX documents (partially or fully) to HTML is a difficult problem, primarily because the two document formats address very different needs: TeX is intended to produce statically laid out documents with fixed dimensions, ultimately representing ink on paper. HTML, on the other hand, assumes a variety of differently sized and scaled screens and consequently prefers to express layouts in more abstract terms, the typesetting of which are ultimately left to the browser to interpret, ideally responsively — i.e. we want the document layout to adapt to different screen sizes, ranging from 8K desktop monitors to cell phone screens.

This means that there is no one “correct” way to convert TeX to HTML — rather there are many choices to be made; most notably, which aspects of the static layout with fixed dimensions described by TeX code to preserve, and which to discard in favour of leaving them up to the rendering engine, thus explaining the plurality of existing converters.

Naturally, many LaTeX macros are somewhat aligned with tags in HTML; for example, sectioning macros (\chapter, \section, etc.) correspond to <h1>, <h2>, etc.; the {itemize} and {enumerate} environments and the \item macro correspond to <ul>, <ol> and <li>, respectively; and so on. Most converters therefore opt for the reasonable strategy of mapping common LaTeX macros directly to their closest HTML relatives, with no or minimal usage of (simple) CSS, effectively focusing on preserving the document semantics of the used constructs (e.g. “paragraph”, “section heading”, “unordered list”). In many situations, this is the natural approach to pursue, especially if we can reasonably assume that the document sources to be converted are sufficiently “uniform”, so that we can provide a similarly uniform CSS style sheet to style them, and this is largely the way existing converters work. To name just a few:

LaTeXML focuses strongly on the semantics, using XML as the primary output format and heuristically determining an author’s intended semantics of everything from text paragraphs (definitions, examples, theorems, etc.) down to the meaning of individual symbols in mathematical formulae; achieving great success with ar5iv.org, hosting HTML documents generated from TeX sources available on arxiv.org.

TeX4ht focuses on plain HTML as output with minimal styling, going as far as to (optionally) replace the \LaTeX macro by the plain ASCII string “LaTeX”.

Pandoc largely focuses on the most important macros and environments with analogues in all of its supported document format to convert between any two of them, e.g. TeX, Markdown, HTML, or docx.

Mathjax focuses exclusively on macros for mathematical formulae and symbols, allowing to use TeX syntax in HTML documents directly, which are subsequently replaced via JavaScript by the intended presentation.

However, the approach described above has notable drawbacks: Firstly, it requires special treatment of LaTeX macros that plain TeX would expand into primitives, and the number of LaTeX macros is virtually unlimited — CTAN has (currently) a collection of 6399 packages, tendency growing, which get updated regularly, and authors can add their own macros at any point. Supporting only the former is a never-ending task, and providing direct HTML translations for the latter is impossible. This is made worse by the very real and ubiquitous practice among LaTeX users of copy-pasting and reusing various macro definitions and preambles assembled from StackOverflow, friends and colleagues, and handed down for (by now literally) generations, even in situations where (unbeknownst to them) “official” packages with better solutions (possibly supported by HTML converters) exist.

Sources:

Dennis Müller