Simple XML to LaTeX Transformation Tutorial

Today, I wanted to share this super simple XML to LaTeX tutorial. Using XSLT, you are going to transform XML data to LaTeX output which you can then go on to compile into your desired output PDF. There will be no fancy stuff whatsoever in this post, just the basics and what to keep in mind with these transformations. It is the quick intro to XML to LaTeX I did with my students a while ago which was done one day after they had their first contact with XSLT, so it should really be beginner-friendly. I labeled it “Advanced LaTeX” anyway because I think starting to automate things is always a step in the right direction 😉

Edit March 2022: Sadly, with WordPress changes (and source code support never working all that well to begin with), the code formatting of this post is pretty broken. Since it tends to re-break soon after I fix it, here is a similar / almost the same template described in this post.

Configuring the transformation scenario in Oxygen

I am going to assume you use Oxygen now because that’s what a lot of people in the DH do and this post is directed towards my friends in the DH. Especially those who think print editions are an obsolete concept in times of the Digital Edition. Maybe having a nice little intro to XML to LaTeX transformations available will change their minds 😉

To set up the transformation scenario, choose XML transformation using XSLT, then choose your XML document and your XSL stylesheet (set up and open those document in the editor before you configure the scenario). Then choose a Saxon 9 version (whichever you like). Then ignore the FO tab and get right to the output tab.

Here you configure how to name your output. Best click the green arrow and choose cfn, then append -latex.tex. So the result the current filename with -latex.tex appended to it. This is an important step so you don’t accidentally overwrite the original. Which in this case is not so dramatic since it has a different ending anyway but if you do XML to XML transformations, this is even more crucial. Then tell Oxygen to show it in the editor as XML (even though you know it isn’t). The editor will then, of course, complain about the non-valid XML but don’t worry. Just copy all (CTRL+ACTRL+C) and paste it (CTRL+V) into a completely empty (!) project in Overleaf  or just compile the LaTeX directly if you have it installed on your machine.

There it is, you’re set. Now let’s get to the stylesheet.

The stylesheet

So, this is the whole thing. You can just grab it and go if you don’t care for the explanation or read on to find out why things were done the way they were done. This tutorial assumes you’re already familiar with how XSLT works, just haven’t done transformations to LaTeX yet, by the way. I am also assuming, your base XML is in the TEI standard.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"     xmlns:xs="http://www.w3.org/2001/XMLSchema"     xmlns:t="http://www.tei-c.org/ns/1.0"     exclude-result-prefixes="xs"     version="2.0">
    <xsl:strip-space elements="*"/>
    <xsl:output method="text" encoding="UTF-8" indent="no" omit-xml-declaration="yes"/>

    <xsl:template match="/">
        <xsl:text>\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\DeclareUnicodeCharacter{2060}{\nolinebreak} % might not be necessary for you

\title{</xsl:text><xsl:apply-templates select="//t:title"/>
        <xsl:text>}
\author{</xsl:text><xsl:apply-templates select="//t:author"/>
        <xsl:text>}\date{\today}
\begin{document}
\maketitle
\tableofcontents\newpage</xsl:text>
        <!-- get some metadata from the TEI header using the push paradigm -->
<xsl:text>\begin{itemize}</xsl:text>
<xsl:for-each select="//t:persName[ancestor::t:teiHeader]">
    <xsl:text>\item </xsl:text>
    <xsl:value-of select="." />
    <xsl:text>

    </xsl:text>
</xsl:for-each>
<xsl:text>\end{itemize}
\newpage</xsl:text>

        <!--  <xsl:apply-templates/> OR -->
        <xsl:apply-templates select="//t:text"/>
        <!-- just use the pull paradigm on the TEI body so you don't get a meaningless TEI header dump in your document -->

        <xsl:text>\end{document}</xsl:text>
    </xsl:template>

    <xsl:template match="t:head">
        <xsl:text>\section{</xsl:text><xsl:apply-templates/><xsl:text>} </xsl:text>
    </xsl:template>

    <xsl:template match="t:p">
        <xsl:apply-templates/>
        <xsl:text>

        </xsl:text>
    </xsl:template>

    <xsl:template match="t:hi">
        <xsl:text>\emph{</xsl:text><xsl:apply-templates/><xsl:text>} </xsl:text>
    </xsl:template>

    <xsl:template match="text()">
        <xsl:analyze-string select="." regex="([&amp;])|([_])|([$])">
            <xsl:matching-substring>
                <xsl:choose>
                    <xsl:when test="regex-group(1)">
                        <xsl:text>\&amp;</xsl:text>
                    </xsl:when>
                    <xsl:when test="regex-group(2)">
                        <xsl:text>\_</xsl:text>
                    </xsl:when>
                    <xsl:when test="regex-group(3)">
                        <xsl:text>\

lt;/xsl:text>
</xsl:when>
<xsl:otherwise/>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select=”.” />
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

The XSLT declaration and the LaTeX preamble

Put the following after your XML declaration. This will ensure output is LaTeX-friendly.

<xsl:strip-space elements="*"/> <!-- for LaTeX -->
<xsl:output method="text" encoding="UTF-8" indent="no" omit-xml-declaration="yes"/>

\DeclareUnicodeCharacter{2060}{\nolinebreak} was added because LaTeX complained about an undefined character in some of my students XML data. Personally, I had never gotten this error before, so you might as well leave it out.

Creating environments

As we had just learnt some XSLT basics, I wanted the students to use at least one push and one pull paradigm type template. So the task was to process any element from the TEI header using the push paradigm and then a pull paradigm template for the body. To make this little template more efficient teaching-wise, I decided to introduce how to create a LaTeX environment using XSLT for the TEI Header / push paradigm and do at least one other command using the pull paradigm on the TEI body. Also this demonstrates as opposed to for the body.

So this next piece of code sets up an itemize environment for persons present in the header. In a “real” stylesheet, it would probably be more wise to check, using whether there is one potential element like that present and only paste the \begin and \end on that condition.

Also, as you might have noticed, all the LaTeX commands are inside . This looks a bit confusing at first but really isn’t. Just make sure you don’t create invalid XSLT by shuffling them around.

<xsl:text>\begin{itemize}</xsl:text>
<xsl:for-each select="//persName[ancestor::t:teiHeader]">
    <xsl:text>\item </xsl:text>
    <xsl:value-of select="." />
    <xsl:text>

    </xsl:text>
</xsl:for-each>
<xsl:text>\end{itemize}</xsl:text>

Pull paradigming emphasis

When creating simple commands using the pull paradigm, be sure that you don’t end up with too many “overlaps” since “simple” commands in LaTeX don’t take multiple paragraphs as arguments. If in doubt, always use environment and the global switches (like \bfseries). Since you are automating things, you always have to take into account that data might not always be marked-up in a way which makes sense to you. There can easily be linebreaks inside a single italic highlight. If this is the case in you data, better create an environment. For simple purposes, however, this is good enough:

<xsl:template match="t:hi"><xsl:text>
\emph{</xsl:text><xsl:apply-templates/><xsl:text>} <xsl:text></xsl:template>

With these commands (and genereally when transforming to LaTeX), you sometimes need to make sure you don’t involuntarily add spaces or lack space which will make the output hard to read and debug.

In this example, I made sure not to add any whitespace inside the template rule and also did not have Oxygen format or indent the XML to avoid these unwanted spaces.

And finally: Escaping entities

As you might remember, markup languages tend to use entities to escape certain characters. Bad thing is, LaTeX and XML use different entities. So we need to escape them. I know that the OxGarage standard stylesheet does this using the translate() function but I prefer to use since it’s less “messy” than a nested translate() construct.

Ah, and side info by the way: There is this standard stylesheet from the TEI consortium which might be of help if you are looking for inspiration. Since it is very generic, however, it might not be helpful if you are a newbie at both XSLT and LaTeX. The XSLT is pretty advanced and also the LaTeX probably uses some commands you might not be aware of.

<xsl:template match="text()">
        <xsl:analyze-string select="." regex="([&amp;])|([_])|([$])">
            <xsl:matching-substring>
                <xsl:choose>
                    <xsl:when test="regex-group(1)">
                        <xsl:text>\&amp;</xsl:text>
                    </xsl:when>
                    <xsl:when test="regex-group(2)">
                        <xsl:text>\_</xsl:text>
                    </xsl:when>
                    <xsl:when test="regex-group(3)">
                        <xsl:text>\

lt;/xsl:text>
</xsl:when>
<xsl:otherwise/>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select=”.” />
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
I say some more about escaping the entities and the whole <xsl:analyze-string> process in this post: Automating XML annotation: Get more done using RegEx Search&Replace and xsl:analyze-string.

And that’s it. I hope this was useful to you.

Cheers,

the Ninja

Buy me coffee!

If my content has helped you, donate 3€ to buy me coffee. Thanks a lot, I appreciate it!

€3.00

I like LaTeX, the Humanities and the Digital Humanities. Here I post tutorials and other adventures.

13 thoughts on “Simple XML to LaTeX Transformation Tutorial

  1. Sounds interesting! Do you think, this could work in RMarkdown ? I want to use an xml-file to import abbreviations into a tex-template for RMarkdown (pdf-output). I am meanwhile quite familiar with RMarkdown. I am familiar with the way to integrate tables of content as well as lists of tables and of figures. What I am looking for is a way of integrating lists of abbreviations.

    Like

    • Overall I think it must be possible but I think I didn’t fully understand the question 😉
      Do you mean transforming XML to RMarkdown directly? If yes – XSLT can transform into any text-based format, so you’re in luck. If not, please explain again what the question was 😉

      Like

      • Thanks für your answer! My issue is as follows: My RMarkdown file is – for pdf-output – based on a Latex template (*.tex-file), which has been supplied at University Hannover. This template provides for a table of contents as well als lists of figures and tables. These directories are generated automatically. What I don´t have (but need), is a list of abbreviations. RMarkdown does not support – as far as I have figured out – the respective Latex-packages (e.g. acronym). On the other hand I dont´t want to write my list of abbreviations into my Latex template. I would prefer to maintain a separate abbreviations-file in a simple format (xml, json). On rendering the pdf-output my abbreviations should be integrated into the pdf. I thought this would be a common requirement – perhaps in other contexts. Nonetheless I haven´t found a solution for this via google. I am grateful for any hints how to proceed. Sincerely, H.-C. Schneider

        Like

      • Yeah I think an XSLT template might be what you need – just let it create .Tex code in a specific file, like “abbrev.tex” (= output file of the XSLT transformation) and /include or /input{abbrev} into your doc. Does this help?

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.