As per request, I wanted to address the subject of JATS-XML to LaTeX transformations today. The post might be interesting for you still even if you’re not particularly interested in said transformation since it will address more general requirements for transformations as well.
What is JATS-XML and why would we transform from and into it?
First things first: What is JATS-XML? It is an XML standard called the Journal Article Tag Suite (JATS).
Journal Article Tag Suite
… is an application of NISO Z39.96-2019, which defines a set of XML elements and attributes for tagging journal articles and describes three article models.
The content on this site is the supporting documentation for the standard. JATS is a continuation of the NLM Archiving and Interchange DTD work begun in 2002 by NCBI. (source & JATS documentation)
It has the <article>
element, and in that, you get <front>
, <body>
, and <back>
. Learn more about it and see examples in the links.
The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012. […]
The JATS provides a set of XML elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews. JATS allows for descriptions of the full article content or just the article header metadata; and allows other kinds of contents, including research and non-research articles, letters, editorials, and book and product reviews. (Wikipedia)
So the quest was the following:
LaTeX to JATS or JATS to LaTeX?
In this transformation, it makes a big difference from which input format we want to start. While LaTeX can contain lots of custom home-made fancy workarounds, an XML standard is – wait for it – standardized 😉 It’s always good to transform from something which is standardized because then, you don’t really need to know what your data looks like (though you should, of course, always know what your data looks like!). You can more or less rely on the data being standard conformant. Then you need transformation rules for all possible elements or just the ones actually in use, and you’re good to go.
Working with code which is not standardized, well, let me say it’s not great starting material for a transformation. So either you need to clean your data to automate the transformation later or you try to get a transformation which only transforms what it knows and leaves the rest which you will hand-correct afterwards.
In the concrete case, the Journal of Engineering & Processing Management (@JEPM_OA
) gets submissions in LaTeX. They were now thinking about transforming this to JATS-XML:
Thanks a million to Peter Flynn (@docum3nt
) for jumping in – he basically said it all: LaTeX should be used as an end-of-pipeline process that is only converted into. But that’s not necessarily the only possible way. What you really need as a requirement for data from which you want to do automated transformations to other formats, however, is that it adheres strictly to a standard. That’s why we use XML standards in the DH. So you know (more or less) what your data will look like and which default transformation steps need to take place. So going from JATS-XML to LaTeX is just a normal case of the simple XML to LaTeX transformation.
In theory, if you’re data are standardized, you should be able to come up with a custom transformation process for them no matter how complicated they are. I say “in theory” because in practice, this means a lot of work. And it’s only really worth it if you honestly have tons of similar data so all the effort will pay off. Unless that’s the case (that you have lots of similar standard-adhereing data), it’s probably not worth it.
Apparently, according to the editors of the journal, the problem thus far seems to be chemical formulas. For those, a transformation scenario could be written if their implementation adheres to standards. It might just be that available out-of-the-box transformation scenarios like Pandoc only implement very basic LaTeX and not some of the packages used, even if they are fairly common packages in your discipline, as far as I know at least). This problem could be avoided if one is willing to put in the work of writing a custom scenario (is it worth it though?). So, in theory, it’s not impossible. However, if the problem is that authors add many custom commands etc. (like it’s common practice in LaTeX), then it isn’t really an option.
Also, it’s always hard to say without knowing what the data really look like in a concrete case. I still hope this was a little bit helpful.
Best,
the Ninja
Resources
- Github JATS-XML topics
- Official Documentation Site
- JATS on Wikipedia
- More info from official site
- Short tutorial on Typeset.io (with a long example of JATS-XML code for those who are curious)
- The Typeset.io blog (has some general JATS-related posts)
- JATS-XML Generator
- JATS Intro on xml.com
- Overleaf apparently had a job offer for transforming LaTeX to JATS at some point in 2017: for the curious, here it is. So technically, there should be someone out there at Overleaf who knows more about this than we do and maybe is about to come up with a general solution for this, so we shouldn’t “homecook” something?
We’re looking for an expert in automatic LaTeX to JATS XML and HTML conversion with LaTeXML. You’ll be working closely with our team to develop and improve our automatic LaTeX to XML conversion pipeline. (from the 2017 Overleaf job ad)