In 2017, I had an internship at the German Historical Institute in Paris, where my task was to annotate around 1,200 regests. My job was to get them from source data in a Word document to TEI-XML in the end, but the project requirement was to keep everything in Word as long as possible. So, the annotations had to be done using Formatvorlagen (which I believe is called ‘macros’ or ‘stylesheets’ in English). In this post, I want to reflect on what helped me finish the annotation in three weeks (when previous estimates were that it would take about a year). Even if your project is different from this setup, these learnings may help you make your own annotation projects more efficient.
Back then, I wrote a tutorial for people in the project who might want to continue my approach after I was gone. It’s in German but you can use LLMs to translate it for you if you’re interested. It’s a bit of a legacy thing but maybe it’s still useful to somebody out there.
You may also want to look into my tutorial Automating XML annotation: Get more done using RegEx Search&Replace and xsl:analyze-string. Additionally, it is highly recommended that you learn whatever keyboard shortcuts are available in editors that you’re using (such as learning how to annotate and move around an XML document in the Oxygen XML editor entirely without clicking or having to remove your hands from the keyboard) if you want to annotate fast.
Using MS Word macros to create TEI-XML documents
The benefit of using these macros in MS Word is that they allow you to make visual modifications, like changing the background color of a word, and labeling it so that when you translate the document to TEI later, the label remains. This is useful for later transforming the document into the TEI elements you want using an XSLT transformation. For example, you can transform the .docx document into simple TEI using the TEI Garage transformation tool.
At the time, this was a method we used to create TEI from Word documents with project partners who were not comfortable working with XML data themselves. In this case, the data was needed for book publication afterward and the partners didn’t want to create the critical edition from TEI, which is theoretically possible using LaTeX (see Simple XML to LaTeX Transformation Tutorial). Also, stay tuned for a tutorial in the making, where I’ll collaborate with an expert to show how you can use LaTeX in the DH—just to honor the name of this blog and actually work with LaTeX a bit, which I haven’t done in a long time.
Anyway, the partners preferred to stay in MS Word for the entire project, so I had to make the most out of these macros. I made them as visual as possible so I wouldn’t need to click into the word to see what annotation it had (that’s a real pain in MS Word), but instead could immediately recognize it at a glance. For example, I set up background colors and different formatting styles for the different annotation categories. There were about 15 to 20 different types of annotations, so I used both paragraph-level and character-level macros, which are the two available options in Word.
The paragraph-level macros ensured that the nesting in TEI would be correct, so that elements belonging together would be in one coherent paragraph. The character-level macros were for the smaller-level annotations, such as personal names. The paragraph-level macros result in <p> elements after the automated TEI transformation (with their macro name as the @type attribute), word-/character-level macros result in <hi> elements with their names in @rend. If you have TEI skills, you will realize that this is a relatively flat (i.e. not well-nested), not really TEI-conforming markup. One needs some XSLT skills to fix this into useable TEI during a second transformation that, of course, will only work if the intermediary TEI from the initial transformation comes out as expected. If this is all Greek to you, you may want to look into A shamelessly short intro to XML for DH beginners (includes TEI).
Anyway, I came up with these classes based on the project requirements. My boss, who arranged the internship and was interested in getting the data into the Monasterium repository and into TEI and CEI (the Charters Encoding Initiative markup), helped set up these categories.
Working like this used to be common practice at our institute for many projects back then and I had done similar work before. Nowadays, we don’t really do this anymore because these transformations from Word often contain many mistakes. For complex markup, fixing errors can take more work than just creating the TEI or XML directly in an XML editor. So now, we try to have at least one person from the project learn XML and create the data directly. However, at the time, this was our standard approach, and I think it’s still okay for relatively simple data. Once the data becomes too complex, though, this method is too error-prone, and you need good XSLT skills to transform it properly after the Word documents are prepared.
Setting up the documents and getting started
For this project, I partitioned the large document into multiple Word documents because the file size was becoming unmanageable. I created documents that were about 200 regests long. Then, I set markers for the beginnings and endings of each regest, using a text indicator with a black background, for example, the word “END” with a black background formatting, so it could be easily and reliably found again and turned into a <div> element in TEI, thereby introducing some more nesting. Next, I applied all the paragraph-level markings. These are a bit trickier because they’re less visually prominent than the bright colour backgrounds I used for character-level annotations.
I think we needed different types of paragraphs for the actual regests and for editorial information. Although I don’t remember all the details, the process involved going through each type of annotation or sometimes doing two or three types in one go, switching between different macros using keyboard shortcuts. This really helped keep my time needs down.
Don’t click around
Initially, they estimated this project would take about a year to complete, but I finished it in three weeks. I made sure to stay focused because I also wanted to see a bit of Paris, so I wasn’t keep on working crazy hours or nights. Instead, I concentrated on minimizing wasted time by avoiding unnecessary clicks.
That’s my biggest takeaway for speeding up annotation: don’t click around. Use keyboard shortcuts as much as you can. Also, if you have multiple items requiring the same steps, do step one for all of them, then move on to step two for all of them.
The ideal implementation of this principle may vary depending on your data, but you should know your own data well enough to come up with a system that works for you inspired by these principles (“Don’t click around” and “Complete each sub-step for the whole document before proceeding to the next step”).
In my experience, this has proven extremely effective and sped up my work significantly. Initially, there’s a bit more setup involved, but for over a thousand items of data, this approach will save you a lot of time.
That’s it for today—hope this helps!
PS: Let me know if you also enjoy these legacy-type posts with some insights into past project practices!
Buy me coffee!
If my content has helped you, donate 3€ to buy me coffee. Thanks a lot, I appreciate it!
€3.00

2 thoughts on “How to annotate fast”