How to create training (ground truth) data to improve OCR/HTR

In this blog post, I want to introduce you to the basics of creating your own ground-truth training set for OCR/HTR, whether you’re building your own model or improving an existing one. Essentially, this is about how to create more data that can be used to train or retrain an algorithm to perform better on your historical texts. While this may seem obvious to many digital humanities practitioners, the task is often carried out by subject matter experts from the humanities who may not be as familiar with the technical aspects.

Improving a baseline model by fine-tuning

When we talk about creating your own ground-truth dataset for handwritten text recognition (HTR) or optical character recognition (OCR), what you often want to do is fine-tune or improve an existing model. Many models are already available (for example on HTR United or Transkribus), so you might provide additional training pages to address specific issues where the model consistently makes mistakes. In fact, it can often be faster to create your ground truth by running an imperfect OCR model first and then correcting that initial output. (However, this approach might not work for you if the existing model’s transcription guidelines differ significantly from your own, making it more time-consuming to correct the output to meet your standard than to start from scratch.)

But let’s not get ahead of ourselves and start with the basics.

What tool should I use? Transkribus versus eScriptorium

First of all, you can perform historical OCR with many different tools. Two typical examples in Digital Humanities are Transkribus (now freemium) and eScriptorium (free & open source).

I don’t want to go into much detail about Transkribus now but here’s some context: I used to be a big fan, and there are old blog posts and videos where I teach the basics of using Transkribus if you’re interested. However, they moved to a subscription model last year due to reduced funding, which is understandable—they need a stable income— but also somewhat problematic since Transkribus is particularly strong because of the great models it offers. Models more often than not created through contributions from various projects, funded by sources like the European Union. It’s just weird to now be selling the models to people who helped create them in the first place. Without the scholars, there would be no models. Then again, it’s also an understandable and probably necessary business decision. It’s a bit of a conundrum. I’ve actually heard about a grant proposal that was not funded because it wanted to use Transkribus instead of the open source alternative (although I don’t have first-hand knowledge of this being actually 100% correct/true). Anyway.. So while certain functionalities of Transkribus remain free, more features are now behind a paywall, essentially charging users for functionality that is only so effective because people contributed their training data for free (and were maybe even funded by public money to do the work). The superior models are, in fact, Transkribus’ main competitive advantage, alongside its user-friendly usability, which caters more to traditional humanities scholars rather than digital humanities experts who might not need such user-friendly features.

Because of these changes, I’m leaning towards recommending eScriptorium, which offers much more control over the process. eScriptorium is open-source and provides great flexibility, making it an excellent option if you want to have more control and understand the technical details. However, setting it up on your own can be complicated, so it’s often better to use an instance set up by someone else (if you’re in DHd, you might want to get in touch with AG OCR).

The machine learning that powers Transkribus and eScriptorium are essentially the same or equally performant. The main advantages of Transkribus are its user-friendly interface for non-technical users and its strong models for historical data.

These are just some initial considerations when creating your own ground-truth training set for OCR. But again, back to the basics.

What do we mean by ground truth? How to use the Character Error Rate (CER) for quality evaluation?

Ground truth is a typewritten copy of a document that is 100% correct. Any error in the ground truth will cause the algorithm to learn something incorrect, so it’s crucial that it closely matches the original text and follows transcription conventions. Depending on the material and Transkribus recommendations, you should transcribe between 5,000 and 50,000 words, with printed text requiring less and handwritten text requiring more. For example, they recommend 10,000 words for each distinct handwriting style.

There are also models trained on large datasets covering many hands from the same period, which can recognize unseen hands, albeit with a higher Character Error Rate (CER). Nowadays, there are also “supermodels” trained on highly variable data, designed to cover a broad range of texts. This is what Transkribus has been going for over the last years—creating supermodels that are not specialized but that generalize well.

The CER is the metric used to measure how well an OCR model performs. There are more metrics, like the Word Error Rate, that can help hone into exactly what is going on with a given model. But if you just want to use an OCR/HTR platform and aren’t interested in actually training your own model, this may not be all that relevant to you.

How you transcribe matters

A consistent transcript accurately represents the document, including errors and punctuation. This is called a diplomatic transcription, where everything is included as it appears, such as combined words, cases, superscripts, subscript characters, and punctuation. It’s crucial to be consistent and follow certain conventions for effective OCR/HTR (re-)training.

You can choose either a diplomatic transcription (i.e. non-normalized) or normalize the text according to modern orthography, but consistency is essential so that the model understands how you want the text transcribed. Don’t switch between diplomatic and normalized approaches. For example, with letters like I and J or U and V, you can transcribe them as they appear or follow modern spelling conventions, but pick one approach and stick to it.

When dealing with ligatures, you can transcribe using the single characters that make up the ligature, or you can transcribe, for example, the normal and long S differently, or just render both as S.

You should transcribe hyphenated words according to the original text, including end-of-line hyphens. A convention used in the NOSCEMUS model (that I like a lot and am contributing to a little bit, learn more here if you read German) is to use a logical negation mark for a hyphen at the end of a line, which is really useful because it can later be used to reunite those words automatically.

In Transkribus, you can also tag words as bold, italic, etc., because this will automatically add the tags in future recognitions. However, you don’t need to mark different fonts, like Courant or Antigua.

For abbreviations, you can either keep the abbreviated form and transcribe it as it appears, or you can transcribe the expanded form. Neural networks can learn to recognize expansions, especially if they’re frequent, but you must always expand them in the same way. And remember to handle end-of-line hyphens consistently throughout your transcription.

An iterative approach to creating Ground Truth (GT)

To get started, we follow an iterative approach to creating ground truth data:

First, we transcribe a number of pages to create the ground truth. It’s recommended to transcribe between 25 to 75 pages in Transkribus, depending on whether the text is printed or handwritten. You can build on base models and use your extra data for fine-tuning, as I said earlier. This approach will speed up the transcription process.

As soon as you have a few training pages, you train your own model. Then, you run it on new pages, correct those pages, and repeat the process, because correcting is usually faster than transcribing from scratch. It’s essential to adhere to the transcription guidelines that you want the model to learn. Consistency is key: if you are inconsistent, the model will learn those inconsistencies, which can lead to significant amounts of unnecessary errors over many pages. So, try to train the model as accurately as possible.

You then train your model using these “gold standard” (or ground truth) transcriptions, and then apply it to the new pages you want transcribed. There will be errors, which you will then fix, using the extra training data created in this process to further improve the model by retraining it, and so on. Once your model reaches a high level of accuracy, you might consider publishing it if it’s of reusable quality.

For more information, you can refer to my two older blog posts How to historical text recognition: A Transkribus Quickstart Guide and Training my own Handwritten Text Recognition (HTR) model on Transkribus Lite and my video tutorial on the basics of using Transkribus Lite (of course, also check out Transkribus’ own tutorials which are probably more accurate). There are also many other tutorials available on the Transkribus website and Transkribus YouTube channel (and I’m following their transcription guidelines in this post).

How to make it fast and efficient

As I’ve mentioned before when giving advice on how to make annotation or data preparation as fast as possible, you can be faster and more accurate if you split tasks into their parts, i.e. finish part 1 for all its occurrences/instances, then switch to part 2, and so on. For example, first check the whole page for occurrences of one type of error, then proceed to other types of OCR correction.

Doing a linear reading of the text can make you much slower. For correcting errors in a text that doesn’t have many, it can actually be faster if you don’t actively read but rather compare characters or character patterns. Your brain will automatically do enough reading to get it right without getting sidetracked by the content of the text, which you’re supposed to care about only in terms of form.

Don’t get bogged down by reading the text if possible. You will still be more accurate if you can theoretically read and understand the text, i.e., if you know the language and maybe even have subject-matter expertise. That way, you’ll notice errors more readily and intuitively, even without actively reading the text. This might sound strange, but try it out, and you’ll probably see what I mean. For example, my Latin alchemical texts that often contain abbreviations of recipe ingredients are easier to transcribe or correct for someone with some Latin proficiency, even better if they know the subject matter and thus, are aware of names of potential recipe ingredients. Someone without that background knowledge will both suffer a lot and make more mistakes in the process. Generally, the more mistakes you need to correct, the more likely you should do a second round of corrections to catch any leftover strays.

So, how should you go about it in practice? Linear reading is probably not the most effective approach, as I have said above. I personally suggest my students try the following when they’re proofreading my alchemy-symbol-laden pages that contain relatively good first OCR/HTR results from a trusted model that, however, needs to be improved for this specific genre of text:

  1. Check if columns are identified correctly.
  2. Copy one special character and check all instances of it on your page. Fix if necessary by pasting it in.
  3. If you need to search for a new special character, save it in a text file for reference (maybe with a link and name/description).
  4. Proceed with all special characters as described above.
  5. Look at line endings: In a highly performant model, most errors are minute, such as missing the logical negation sign at broken words, or missing final points or commas.
  6. Check if any words are glued together (i.e., are missing whitespace).
  7. Once the most obvious things are done, do an “overview reading” of the whole page to catch any other errors you may have missed.

I would not suggest reading linearly or actively “reading” the text when correcting it (unless you have enough Latin to immediately notice errors).

In pages that contain many errors, I would also recommend that you systematically check for all those types of errors first (separately, one after the other error source, ideally from a checklist like the one above). Only towards the end you move towards something like a (superficial!) linear reading to catch the last mistakes. But you won’t catch many mistakes (or at least won’t catch them efficiently) if you start out with an extremely poor quality text. However, please also note that starting out with a poor quality intermediary OCR output may be unavoidable, especially if you work with handwritten documents. My workflow works best on print (with a high-performing HTR model but challenging layout and special characters).

And in terms of fatigue: You might be better off doing this in multiple sessions, as your eyes will get tired. If you need to have your improved HTR model ready by a certain date, factor this into your planning.

Anyway, that’s it from me for today.

Hope this helped!

Best,
the Ninja

Buy me coffee!

If my content has helped you, donate 3€ to buy me coffee. Thanks a lot, I appreciate it!

€3.00

LaTeX Ninja's avatar

I like LaTeX, the Humanities and the Digital Humanities. Here I post tutorials and other adventures.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.