How to create training (ground truth) data to improve OCR/HTR

In this blog post, I want to introduce you to the basics of creating your own ground-truth training set for OCR/HTR, whether you’re building your own model or improving an existing one. Essentially, this is about how to create more data that can be used to train or retrain an algorithm to perform better on your historical texts. While this may seem obvious to many digital humanities practitioners, the task is often carried out by subject matter experts from the humanities who may not be as familiar with the technical aspects. Improving a baseline model by fine-tuning When we talk about creating your own ground-truth dataset for handwritten text recognition (HTR) or optical character recognition (OCR), what you often want to do is fine-tune or improve an existing model. Many models are already available (for example on HTR United or Transkribus), so you might provide additional training pages to address specific issues where the model consistently makes mistakes. In fact,

read more How to create training (ground truth) data to improve OCR/HTR

The LaTeX Ninja is nominated for the DH Awards for 2019! Please vote!

I don’t usually do self-promotional posts, but today I’ll make an exception: The LaTeX Ninja blog’s DH category was nominated for the DH Awards 2019. So I warmly encourage you to vote for the Ninja and share this – because, as the site’s 2019 FAQ state quite clearly – in the end, the DH awards are a popularity contest. So this is the reason for some shameless self-promotion today. I’ll be back with useful content soon. But this also serves me well in terms of my own blog time management since I’m currently too busy with conference organization to able to provide an insightful thoughtful blog entry. Generally, anyone can vote – you don’t even need to be in the DH yourself! So please do vote and get your grandma to vote too 😉 Am I allowed to vote? Everyone is allowed to vote, voting is entirely open to the public. You do not even have to view yourself as

read more The LaTeX Ninja is nominated for the DH Awards for 2019! Please vote!