How to create training (ground truth) data to improve OCR/HTR

In this blog post, I want to introduce you to the basics of creating your own ground-truth training set for OCR/HTR, whether you’re building your own model or improving an existing one. Essentially, this is about how to create more data that can be used to train or retrain an algorithm to perform better on your historical texts. While this may seem obvious to many digital humanities practitioners, the task is often carried out by subject matter experts from the humanities who may not be as familiar with the technical aspects. Improving a baseline model by fine-tuning When we talk about creating your own ground-truth dataset for handwritten text recognition (HTR) or optical character recognition (OCR), what you often want to do is fine-tune or improve an existing model. Many models are already available (for example on HTR United or Transkribus), so you might provide additional training pages to address specific issues where the model consistently makes mistakes. In fact,

read more How to create training (ground truth) data to improve OCR/HTR

How to annotate fast

In 2017, I had an internship at the German Historical Institute in Paris, where my task was to annotate around 1,200 regests. My job was to get them from source data in a Word document to TEI-XML in the end, but the project requirement was to keep everything in Word as long as possible. So, the annotations had to be done using Formatvorlagen (which I believe is called ‘macros’ or ‘stylesheets’ in English). In this post, I want to reflect on what helped me finish the annotation in three weeks (when previous estimates were that it would take about a year). Even if your project is different from this setup, these learnings may help you make your own annotation projects more efficient. Back then, I wrote a tutorial for people in the project who might want to continue my approach after I was gone. It’s in German but you can use LLMs to translate it for you if you’re interested.

read more How to annotate fast

How to write a (Digital Humanities) abstract? Lessons Learned from Reviewing

Writing abstracts and conference submissions is a key element of academic life, yet, I find that there is little guidance for those new to the activity. There are many things to know that will (in my experience) drastically increase your chances of getting accepted. Acting as a reviewer teaches you a lot about becoming a better writer, just like evaluating applications teaches you how to write good applications. However, only very few young academics actually get the opportunity to see the other side of the process. Usually, evaluation work is dominated by more senior academics, leaving newcomers dependent on their guidance, support, and mentorship. But not everybody has equal access to dedicated mentors with enough time on their hands to help teach you these basics effectively. It’s a flaw in our academic system that we invest so little time and energy in training but it’s not the individuals’ fault – the system simply doesn’t value activities that don’t result in

read more How to write a (Digital Humanities) abstract? Lessons Learned from Reviewing

About the Radio Silence

I haven’t blogged for a while. I owe you an apology and an explanation, so here goes… Why I’m not blogging I’ve been reflecting a lot over the last few years on why I don’t really blog anymore. I think part of it is that when I was still actively maintaining the blog, I was creating teaching content but wasn’t really teaching much in comparison. Now, with the increased amount of teaching I do, it has kind of replaced my blogging, which is sad. But I also didn’t initially know what to do about it – or maybe it just wasn’t the right time. I keep thinking about the many concepts I teach each semester that would make great blog posts and about how I should put them out there. Given that I have all those teaching materials, it would be easy to distill some of it into a blog post, yet I find that I simply don’t. Don’t get

read more About the Radio Silence

What’s the deal with modelling in Digital Humanities?

Modelling is central to the Digital Humanities. Even so much that some claim it is what unites the DH as a field or discipline! But what is modelling? What do we mean by it anyway? This post will hopefully provide you with the primer you need. Sorry for the very sporadic blogging lately. I still haven’t figured out how to include blogging into my PostDoc life. I think I want to get to a rhythm of around 1-2 posts per month. More than that is absolutely not realistic but, as you may have realized, I didn’t even manage that consistently over the last year. Then again, it’s not like I’m not producing teaching materials anymore. Most of my efforts this year have gone into all the classes I have been teaching (I’m hoping to share slides and teaching materials for all of them once they are cleaned up) – I have taught an intro to text mining, my usual information

read more What’s the deal with modelling in Digital Humanities?

What you really need to know about Digital Scholarly Editing

Today’s post is a short introduction to digital scholarly editing. I will explain some basic principles (so mostly theory) and point you to a few resources you will need to get started in a more practical fashion. I’m teaching a class on digital scholarly editing this term, so I thought I could use the opportunity to write an intro post on this important topic. How does a Digital Edition relate to an analogue scholarly edition? Unlike analogue scholarly editons, digital editions are not exclusive to text and they overcome the limitations of print by following what we call a digital paradigm rather than an analogue one. This means that a digital edition cannot be given in print without loss of content or functionality. A retrodigitized edition (an existing analogue edition which is digitized and made available online), thus, isn’t enough to qualify as a digital edition because it follows the analogue paradigm. Ergo: It’s not about the storage medium. A

read more What you really need to know about Digital Scholarly Editing

A Primer on Version Control and Why You Need It

Today’s post is a quick introduction to version control as a concept and version control systems. It explains what they are and why you should be using them. I was just sending one of my best old-timey blogposts to a friend (How to quit MS Word for good), ended up re-reading it and realized that therein, I had promised that I would write a blog post on version control some day. And, if I’m not mistaken, I never followed up on that. So here you are, a short post on version control just to keep things going on the blog. What is Version Control? So I read this book a few years ago. The Complete Software Developer’s Career Guide: How to Learn Programming Languages Quickly, Ace Your Programming Interview, and Land Your Software Developer Dream Job by John Sonmez (Simple Programmer 2017). While I’m not that fond of its author anymore since I realized that he uses his platform to

read more A Primer on Version Control and Why You Need It

First ever LaTeX Ninja workshop at Harvard: “Beyond TEI: Digital Editions with XPath and XSLT for the Web and in LaTeX”

It’s been awfully quiet on this blog but actually, there’s lots of Ninja activity going on right now: I’m excited to announce that I will give the first ever official LaTeX Ninja workshop, in person at Harvard in about two weeks! It’s called “Beyond TEI: Digital Editions with XPath and XSLT for the Web and in LaTeX”. (Apart from that, there’s a short book review coming up in TUGboat.) Since there probably are a good number of people who would be interested in such a workshop but can’t attend in person, I will share the slides and teaching materials on Github later on. That way, they can be reused for self-study. This blogpost gives somewhat of an outline of the contents of the workshop and contains links to related posts on this blog. Participants might want to read some of them in preparation or as an additional resource. [Get to the github repo with all the materials (`additional resources’ directory)

read more First ever LaTeX Ninja workshop at Harvard: “Beyond TEI: Digital Editions with XPath and XSLT for the Web and in LaTeX”

Why I stopped my Twitter bots

As some of you might know, I have written about my Twitter bots on this blog a number of times. Now I decided to shut them all down and wanted to give at least a short explanation of why that was. TLDR: It got unexpectedly expensive and I reasoned the benefit wasn’t really worth the price. Off-topic note on posting schedule: As my devoted readers have probably noticed by now, we’re down to a bi-weekly posting schedule at the most. I have been thinking about it and I’m aiming for two posts per month for now. Actually that’s less than every second week. The point is, I have been reflecting about what makes this blog what it is and I think that’s the good posts I come up with every once in a while. When those add up, that makes the blog a useful resource for time to come. It’s these posts that I want to focus on. And frankly,

read more Why I stopped my Twitter bots

How to write (Ancient) Greek in LaTeX

Because I’m a classicist by training, I have been wanting to broach the topic of how to typeset Ancient Greek in LaTeX for a long time. So today comes a short post on the topic. There are a number of ways you can approach this but most importantly, you need to decide whether you need just any Greek letters or Ancient Greek letters. Because Ancient Greek has diacritics which aren’t featured in all (“normal”) Greek keyboards. This blogpost covers the three sub-topics I deemed relevant to the question and they are: How do I get my Greek letters in the first place? (related to 1) How/Where do I get a Greek keyboard and which one to choose? How to typeset Greek in LaTeX? How to get your Greek letters If you’re just adding, say a note on the origin of a word to your text, you might not even need to install a Greek keyword at all. When I just

read more How to write (Ancient) Greek in LaTeX

How to use Deliberate Practice to reach your Peak [Book Review]

Have you heard of the concept of “deliberate practice”? It’s a method for rapid skill aquisition through practicing in a certain way. The concept is discussed in detail in a highly recommened book: Peak: Secrets from the New Science of Expertise by K. Anders Ericsson and Robert Pool (2016). So here it is. At last. The long promised book review and summary of the most important takeaways from Peak. Ever wonder why you’re not improving at skills despite using them every day? You’re not using deliberate practice is why. So what is deliberate practice anyway? […] deliberate practice [is a] a term coined by Ericsson to refer to the specific learning method used by experts to achieve superior performance in their fields, and mental representations. (Wikipedia entry on Peak) The book resulted from one of the top reserachers in the science of expertise, K. Anders Ericsson, cooperating with science communicator Robert Pool to make his research understandable to the masses. Malcom

read more How to use Deliberate Practice to reach your Peak [Book Review]