How to create training (ground truth) data to improve OCR/HTR

In this blog post, I want to introduce you to the basics of creating your own ground-truth training set for OCR/HTR, whether you’re building your own model or improving an existing one. Essentially, this is about how to create more data that can be used to train or retrain an algorithm to perform better on your historical texts. While this may seem obvious to many digital humanities practitioners, the task is often carried out by subject matter experts from the humanities who may not be as familiar with the technical aspects. Improving a baseline model by fine-tuning When we talk about creating your own ground-truth dataset for handwritten text recognition (HTR) or optical character recognition (OCR), what you often want to do is fine-tune or improve an existing model. Many models are already available (for example on HTR United or Transkribus), so you might provide additional training pages to address specific issues where the model consistently makes mistakes. In fact,

read more How to create training (ground truth) data to improve OCR/HTR

How to annotate fast

In 2017, I had an internship at the German Historical Institute in Paris, where my task was to annotate around 1,200 regests. My job was to get them from source data in a Word document to TEI-XML in the end, but the project requirement was to keep everything in Word as long as possible. So, the annotations had to be done using Formatvorlagen (which I believe is called ‘macros’ or ‘stylesheets’ in English). In this post, I want to reflect on what helped me finish the annotation in three weeks (when previous estimates were that it would take about a year). Even if your project is different from this setup, these learnings may help you make your own annotation projects more efficient. Back then, I wrote a tutorial for people in the project who might want to continue my approach after I was gone. It’s in German but you can use LLMs to translate it for you if you’re interested.

read more How to annotate fast

What’s the deal with modelling in Digital Humanities?

Modelling is central to the Digital Humanities. Even so much that some claim it is what unites the DH as a field or discipline! But what is modelling? What do we mean by it anyway? This post will hopefully provide you with the primer you need. Sorry for the very sporadic blogging lately. I still haven’t figured out how to include blogging into my PostDoc life. I think I want to get to a rhythm of around 1-2 posts per month. More than that is absolutely not realistic but, as you may have realized, I didn’t even manage that consistently over the last year. Then again, it’s not like I’m not producing teaching materials anymore. Most of my efforts this year have gone into all the classes I have been teaching (I’m hoping to share slides and teaching materials for all of them once they are cleaned up) – I have taught an intro to text mining, my usual information

read more What’s the deal with modelling in Digital Humanities?

What you really need to know about Digital Scholarly Editing

Today’s post is a short introduction to digital scholarly editing. I will explain some basic principles (so mostly theory) and point you to a few resources you will need to get started in a more practical fashion. I’m teaching a class on digital scholarly editing this term, so I thought I could use the opportunity to write an intro post on this important topic. How does a Digital Edition relate to an analogue scholarly edition? Unlike analogue scholarly editons, digital editions are not exclusive to text and they overcome the limitations of print by following what we call a digital paradigm rather than an analogue one. This means that a digital edition cannot be given in print without loss of content or functionality. A retrodigitized edition (an existing analogue edition which is digitized and made available online), thus, isn’t enough to qualify as a digital edition because it follows the analogue paradigm. Ergo: It’s not about the storage medium. A

read more What you really need to know about Digital Scholarly Editing

A Primer on Version Control and Why You Need It

Today’s post is a quick introduction to version control as a concept and version control systems. It explains what they are and why you should be using them. I was just sending one of my best old-timey blogposts to a friend (How to quit MS Word for good), ended up re-reading it and realized that therein, I had promised that I would write a blog post on version control some day. And, if I’m not mistaken, I never followed up on that. So here you are, a short post on version control just to keep things going on the blog. What is Version Control? So I read this book a few years ago. The Complete Software Developer’s Career Guide: How to Learn Programming Languages Quickly, Ace Your Programming Interview, and Land Your Software Developer Dream Job by John Sonmez (Simple Programmer 2017). While I’m not that fond of its author anymore since I realized that he uses his platform to

read more A Primer on Version Control and Why You Need It

Navigating interdisciplinarity (as a DH scholar)

Today’s post is something at the interface of rant and rambling. While I love being interdisciplinary, it’s also quite the hassle at times which is why I guess most interdisciplinary scholars sometimes wished they weren’t doing interdisciplinary work. There are so many negative stereotypes, like… “You have it easier being interdisciplinary” vs it’s actually twice the work So do you really think that we have it easier? I hate how we always get this reproach that we’re taking the easy route. Can somebody please explain to me what’s “easy” about having to follow the state of the art in multiple fields at the same time? And then not even knowing where to get published because scholars from discipline A don’t understand half of your research and the same in the other direction. I tend to be somewhat “too historical” for the Digital Humanities but then waaaay to technical for the “normal Humanities”. I think being in the DH and doing

read more Navigating interdisciplinarity (as a DH scholar)

The most important book to read if you want to learn Digital Humanities, Computer Science, Maths, Programming or LaTeX

Today I wanted to share a tiny book review of the book I claim to be the most important book you should read if you want to learn any technical topic but are unsure if you are up for it. The book I’m talking about is not Donald Knuth (although his books are highly recommended, especially if you’re a (La)TeX nerd!). It’s not even a computer book! I’m talking about: Mindset: The New Psychology of Success by Carol Dweck (New York: Random House 2006). The fixed mindset versus the growth mindset This will be a short post because Dweck’s message is simple. There are two mindsets, the ‘fixed mindset’ and the ‘growth mindset’ and which one you have greatly impacts your success in learning and self-development. The ‘fixed mindset’ assumes your abilities and talents are fixed. Thus, you are proud of what you’re good at because you link it to your personality (“I’m a person who is good at…”). But

read more The most important book to read if you want to learn Digital Humanities, Computer Science, Maths, Programming or LaTeX

LaTeX for thesis writing

Having re-read my LaTeX for PhD students post, I realized I hadn’t mentioned a lot of things I would like to impart to you. So here comes LaTeX for thesis writing – a few more arguments in favour of starting to learn LaTeX now. Just to sum up what has already been said in the last post: The main points speaking in favour of you typesetting your thesis in LaTeX are the citation management, tables, maths and images which can be more of a hastle in MS Word. In the aforementioned blogpost, I also added that you should take into account that a thesis will yield two PDF outputs with very different requirements from the same document – another reason to use LaTeX. But there are many more things to take into account. LaTeX for maths, images and the like (in short, everything MS Word isn’t good at) A lot of people say that the “LaTeX is great for maths”

read more LaTeX for thesis writing

List of Resources for getting started with (teaching) digital methods

Having just attended a talk in an event on Digital Humanities and Neo-Latin, I was inspired to share a short list of introductory resources on DH, especially for teachers who feel more like Humanities scholars and don’t have tons of time to learn everything autodidactically. They can use those resources to learn for themselves and pass on this knowledge or pass on this link. But also, since you’ve found this blog, you’re already on a great path to learning DH! 🙂 I’ll try to keep this updated – and it’s not really done yet, so feel free to contribute. Discipline-independent DH dariahTeach: great MOOCs on many topics Source criticism in a digital age DARIAH-EU DH course registry EADH Courses List Digital Classics Article by yours truly in German: Digitale Lernplattformen und Open Educational Resources im Altsprachlichen Unterricht I. Technische Spielräume am Beispiel des ›Grazer Repositorium antiker Fabeln‹ (GRaF). It contains a few resources on digital resources and digital teaching, mostly

read more List of Resources for getting started with (teaching) digital methods

Machine Learning for the Humanities: A very short introduction and a not-so-short reflection

Machine Learning is one of those hot topics at the moment. It’s even starting to become a really hot topic in the Humanities and, of course, also in the DH. But Humanities and Machine Learning are not the most obvious combination for many reasons. Tutorials on how to run machine learning algorithms on your data are starting to pop up in large quantities, even for the DH. But I find it problematic that they often just use those methods, just show you those few lines of code to type in and that’s it. Frameworks have made sure that ML algorithms are easy to use. They actually have a super-low entry level programming-wise thanks to all those libraries. But the actual thing about ML is that you need to understand it or it’s good for nothing. (Ok, I admit there are some uses which are pretty straightforward and don’t need to be fully understood by users, such as Deep Learning powered

read more Machine Learning for the Humanities: A very short introduction and a not-so-short reflection