How to create training (ground truth) data to improve OCR/HTR

In this blog post, I want to introduce you to the basics of creating your own ground-truth training set for OCR/HTR, whether you’re building your own model or improving an existing one. Essentially, this is about how to create more data that can be used to train or retrain an algorithm to perform better on your historical texts. While this may seem obvious to many digital humanities practitioners, the task is often carried out by subject matter experts from the humanities who may not be as familiar with the technical aspects. Improving a baseline model by fine-tuning When we talk about creating your own ground-truth dataset for handwritten text recognition (HTR) or optical character recognition (OCR), what you often want to do is fine-tune or improve an existing model. Many models are already available (for example on HTR United or Transkribus), so you might provide additional training pages to address specific issues where the model consistently makes mistakes. In fact,

read more How to create training (ground truth) data to improve OCR/HTR

How to annotate fast

In 2017, I had an internship at the German Historical Institute in Paris, where my task was to annotate around 1,200 regests. My job was to get them from source data in a Word document to TEI-XML in the end, but the project requirement was to keep everything in Word as long as possible. So, the annotations had to be done using Formatvorlagen (which I believe is called ‘macros’ or ‘stylesheets’ in English). In this post, I want to reflect on what helped me finish the annotation in three weeks (when previous estimates were that it would take about a year). Even if your project is different from this setup, these learnings may help you make your own annotation projects more efficient. Back then, I wrote a tutorial for people in the project who might want to continue my approach after I was gone. It’s in German but you can use LLMs to translate it for you if you’re interested.

read more How to annotate fast

What you really need to know about Digital Scholarly Editing

Today’s post is a short introduction to digital scholarly editing. I will explain some basic principles (so mostly theory) and point you to a few resources you will need to get started in a more practical fashion. I’m teaching a class on digital scholarly editing this term, so I thought I could use the opportunity to write an intro post on this important topic. How does a Digital Edition relate to an analogue scholarly edition? Unlike analogue scholarly editons, digital editions are not exclusive to text and they overcome the limitations of print by following what we call a digital paradigm rather than an analogue one. This means that a digital edition cannot be given in print without loss of content or functionality. A retrodigitized edition (an existing analogue edition which is digitized and made available online), thus, isn’t enough to qualify as a digital edition because it follows the analogue paradigm. Ergo: It’s not about the storage medium. A

read more What you really need to know about Digital Scholarly Editing

A Primer on Version Control and Why You Need It

Today’s post is a quick introduction to version control as a concept and version control systems. It explains what they are and why you should be using them. I was just sending one of my best old-timey blogposts to a friend (How to quit MS Word for good), ended up re-reading it and realized that therein, I had promised that I would write a blog post on version control some day. And, if I’m not mistaken, I never followed up on that. So here you are, a short post on version control just to keep things going on the blog. What is Version Control? So I read this book a few years ago. The Complete Software Developer’s Career Guide: How to Learn Programming Languages Quickly, Ace Your Programming Interview, and Land Your Software Developer Dream Job by John Sonmez (Simple Programmer 2017). While I’m not that fond of its author anymore since I realized that he uses his platform to

read more A Primer on Version Control and Why You Need It

First ever LaTeX Ninja workshop at Harvard: “Beyond TEI: Digital Editions with XPath and XSLT for the Web and in LaTeX”

It’s been awfully quiet on this blog but actually, there’s lots of Ninja activity going on right now: I’m excited to announce that I will give the first ever official LaTeX Ninja workshop, in person at Harvard in about two weeks! It’s called “Beyond TEI: Digital Editions with XPath and XSLT for the Web and in LaTeX”. (Apart from that, there’s a short book review coming up in TUGboat.) Since there probably are a good number of people who would be interested in such a workshop but can’t attend in person, I will share the slides and teaching materials on Github later on. That way, they can be reused for self-study. This blogpost gives somewhat of an outline of the contents of the workshop and contains links to related posts on this blog. Participants might want to read some of them in preparation or as an additional resource. [Get to the github repo with all the materials (`additional resources’ directory)

read more First ever LaTeX Ninja workshop at Harvard: “Beyond TEI: Digital Editions with XPath and XSLT for the Web and in LaTeX”

The LaTeX Newbie’s Guide to Using Overleaf for Conference Paper Submissions

Conference paper submissions in LaTeX are becoming increasingly popular in fields outside the technical disciplines (which have embraced them a long time ago already). Be that the Digital Humanities or historians wanting to contribute to events such as HistoCrypt, LaTeX templates for submission are getting more widely adopted. That’s why I wanted to dedicate this first post of 2022 to this important topic, so you are ready for LaTeX conference paper/abstract submissions in 2022! How to get started quickly and what to be aware of Find the template to use. The conference will have probably prepared a template you’re supposed to do. Since a principle behind TeX/LaTeX is the separation of form and content, you really only need to focus on writing your text. The layout will be provided by the conference organizers in said template. If this template isn’t on Overleaf yet, download it and upload it as a new project in the online LaTeX editor Overleaf. This will

read more The LaTeX Newbie’s Guide to Using Overleaf for Conference Paper Submissions

How to maintain Twitter with little effort as an academic: The Ninja’s “How to better promote your content on Twitter” Guide. Part 5

Academic Twitter can be an important tool for networking, we get it. But I’ve talked to more and more colleagues who have given up on Twitter because they felt that they couldn’t make it work and also didn’t want to spend unreasonable amounts of time on it. I get that too. Apart from the Twitter experiment I did in November 2020 and times where there’s relevant stuff going on, I also want to minimize time spent on social media/Twitter as much as possible. But, to my great surprise, I realized my accounts are still growing even though I’m not doing much. That’s when I thought “Wait, this could be relevant for my readers” and decided to explain to you what I did. The goal: Setting your Twitter account up right for a relatively low-maintenance Twitter presence with some growth In my experience, many academics sign up for Twitter and then never get on Twitter again because they don’t know how

read more How to maintain Twitter with little effort as an academic: The Ninja’s “How to better promote your content on Twitter” Guide. Part 5

List of Resources for getting started with (teaching) digital methods

Having just attended a talk in an event on Digital Humanities and Neo-Latin, I was inspired to share a short list of introductory resources on DH, especially for teachers who feel more like Humanities scholars and don’t have tons of time to learn everything autodidactically. They can use those resources to learn for themselves and pass on this knowledge or pass on this link. But also, since you’ve found this blog, you’re already on a great path to learning DH! 🙂 I’ll try to keep this updated – and it’s not really done yet, so feel free to contribute. Discipline-independent DH dariahTeach: great MOOCs on many topics Source criticism in a digital age DARIAH-EU DH course registry EADH Courses List Digital Classics Article by yours truly in German: Digitale Lernplattformen und Open Educational Resources im Altsprachlichen Unterricht I. Technische Spielräume am Beispiel des ›Grazer Repositorium antiker Fabeln‹ (GRaF). It contains a few resources on digital resources and digital teaching, mostly

read more List of Resources for getting started with (teaching) digital methods

Long-Term Twitter Strategizing: The Ninja’s “How to better promote your content on Twitter” Guide. Part 4

Twitter is an important professional networking platform for the Digital Humanities. But it’s not exactly self-evident how to make it work in your favour. This part explains long-term Twitter Strategy. This involves a few elements which might not seem so important at first but will be to keep your Twitter presence active and at a steady growth rate, without having to constantly put in lots of effort. This includes scheduled Tweets and using analytics and reflection to determine how to best cater to the interests of your followers. As some of you might remember, I did a Twitter Engagement Experiment at some time in autumn last year. Now I wanted to share my most important learnings, so you can make your Twitter presence more effective with just as little work as you want to put in. Actually, this was all just meant to be one post but it got so crazy long that I decided to make it into a

read more Long-Term Twitter Strategizing: The Ninja’s “How to better promote your content on Twitter” Guide. Part 4

Bio Engineering, Tweet Structure or How to lure your audience: The Ninja’s “How to better promote your content on Twitter” Guide. Part 3

Twitter is an important professional networking platform for the Digital Humanities. But it’s not exactly self-evident how to make it work in your favour. This part explains how you can improve the rate you’re gaining followers at by immediately providing them with reliable information about what value you provide for them. The best and quickest way to achieve that is by a one-time improvement session for your bio! If that’s not a life hack 😉 As some of you might remember, I did a Twitter Engagement Experiment at some time in autumn last year. Now I wanted to share my most important learnings, so you can make your Twitter presence more effective with just as little work as you want to put in. Actually, this was all just meant to be one post but it got so crazy long that I decided to make it into a series of digestible short posts. More Twitter Growth/Strategy advice 8) Set up your

read more Bio Engineering, Tweet Structure or How to lure your audience: The Ninja’s “How to better promote your content on Twitter” Guide. Part 3