Today’s blog post is a teaser for a video class called Computer Vision for Digital Humanities (funded by CLARIAH-AT with the support of BMBWF). The self-learning resource (video lessons plus Jupyter notebooks) is an introduction to Computer Vision methods for Digital Humanists. It addresses some Humanities issues that many typical introductions to computer vision do not cover. This post is an example of such reflection. It gives an insight into the first exercise of the class, filtering a list of metadata to create a ground truth dataset for training a classification algorithm. This blogpost does not contain the actual data (so you stay tuned for the video class!) but discusses the issues which arise when creating a ground truth data set for computer vision using Humanities data.

Citation suggestion: Suzana Sagadin & Sarah Lang, How to create a ground truth data set for computer vision using Humanities data, in LaTeX Ninja Blog, 04.07.2023. https://latex-ninja.com/2023/07/04/how-to-create-a-ground-truth-data-set-for-computer-vision-using-humanities-data/

Further links to the whole school: The YouTube playlist “Computer Vision for Digital Humanists” contains the video lectures but there are also slides, a cheatsheet and, most importantly Jupyter Notebooks so you can actually get your hands dirty and follow the PyTorch code. The accompanying blogpost is Learning how to Distant View: Computer Vision for Digital Humanists.

Goal of this session

This exercise was supposed to fill a common gap in applying computer vision to the Humanities: if out-of-the-box methods, tools, or software don’t work out of the box for my data, I might need to train my own model. In order for that to be possible, I first need to create a usable ground truth dataset. However, this is not a trivial task and many Humanities reflections go into it.

Following the principle of GIGO (garbage in, garbage out), it is an integral part of a Digital Humanities computer vision task to include Humanities expertise and careful reflection already in this early stage as this may have a profound impact on the model we end up training. In fine-tuning the performance of the model later, we need to investigate both the engineering aspects (can we improve the technical implementation?) as well as the Humanities interpretation (do these results make sense?).

Many computer vision tutorials and ‘Hello World!’ examples you can find online work with an existing dataset (such as Iris or MNIST). However, the skills you learn in running a hello world machine learning algorithm on those datasets don’t necessarily transfer that well to Humanities research questions. Thus far, it is hard to find tutorials that transfer well to Humanities use cases. Also, since those common tutorials usually start with a ready-made dataset, Humanists can be at a loss on how to even get started. There is a great Programming Historian tutorial on Deep Learning for Computer Vision but here, too, the focus is not on creating the data set.

This session demonstrates how to pre-process and filter out a .csv file containing information about our image data set. We have simplified it from the real dataset to make it more manageable but retained common issues. The only issue we have spared you is making sure you are allowed to reuse the data (!). Please make sure you have the appropriate licenses before starting your own project. Preparing a dataset is a lot of work and we would not want you to do it in vain, only to realize later you are not allowed to share your results.

Another common task in creating ground truth might be annotating instances of objects in images and assigning labels to them. This can be used to train object detection or segmentation algorithms. For this, you might want to use a tool such as supervise.ly. For data management, we show you the Tropy software (https://tropy.org) in another session.

Creating ground truth

The goal of the practical sessions is to build an image classifier. Most images in our example dataset (which I will not go into detail about in this blog post, but stay tuned for the actual video class!) date from approximately 1860 to 1950 (with a few outliers). With this data, we will build an image classifier that estimates the century of origin (19th or 20th century) for a given image. To train a classifier, we need a labeled dataset. That means we have to associate each image in our dataset with a corresponding label. In our case, the label is the century of origin, either the 19th or 20th century. These labels serve as the ‘ground truth’ that our classifier will learn from. This process, known as supervised learning, allows the model to understand the patterns and features of each century and thus making it possible for the model to correctly guess the century of new images it hasn’t seen before.

So we want to prepare a dataset for a computer vision classification problem, distinguishing historical photos from the 19th century to the 20th century. In order to create the ground truth dataset (with photos either labeled as ’19’ or ’20’), we need to filter the data. First, we will collect the relevant images. That is, we’re looking for images that we can confidently classify as either ’taken in the 19th century’ or ’taken in the 20th century’. We begin by looking very closely at our data in the CSV file and finding out criteria we need to filter by to make sure we only label data from the 19th and 20th centuries as their respective century (this is more difficult than it might sound at first, remember these are Humanities data we are talking about!). After inspecting the CSV file, it becomes clear that the data is a bit inconsistent and, unfortunately, we cannot use all the pictures. We will need to filter the entries based on certain criteria. Here are some definitions we came up with.

Definitions:

The 19th century: 1800 to 1899.
The 20th century: 1900 to 1999.

Please note that the question of how to define a century is actually a bit of a tricky one as well. You could also go with 1801-1900 and so on.

Filtering criteria:

Only include rows where the ’Type’ is ’Photograph’
Remove entries where ’Date (From)’ is empty
Exclude entries where ’Date (From)’ is less than 1900 and ’Date (To)’ is greater or equal to 1900
Exclude entries where ’Date (From)’ is less than 1900 and ’Date (Appendix)’ is ’after’
Exclude entries where ’Date (From)’ is greater than 1899 and ’Date (Appendix)’ is ’before’
Exclude entries where ’Date (From)’ is greater than 2000

Labels:

Assign ’19’ to:

’Date (To)’ is less than 1900, or
’Date (From)’ is less than 1900 and ’Date (Appendix)’ is ’before’, or
’Date (From)’ is less than 1900 and ’Date (Appendix)’ and ’Date (To)’ are empty

Assign ’20’ to:

‘Date (From)’ is greater than 1899

Some context for understanding what’s happening here. For instance, the dataset does not contain any data before 1800 (as the first photographs don’t appear until around 1830), so we don’t need to filter out data before that date but it does contain data after 1999, which is why we do need to filter that out.

Thinking about what classification we are teaching our algorithm

Centuries are, in fact, relatively arbitrary borders for classification. Distinguishing certain style periods would be much more distinctive. Also, decisions like whether 1900 is included in the 20th century or not can cause data errors, so this is an important learning when preparing a ground truth dataset. Pay close attention! The turn of the century is something of an artificial caesura that does not necessarily coincide with a sudden style change. Photos from around the turn of the century, say around 1890 to 1910, are hard to distinguish and classify, even for humans. Therefore, we have to expect the machine learning algorithm to equally have problems with this particular time span.

In fact, we should make use of confusion matrices in the analysis and interpretation stage to check where our algorithm particularly falls short.

A confusion matrix (also table of confusion ) is a table with two rows and two columns that reports the number of true positives, false negatives, false positives, and true negatives. (https://www.geeksforgeeks.org/confusion-matrix-machine-learning/)

In fact, it might make sense to exclude particularly ambiguous data from the training set or declare different classes for more coherent time periods rather than working with an artificial distinction like we do in our example. However, since this very simple (maybe oversimplified!) binary classification problem is supposed to function as a “Hello world!” introduction to applying computer vision to a Humanities problem, this still makes sense.

However, on the Humanities part, it becomes obvious that Humanities data remain difficult and should not be underestimated (read more here: https://mfafinski.github.io/Historical_data/). If we wanted to do this right and create actual sensible research from this problem, there would be many more factors to consider. If we gave this some more thought, we might realize this assumption of distinguishing between centuries does not make all that much sense from a research perspective in the first place.

But we will not overthink this now to keep it simple enough, in the hope that this will help us understand the machine learning process before we can progress to working on actual research questions for Digital Humanities. We have, thus, simplified the date entries and the whole research question a little to make the learning process easier, as our main focus in this class is not to explore the complexities of dating historical artifacts, but to show you how to organise your own collection effectively for machine learning purposes.

Filtering and pre-processing our dataset to create ground truth

The dataset needs to be filtered based on specific criteria to create a ground truth dataset for training the computer vision classification problem of distinguishing historical photos from the 19th century from photos from the 20th century. The filtering process is necessary due to the makeup of the dataset we procured for our purposes (that was created for different purposes than ours, so it needs to be pre-processed first!):

Filtering by ’Type’: Only including rows where the ’Type’ is ’Photograph’ ensures that only relevant entries related to photographs are considered for the classification problem. The data set includes a few other data types we do not wish to include.
Filtering by ’Date (From)’ empty values: Removing entries where ’Date (From)’ is empty ensures that we have complete and reliable date information for each photo. We cannot use images with incomplete metadata, so we have to exclude them from the ground truth (since we cannot be sure from when they are).
Filtering by date ranges:

Excluding entries where ’Date (From)’ is less than 1900 and ’Date (To)’ is greater or equal to 1900: ensures that photos, where we do not have certainty from which century they pertain, are excluded from our ground truth. This could confuse our algorithm if we accidentally feed it ambiguous or incorrect examples. It is common in Humanities data to not have exact dating but rather date ranges if the actual precise date is unknown. Many computer applications, however, do not fare well with such imprecise data. Thus, we need to eliminate rows in our data frame, which are unsuitable for our ground truth dataset.
Excluding entries where ’Date (From)’ is less than 1900 and ’Date (Appendix)’ is ’after’: basically the same reason as above; ensures that photos with uncertain dates or dates that extend beyond the 19th century are not included in the 19th-century category.
Excluding entries where ’Date (From)’ is greater than 1899 and ’Date (Appendix)’ is ’before’: ensures that photos with uncertain or vague dating that starts before the 20th century are not included in the 20th-century category.
Excluding entries where ’Date (From)’ is greater than 2000: removes data beyond the scope of our classification problem (i.e. dated after 2000) from the dataset.

These filtering criteria are essential to ensure that the ground truth dataset accurately represents the distinction between historical photos from the 19th and 20th centuries. However, when we do the filtering, it is crucial that we are as precise as possible. Since data is often sparse in the Humanities and many datasets are not exactly ‘big data’, we cannot afford to lose unnecessary amounts of data to imprecise filtering. But for proper filtering, it is essential that we know our dataset well so we don’t include misleading or incomplete data, but also that we don’t unnecessarily throw away data that we could have kept had we investigated properly.

Conclusion

It is important to note that categorizing photos based on centuries is a relatively arbitrary approach and may not always capture distinct style periods. The inclusion of the year 1900 in the 20th century can also lead to data errors or uncertainties. Photos from around the turn of the century (for example 1890–1910) may exhibit similar characteristics, making it challenging for both humans and machine learning algorithms to classify them accurately.

This highlights the need to employ tools like confusion matrices to assess algorithm performance and identify areas where classification may be particularly challenging. However, for this introductory exercise in applying computer vision to a humanities problem, this simplified binary classification problem can still serve as a valuable starting point. It emphasizes the complexity of humanities data and the importance of approaching research questions with careful consideration of various factors.

I hope this made you excited for the video school – I will keep you updated when it comes out (presumably early September 2023)! –> EDIT: The YouTube playlist “Computer Vision for Digital Humanists” contains the video lectures but there are also slides, a cheatsheet and, most importantly Jupyter Notebooks so you can actually get your hands dirty and follow the PyTorch code. The accompanying blogpost is Learning how to Distant View: Computer Vision for Digital Humanists.

Buy me coffee!

If my content has helped you, donate 3€ to buy me coffee. Thanks a lot, I appreciate it!

€3.00

LaTeX Ninja'ing and the Digital Humanities

The verb "to ninja" means "to act or move like a ninja, particularly with regard to a combination of speed, power, and stealth." LaTeX adventures, demystifying digital tools for Humanists, one tutorial at a time.

How to create a ground truth data set for computer vision using Humanities data

Goal of this session

Creating ground truth

Thinking about what classification we are teaching our algorithm

Filtering and pre-processing our dataset to create ground truth

Conclusion

One thought on “How to create a ground truth data set for computer vision using Humanities data”

Leave a comment Cancel reply

Goal of this session

Creating ground truth

Thinking about what classification we are teaching our algorithm

Filtering and pre-processing our dataset to create ground truth

Conclusion

Share this:

Related

One thought on “How to create a ground truth data set for computer vision using Humanities data”

Leave a comment Cancel reply