Machine Learning for the Humanities: A very short introduction and a not-so-short reflection

Machine Learning is one of those hot topics at the moment. It’s even starting to become a really hot topic in the Humanities and, of course, also in the DH. But Humanities and Machine Learning are not the most obvious combination for many reasons. Tutorials on how to run machine learning algorithms on your data are starting to pop up in large quantities, even for the DH. But I find it problematic that they often just use those methods, just show you those few lines of code to type in and that’s it.

Frameworks have made sure that ML algorithms are easy to use. They actually have a super-low entry level programming-wise thanks to all those libraries. But the actual thing about ML is that you need to understand it or it’s good for nothing. (Ok, I admit there are some uses which are pretty straightforward and don’t need to be fully understood by users, such as Deep Learning powered lemmatization of orthographically non-normalized historical languages). But I still felt like I needed to share my (theoretical) thoughts on it too. And give a little primer on the practicals.

In this post, I will give you a few little rants on ML and DH, a short tutorial and some resources. If you don’t want my ramblings, you can skip to the short tutorial rightaway.

Tutorials on the practicals are easy to find, however. That’s why I wanted to supplement the existing materials out there with a short theoretical primer. But bear with me, it won’t get boring. It also won’t go incredibly deep but I hope it might fill in the gaps for many of you who are interested, feeling like they’re falling behind if they don’t know all that much about ML or need to learn it for a job. Ready-to-use tutorials are great. But when you’re in the Humanities, you should at least have a vague idea of what’s actually going on.

Computational Methods & DH in a Nutshell: The answer is 42

I promised that this would be fun, so here goes: Enter The Hitchhiker’s Guide to the Galaxy. As per the Hitchhiker’s Guide, we all know (or should know!) that the answer to everything is 42. But also, that it becomes obvious really quickly that this information isn’t all that helpful if we don’t know what the question was. This is ML / Computational Methods and the DH in a nutshell (and, of course, horribly simplified but hey, remember didactical reduction).

If we were to simplify things quite a bit, then we could say that essentially, any result we get from all Computational Methods is 42. Or a number like 42. And like with 42, this answer will only really mean anything to us if we know what the question was. And we can determine that the question made sense, the answer is actually a valid answer for the given question and that the answer actually helped us progress with our Humanities type research questions.

In the DH, a result is only a result if we can explain how we got there using classical Humanities hermeneutics. However, most computational methods don’t exactly lend themselves to being explained in classical Humanities terms. So technically, ML results aren’t really results in the context of the Humanities. What we do with the result is the result, if you get what I mean. We need to very carefully select our research questions so that we may even get a half-useful result at all.

Given that explaining results using Humanities hermeneutics is not at all possible in Machine Learning (at least most times), we are faced with a dilemma: Either we ask questions so detailled that we can actually get an understandable answer for it or we ask a big question, let an unsupervised algorithm run over it and then try to make research out of the disparities between what the humans saw in the data and what the machine saw in it. That way, we might be able to learn new methodologies from a machine or identify new features the machine determined to be helpful in answering our questions that we hadn’t seen before and that might not even be anaylzeable by hand due to their complexity. That’s a win then.

But – I have put out this cheeky statement now (please prove me wrong!) – I feel that this sort of reflexion has not entered DH tutorials for ML yet. They often seem scarily mechanical and start using code rightaway before anyone has even had the slightest chance to understand the principles involved, what it means when a machine “learns”, etc.etc. So far, the theoretical reflection remains hidden in theoretical volumes (which you can only really understand once you already kind of understand machine learning algorithms) but the tutorials trying to get DH people to use ML quickly (using frameworks which are quick but also hide practically litterally everything that’s actually going on) don’t contain even a side note on potential fallacies. This is a problem! We need to resove this disparity as soon as possible or we will both split the field into those doing ML in secret and the rest who don’t understand it, so they can’t criticize it – and also pull the pin on our Humanities roots (by working with data like a Computer Scientist would and not in the way a Humanities scholar would). ML only gets really cool for the DH when we’re doing it in a well-reflected way. We can only get the most out of it once we’ve arrived at this point.

So, long story short – in order to be able to actually use ML in DH research, we need to learn how we can validate the results ML gives us in terms of the hermeneutics of the Humanities and understand the algorithms it uses.

After all of this ramble, let’s get to my short tutorial.

The promised short intro to ML for the DH

How ML algorithms “learn”

Tom Mitchell said in 1998 that a ML problem is defined by a program learning from Experience E with respect to a Task T while its success is being measured by the Performance (Measure) P (which should improve at performing Task T with Experience E). In the case of a spam filtering program, the Task T is classifying emails. The experience E comes from Gmail watching you classify your own emails, labeling them as spam or not. The Performance P is measured in the amount / fraction of emails classified correctly.

But what does that actually mean? This definition doesn’t really tell us how it’s done in practice.

Now, ML is very mathematical really. Take the example of supervised learning. Imagine you plot all your given data (which might be a lot but not an infinite number) on a coordinate system. The machine will learn by trying to “fit” a function around your given data which fits it as well as possible. So basically it “invents” a function which can explain the data that was fed into it. We basically assume that the data you gave to it is incomplete, so we ask the algorithm to fill in all the parts until we arrive at a continuous function/line (as opposed to just the single coordinate points we had before). Please ponder this until it makes sense, this is very important and I feel that I didn’t find the most appropriate words for it. Once you’ve pondered this, you’re allowed to go on ; )

This resulting function is then called a model for your given use case. As we know from other models in the DH, the performance and results gained using models are highly dependent of the quality of the model and how well the features represented in the model model the reality to be modelled. (Sorry for saying model so many times.)

Once the model exists, we use it on the test data to make what we call a prediction. If you go back to our mathemetical cartesian coordinate system, remember that we supplied it with a few data points and basically filled in the rest – as best as we could – connecting the dots with a line (i.e. inventing a mathematical function which describes the data as well as possible). Now we can test whether it works on the rest too by inputting new x values and checking (using the test data) whether the given output (i.e. related y-value) is correct / in accordance with our model function.

We can further optimize our model using error or cost functions. (I’ll say a few more words about those later.) Anyway, this process of finally using the ML model we generated by fitting (aka training) the model on our data, is called predicting. Now that you’ve (hopefully) understood more or less how this process works, you’ll probably see why fitting is actually the more fitting term for the process described (haha, terrible pun here).

Selecting training data and measuring accuracy or success of ML

We have learned that training (fitting) occurs on data we provide for it. Thus, in ML many problems can arise from having the model learn on badly chosen samples of data. If the data given is, in fact, not representative for the data the model is later supposed to be used on, shit will not work. I repeat: Won’t work. This might seem obvious now but with tons of data you don’t understand, determining this might be less obvious than what it sounds like now.

This is why in ML, its also very important to have functions which automatically measure the success of what’s going on, i.e. give an accuracy rate. The accuracy is rate shows how well the model performs on test data it hasn’t seen before (i.e. whether its predictions are good). It can happen that a model learns too much from the test data that so-called “overfitting” occurs, i.e. the models learns the test data by heart and can spit it out perfectly but it didn’t actually learn anything. It’s like with learning a language: only because you can spit back many items of vocabulary which were fed to you perfectly doesn’t say much about whether you actually learned the language. That is evaluated by testing how well you can react to circumstances you haven’t seen before and by how effectively you can use what you learned, i.e. communicate in the language. It’s the same thing for ML algorithms.

Also, one might assume that your success communicating in a given target language will depend on how well chosen the vocabulary was that you were given to learn. If it wasn’t representative of the language or the requirements you’re expected to fulfill afterwards, you’ll have learned something but your skills will not really be applicable to the task you were actually meant to do. Same goes for ML models and algorithms.

If you want to learn more about properly selecting data for training and testing purposes, the keyword is sampling.

In order to optimize results you get from ML algorithms, you need to play around with the parameters given to the ML functions. Because what’s going on is probably outside of the range of your own limited imagination, it’s important to get a lot of experience playing around. Only like that can you really get a feel for how the algorithm you’re using actually behaves.

One could talk about each of the topics I have mentioned here for ages but nobody has time, including me, so we won’t 😉 But if you want to get into this, remember that this is only a very short intro tutorial. Let’s get on with some more fancy facts to know…

Types of ML

Usually, we distinguish three different types of ML (of which only the first two are relevant for the Humanities).

  1. Supervised Learning: You give the algorithm training examples with target values (for example, human-made “gold standard” annotations) and the goal is to predict corresponding target values for new data that the algorithm hasn’t seen before, based on those training data. Meaning that we supply what we think would be “correct solutions” for the problem at hand. Examples can be OCR or speech recognition. We can use many different types of algorithms to achieve this. So just to throw around a few names, these are among others: Linear Regression and Classification (more on that later), there are neural networks (i.e. Deep Learning), k-nearest neighbour classifiers, logistic regression, support vector machines and Bayes classifiers, for example. I won’t go into detail for most of them but just so you know there are different methods to approach supervised learning.
  2. Unsupervised Learning: Here, training examples are given without target values (meaning without indication what an expected result may look like), i.e. the algorithm isn’t told what we consider a correct solution. The goal here is to detect patterns and extract structure from data. This is useful when we don’t know enough about the data yet ourselves so that we’re unable to formulate a precise research question or give examples of correct classifications/solutions. This approach thus is more exploratory than supervised approaches. Ergo, if you don’t know what you want, try unsupervised learning. Examples are clustering, segmentation, embedding, Principal Component Analysis (PCA), Linear Discriminant Analyses (LDA) and k-means clustering.
  3. Reinforcement Learning: While both approaches above are characterized by algorithms learning from data, reinforcement learing algorithms learn by doing (i.e. trial and error). This can be useful for robotics. The algorithm knows what the goal is and will get positive reinforcement when it was successful and negative reinforcement (“punishment”) when it’s doing “something wrong”. This feedback in the form of reward or cost during trial-and-error episodes is used to maximize reward and minimize cost (in terms of how the algorithm is implemented). I won’t go into this further because a) I don’t know all that much about it 😀 and b) it’s not really relevant for the DH (yet?). Robots are trained like this and this has allowed machines to learn many things Michael Pollanyi has once called tacit knowledge or things we do intuitively. (Of that sort of knowledge, he stated that it could never be made explicit and thus, theoretically could not be learned by a computer, like riding a bicycle which robots now can do.)

As a short side note on the history of ML

For a long time, linguists thought that teaching a computer language would be possible using rule-based approaches. Once enough rules are laid out in an explicit way for the computer, tasks like Machine Translation would become possible, people thought. Since the 1990ies, however, (if I’m remembering this correctly) people realized that statistical approaches to langauge are more powerful because human language is so illogical and has so many exceptions when you look at the details that us humans would become a bottleneck in laying out those rules. It’s way easier if a machine learns those statistics itself. If you’re interested, a fascinating book on the topic is Jun Wu, The Beauty of Mathematics in Computer Science (2018). It comes from a series of blog posts, thus is definitely readable for laypeople interested in the topic and it covers mostly language-related tasks.

Getting philosophical: How do we define intelligent machines?

Oh and also, another question in AI is, of course, how we can even define Artificial Intelligence. Does a machine need to feel like a human being to achieve human intelligence. Does the machine need to learn to hate its own reflection in the mirror to become human (see Steven Pressfield, Turning Pro (2012): The Human Condition, p.3 or the Hitchhiker’s Guide‘s Marvin the Paranoid Android or depressed robot)?

This question people are trying to answer in the contest of the “imitation game” or “Turing Test”. It’s a fascinating topic in the philosophy behind AI and the limitations of computation but not a topic for this post here and now.

(But, if you’re feeling adventurous, you can also look up John Searle’s Chinese Room argument, which has become super popular and well-known in philosophy, which makes a case why the Turing test might not be a valid test for whether machines have become “intelligent” after all. The general gist is the question whether it might be possible for a machine to spit out correct/expected/adequate answers even though it has no semantic understanding of what it’s doing, so it really would be very much mechanical and not very intelligent in the way we humans define intelligence.)

Oh and also because I’m a fan girl, there is Harry Collins’s (sociology/epistemology of science) Artifictional Intelligence: Against Humanity’s Surrender to Computers (2018). Which gives a good grasp of what ML/AI algorithms actually do (and don’t) and why that’s pretty far away from the image of AI science fiction is feeding us, hence the title “arti-fiction-al intelligence”.

Maths concepts to know, minimizing error functions and types of learning

In order to “get fluent in ML”, you’ll have to get to know tons of algorithms and maths/statistics concepts such as Gaussians (or normal distributions), standard derivations, probability density functions, parametric models and all that jazz. I think this would also be beyond the scope of this super short intro, so I’ll try and keep it super short. But some stuff still needs mentioning, in my opinion and those are a few more key concepts. As for the details of the algorithms, you can always look those up and understand them later (when you set out to actually become “ML fluent”). But once you understood the principles behind the different implementations, I’d say you can call yourself “ML litterate” which is probably the goal I’m trying to achieve with this post.

“Minimizing cost/error” is is one of those underlying principles of ML in general. It means that we “give penalties” for bad results in order to minize error. This is a way of “steering” our algorithm in the direction we want it to go. We have special “cost or error function” for that. A famous one is the Least Squares Error (LSE) in which the error values are squared so that the bigger the error, the more it’s going to get emphasized by the exponential function, thus punished harder. Another one is Gradient Descent, for example.

Some more terms I do think you should know are:

  1. Regression Learning: prediction of continously valued output (i.e. in Linear Regression, we assume that the result values form a more or less linear function. An example is the rise of housing prices depending on the size of the appartment).
  2. Classification Learning: We expect discrete valued output (i.e. the result is either 0 or 1). Our values will be somewhere between those discrete values. This allows for probabilities and also attributing values to classes. You can also have more than two classes. For example, the MNIST dataset for digit classification from images (often the first example to learn neural networks) will have classes from 0-9 and every digit found will be classified into one of them.

We should also know about model representation in ML: We have a certain number of training examples (in supervised learning) which is usually denoted as m. Then we have x input variables (i.e. features to watch) and we have the y-value which is the target value. One X-Y-comination like this is a training example, a set of which, in turn, is a training set. The expected value for Y determines, like we heard above, if it’s a regression problem (y can take contiuous-valued output) or a classification problem (we expect y to only take a very reduced amount of possible values/options).

There are a few datasets you will get to know after completing a few intro tutorials because usually, the same ones are used all over again. There is the Titanic dataset, the MNIST (image classification for digits or fashion), the flower (Iris) one, and many more. A problem with this is that those are not easily applied to most DH research questions or at least the application won’t be obvious to a first-time unexperienced user with little knowledge of the theoretical principles. (Remember I criticizsed that many tutorials on ML are super “mechanical” like “Type is this code and shit will work”. This does not give you any basis whatsoever to actually apply “what you learned” – if you want to say you actually learned anything – to a use case which is only slightly different from the one shown in the example, at least in most cases.) I think we in the DH should make it our responsibility to generate a few example data sets on which new “ML in the DH” scholars can learn. It’s too much to ask, in my opinion, to both understand a new technology and then also be expected to transfer what you learned to completely different research questions and datasets.

I now feel like this is horribly incomplete and also, I didn’t include neural networks/deep learning at all. But since this was supposed to be a short tutorial, I’ll stop here. This info probably is enough to digest to start with. Let me know if this helped you or what else you would have liked to learn! Or if you’re a teacher of ML, let me know what I missed which you would have included!

This tutorial was really short (I mean not really, but given the immense nature of the subject matter). I might follow up at some point but I thought, given that I’m lately very short on time for blogging, you will profit more from a published imperfect blogpost than a perfect one which never comes out. I would be happy to hear from you if you’re about to get into ML, if I have motivated you to do so or if you’re already doing DH ML research or making tutorials and disagree with my points. I have done my research with this post but not as much as I would usually do, especially regarding exsiting DH ML tutorials. Feel free to point out helpful resources to me. Until then, here are some resources I quickly came up with.

Resources

Help

If you need help with anything computational, I can highly recommend the Computational Humanities Forum. Here, you will find experienced colleagues who are more than willing to help, I’m sure.

Specific Courses for “DH and ML”

I highly recommend the Python Humanities website and Youtube channel for tutorials on using Python for NLP and machine learning. They are done with lots of love, still I feel that the theory got a bit lost along the way (like my overall criticism in many ML tutorials) and most of the tutorials’ content will be “this is how you type it into Python” (which is fair because that’s the goal of the channel). But then again, it’s hard to strike a balance. There are many theoretical ML tutorials which, on the other hand, remain so theoretical that they alone also won’t really let you grasp what ML actually means. I have tried to supply a “bridge” here but as to whether I have succeeded only you, my dear readers, can decide.

By the way, if you want to get into ML, there is a #100DaysofML challenge on Twitter and I suggest you could combine this with a #100DaysofDH project 😉

The Deep Learning for DH course on github has interesting material. It has many really good applications of Deep Learning, such as annotation, spelling-correction, OCR, stylometry and more.

However, I personally don’t get why people use GANs for DH. They are incredibly hard to actually understand and they’re generative, as the name says. I mean, surely there is a very limited number of research questions which lends itself to be answered using such models. But really? Mostly, I don’t want to generate new poetry or new images from historical plants (maybe for an educational computer game?). As a historian, I want to analyze the existing historical sources. So, yeah, I’m not convinced by GANs yet but maybe I don’t understand them enough. Convince me if you think I’m wrong about this! Looking forward to hearing from you guys in the comments!

Edit: An article on ‘Generative Digital Humanities’ by Fabian Offert and Peter Bell, dealing with the question of how GANs can be useful to DH, just came out: http://ceur-ws.org/Vol-2723/short23.pdf – I’m happy that this subject is now being adressed in a meaningful way 😉

Massive Open Online Courses (MOOCs) on ML in general

If you want to learn about ML in general, there are many great MOOCs and resources out there, such as the Stanford ML course on Coursera, for example, or the OpenHPI course on Deep Learning for Computer Vision (in German). However, having done those courses, I’m not sure I should recommend them because a lot of work goes into them if you want to do them properly. But still, the direct application for the DH is not immediately clear. The courses do, however, give you an enhanced understanding of what the techniques and algorithms actually do behind the scenes. So it’s good to get a proper understanding of the details and implementations if you decided to “get fluent in ML” after all. Maybe a good idea / compromise would be to go through the courses quickly, skipping over technical details (which are relevant, of course, but not when you’re just trying to get a first overview).

Journals

I think the Computational Humanities group were setting up a journal (which I couldn’t find anymore, but here’s a link to their site) also. A related journal in the area of Computational Humanities probably is The Journal of Data Mining and Digital Humanities.

Reading / Publications

You can also check out this short PDF on Critical Digital Humanities and Machine Learning or read Ted Underwood’s publications such as Distant Horizons (2019). Also James E. Dobson, Critical Digital Humanities. The Search for a New Methodology (2019).

A little criticism on ML in the DH

I think a problem that we still have is that too few people who are Humanities scholars through and through understand Machine Learning enough to fully exploit its potential for the DH. Of course, there are some useful applications and some are also pretty obvious ones because they are just repetitive tasks a machine can do better than us. (You might also be interested in my post on generating useful research questions quantitative methods can answer.)

But then, it seems that, so far, many who are on the more technical end of the DH seem to be including ML applications in their courses which just aren’t DH applications and regarding which I personally cannot yet see what useful Humanities questions such methods could ever answer (GANs are a hot candidate for this criticism). I think, as DH people, our job would be first to find an application which is actually useful and meaningful to the Humanities before making big headlines that we’re also able to replicate/use this fancy-as-hell technology from our friends in the Computer Science department.

That’s why I think it’s important that as many people as possible who strongly identify with “classical” Humanities work and who themselves have (non-only digital) Humanities research questions still should learn to understand quantitative or computational methods and machine learning (become what I called “ML litterate”) so we can come up with useful applications for them as quickly as possible. Because I really don’t like this “l’art pour l’art” approach to computational methods in the DH. Not because I dislike “l’art pour l’art” per se. Not when it’s about, well, art. Then this type of thinking totally makes sense to me. But it doesn’t when its about digital algorithms. I don’t think the Humanities have to use some fancy new algorithms just because they exist. I think we should use them because it enhances our work. But in order to make that assessment, we first have to properly understand what these methods do, how they do it and what they are capable of (i.e. achieve “ML/computational methods literacy”).

This is it for now. Hoping your are well,

cheers,

the Ninja

PS: If you find any mistakes in the post or have suggestions to make it better or more concise, feel free to let me know! Also, maybe send links if you think I missed some useful resources on the topic.

Buy me coffee!

If my content has helped you, donate 3€ to buy me coffee. Thanks a lot, I appreciate it!

3.00 €

Aftertoughts

Since there was some discussion about this topic and the post on Twitter, I wanted to include some of the tweets and the resources which were mentioned to me, namely by Michael Piotrowski. (Thanks!)

Tweets

Resources pointed out by Michael Piotrowksi

  1. Michael Piotrowski, Historical Models and Serial Sources, in: Journal of European Periodical Studies Vol 4 No 1 (2019): Digital Approaches Towards Serial Publications. DOI: https://doi.org/10.21825/jeps.v4i1.10226
  2. Gilles-Gaston Granger, The Découpage of Phenomena, in: Formal Thought and the Sciences of Man. Boston Studies in the Philosophy of Science, vol 75, Springer, Dordrecht (1983). https://doi.org/10.1007/978-94-009-7037-3_4
  3. Mateusz Fafinski, Historical data. A portrait, in: History in Translation (blog), 29 Sep 2020, https://mfafinski.github.io/Historical_data/. (This site also has a “Buy me coffee” button, so I suggest you do so if you like the linked post 😉 –> it’s really good, read the paper, here follows a quote to motivate you)

Your data is not what you think it is
Historical data is not any data. It might look like one, even superficially behave like one, but it isn’t. And historical data stored in collections (be it museums, archives, galleries or databanks) is even less like normal data. And what historical data most certainly isn’t is big data: that unicorn of data science that is supposed to change into gold under the influence of the modern philosophers’ stone: machine learning. Leaving aside the deeply flawed assumptions about what big data can and cannot achieve (and there is a lot to unpack there) let us focus on why historical data cannot be treated that way.

No data is neutral and apart from the fanatical bigdatists (believe me, they exist) very few data scientists will dispute the existence of inherent bias in datasets. Algorithms are the equivalent of a parrot: they will repeat and amplify any bias embedded in the dataset. [… ML fails] keep happening and will keep happening, because the creators of those algorithms don’t look at their data. And when I write look I mean: read. With comprehension.

Now, one can make an argument that all data is historical data. Correct. […] In other words: scientists and algorithmsmiths cannot just willy-nilly produce 4-page papers claiming “discoveries” based on “pure data”. Context is everything, because all data is historical.

[…]

Historical data does not consist of simple data points. Each instance, before it can be analysed, must be contextualised. And yes, this principle should be applied for all data, because all data is historical data. But if we move further in time or further from our area of expertise the gap between our intuitive understanding and the pattern of data instances increases. Similarly, the further back in time the higher the chance of data instances not surviving. The bigger the impact of curating in collections.

[…]

Just, quite simply, ask your friendly neighbourhood humanist […] Even more importantly: if you can amplify humanists, their work and their critique in the public sphere, do it! Make humanist expertise more exciting than badly done regression graphs. (Fafinski, Historical data. A portrait)

Resources mentioned by me in the original post (summarized in list form)

  1. Jun Wu, The Beauty of Mathematics in Computer Science (2018).
  2. Harry Collins, Artifictional Intelligence: Against Humanity’s Surrender to Computers (2018).
  3. Ted Underwood, Distant Horizons (2019).
  4. James E. Dobson, Critical Digital Humanities. The Search for a New Methodology (2019).
  5. Also maybe check out the project An Agile Approach Towards Computational Modeling of Historiographical Uncertainty
  6. Please get everything else (that isn’t an actual book) from the Resources section or the links!

I like LaTeX, the Humanities and the Digital Humanities. Here I post tutorials and other adventures.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.