Since I am about to prepare a workshop on natural language processing and a pre-workshop-workshop where I need to quickly/crashcourse introduce my (non-digital) Classicist friends to some basics on programming, let me share a list of programming concepts I compiled with you. I would be happy for your suggestions and comments regarding mistakes. I will probably publish this together with some key concepts of quantitative text analysis (blogpost to come) on a cheatsheet or as slides for you later 😉
Intro to key concepts of programming
This list of concepts is not super-structured and meant to work as a ‘reference tool’ as well as a text to be read, so I tried to give it a more or less useful ‘chronology’, meaning that later parts kind of build on earlier ones. I start off with what a computer program or algorithm actually is and how we translate between source code (the code we write) and the code which gets fed to the computer in the background (which looks not at all alike and is not the same code!) . This is a bit theoretical, but bear with me. It gets (a little) more practical soon. Sorry that there is some theory to be learned (or at least heard about once so you know where you can find it again when needed) before you get started. It might not make a lot of sense before you’ve written your first lines of code, so maybe read this through once, do some exercises (there are many free tutorials online for any number of programming languages of your choice) and then get back here to read through this again.
Without further ado, let’s get going (with the a little bit more ‘dry’ parts at the beginning).
To get a computer to do anything, you (or someone else) have to tell it exactly — in excruciating detail — what to do. Such a description of ”what to do” is called a program, and programming is the activity of writing and testing such programs. […] The difference between such [everyday] descriptions and programs is one of degree of precision: humans tend to compensate for poor instructions by using common sense, but computers don’t. […] when you look at those simple [everyday] instructions, you’ll find the grammar sloppy and the instructions incomplete. A human easily compensates. […] In contrast, computers are really dumb. (Stroustrup 2014, 44)
We also […] to assume ”common sense”. Unfortunately, common sense isn’t all that common among humans and is totally absent in computers (though some really well-designed programs can imitate it in specific, well-understood cases). (Stroustrup 2014, 35)
An algorithm is a recipe providing a formalized way to automate carrying out a task. Let this sink in and reflect if you understand what it means. It doesn’t have to be noted down in a particular programming language yet, it can be in ‘pseudo-code’ (on the example of text processing: ”Whenever you find a verb, do this. When you find a noun, do that.” etc.)
The existence of this term hints at the fact that computers don’t think like humans do, in fact, they don’t think at all. A beginner needs to get to know ways of communicating effectively with a computer. This involves experience, but at a higher level also requires a certain background knowledge about the inner workings of computers (i.e. what are data types, how are libraries built, what parameters and arguments do functions expect and what do they return?)
Source code and compiled code
Source code is what we see and write. However, a computer can only understand machine language (1s and 0s), so we need to ‘bridge this gap’.
There are two ways of doing this: Either, we need to ‘press a button’ (which might not always be a litteral button in a normal programming language, but is — for example — when you compile LaTeX code into a PDF on Overleaf…) and then the code gets compiled. It produces an executable as output which can, in turn, be executed. Instead of compiled and interpreted, we can also say static and dynamic.
Apart from compiled languages, we can use an ‘interpreted‘ language where this ‘compiling process’ is done, line by line, while to code runs. Python is such an interpreted language (its more complicated than that in the background, but don’t worry about this). ‘Interpreting’ at runtime is a bit slower and it can mean that the compiler won’t be there at (the non-existent) compile time to inform you about some inconsistencies in your code. But in this case, you will have to ‘run’ it more frequently anyway, so it probably doesn’t make a difference and standard code editors will help with pointing out inconsistencies anyway. You basically get your error messages one way or the other. Which leads us to the next point…
Debugging errors and error handling
Debugging is the act of cleaning up your code when there are errors in it which prevent it from working correctly, as intended or from running at all. The term comes from a time when computers were huge and litterally had to be debugged because bugs were stopping the wires from working correctly. (However, this seems to have happened only once and the term is more like an urban legend. Thanks to Karl Berry for pointing it out!) There are different types of errors, most importantly:
Syntax errors mean that you didn’t respect the conventions of your programming language. Like spelling and grammar mistakes in normal writing.
Logic errors are hard to detect since the conventions of the programming language are respected and automated checking tools might not find them. An example would be passing nothing to a function (indirectly and thus, invisibly) and, thus (obviously but not obviously) getting incorrect results (‘Null reference error’, usage of wrong conditional operator = design flaw). Usually these are ‘bigger’ errors in the program flow compared to the more localized syntax errors (typos, missing semicolon or closing brackets, etc.).
Semantic errors mean that the code runs alright, but it doesn’t do what you intended it to do. Remember that the computer will do exactly what you tell it to do even if this is not, in fact, the same as what you intended to do. (This happens a lot, not only to beginners).
A program crashes due to unexpected input but the compiler can’t know someone will input something which will cause the crash (such as an unexpected and thus uncaught division-by-zero error which you should actually prevent). There are many recurring types of errors which consequently have received their own names, such as index-out-of-range error or off-by-one error, to name to most well-known examples.
Errors should be handeled in your program flow, this means they should be anticipated, ‘caught’ or ‘thrown’. An error message will specify what kind of problem happened which makes debugging easier. An expected and thus properly handeled error prevents the program from crashing whenever something problematic happens (which shouldn’t happen!). If you have an advanced error handling in place, your program will know how to react when different types of errors happen. However, this is maybe overkill for tiny script-like programs where a crash is not an issue.
Input and Output (IO)
Input and output (sometimes referred to as I/0) are base concepts of how computing works: You give input to a function or computer program and expect it to return a certain output. In between, some processing happens.
This is actually the same principle in a function you know from maths, if this helps you. For example, in image processing, every pixel of an image has a colour code stored in some data format (which you don’t need to worry about). Just know that it has a numeric value which can be used as input for a function. So when processing an image in an image manipulation programme (such as turning it to grayscale), every colour value gets inputted into a function which makes some decisions, such as: If the colour value is above threshold XY, make it black – if it is below it, make it white. So in the “output graph”, like a function graph in maths, you get these transformed values which can, in turn, be “put back” (written onto) their old locations in the coordinate system that is the grid behind what you see as a digital image.
The following terms will give you more detailed understanding of how exactly these functions work, but keep the image of maths functions in the back of your mind. Computer functions are not exactly the same, but the principle is similar.
A variable is a container for storing information, kind of like in maths where a variable is a placeholder for a yet unkown value. It is named (such as ‘x’ in maths) and can be called by this name. Calling the name usually causes it to give you its ‘contents’.
A function is a generalized bit of code, made for general-purpose reuse. So essentially, you abstract what is being done. Instead of
''process my document test.txt'' you would write something like
process(text) and would be able to pass a document of any name to this
process() function like
process("unicorn.txt"). This has the awesome advantage that it doesn’t matter whether your document is called `
test‘ or `
unicorn‘ or, in fact, any other crazy name you can come up with. A function usually has a ‘signature’ that means a list of what parameters you pass to it (in the definition of a function, you say which type of data will be inputted) and what it will return.
A parameter is a variable in the declaration of a function. Together with variables (and parameters are a type of variable), parameters contribute to the abstraction of functions. This is a good thing.
Imagine you would (and you will in the process of developping a function) write multiple lines of code. Once you got the thing working, you want to apply it to all the other things this automated processing needs to be applied to. If you don’t create a function where this gets abstracted, you have to copy all those lines over and over again, only to make minor changes in the actual code. This is a situation that practically screams “Define a function for this repetitive behaviour!” at you. You should listen to the call.
Once it’s abstracted and you realize a change needs to be made, you need to make that change exactly once and not in every single line of the code you copy-and-pasted around which now clutters your editor. This means, with writing a function, we have reduced redundancy – and in computer things, we always want to reduce redundancy. Amen.
However, the variables and parameters are abstract stand-ins (which is good, as we just learned). Their names can act like placeholders. But this can be quite confusing to a beginner. The point is that this abstract placeholder (called ‘parameter’) in a function is not the same thing as the actual content you pass to the function (which is called the ‘argument’, see below):
An argument is almost the same as a parameter, but the difference is that it means the actual value which gets passed to the function. So when using a function, you can pass the number 3 to it (3 is the argument). In the definition/declaration, you say that an integer variable will be passed to the function (=the parameter is an integer type value).
The return value is what a function will return as a result. This has a ‘return type’, specifying which data type will be returned. This can sometimes be good to know so that we don’t get confused in further processing steps. However, the point about the whole function-abstraction-thing is that we really shouldn’t need to know exactly how the processing is done in a function. We only need to know what we put int (input) and what we expect to get out of it (output). This makes it easier to build really complex programs where you don’t want to have to worry about how excatly things were implemented in the detailled single processing steps.
A library (or package) is essentially nothing else than a bunch of functions, only that when you load them into your program, you can profit from (usually) really well-done and maintained functions somebody else made. Using libraries, you can accomplish quite something writing very little code yourself. If not for practice purposes, it is always better to reuse a standard library rather than writing things yourself — package maintainers are experienced experts and might have taken things into account which you didn’t.
When you use code from a library, it means that this code is not part of the standard repertoire (‘vocabulary’) of your language: It is a set of functions made using this standard repertoire just as your own functions are. When you invoke or ‘load’ them, it’s essentially as though the computer would just copy their text into your document behind the scenes (and this is exactly what happens behind the scenes). Functions coming from libraries can have the name of their library prefixed as a ‘namespace’. This can sometimes be a little confusing, especially for beginners because it obscures where which names belong to in longer expressions and requires you to know/remember from which libraries your used functions come.
By combining and reusing libraries, you can be really efficient and ”stand on the shoulders of giants”, as they say.
Program Flow and Flow Control
In the beginning, we learned that an algorithm (and thus, also it’s concrete implementation, a program) is essentially a cooking recipe which allows us to automate things which are simple enough so they lend themselves to being automated. Sometimes, even things which look complicated at first can be reduced to simpler substeps. This is called the ‘divide et impera’ principle in computer programming. However, cooking recipes always expect the same input (the ingredients, so to speak). If the input differs from what’s specified (even just a tiny little bit!), the computer really doesn’t know what to do with it because the computer doesn’t think and doesn’t have common sense.
These situations where a computer doesn’t ‘get’ simple things and doesn’t compensate for our little inconsistencies (such as typos) can be irritating. So please remember always that the computer doesn’t think. It’s not its fault. You made the mistake, so go and correct it.
To make our programs more robust, we can use flow control, that is: some simple rules in the form of if-then conditions or behaviour which gets repeated as many times as needed or as long as there is still input. These regulate what we call ‘program flow’. There are a few options available, for example:
‘Looping’ means repeating an action for a specified number of times. If something goes wrong in the definition of the ending condition (which always need to be defined!), you can end up with an ‘infinite loop’ which will cause the program to hang or ‘freeze’ and eventually crash. For-loops repeat for a fixed amout of times, while loops repeat while a condition is true, do-while-loops execute at least one time and then behave as while-loops.
A conditional is an if-then-type decision which can be used to regulate program flow. You can define multiple if cases, but also if-else and else cases which will handle everything which doesn’t match the condition.
Suffice it to say, for now, that there are different ways of storing data. As a beginner, especially if you’re using a loosely typed language such as Python, this isn’t all that important at the beginning so I don’t want to be too verbose as not to confuse you with irrelevant facts.
There are simple (‘primitive’) data types, such as integer, characters, floating point numbers or strings; as well as complex data types such as hashmaps, lists, dictionaries, etc. which are specific to the implementation in a programming language. There are also ways for you to define your own complex data types (such as defining that if you want to store data on people, you need their names, ages and addresses and and address always consists of X, Y and Z).
Python is a ‘dynamically (or weakly) typed’ language, so you don’t need to perform specific operations yourself (as opposed to ‘strongly typed’ languages like C) such as:
‘Type-casting’ means to ‘change’ data from one type to another. A classic example is that ‘1’ if typed into a terminal, actually is a string, not a number to the computer. In order to calculate with it, you need to change it to an integer. But this is not an issue in Python, so don’t worry about that now.
So this is it for now. Got longer than I wanted anyway. I hope this helps. If you find any mistakes or have any suggestions, I would be happy about feedback.
Stroustrup 2014 = Bjarne Stroustrup. Programming. Principles and Practice Using C++. 2nd Edition. NY: Addison-Wesley, 2014.
Buy me coffee!
If my content has helped you, donate 3€ to buy me coffee. Thanks a lot, I appreciate it!