Why Looking More Closely at Our Data Is the Way to Better Research Ethics

Better data leads to better research. That might sound obvious, but in practice, it’s often not where the attention goes. In conversations about AI, digital humanities, or computational research, the spotlight usually lands on models—new tools, new methods, new technical breakthroughs. Meanwhile, the data those models rely on quietly sits in the background, treated as a given. It isn’t. We know that from the data/capta discourse. If we actually care about research ethics, we need to shift that focus. Because most of the ethical issues people worry about don’t start with the model. They start with the data. And data, especially in historical research, is messy. It’s incomplete, shaped by power structures, and full of gaps. Some voices were never recorded. Others were preserved unevenly. So when we build datasets, we’re not just collecting neutral material, we’re making choices: What gets included? What gets left out? What gets cleaned up, standardised, or ignored?

This is where things get interesting—and where ethics becomes very practical. And that’s what we want, isn’t it? Don’t critical approaches always get criticised for not being practical enough? Well, here you got practical. No more excuses to ignore it.

If you like me ranting about those topics, I guess you’ve come to the right place. But you can read about these things in even more depth in some recent publications like this one: Lang, Sarah A. “(Doing) Computational History: The Role of Data Work in Computational Approaches.” Histories 6, no. 2 (2026): 26. https://doi.org/10.3390/histories6020026

Focus on Data over Models: Who does the work?

The data-centric AI (DCAI) movement argues that (ethical) improvements lie in improving datasets rather than applying technical fixes to models. Computational scholarship requires significant labor to properly deal with data, including cleaning, contextualising, supplementing, and documenting data. This work, at its core, cannot be automated or rushed. It demands scholarly expertise, sustained investment, and recognition from funding bodies, institutional review boards, and research communities.

However, this labour is unevenly distributed and often undervalued. Within both history and computer science, data work has long been framed as secondary, “menial,” or feminised, and this pattern persists in computational humanities. Tasks such as digitisation, metadata creation, and corpus construction are essential to computational research, yet they are rarely foregrounded in publications or rewarded in hiring. At the same time, more highly valued and better-compensated forms of computational labour are not equally accessible. Researchers at well-resourced institutions, with specialised training or strong networks, are more likely to engage in large-scale computational projects. This raises serious questions about equity and access in a field that increasingly centers data-driven scholarship.

The growing emphasis on quantification reinforces these dynamics. Computational humanities is often perceived as more “rigorous” or “scientific” than traditional humanities work. This perception is rooted in long-standing assumptions about the objectivity of numbers. Yet numerical output alone does not guarantee meaningful knowledge. Quantification is not a marker of quality per se; it must fit the research question.

On the flip side, working with data takes time. A lot of it. Cleaning, documenting, contextualising, enriching. It’s slow, detailed work that doesn’t lend itself to shortcuts. And yet, this kind of labour is often treated as secondary. It’s the part that happens “before the real research begins,” even though it fundamentally shapes everything that follows.

And like I described above that certain types of people usually get to do the computational work, there’s also a pattern to who ends up doing the data work and how it’s valued. Data work has long been framed as routine, even “menial,” despite requiring significant expertise. Even before the advent of digital methods (think bibliographical descriptions, etc). At the same time, the more visible, technical side of computational research tends to get more recognition. Add to that the fact that not everyone has equal access to training, infrastructure, or large-scale projects, and you start to see a bigger issue: just like the data itself, the way we organise research is not neutral.

Data work and whose work is worthy of credit?

Numbers carry authority. Computational approaches are often seen as more rigorous, more objective. But that assumption doesn’t always hold. Not every research question benefits from being turned into data, and not every dataset can support meaningful quantification. Sometimes, forcing things into numbers just flattens everything and for no benefit whatsoever.

But this bias towards quantification also affects what kinds of research get attention. It has real-life consequences for the labor we do. And it’s this labor, the questions of who does the work and which work that are always at the core of those omnipresent never-ending discussions on how DH can be defined. Topics and approaches that don’t easily translate into numbers end up sidelined. Yet, instead of relying on increasingly complex models trained on opaque datasets, playing along with the whims of the tech industry, there’s real value in investing in transparent, well-documented, and thoughtfully constructed corpora. That doesn’t mean we have to abandon computational methods once and for all and jump ship. (Please note that it is a common way to shut down critical voices by opening up this either-or mindset of “Well if you don’t like it, you can just stop using AI and leave everyone else alone”. It doesn’t have to be either or. Don’t let them fool you into thinking you have to choose. You don’t).

Re-focusing on data also means rethinking what counts as meaningful work. Preparing data isn’t just a technical step; it carries so many interpretive choices. It’s where all our beloved hermeneutics live. Isn’t hermeneutics or most beloved best bit of the Humanities? (Or does it only count if it’s an old white would-be philosopher doing the work? Pardon my language.)

And yet, this work is often invisible. (Maybe because of who does it?) You rarely see detailed accounts of data preparation in final publications. The people involved (like librarians, archivists, digitisation teams, metadata specialists and so forth) are essential to achieving any computational results whatsoever, but not always credited in ways that reflect their contribution. If we’re serious about ethics, that needs to change.

Documenting datasets

Another piece of the puzzle is documentation. It’s not enough to have data. We need to know where it came from, how it was assembled, and what decisions shaped it. Without that, it’s hard to assess the limits of a dataset, let alone reuse it responsibly. There are frameworks and guidelines that help with this, but they only work if they’re taken seriously and supported with time and resources.

At the centre of all computational research is corpus construction. Deciding what goes into a dataset is not a neutral technical step. It is an interpretive act that shapes the questions researchers can ask and the answers they can give. Examining representation within datasets, identifying gaps, and actively addressing them are essential practices. The moment you build a corpus, you’re defining what can be seen and what remains invisible. Rather than continually optimising models developed under opaque conditions, we should focus on building transparent, reusable, and well-documented datasets. While large language models have their place, much can still be achieved with more traditional (AI or non-AI) methods if the underlying data is sound.

So what does better research ethics look like in practice? It looks like slowing down enough to ask what’s really in the data (and maybe trying to fix it the result doesn’t look that good). It looks like treating data work as core research, not prep work. It looks like making labour visible and valuing it accordingly. And it looks like being honest about the limits of what our data can tell us.

In the end, better models won’t fix problematic data. That’s up to us. Robust computational research is not simply about applying algorithms. It depends on the quality of the data and the interpretive work involved in preparing it. Historians, librarians, metadata specialists, and digitisation teams all contribute essential expertise. Their work forms the foundation of computational analysis, yet it is all too often overlooked. This undervaluation is not unique to the humanities. In computer science, the often outsourced and underpaid nature of data work is well documented. While digital humanities projects may not directly replicate these practices, they still rely on forms of invisible labour that are rarely acknowledged. If research ethics is taken seriously, this labour must be made visible and properly credited.

Documentation plays a key role here. Standards such as FAIR and CARE may provide guidance, but they are not enough without detailed accounts of how datasets are constructed. Researchers need to know exactly what was included, what was excluded, and why. Without this, datasets cannot be critically assessed or reused effectively. While methods for documenting data and models are improving, this work, too, is still under-recognised within academic systems. But the fact that data papers are on the rise is a good sign at least.

Well-constructed and well-documented corpora are the basis of sound computational research. Without a clear understanding of what the data contains, even the most advanced models risk producing results that misrepresent historical realities. Looking more closely at our data is that practical way to better ethics that we’re all (pretending we’re) looking for. Still, ethics cannot be reduced to checklists or compliance. It must be embedded in research design and supported by institutional structures that make more ethical (data) practices possible. Existing frameworks offer a starting point, but they do not replace critical reflection. Ethical research requires time, resources, and practical tools that translate principles into everyday workflows. And credit. We need to give credit where it’s due. We have a long backlog of people to thank.

So long and thanks for all the fish!
The Ninja

Buy me coffee!

If my content has helped you, donate 3€ to buy me coffee. Thanks a lot, I appreciate it!

€3.00

LaTeX Ninja'ing and the Digital Humanities

The verb "to ninja" means "to act or move like a ninja, particularly with regard to a combination of speed, power, and stealth." LaTeX adventures, demystifying digital tools for Humanists, one tutorial at a time.

Why Looking More Closely at Our Data Is the Way to Better Research Ethics

Focus on Data over Models: Who does the work?

Data work and whose work is worthy of credit?

Documenting datasets

Leave a comment Cancel reply

Focus on Data over Models: Who does the work?

Data work and whose work is worthy of credit?

Documenting datasets

Share this:

Related

Leave a comment Cancel reply