The Opposite of Unsupervised

Viola, Lorella

doi:10.1007/978-3-031-16950-2_3

Lorella Viola²

1904 Accesses

Abstract

One widespread way to enhance access to digital material is through a method called ’content enrichment’. Enrichment consists of several techniques such as entity annotation (e.g., named entity recognition—NER), entity linking (e.g., entity disambiguation), text classification and linguistic annotation (e.g., parts-of-speech tagging—POS). In this chapter, I present the enrichment of ChroniclItaly 3.0 as an example of how the post-authentic framework can be used as the applied theory for the curation of digital material. ChroniclItaly 3.0 has been enriched for NER, geocoding and sentiment. Throughout the chapter, I show fundamental concepts and methods of the post-authentic framework by focussing specifically on the key parts of the enrichment process that draw attention to the fluid exchanges between computational and human agency. I argue that in the contemporary context of digital knowledge creation practices, the task of digital enrichment cannot be handled as a fully automatic operation. As computational models are based on biased and incomplete datasets, I make a case for a dynamic conceptualisation of the digital object as unfinished, situated and intentional, acknowledged as containing its past ecosystem alongside present and future curators’ and users’.

You have full access to this open access chapter, Download chapter PDF

When you control someone’s understanding of the past, you control their sense of who they are and also their sense of what they can imagine becoming. (Abby Smith Rumsey, 2016)

3.1 Enrichment of Digital Objects

After the initial headlong rush to digitisation, libraries, museums and other cultural heritage institutions realised that simply making sources digitally available did not ensure their use; what in fact became apparent was that as the body of digital material grew, users’ engagement decreased. This was rather disappointing but more importantly, it was worrisome. Millions had been poured into large-scale digitisation projects, pitched to funding agencies as the ultimate Holy Grail of cultural heritage (cfr. Chap. 2), a safe, more efficient way to protect and preserve humanity’s artefacts and develop new forms of knowledge, simply unimaginable in the pre-digital era. Although some of it was true, what had not been anticipated was the increasing difficulty experienced by users in retrieving meaningful content, a difficulty that corresponded to the rate of digital expansion. Especially when paired with poor interface design, frustrated users were left with an overall unpleasant experience, feeling overwhelmed and dissatisfied.

Thus, to earn the return on investment in digitisation, institutions urgently needed novel approaches to maximise the potential of their digital collections. It soon became obvious that the solution was to simplify and improve the process of exploring digital archives, to make information retrievable in more valuable ways and the user experience more meaningful on the whole. Naturally, within the wider incorporation of technology in all sectors, it is not at all surprising that ML and AI have been more than welcomed to the digital cultural heritage table. Indeed, AI is particularly appreciated for its capacity to automate lengthy and boring processes that nevertheless enhance exploration and retrieval for conducting more in-depth analyses, such as the task of annotating large quantities of digital textual material with referential information. Indeed, as this technology continues to develop together with new tools and methods, it is more and more used to help institutions fulfil the main purposes of heritagisation: knowledge preservation and access.

One widespread way to enhance access is through ‘content enrichment’ or just enrichment for short. It consists of a wide range of techniques implemented to achieve several goals from improving the accuracy of metadata for better content classification¹ to annotating textual content with contextual information, the latter typically used for tasks such as discovering layers of information obscured by data abundance (see, for instance, Taylor et al. 2018; Viola and Verheul 2020a). There are at least four main types of text annotation: entity annotation (e.g., named entity recognition—NER), entity linking (e.g., entity disambiguation), text classification and linguistic annotation (e.g., parts-of-speech tagging—POS). Content enrichment is also often used by digital heritage providers to link collections together or to populate ontologies that aim to standardise procedures for digital sources preservation, help retrieval and exchange (among others Albers et al. 2020; Fiorucci et al. 2020).

The theoretical relevance of performing content enrichment, especially for digital heritage collections, lies precisely in its great potential for discovering the cultural significance underneath referential units, for example, by cross-referencing them with other types of data (e.g., historical, social, temporal). We enriched ChroniclItaly 3.0 for NER, geocoding and sentiment within the context of the DeXTER project. Informed by the post-authentic framework, DeXTER combines the creation of an enrichment workflow with a meta-reflection on the workflow itself. Through this symbiotic approach, our intention was to prompt a fundamental rethink of both the way digital objects and digital knowledge creation are understood and the practices of digital heritage curation in particular.

It is all too often assumed that enrichment, or at least parts of it, can be fully automated, unsupervised and even launched as a one-step pipeline. Preparing the material to be ready for computational analysis, for example, often ambiguously referred to as ‘cleaning’, is typically presented as something not worthy of particular critical scrutiny. We are misleadingly told that operations such as tokenisation, lowercasing, stemming, lemmatisation and removing stopwords, numbers, punctuation marks or special characters don’t need to be problematised as they are rather tedious, ‘standard’ operations. My intention here is to show how it is on the contrary paramount that any intervention on the material is tackled critically. When preparing the material for further processing, full awareness of the curator’s influential role is required as each one of the taken actions triggers different chain reactions and will therefore output different versions of the material. To implement one operation over another influences how the algorithms will process such material and ultimately, how the collection will be enriched, the information accessed, retrieved and finally interpreted and passed on to future generations (Viola and Fiscarelli 2021b, 54).

Broadly, the argument I present provokes a discussion and critique of the fetishisation of empiricism and technical objectivity not just in humanities research but in knowledge creation more widely. It is this critical and humble awareness that reduces the risks of over-trusting the pseudo-neutrality of processes, infrastructures, software, categories, databases, models and algorithms. The creation and enrichment of ChroniclItaly 3.0 show how the conjuncture of the implicated structural forces and factors cannot be envisioned as a network of linear relations and as such, cannot be predicted. The acknowledgement of the limitations and biases of specific tools and choices adopted in the curation of ChroniclItaly 3.0 takes the form of a thorough documentation of the steps and actions undertaken during the process of creation of the digital object. In this way, it is not just the product, however incomplete, that is seen as worthy of preservation for current and future generations, but also equally the process (or indeed processes) for creating it. Products and processes are unfixed and subject to change, they transcend questions of authenticity; they allow room for multiple versions, all equally post-authentic, in that they may reflect different curators and materials, different programmers, rapid technological advances, changing temporal frameworks and values.

3.2 Preparing the Material

How to critically assess which ones of the preparatory operations for enrichment one should perform depends on internal factors such as the language of the collection; the type of material; the specific enrichment tasks to follow as well as external factors such as the available means and resources, both technical and financial; the time-frame, the intended users and research aims; the infrastructure that will store the enriched collection; and so forth. Indeed, far from being ‘standard’, each intervention needs to be specifically tailored to individual cases. Moreover, since each operation is factually an additional layer of manipulation, it is fundamental that scholars, heritage operators and institutions assess carefully to what degree they want to intervene on the material and how, and that their decisions are duly documented and motivated. In the case of ChroniclItaly 3.0, for example, the documentation of the specific preparatory interventions taken towards enriching the collection, namely, tokenisation, removing numbers and dates and removing words with less than two characters and special characters, is embedded as an integral part of the actual workflow. I wanted to signal the need for refiguring digital knowledge creation practices as honest and fluid exchanges between the computational and human agency, counterbalancing the narrative that depicts computational techniques as autonomous processes from which the human is (should be?) removed. Thus, as a thoughtful post-authentic project, I have considered each action as part of a complex web of interactions between the multiple factors and dynamics at play with the awareness that the majority of such factors and dynamics are invisible and unpredictable. Significantly, the documentation of the steps, tools and decisions serves the valuable function of acknowledging such awareness for contemporary and future generations.

This process can be envisioned as a continuous dialogue between human and artificial intelligence and it can be illustrated by describing how we handled stopwords (e.g., prepositions, articles, conjunctions) and punctuation marks when preparing ChroniclItaly 3.0 for enrichment. Typically, stopwords are reputed to be semantically non-salient and even potentially disruptive to the algorithms’ performance; as such, they are normally removed automatically. However, as they are of course language-bound, removing these items indiscriminately can hinder future analyses having more destructive consequences than keeping them. Thus, when enriching ChroniclItaly 3.0, we considered two fundamental factors: the language of the data-set—Italian—and the enrichment actions to follow, namely, NER, geocoding and SA. For example, we considered that in Italian, prepositions are often part of locations (e.g., America del Nord—North America), organisations (e.g., Camera del Senato—the Senate) and people’s names (e.g., Gabriele d’Annunzio); removing them could have negatively interfered with how the NER model had been trained to recognise referential entities. Similarly, in preparation for performing SA at sentence level (cfr. Sect. 3.4), we did not remove punctuation marks; in Italian punctuation marks are typical sentence delimiters; therefore, they are indispensable for the identification of sentences’ boundaries.

Another operation that we critically assessed concerns the decision to whether to lowercase the material before performing NER and geocoding. Lowercasing text before performing other actions can be a double-edged sword. For example, if lowercasing is not implemented, a NER algorithm will likely process tokens such as ‘USA’, ‘Usa’, ‘usa’, ‘UsA’ and ‘uSA’ as distinct items, even though they may all refer to the same entity. This may turn out to be problematic as it could provide a distorted representation of that particular entity and how it is connected to other elements in the collection. On the other hand, if the material is lowercased, it may become difficult for the algorithm to identify ‘usa’ as an entity at all,² which may result in a high number of false negatives, thus equally skewing the output. We, once again, intervened as human agents: we considered that entities such as persons, locations and organisations are typically capitalised in Italian and therefore, in preparation for NER and geocoding, lowercasing was not performed. However, once these steps were completed, we did lowercase the entities and following a manual check, we merged multiple items referring to the same entity. This method allowed us to obtain a more realistic count of the number of entities identified by the algorithm and resulted in a significant redistribution of the entities across the different titles, as I will discuss in Sect. 3.3. Albeit more accurate, this approach did not come without problems and repercussions; many false negatives are still present and therefore the tagged entities are NOT all the entities in the collections. I will return to this point in Chap. 5.

The decision we took to remove numbers, dates and special characters is also a good example of the importance of being deeply engaged with the specificity of the source and how that specificity changes the application of the technology through which that engagement occurs. Like the large majority of the newspapers collected in Chronicling America, the pages forming ChroniclItaly 3.0 were digitised primarily from microfilm holdings; the collection therefore presents the same issues common to OCR-generated searchable texts (as opposed to born digital texts) such as errors derived from low readability of unusual fonts or very small characters. However, in the case of ChroniclItaly 3.0, additional factors must be considered when dealing with OCR errors. The newspapers aggregated in the collection were likely digitised by different NDNP awardees, who probably employed different OCR engines and/or chose different OCR settings, thus ultimately producing different errors which in turn affected the collection’s accessibility in an unsystematic way. Like all ML predictions models, OCR engines embed the various biases encoded not only in the OCR engine’s architecture but more importantly, in the data-sets used for training the model (Lee 2020). These data-sets typically consist of sets of transcribed typewritten pages which embed the human subjectivity (e.g., spelling errors) as well as individual decisions (e.g., spelling variations).

All these factors have wider, unpredictable consequences. As previously discussed in reference to microfilming (cfr. Sect. 2.2), OCR technology has raised concerns regarding marginalisation, particularly with reference to the technology’s consequences for content discoverability (Noble 2018; Reidsma 2019). These scholars have argued that this issue is closely related to the fact that the most largely implemented OCR engines are both licensed and opaquely documented; they therefore not only reflect the strategic, commercial choices made by their creators according to specific corporate logics but they are also practically impossible to audit. Despite being promoted as ‘objective’ and ‘neutral’, these systems incorporate prejudices and biases, strong commercial interests, third-party contracts and layers of bureaucratic administration. Nevertheless, this technology is implemented on a large scale and it therefore deeply impacts what—on a large scale—is found and lost, what is considered relevant and irrelevant, what is preserved and passed on to future generations and what will not be, what is researched and studied and what will not be accessed.

Understanding digital objects as post-authentic entails being mindful of all the alterations and transformations occurring prior to accessing the digital record and how each one of them is connected to wider networks of systems, factors and complexities, most of which are invisible and unpredictable. Similarly, any following intervention adds further layers of manipulation and transformation which incorporate the previous ones and which will in turn have future, unpredictable consequences. For example, in Sect. 2.2 I discussed how previous decisions about what was worth digitising dictated which languages needed to be prioritised, in turn determining which training data-sets were compiled for different language models, leading to the current strong bias towards English models, data-sets and tools and an overall digital language and cultural injustice.

Although the non-English content in Chronicling America has been reviewed by language experts, many additional OCR errors may have originated from markings on the material pages or a general poor condition of the physical object. Again, the specificity of the source adds further complexity to the many problematic factors involved in its digitisation; in the case of ChroniclItaly 3.0, for example, we found that OCR errors were often rendered as numbers and special characters. To alleviate this issue, we decided to remove such items from the collection. This step impacted differently on the material, not just across titles but even across issues of the same title. Figure 3.1 shows, for example, the impact of this operation on Cronaca Sovversiva, one of the newspapers collected in ChroniclItaly 3.0 with the longest publication record, spanning almost throughout the entire archive, 1903–1919. On the whole, this intervention reduced the total number of tokens from 30,752,942 to 21,454,455, equal to about 30% of overall material removed (Fig. 3.2). Although with sometimes substantial variation, we found the overall OCR quality to be generally better in the most recent texts. This characteristic is shared by most OCRed nineteenth-century newspapers, and it has been ascribed to a better conservation status or better initial condition of the originals which overall improved over time (Beals and Bell 2020). Figure 3.3 shows the variation of removed material in L’Italia, the largest newspaper in the collection comprising 6489 issues published uninterruptedly from 1897 to 1919.

A line graph of variation percentage versus the years from 1904 to 1918. The line of removed material begins at 0.129 at top, and descents vertically, and fluctuates. Then, it increases in 19010, and then drops down to 0.123 in 1916. — **Fig. 3.1**

A graph compares the percentages of the preserved and not preserved titles. The overall percentage for the preserved category is around 70%, while the not preserved category is around 30%. — **Fig. 3.2**

A line graph of variation percentage versus the years between 1900 to 1920. The line of removed material peaks to 1.155, and has a steep low between 1905 and 1910. Further, it peaks at 0.145, and decreases slowly. — **Fig. 3.3**

Finally, my experience of previously working on the GeoNewsMiner (Viola et al. 2019) (GNM) project also influenced the decisions we took when enriching ChroniclItaly 3.0. As said in Sect. 2.4, GNM loads ChroniclItaly 2.0, the version of the ChroniclItaly collections annotated with referential entities without having performed any of the pre-processing tasks described here in reference to ChroniclItaly 3.0. A post-tagging manual check revealed that, even though the F1 score of the NER model—that is the measure to test a model’s accuracy—was 82.88, due to OCR errors, the locations occurring less than eight times were in fact false positives (Viola et al. 2019; Viola and Verheul 2020a). Hence, the interventions we made on ChroniclItaly3.0 aimed at reducing the OCR errors to increase the discoverability of elements that were not identified in the GNM project. When researchers are not involved in the creation of the applied algorithms or in choosing the data-sets for training them—which especially in the humanities represents the majority of cases—and consequently when tools, models and methods are simply reused as part of the available resources, the post-authentic framework can provide a critical methodological approach to address the many challenges involved in the process of digital knowledge creation.

The illustrated examples demonstrate the complex interactions between the materiality of the source and the digital object, between the enrichment operations and the concurrent curator’s context, and even among the enrichment operations themselves. The post-authentic framework highlights the artificiality of any notion conceptualising digital objects as copies, unproblematised and disconnected from the material object. Indeed, understanding digital objects as post-authentic means acknowledging the continuous flow of interactions between the multiple factors at play, only some of which I have discussed here. Particularly in the context of digital cultural heritage, it means acknowledging the curators’ awareness that the past is written in the present and so it functions as a warning against ignoring the collective memory dimension of what is created, that is the importance of being digital.

3.3 NER and Geolocation

In addition to the typical motivations for annotating a collection with referential entities such as sorting unstructured data and retrieving potentially important information, my decision to annotate ChroniclItaly 3.0 using NER, geocoding and SA was also closely related to the nature of the collection itself, i.e., the specificity of the source. One of the richest values of engaging with records of migrants’ narratives is the possibility to study how questions of cultural identities and nationhood are connected with different aspects of social cohesion in transnational, multicultural and multilingual contexts, particularly as a social consequence of migration. Produced by the migrants themselves and published in their native language, ethnic newspapers such as those collected in ChroniclItaly 3.0 function in a complex context of displacement, and as such, they offer deep, subjective insights into the experience and agency of human migration (Harris 1976; Wilding 2007; Bakewell and Binaisa 2016; Boccagni and Schrooten 2018).

Ethnic newspapers, for instance, provide extensive material for investigating the socio-cognitive dimension of migration through markers of identity. Markers of identity can be cultural, social or biological such as artefacts, family or clan names, marriage traditions and food practices, to name but a few (Story and Walker 2016). Through shared claims of ethnic identity, these markers are essential to communities for maintaining internal cohesion and negotiating social inclusion (Viola and Verheul 2019a). But in diasporic contexts, markers of identity can also reveal the changing subtle renegotiations of migrants’ cultural affiliation in mediating interests of the homeland with the host environment. Especially when connected with entities such as places, people and organisations, these markers can be part of collective narratives of pride, nostalgia or loss, and their analysis may therefore bring insights into how cultural markers of identity and ethnicity are formed and negotiated and how displaced individuals make sense of their migratory experience. The ever-larger amount of available digital sources, however, has created a complexity that cannot easily be navigated, certainly not through close reading methods alone. Computational methods such as NER methodologies, though presenting limitations and challenges, can help identify names of people, places, brands and organisations thus providing a way to identify markers of identity on a large scale.

We annotated ChroniclItaly 3.0 by using a NER deep learning sequence tagging tool (Riedl and Padó 2018) which identified 547,667 entities occurring 1,296,318 times across the ten titles.³ A close analysis of the output, however, revealed a number of issues which required a critical intervention combining expert knowledge and technical ability. In some cases, for example, entities had been assigned the wrong tag (e.g., ‘New York’ tagged as a person), other times elements referring to the same entity had been tagged as different entities (e.g., ‘Woodrow Wilson’, ‘President Woodrow Wilson’), and in some other cases elements identified as entities were not entities at all (e.g., venerdí ‘Friday’ tagged as an organisation). To avoid the risk of introducing new errors, we intervened on the collection manually; we performed this task by first conducting a thorough historical triangulation of the entities and then by compiling a list of the most frequent historical entities that had been attributed the wrong tag. Although it was not possible to ‘repair.’ All the tags, this post-tagging intervention affected the redistribution of 25,713 entities across all the categories and titles, significantly improving the accuracy of the tags that would serve as the basis for the subsequent enrichment operations (i.e., geocoding and SA). Figure 3.4 shows how in some cases the redistribution caused a substantial variation: for example, the number of entities in the LOC (location) category significantly decreased in La Rassegna but it increased in L’Italia. The documentation of these processes of transformation is available Open Access⁴ and acts as a way to acknowledge them as problematic, as undergoing several layers of manipulation and interventions, including the multidirectional relationships between the specificity of the source, the digitised material and all the surrounding factors at play. Ultimately, the post-authentic framework to digital objects frames digital knowledge creation as honest and accountable, unfinished and receptive to alternatives.

A graph illustrates the distribution of entities per title versus the percentage. L O C indicates positive at 0.4 for La Rassegna, and negative for L Italia. P E R indicates negative for every title, whereas O R G has a decreased number of entities for every title. — **Fig. 3.4**

Once entities in ChroniclItaly 3.0 were identified, annotated and verified, we decided to geocode places and locations and to subsequently visualise their distribution on a map. Especially in the case of large collections with hundreds of thousands of such entities, their visualisation may greatly facilitate the discovery of deeper layers of meaning that may otherwise be largely or totally obscured by the abundance of material available. I will discuss the challenges of visualising digital objects in Chap. 5 and illustrate how the post-authentic framework can guide both the development of a UI and the encoding of criticism into graphical display approaches.

Performing geocoding as an enrichment intervention is another example of how the process of digital knowledge creation is inextricably entangled with external dynamics and processes, dominant power structures and past and current systems in an intricate net of complexities. In the case of ChroniclItaly 3.0, for instance, the process of enriching the collection with geocoding information shares much of the same challenges as with any material whose language is not English. Indeed, the relative scarcity of certain computational resources available for languages other than English as already discussed often dictates which tasks can be performed, with which tools and through which platforms. Practitioners and scholars as well as curators of digital sources often have to choose between either creating resources ad hoc, e.g., developing new algorithms, fine-tuning existing ones, training their own models according to their specific needs or more simply using the resources available to them. Either option may not be ideal or even at all possible, however. For example, due to time or resources limitations or to lack of specific expertise, the first approach may not be economically or technically feasible. On the other hand, even when models and tools in the language of the collection do exist—like in the case of ChroniclItaly 3.0—typically their creation would have occurred within the context of another project and for other purposes, possibly using training data-sets with very different characteristics from the material one is enriching. This often means that the curator of the enrichment process must inevitably make compromises with the methodological ideal. For example, in the case of ChroniclItaly 3.0, in the interest of time, we annotated the collection using an already existing Italian NER model. The manual annotation of parts of the collection to train an ad hoc model would have certainly yielded much more accurate results but it would have been a costly, lengthy and labour-intensive operation. On the other hand, while being able to use an already existing model was certainly helpful and provided an acceptable F1 score, it also resulted in a poor individual performance for the detection of the entity LOC (locations) (54.19%) (Viola and Fiscarelli 2021a). This may have been due to several factors such as a lack of LOC-category entities in the data-set used for originally training the NER model or a difference between the types of LOC entities in the training data-set and the ones in ChroniclItaly 3.0. Regardless of the reason, due to the low score, we decided to not geocode (and therefore visualise) the entities tagged as LOC; they can however still be explored, for example, as part of SA or in the GitHub documentation available Open Access. Though not optimal, this decision was motivated also by the fact that geopolitical entities (GPE) are generally more informative than LOC entities as they typically refer to countries and cities (though sometimes the algorithm retrieved also counties and States), whereas LOC entities are typically rivers, lakes and geographical areas (e.g., the Pacific Ocean). However, users should be aware that the entities currently geocoded are by no means all the places and locations mentioned in the collection; future work may also focus on performing NER using a more fine-tuned algorithm so that the LOC-type entities could also be geocoded.

3.4 Sentiment Analysis

Annotating textual material for attitudes—either sentiment or opinions—through a method called sentiment analysis (SA) is another enriching technique that can add value to digital material. This method aims to identify the prevailing emotional attitude in a given text, though it often remains unclear whether the method detects the attitude of the writer or the expressed polarity in the analysed textual fragment (Puschmann and Powell 2018). Within DeXTER, we used SA to identify the prevailing emotional attitude towards referential entities in ChroniclItaly 3.0. Our intention was twofold: firstly, to obtain a more targeted enrichment experience than it would have been possible by applying SA to the entire collection and, secondly, to study referential entities as markers of identity so as to access the layers of meaning migrants attached historically to people, organisations and geographical spaces. Through the analysis of the meaning humans invested in such entities, our goal was to delve into how their collective emotional narratives may have changed over time (Tally 2011; Donaldson et al. 2017; Taylor et al. 2018; Viola and Verheul 2020a). Because of the specific nature of ChroniclItaly 3.0, this exploration inevitably intersects with understanding how questions of cultural identities and nationhood were connected with different aspects of social cohesion (e.g., transnationalism, multiculturalism, multilingualism), how processes of social inclusion unfolded in the context of the Italian American diaspora, how Italian migrants managed competing feelings of belonging and how these may have changed over time.

SA is undoubtedly a powerful tool that can facilitate the retrieval of valuable information when exploring large quantities of textual material. Understanding SA within the post-authentic framework, however, means recognising that specific assumptions about what constitutes valuable information, what is understood by sentiment and how it is understood and assessed guided the devise of the technique. All these assumptions are invisible to the user; the post-authentic framework warns the analyst to be wary of the indiscriminate use of the technique. Indeed, like other techniques used to augment digital objects including digital heritage material, SA did not originate within the humanities; SA is a computational linguistics method developed within natural language processing (NLP) studies as a subfield of information retrieval (IR). In the context of visualisation methods, Johanna Drucker has long discussed the dangers of a blind and unproblematised application of approaches brought in the humanities from other disciplines, including computer science. Particularly about the specific assumptions at the foundation of these techniques, she points out, ‘These assumptions are cloaked in a rhetoric taken wholesale from the techniques of the empirical sciences that conceals their epistemological biases under a guise of familiarity’ (Drucker 2011, 1). In Chap. 4, I will discuss the implications of a very closely related issue, the metaphorical use of everyday lexicon such as ‘sentiment analysis’, ‘topic modelling’ and ‘machine learning’ as a way to create familiar images whilst however referring to rather different concepts from what is generally internalised in the collective image. In the case of SA, for example, the use of the familiar word ‘sentiment’ conceals the fact that this technique was specifically designed to infer general opinions from product reviews and that, accordingly, it was not conceived for empirical social research but first and foremost as an economic instrument.

The application of SA in domains different from its original conception poses several challenges which are well known to computational linguists—the techniques’ creators—but perhaps less known to others; whilst opinions about products and services are not typically problematic as this is precisely the task for which SA was developed, due to their much higher linguistic and cultural complexity, opinions about social and political issues are much harder to tackle. This is due to the fact that SA algorithms lack sufficient background knowledge of the local social and political contexts, not to mention the challenges of detecting and interpreting sarcasm, puns, plays on words and ironies (Liu 2020). Thus, although most SA techniques will score opinions about products and services fairly accurately, they will likely perform poorly when based on opinionated social and political texts. This limitation therefore makes the use of SA problematic when other disciplines such as the humanities and the social sciences borrow it uncritically, worse yet it raises disturbing questions when the technique is embedded in a range of algorithmic decision-making systems based, for instance, on content mined from social media. For example, since its explosion in the early 2000s, SA has been heavily used in domains of society that transcend the method’s original conception: it is constantly applied to make stock market predictions and in the health sector and by government agencies to analyse citizens’ attitudes or concerns (Liu 2020).

In this already overcrowded landscape of interdependent factors, there is another element that adds yet more complexity to the matter. As with other computational techniques, the discourse around SA depicts the method as detached from any subjectivity, as a technique that provides a neutral and observable description of reality. In their analysis of the cultural perception of SA in research and the news media, Puschmann and Powell (2018) highlight for example how the public perception of SA is misaligned with its original function and how such misalignment ‘may create epistemological expectations that the method cannot fulfill due to its technical properties and narrow (and well-defined) original application to product reviews’ (2). Indeed, we are told that SA is a quantitative method that provides us with a picture of opinionated trends in large amounts of material otherwise impossible to map. In reality, the reduction of something as idiosyncratic as the definition of human emotions to two/three categories is highly problematic as it hides the whole set of assumptions behind the very establishment of such categories. For example, it remains unclear what is meant by neutral, positive or negative as these labels are typically presented as a given, as if these were unambiguous categories universally accepted (Puschmann and Powell 2018). On the contrary, to put it in Drucker’s words ‘the basic categories of supposedly quantitative information […] are already interpreted expressions’ (Drucker 2011, 4).

Through the lens of the post-authentic framework, the application of SA is acknowledged as problematic and so is the intrinsic nature of the technique itself. A SA task is usually modelled as a classification problem, that is, a classifier processes pre-defined elements in a text (e.g., sentences), and it returns a category (e.g., positive, negative or neutral). Although there are so-called fine-grained classifiers which attempt to provide a more nuanced distinction of the identified sentiment (e.g., very positive, positive, neutral, negative, very negative) and some others even return a prediction of the specific corresponding sentiment (e.g., anger, happiness, sadness), in the post-authentic framework, it is recognised that it is the fundamental notion of sentiment as discrete, stable, fixed and objective that is highly problematic. In Chap. 4, I will return to this concept of discrete modelling of information with specific reference to ambiguous material, such as cultural heritage texts; for now, I will discuss the issues concerning the discretisation of linguistic categories, a well-known linguistic problem.

In his classic book Foundations of Cognitive Grammar, Ronald Langacker (1983) famously pointed out how it is simply not possible to unequivocally define linguistic categories; this is because language does not exist in a vacuum and all human exchanges are always context-bound, view-pointed and processual (see Langacker 1983; Talmy 2000; Croft and Cruse 2004; Dancygier and Sweetser 2012; Gärdenfors 2014; Paradis 2015). In fields such as corpus linguistics, for example, which heavily rely on manually annotated language material, disagreement between human annotators on same annotation decisions is in fact expected and taken into account when drawing linguistic conclusions. This factor is known as ‘inter-annotator agreement’ and it is rendered as a measure that calculates the agreement between the annotators’ decisions about a label. The inter-annotator agreement measure is typically a percentage and depends on many factors (e.g., number of annotators, number of categories, type of text); it can therefore vary greatly, but generally speaking, it is never expected to be 100%. Indeed, in the case of linguistic elements whose annotation is highly subjective because it is inseparable from the annotators’ culture, personal experiences, values and beliefs—such as the perception of sentiment—this percentage has been found to remain at 60–65% at best (Bobicev and Sokolova 2018).

The post-authentic framework to digital knowledge creation introduces a counter-narrative in the main discourse that oversimplifies automated algorithmic methods such as SA as objective and unproblematic and encourages a more honest conversation across fields and in society. It acknowledges and openly addresses the interrelations between the chosen technique and its deep entrenchment in the system that generated it. In the case of SA, it advocates more honesty and transparency when describing how the sentiment categories have been identified, how the classification has been conducted, what the scores actually mean, how the results have been aggregated and so on. At the very least, an acknowledgement of such complexities should be present when using these techniques. For example, rather than describing the results as finite, unquestionable, objective and certain, a post-authentic use of SA incorporates full disclosure of the complexities and ambiguities of the processes involved. This would contribute to ensuring accountability when these analytical systems are used in domains outside of their original conception, when they are implemented to base centralised decisions that affect citizens and society at large or when they are used to interpret the past or write the future past.

The decision of how to define the scope (see for instance Miner 2012) prior to applying SA is a good example of how the post-authentic framework can inform the implementation of these techniques for knowledge creation in the digital. The definition of the scope includes defining problematic concepts of what constitutes a text, a paragraph or a sentence, and how each one of these definitions impacts on the returned output, which in turn impacts on the digitally mediated presentation of knowledge. In other words, in addition to the already noted caveats of applying SA particularly for social empirical research, the post-authentic framework recognises the full range of complexities derived from preparing the material, a process—as I have discussed in Sect. 3.2—made up of countless decisions and judgement calls. The post-authentic framework acknowledges these decisions as always situated, deeply entrenched in internal and external dynamics of interpretation and management which are themselves constructed and biased. For example, when preparing ChroniclItaly 3.0 for SA, we decided that the scope was ‘a sentence’ which we defined as the portion of text: (1) delimited by punctuation (i.e., full stop, semicolon, colon, exclamation mark, question mark) and (2) containing only the most frequent entities. If, on the one hand, this approach considerably reduced processing time and costs, on the other hand, it may have caused less mentioned entities to be underrepresented. To at least partially overcome this limitation, we used the logarithmic function 2*log2⁵ to obtain a more homogeneous distribution of entities across the different titles, as shown in Fig. 3.5.

A horizontal bar graph of title versus the number of entities. L Italia has the highest number, while La Rasssegna marks the least. Overall, the graph fluctuates throughout. — **Fig. 3.5**

As for the implementation of SA itself, due to the lack of suitable SA models for Italian when DeXTER was carried out, we used the Google Cloud Natural Language Sentiment Analysis⁶ API (Application Programming Interface) within the Google Cloud Platform Console,⁷ a console of technologies which also includes NLP applications in a wide range of languages. The SA API returned two values: sentiment score and sentiment magnitude. According to the available documentation provided by Google,⁸ the sentiment score—which ranges from −1 to 1—indicates the overall emotion polarity of the processed text (e.g., positive, negative, neutral), whereas the sentiment magnitude indicates how much emotional content is present within the document; the latter value is often proportional to the length of the analysed text. The sentiment magnitude ranges from 0 to 1, whereby 0 indicates what Google defines as ‘low-emotion content’ and 1 indicates ‘high-emotion content’, regardless of whether the emotion is identified as positive or negative. The magnitude value is meant to help differentiate between low-emotion and mixed-emotion cases, as they would both be scored as neutral by the algorithm. As such, it alleviates the issue of reducing something as vague and subjective as the perception of emotions to three rigid and unproblematised categories. However, the post-authentic framework recognises that any conclusion based on results derived from SA should acknowledge a degree of inconsistency between the way the categories of positive, negative and neutral emotion have been defined in the training model and the writer’s intention in the actual material to which the model is applied. Specifically, the Google Cloud Natural Language Sentiment Analysis algorithm differentiates between positive and negative emotion in a document, but it does not specify what is meant by positive or negative. For example, if in the model sentiments such as ‘angry’ and ‘sad’ are both categorised as negative emotions regardless of their context, the algorithm will identify either text as negative, not ‘sad’ or ‘angry’, thus creating further ambiguity to the already problematic and non-transparent way in which ‘sad’ and ‘angry’ were originally defined and categorised. To marginally deal with this issue, we established a threshold within the sentiment range for defining ‘clearly positive’ (i.e., > 0.3) and ‘clearly negative’ cases (i.e., < −0.3). The downside of this approach was however that the algorithm considered all the cases between these two values as neutral/mixed-emotion cases which inevitably led to a flattening of nuances. In Chap. 5, I will return to the ambiguities of SA when discussing the design choices for developing the DeXTER app, the interactive visualisation tool to explore ChroniclItaly 3.0, and I will present suggestions towards visualising the complexities and uncertainties in data-models and visualisation techniques.

The application of the post-authentic framework to SA highlights that the technique is far from being methodologically ideal and it calls attention to all the uncertainties of using it in fields other than IR and for tasks other than product review, as the use case discussed here. The post-authentic framework acts therefore as a warning against these shortcomings and creates a space for accountability for the adopted curatorial decisions. Within DeXTER and ChroniclItaly 3.0, we thoroughly documented such decisions which can be accessed through the openly available dedicated GitHub repository⁹ which also includes the code, links to the original and processed material, and the files documenting the manual interventions. Ultimately, the post-authentic framework counterbalances the main public discourse—separate from computational research—which promotes SA as an exact way to measure emotions and opinions, it recognises when its use is disconnected from its original purpose, and it accordingly advocates the reworking of the user’s epistemological expectations.

In this respect, the implementation of the post-authentic framework for knowledge creation in the digital relates to one of the central pillars of science, that of replicability (or reproducibility/repeatability).¹⁰ The principle postulates that following a study’s detailed descriptions, claims and conclusions obtained by scientists can be verified by others. This is done in the name of transparency, traceability and accountability, which are also fundamental aspects of post-authentic work. The difference however lies in the purpose of these fundamental notions; whereas in science they are primarily aimed at allowing independent confirmation of a study’s results, within the post-authentic framework, they are not solely concerned with this specific scientific goal and in fact they move beyond it. For example, traditionally, a study is believed to be replicable if sufficient transparency has been observed on the data, the research purposes, the method, the conclusions, etc. and yet some studies can be perfectly transparent and not at all replicable (Peels 2019; Viola 2020b). This is, for instance, believed to be the case especially in the humanities for which the very nature of some studies can make replication impossible, for example, due to a particularly interpretative analysis (Peels, 2019).

On the opposite end of the scale, empirical works are believed to be—at least in theory—fully replicable. Thus, despite the still unresolved debate on the ‘R-words’, over the years, protocols and standards for replication in science have been perfectioned and systematised. When computers started to be used for experiments and data analysis, things turned complicated. Plesser (2018), for instance, explains how it became apparent that the canonical margins for experimental error did not somehow apply to digital research:

Since digital computers are exact machines, practitioners apparently assumed that results obtained by computer could be trusted, provided that the principal algorithms and methods employed were suitable to the problem at hand. Little attention was paid to the correctness of implementation, potential for error, or variation introduced by system soft- and hardware, and to how difficult it could be to actually reconstruct after some years—or even weeks—how precisely one had performed a computational experiment. (Plesser 2018, 1)

The post-authentic framework is comfortable with the belief that attainability of complete objectivity (and therefore perfect replicability) is always but an illusion. Indeed, the post-authentic relevance of transparency, traceability and consequently, accountability lies primarily in the acknowledgement of a collective responsibility, the one that comes with the building of a source of knowledge for current and future generations. Thus, within the post-authentic framework being transparent about both the ‘raw’ and the processed material, about the methodology, the analytical processes and the tools assumes a whole new importance: the creation of other digital forms which allow to trace technical obsolescence, acknowledge power relations and attempt to fluidly incorporate the exchanges that lead to symbiosis, not friction, across interactions. As argued by Fiona Cameron with regard to digital cultural heritage (2021, 12):

[digital cultural heritage] encapsulate[s] other registers of significance, temporality and agency such as planetary technological infrastructures, material agency, non-human, elemental, and earthly processes, all of which are invisible figures in their constitution.

The post-authentic framework for digital knowledge creation recognises that whatever arises out of the confluence of all these different agencies cannot be fully predicted. The role of documentation by researchers, museums, archives, libraries, software developers and so on acts therefore as a means to acknowledge that we are writing the future past and that writing the past means controlling the future. The post-authentic framework provides an architecture to meet the need for accountability to current and future generations.

Finally, the documentation of the interventions has wider resonance particularly in relation to increasing awareness towards sustainability in digital knowledge creation. In June 2020, the UN published the Roadmap for Digital Cooperation report which set a list of key actions to be achieved by 2030 in order to advance a more equitable digital world. Whilst acknowledging that ‘Meaningful participation in today’s digital age requires a high-speed broadband connection to the Internet’ (United Nations 2020b, 5), the report also highlights that half of the world’s population (3.7 billion people) currently does not have access to the Internet. The lack of digital access, also commonly referred to as ‘Digital Divide’, affects those mostly located in least developed countries (LDCs), landlocked developing countries (LLDCs) and small island developing states (SIDS) with an even more acute gap in countries such as sub-Saharan Africa, where only 11% have access to household computers and 82% lack Internet access altogether.

The digital inequality worsens the already existing inequalities in society as those who are the most vulnerable are disproportionately affected by the divide. Based as it is on a universal vision of digital transformation, current digital knowledge creation practices face therefore not only the danger of being available exclusively to half of humanity but also of yet again imposing Western-centred perspectives on how knowledge is created and accessed. The future looks ever more digital and digitally available repositories will become larger and larger; reconceptualising digital objects within the post-authentic framework means also fostering their reconceptualisation not just in terms of what we are digitising but also how and for whom. In this sense, the creation, curation, analysis and visualisation of digital objects should whenever possible prefer methods and practices that make curatorial workflows sustainable, interoperable and reusable. This should include the storage of the material in an Open Access repository, the use of freely available and fully documented software and a thorough documentation of the implemented steps and interventions, including an explanation of the choices made which will in turn facilitate research accessibility, transparency and dissemination.

In the next chapter, I will illustrate the third use case of the book, the application of the post-authentic framework to digital analysis. Through the example of topic modelling, I will show how the post-authentic framework can guide a deep understanding of the assemblage of culture and technology in software and help us achieve the interpretative potential of computation. I will specifically discuss the implications for knowledge creation of the transformation of continuous material into discrete form—binary sequences of 0s and 1s—with particular reference to the notions of causality and correlations. Within this broader discussion, I will then illustrate the example of topic modelling as a computational technique that treats a collection of texts as discrete data, and I will focus on the critical aspects of topic modelling that are highly dependent on the sources: pre-processing, corpus preparation and deciding on the number of topics. The topic modelling example ultimately shows how producing digital knowledge requires sustained engagement with software, in the form of fluid, symbiotic exchanges between processes and sources.

References

Albers L, Große P, Wagner S (2020) Semantic data-modeling and long-term interpretability of cultural heritage data—three case studies. In: Kremers H (ed) Digital cultural heritage. Springer International Publishing, Cham, pp 239–253. https://doi.org/10.1007/978-3-030-15200-016
Google Scholar
Bakewell O, Binaisa N (2016) Tracing diasporic identifications in Africa’s urban landscapes: evidence from Lusaka and Kampala. Ethn Racial Stud 39(2):280–300. https://doi.org/10.1080/01419870.2016.1105994
Article Google Scholar
Beals M, Bell E (2020) The atlas of digitised newspapers and metadata: reports from oceanic exchanges. Technical report, Transatlantic Partnership for Social Sciences and Humanities 2016 Digging into Data Challenge. Artwork Size: 1225056 Bytes. Publisher: figshare
Google Scholar
Bobicev V, Sokolova M (2018) Thumbs up and down: sentiment analysis of medical online forums. In: EMNLP 2018. https://doi.org/10.18653/v1/W18-5906
Boccagni P, Schrooten M (2018) Participant observation in migration studies: an overview and some emerging issues. In: Zapata-Barrero R, Yalaz E (eds) Qualitative research in european migration studies, imiscoe research series. Springer International Publishing, Cham, pp 209–225. https://doi.org/10.1007/978-3-319-76861-8-12
Chapter Google Scholar
Cameron F (2021) The future of digital data, heritage and curation in a more-than-human world. Routledge, Abingdon
Book Google Scholar
Croft W, Cruse DA (2004) Cognitive linguistics, 1st edn. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511803864
Dancygier B, Sweetser E (eds) (2012) Viewpoint in language: a multimodal perspective. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9781139084727, http://ebooks.cambridge.org/ref/id/CBO9781139084727
Donaldson C, Gregory IN, Taylor JE (2017) Locating the beautiful, picturesque, sublime and majestic: spatially analysing the application of aesthetic terminology in descriptions of the English Lake District. J Hist Geogr 56:43–60. https://doi.org/10.1016/j.jhg.2017.01.006, https://linkinghub.elsevier.com/retrieve/pii/S0305748817300178
Drucker J (2011) Humanities approaches to graphical display. Digit Humanit Q 5(1). http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html
Fiorucci M, Khoroshiltseva M, Pontil M, Traviglia A, Del Bue A, James S (2020) Machine learning for cultural heritage: a survey. Patt Recog Lett 133:102–108. https://doi.org/10.1016/j.patrec.2020.02.017, https://www.sciencedirect.com/science/article/pii/S0167865520300532
Gärdenfors P (2014) The geometry of meaning: semantics based on conceptual spaces. MIT Press, Cambridge
Book Google Scholar
Harris M (1976) History and significance of the EMIC/ETIC distinction. Annu Rev Anthropol 5(1):329–350. https://doi.org/10.1146/annurev.an.05.100176.001553
Article Google Scholar
Langacker RW (1983) Foundations of cognitive grammar. Indiana University Linguistics Club, Bloomington
Google Scholar
Lee B (2020) Compounded mediation: a data archaeology of the newspaper navigator dataset. Digit Human Q 15(4). https://doi.org/10.17613/K9GT-6685, https://hcommons.org/deposits/item/hc:32415/. Publisher: Humanities Commons
Liu B (2020) Sentiment analysis: mining opinions, sentiments, and emotions. Cambridge University Press, Cambridge
Book Google Scholar
Miner G (ed) (2012) Practical text mining and statistical analysis for non-structured text data applications, 1st edn. Academic Press, Waltham
Google Scholar
Noble SU (2018) Algorithms of oppression: how search engines reinforce racism. New York University Press, New York
Book Google Scholar
Paradis C (2015) Conceptual spaces at work in sensory cognition: domains, dimensions and distances. In: Applications of conceptual spaces. Springer, Berlin, pp 33–55
Chapter Google Scholar
Peels R (2019) Replicability and replication in the humanities. Res Integ Peer Rev 4(1):2. https://doi.org/10.1186/s41073-018-0060-4
Article Google Scholar
Plesser HE (2018) Reproducibility vs. replicability: a brief history of a confused terminology. Front Neuroinform 11:76. https://doi.org/10.3389/fninf.2017.00076
Puschmann C, Powell A (2018) Turning words into consumer preferences: how sentiment analysis is framed in research and the news media. Soc Media Soc 4(3). https://doi.org/10.1177/2056305118797724, http://journals.sagepub.com/doi/10.1177/2056305118797724
Reidsma M (2019) Masked by trust: bias in library discovery. Library Juice Press, Sacramento
Google Scholar
Riedl M, Padó S (2018) A named entity recognition shootout for German. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: short papers). Association for Computational Linguistics, Melbourne, pp 120–125. https://doi.org/10.18653/v1/P18-2020
Rumsey AS (2016) When we are no more: how digital memory is shaping our future. https://www.youtube.com/watch?v=_ZJDFDscmWE
Story J, Walker I (2016) The impact of diasporas: markers of identity. Ethn Racial Stud 39(2):35–141. https://doi.org/10.1080/01419870.2016.1105999
Article Google Scholar
Tally RT (ed) (2011) Geocritical explorations: space, place, and mapping in literary and cultural studies. Palgrave Macmillan, New York
Google Scholar
Talmy L (2000) Toward a cognitive semantics. MIT Press, Cambridge
Book Google Scholar
Taylor J, Donaldson CE, Gregory IN, Butler JO (2018) Mapping digitally, mapping deep: exploring digital literary geographies. Lit. Geograph. 4(1):10–19. Number: 1
Google Scholar
United Nations (2020b) Roadmap for Digital Cooperation. Technical report, United Nations. https://www.un.org/en/content/digital-cooperation-roadmap/assets/pdf/Roadmap_for_Digital_Cooperation_EN.pdf
Viola L (2020b) Replication, evaluation and quantitative analysis in the DH era: transparent digital practices and lessons learned from the development of the GeoNewsMiner. In: DH Benelux 2020 #GoesOnline, World Wide Web, 3rd – 5th June 2020, Zenodo. https://doi.org/10.5281/ZENODO.3859535, https://zenodo.org/record/3859535
Viola L, Fiscarelli AM (2021a) ChroniclItaly 3.0. A deep-learning, contextually enriched digital heritage collection of Italian immigrant newspapers published in the USA, 1898–1936. https://doi.org/10.5281/ZENODO.4596345, https://zenodo.org/record/4596345. ISSN: 1613-0073. Version Number: v3.0.0 Type: dataset
Viola L, Fiscarelli MA (2021b) From digitised sources to digital data: Behind the scenes of (critically) enriching a digital heritage collection. In: Weber A, Heerlien M, Gassó Miracle E, Wolstencroft K (eds) Proceedings of the international conference collect and connect: archives and collections in a digital age, CEUR – workshops proceedings, vol 2810, pp 51–64. http://ceur-ws.org/Vol-2810/paper5.pdf
Viola L, Verheul J (2019a) The media construction of Italian identity: a transatlantic, digital humanities analysis of italianitá , ethnicity, and whiteness, 1867–1920. Identity 19(4):294–312. https://doi.org/10.1080/15283488.2019.1681271
Article Google Scholar
Viola L, Verheul J (2020a) Machine learning to geographically enrich understudied sources: a conceptual approach:. In: Rocha P, Steels L, van den Herik H (eds) Proceedings of the 12th international conference on agents and artificial intelligence. SCITEPRESS - Science and Technology Publications, Valletta, pp 469–475. https://doi.org/10.5220/0009094204690475
Viola L, De Bruin J, van Eijden K, Verheul J (2019) The GeoNewsMiner (GNM): an interactive spatial humanities tool to visualize geographical references in historical newspapers (v1.0.0). https://github.com/lorellav/GeoNewsMiner
Wilding R (2007) Transnational ethnographies and anthropological imaginings of migrancy. J Ethn Migr Stud 33(2):331–348. https://doi.org/10.1080/13691830601154310
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Luxembourg, Esch-sur-Alzette, Luxembourg
Lorella Viola

Authors

Lorella Viola
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Viola, L. (2023). The Opposite of Unsupervised. In: The Humanities in the Digital: Beyond Critical Digital Humanities. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-16950-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-16950-2_3
Published: 29 September 2022
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-16949-6
Online ISBN: 978-3-031-16950-2
eBook Packages: Literature, Cultural and Media StudiesLiterature, Cultural and Media Studies (R0)

Publish with us

Policies and ethics