The mass digitisation of archives and exponential growth in born-digital archives over the past two decades, together with the publication of large volumes of aggregated cultural heritage metadata over the past decade, have resulted in an enormous volume of archives and associated (meta)data, being available digitally. This has produced a valuable but under-utilised source of large-scale digital data ripe for interrogation by scholars and practitioners in the interdisciplinary field of the Digital Humanities, opening up digitised archives to a range of novel digital research methods. Using computational methods, digital humanists interrogate large digital datasets to address an ever-expanding range of humanities questions. This article has adopted a broad interpretation of the term “the Digital Humanities” as being at the intersection of computational technologies and humanities scholarship (Schreibman et al. 2004), and uses the term “digital humanist” to refer to any person engaged in scholarship or practice within the field of Digital Humanities and/or using computational methods for humanities research. The Digital Humanities is driven by digital data; however, it is not enough that archival data is available digitally, it needs to be integrated, interoperable, and interrogable (Bikakis 2021; Koho et al. 2021; Zeng 2019). Current approaches to digitisation—the process of producing a digital surrogate of an analogue object—usually result in the production of unstructured data, for example in the form of a digital image, sound or moving image file, consequently presenting a number of barriers to digital humanists.

In order to increase the utilisation of digitised and born-digital archives by digital humanists, it is essential that their contents and associated metadata are made available in a machine-readable format. Such “datafication” “…renders a diverse range of information as machine-readable, quantifiable data for the purpose of aggregation and analysis” (Southerton 2020). Digitisation is often a primary stage in datafication, yet as a process it remains distinct from datafication, which emerged with the growth of data analytics and big data over the past decade (Mayer-Schönberger and Cukier 2013). It is beyond the scope of this article to provide a detailed examination of datafication. However, it must be noted that the process of datafication has been critiqued as losing much qualitative detail of the source material (Southerton 2020), as perpetuating and subjecting new harms and discrimination upon already vulnerable communities (Noble 2018; Sutherland 2019), and as having “benefited governments, medical institutions, and corporations at the expense of citizens’ and consumers’ liberty and privacy” (Schüll 2020, pp. 457), among other criticisms. On the other hand, Blanke and Prescott assert that datafication has particular significance for the humanities as it “suggests that for big data work unstructured data is not enough” (2016, pp. 193). Michael Moss et al. suggest the key consequence of the move towards datafication for archives is that they have been transformed from a collection of administrative records into “a collection of data to be mined” (Moss et al. 2018, pp. 120).

When engaged with critically, the datafication of digitised archives provides many new opportunities for digital humanists to make use of the vast quantity of archives digitised over the past two decades. However, making such heterogeneous and distributed digital data available in accordance with the FAIR Guiding Principles—that data be published as Findable, Accessible, Interoperable and Re-usable data (Wilkinson et al. 2016)—is one of the major challenges facing the cultural heritage domain. Linked Data provides a viable means of archival datafication capable of implementing the FAIR Guiding Principles, creating machine-readable, interoperable, extensible Archival Linked Data suited to interrogation and analysis using digital humanities research methods. Using Linked Data, archival data (catalogue data, metadata, data extracted from the contents of born-digital and digitised archives) can be embedded into the web, enriching and further contextualising archival data, and making it easier to discover, access, and utilise.

Soon after the introduction of Linked Data, early case studies began to explore its potential application in the archives sector (Clough et al. 2011; Ruddock 2011). In the intervening decade, a substantial body of associated research has continued to develop examining the application of Linked Data to archives. Numerous published case studies are available, with Karen F. Gracy and Jinfang Niu providing some scholarly assessment. The majority of available literature principally focuses on the technical challenges posed by the production of Linked Data and the development of solutions to them. Critical engagement and assessment of Archival Linked Data is currently absent, and there has been little consideration of the benefits that Archival Linked Data offers to digital humanities research and scholarship, or of how it can increase utilisation of digitised or born-digital archives. In digital humanities scholarship, in contrast, there is growing evidence of how Linked Data specifically supports research. Over the past decade an increased adoption of Linked Data is evident in digital humanities scholarship, where it has been presented as a means of increasing the interoperability of digital cultural heritage data (Jordanous et al. 2012; Miyakita et al. 2018). Case studies provide evidence to suggest that Linked Data benefits the Digital Humanities in that it provides a viable means of publishing the digital data which fuels the Digital Humanities in the form of integrated, harmonised, and interoperable large-scale datasets which can be reused in multiple ways. However, notwithstanding that evidence, analysis of Linked Data and how it benefits the Digital Humanities remains limited.

Despite an evident growth of interest, Linked Data remains under-examined in both archival and digital humanities scholarship. Scholarly examination, has placed a heavy emphasis on the technical aspects—the “how” of Linked Data, but has given little consideration of the “why”, i.e. the benefits of Linked Data. This article begins to address this gap by examining how Archival Linked Data is beneficial for those engaged in digital humanities research and scholarship. In considering some of the barriers currently preventing digital humanists from being able to utilise digitised and born-digital archives, and suggesting the production of Archival Linked Data as part of interdisciplinary collaboration as a way forward, this article is a first step towards developing an understanding of the potential intersections between archival and digital humanities Linked Data scholarship and praxis.

It does this in four ways: first, after a very brief introduction to Linked Data, the article provides a snapshot of the Linked Data landscape in both archival and digital humanities scholarship. Secondly, extrapolating from archival scholarly, case-study, and practice-based literature, it details the benefits of Archival Linked Data which directly support increasing access to digitised archives for the purposes of digital humanities scholarship and practice. Thirdly, it examines some of the barriers currently preventing digital humanists from being able to take full advantage of the abundant potential of Archival Linked Data, and of utilising current digitised or datafied archives. Drawing these findings together, the author considers future directions of Archival Linked Data, including its role in processing and increasing access to born-digital archives, the incorporation of Artificial Intelligence (AI) and low-barrier tools to scale up the production of Archival Linked Data, and increased collaboration between archival and digital humanities scholars and practitioners. This is a theoretical account; further research is needed to validate, through real-life application, the benefits of Linked Data for the purposes of digital humanities scholarship. In particular, examination of how the technological and structural barriers highlighted here are actually affecting the critical work of, for example, a national archive institution, and identification of the changes and developments necessary to ensure the progress of digital scholarship dependent on Linked Data, would be especially valuable contributions to current scholarship and practice.

Linked Data

First conceived in 2006 as the building blocks of the Semantic Web, Linked Data is most simply described as a “set of best practices for publishing and connecting structured data on the Web” (Bizer et al. 2009, pp. 1). Whereas the Web is based upon documents (e.g. web pages) being linked to other documents, the Semantic Web is structured in the form of semantically defined data linked to other data, known as Linked Data. Linked Data itself is not a specific technology or tool, rather it is a set of principles for publishing structured data which underpin the creation of interoperable machine-readable data on the Semantic Web. The Linked Data Principles state:

  1. 1.

    Use URIs [Unique Resource Identifiers] as names for things

  2. 2.

    Use HTTP URIs so that people can look up those names

  3. 3.

    When someone looks up a URI, provide useful information, using the standards (RDF [Resource Description Framework], SPARQL [SPARQL Protocol and RDF Query Language])

  4. 4.

    Include links to other URIs so that they [people and intelligent Semantic Web agents] can discover more things (Berners-Lee 2006)

Linked Data is structured in the form of triples as subject-predicate-object, whereby the predicate semantically defines the relationship between the two entities. By using unique identifiers, standardised ontologies and vocabularies, large, interconnected networks of these entities and their relationships are created, which can be represented in the form of a knowledge graph.

The Linked Data principles were followed up in 2010 by the 5 Star Linked Open Data Scheme which incorporated the Open Knowledge Foundation’s definition of Open Data as that which “can be freely used, modified and shared by anyone for any purpose” (Open Knowledge Foundation (ND) 2021). The 5 Star Scheme defines a sliding scale of the fundamental technical and accessibility requirements of Linked Data as it moves towards becoming Linked Open Data: data which is available on the web (1*), as machine-readable structured data (2*), in a non-proprietary format (3*), using standards of the World Wide Web Consortium (RDF and SPARQL) (4*), and linked to other Linked Open Data (5*) (Berners-Lee 2010). This article has adopted the term Linked Data to refer collectively to Linked Data that is published under an open licence (i.e. Linked Open Data) and that which is not. A distinction has only been maintained when discussing individual case studies where the language of the author(s) has been used, and for resources which include Linked Open Data in their title. For an in depth introduction to Linked Data see Gracy 2015 (Fig. 1).

Fig. 1
figure 1

A simple knowledge graph

Archives and Linked Data

The year 2011, five years after Tim Berners-Lee’s first articulation of the Linked Data Principles, is a significant turning point in the emergence of interest in Linked Data within the archives sector. Although the Library of Congress published authority files as RDF triples in 2008 (Summers et al. 2008), it was in 2011 that reports on the potential uses and benefits of Linked Data for Libraries, Archives and Museums first emerged (Isaac et al. 2011; Keller et al. 2011). The same year saw the publication of the first complete datasets of bibliographic Linked Data—the British Library’s British National Bibliography, and Oslo Public Library Linked Data (Godby et al. 2015; Rekkavik 2014)—and early articles reporting on archives-specific Linked Data projects, such as research at the UK National Archives on geo-location referencing (2009–2011) and the LOCAH (Linked Open COPAC and Archives Hub) project (2010–2011) (Clough et al. 2011; Ruddock 2011). One of the first archival instances of a Linked Data dataset derived from the LOCAH Project, which published a subset of archival metadata taken from the Archives Hub, an archival metadata aggregator, together with bibliographic data from Copac (Consortium of Online Public Access Catalogues) (Ruddock 2011). Through identifying a number of challenges that the project team was facing, including issues of sustainability, maintenance of provenance, and lack of standardisation of archival data, the LOCAH project raised awareness of the specific barriers facing archive services in the adoption of Linked Data. On the other hand, it also made clear the numerous benefits which Linked Data could bring—including integration with other data sources, the facilitation of serendipitous search, and the creation of flexible entry points to archival data, thereby increasing the exposure of collections.

Interest has continued to grow over the following decade, as evidenced by the plethora of published case studies now available. These case studies generally fall into one of three categories. Firstly, those which develop Archival Linked Data infrastructure, such as gazetteers (Clough et al. 2011), mark-up schemas (Dobreski et al. 2019; Gartner 2015), tools and workflows (Pattuelli et al. 2013; Stevenson 2012a) and ontologies (Corn and Patrick 2019; Llanes-Padrón and Pastor-Sánchez 2017; Robledano-Arillo et al. 2020). Most notable in this category is the ontology of the International Council on Archives’ latest standard for archival description, Records in Contexts, which has been presented as the Linked Data ontology, RiC-O (Pitti and Stockting 2016). The second category, which is most common, is those which enhance discrete pre-existing collections of archival metadata and publish them as Linked Data, including for the Historical Archive of the European Commission (Damova 2020), the Zeri Photo Archive (Daquino et al. 2017), and the Getulio Vargas Foundation’s Archives (Rademaker et al. 2015).

In the third category are new digital archival resources made available via Linked Data web services and user interfaces. An early example is the Online Digital Archive of Architectural Practice in Post-War Queensland, a digital archive providing a comprehensive online resource of oral history interviews—supplemented with items from related archive and library collections (Hunter et al. 2012). The case study identified a number of benefits of adopting Linked Data in the development of digital archives, providing evidence that benefits can be experienced equally by archive services, their key stakeholders, and users. The Linked Jazz project created a Linked Open Data dataset of over two thousand jazz musicians extracted from oral history interviews (Pattuelli et al. 2013, 2015, 2017). The dataset is available via a network visualisation tool which enables analysis of the professional and social relationships between individual musicians. Linked Jazz presented a convincing argument to support further investment in Archival Linked Data: “As the amount of digital cultural heritage data continues to grow at an exponential rate, there is a call for new strategies and applications to enhance their discovery, interpretation, and use. The application of LOD [Linked Open Data] technology to cultural heritage content holds enormous potential to answer this call” (Pattuelli et al. 2013, pp. 6) (see Fig. 2).

Fig. 2
figure 2

Linked Jazz project by the Semantic Lab at Pratt Institute—network visualisation tool (CC BY)

Archival scholars examining Linked Data include Gracy’s analysis of the opportunities and challenges of implementing Linked Data (2015), and experiments using Linked Data to increase access to library metadata for music resources and moving image archive metadata (Gracy et al. 2013, 2018). By developing crosswalks between datasets, Gracy has also provided tools for the implementation of Linked Data, and made a number of recommendations to overcome the challenges associated with the processes involved (2015). Niu’s scholarly overview of early trends in archival Linked Data practice revealed that the majority of projects up to 2015 had primarily converted existing descriptions. Furthermore, she found that the implementation of Linked Data significantly changes archival description, and argued that many of the methods of accessing Linked Data are beyond the capabilities of, what she terms, “generic users” of archives (Niu 2016).

The Digital Humanities and Linked Data

As is the case in archival scholarship, the trend in digital humanities Linked Data scholarship has been for individual case studies. Writing in 2012, the authors of a case study using Semantic Web techniques to enrich TEI (Text Encoding Initiative) encoded manuscripts, argued that “the benefits of semantic web techniques are currently under-explored in Digital Humanities research”, positioning Linked Data as a solution to data being made available but not interoperable (Jordanous et al. 2012, pp. 44). Since then, case studies have demonstrated how various digital humanities research methods can be combined with Linked Data to support research in a range of disciplines, including Linguistics (Chiarcos et al. 2018; Cimiano et al. 2020), Literary Studies (Egloff and Picca 2020; García et al. 2016), Music (Eyharabide et al. 2019), and Digital History (Balado et al. 2015; Blanke and Riechert 2020; Bruneau et al. 2021; Horne 2020; Hyvönen et al. 2016; Miyakita et al. 2018; Romein et al. 2020; Tamper et al. 2018).

Of particular relevance, are the case studies combining digital humanities methods with digital cultural heritage data; as a recent overview has argued: “Linked Data and Semantic Web technologies are becoming increasingly important in creating, publishing, and analysing Cultural Heritage data in Digital Humanities” (Bikakis et al. 2021, pp. 166). Earlier case studies focused on creating a discrete Linked Data dataset and examining the links between data within the dataset, or with a specific external dataset. The Sharing Ancient Wisdoms project, for example, created a digital edition of Greek and Arabic collections of ancient wise sayings using RDF. The project demonstrated the value of incorporating Linked Data into the process of producing online scholarly editions marked up using the TEI XML format, and identified advantages of that approach for digital humanists wishing to create similar digital editions, or to analyse the data in resulting datasets (Jordanous et al. 2012). The Dutch Ships and Sailors project developed and evaluated a method for identifying links between ships in two datasets, a dataset of Dutch maritime events, the “Northern Muster Rolls database”, and the historical newspaper archive of the Dutch National Library (Balado et al. 2015). The project combined machine learning techniques with manual assessment to identify links between the two datasets, proving not only the possibility of retrieving a considerable amount of relevant links between the two datasets, but also that the links provided new opportunities for analysis of the source datasets.

However, as more and more interoperable datasets have become available, an increasing number of studies have focused on the development of tools, user interfaces, and web services to allow for the access of Linked Data from multiple sources without the need for advanced technical expertise (Baierer et al. 2017; Egloff and Picca 2020; Hoekstra et al. 2016; Miyakita et al. 2018; Sztyler et al. 2014; Xia et al. 2018). A key example is DIVE, a Linked Data digital cultural heritage collection browser which uses historical events and narratives as the context for searching, browsing, and providing access to objects from heterogeneous collections (de Boer et al. 2015). A particularly well-known example is WarSampo—Finnish Second World War on the Semantic Web, an initiative which integrated nineteen distributed datasets relating to Finland in the Second World War as Linked Open Data, and created the WarSampo Data for Digital Humanities Linked Open Data service to enable research and the creation of applications related to war history (Hyvönen et al. 2016; Koho et al. 2021). Additionally, the initiative has developed the WarSampo Portal to provide access to those without Linked Data skills, enabling search and browsing based on originally six, and subsequently increased to nine, different “perspectives”, including event, place, person, casualty, and army unit (Hyvönen et al. 2016; Koho et al. 2021). WarSampo has transformed and harmonised previously isolated datasets to form a unified and extensible knowledge graph which enables queries that were hitherto impossible, including presenting biographies for individual soldiers and histories of military units (Koho et al. 2016, 2021) (see Fig. 3).

Fig. 3
figure 3

WarSampo Finnish Second World War on the Semantic Web—faceted search interface (CC BY)

Indeed, Finland has become a key centre in developing and investing in the Linked Open Data infrastructure required to provide access to digital cultural heritage data for the purposes of the Digital Humanities. To date this has included the development of an ontology service, Linked Data publishing platform, and various cultural heritage Semantic Web portals (Hyvönen 2020a). These portals, including BiographySampo, BookSampo, CultureSampo, NameSampo, TravelSampo and the above-mentioned multi-award winning WarSampo, have been accessed by millions of end-users, including specialist digital humanities scholars, genealogists and the general public (Hyvönen 2020a; Hyvönen et al. 2019; Koho et al. 2016, 2021). The Sampo portals, together with a number of other case studies, clearly demonstrate that investment in Linked Data tools and services for the Digital Humanities benefits users in general, and aids the development of cultural heritage applications more widely (Blanke and Riechert 2020; de Boer et al. 2015; Hyvönen 2020a; b; Hyvönen et al. 2016; Koho et al. 2021; Oldman et al. 2016; Tamper et al. 2018).

The benefits of Archival Linked Data for Digital Humanities scholarship

Whereas digital humanities scholarship demonstrates an interest in, and awareness of, the value of Archival Linked Data for the Digital Humanities, there has been little similar consideration of its value, and its benefits, within archival scholarship—this despite the fact that, at least in one instance, inspiration for the project came directly from the Digital Humanities to the archive sector. In a study developing a Linked Data informed ontology for the description of Spanish Civil War photographic archives, the authors acknowledged that the context of their research was partially informed by “the transformations in the organization of documentary collections spurred by the digital humanities” (Robledano-Arillo et al. 2019, pp. 67). However, the aim of the study itself was directed to supporting historians using traditional research methods in accessing photographs for iconographic research. Beyond this, there are some general suggestions that in processing large datasets, revealing relationships across heterogeneous sources, and enabling access to unstructured data, Archival Linked Data offers opportunities for digital humanities research (Bones 2019; Pattuelli et al. 2017). However, what exactly these opportunities and benefits are remains under-examined.

Notwithstanding the lack of explicit investigation of the benefits of Archival Linked Data for the Digital Humanities, it is evident that many of the benefits of Linked Data to archive services, their collections, and users which have been articulated from within the archival sector are especially advantageous to the Digital Humanities. Archival Linked Data meets the needs of digital humanists in that it provides interoperable, reusable, and integrated data. However, in making archival data machine-readable, connecting disparate and cross-disciplinary datasets, and revealing previously unknown relationships across collections, the benefits of Archival Linked Data for digital humanities scholarship extend much further, and deserve to be much more clearly articulated across the disciplines.

Within archival scholarship, the literature contains multiple examples of Archival Linked Data’s relevance and value to the Digital Humanities community. Improvements in knowledge discovery, information search and retrieval greatly benefits digital humanists due to improved effectiveness, efficiency, and precision (Llanes-Padrón and Pastor-Sánchez 2017; McKenna et al. 2018; Rademaker et al. 2015; Robledano-Arillo et al. 2019). More complex queries can be accommodated through the use of SPARQL. With a single SPARQL query, searches can be made across multiple collections, and navigation across cultural heritage and non-cultural heritage sources of data is made possible (Gracy 2015; Gracy et al. 2013; McKenna et al. 2018). Linked Data makes it easier to reuse, align and enrich archival data (Pitti et al. 2016), and integrate it with data derived from other sources (Clough et al. 2011; Gracy 2015; Hunter et al. 2012; Niu 2016; Pattuelli et al. 2017). In this way, the data cleaning and harmonisation processes which occupy a significant part of any digital humanities research that brings together data from multiple sources are made redundant. However, it must be noted that this is in large part due to these processes having already been undertaken in the production of Linked Data, and which constitute one of the most time-consuming aspects of the process (Davis 2019). Linked Data pushes archival data closer to the individual user, allowing them to more efficiently access data for their specific applications (Debruyne et al. 2016; Rademaker et al. 2015), and to navigate seamlessly between datasets (Baierer et al. 2017; Clough et al. 2011; McKenna et al. 2018). In accommodating multiple search methods, Linked Data enables different search methods to be made available to meet the needs of digital humanists with a range of technical abilities and research interests. These include SPARQL endpoints, entity, semantic concept or keyword search, browsing and exploration, and serendipitous search.

As a machine-readable format, Linked Data is capable of supporting automatic reasoning and analysis of semantic data, querying large volumes of data, and offering new methods for discovery, engagement, interpretation, and use. Web-based user interfaces often incorporate tools which facilitate new methods of engagement and analysis, and “allow the use of archival data in ways that a few years ago were unimaginable or prohibitively difficult to do for both social and technological reasons” (Pitti et al. 2016, pp. 176). Adopting Linked Data creates a digital research environment ideal for digital humanists, opening up archival data to natively digital methods, and supporting dynamic research methods (Gartner 2015; McKenna et al. 2018). Furthermore, Linked Data extends the depth and breadth of archival analysis possible as it can be interrogated through, for example, graphical interfaces, data or text mining, data clustering, information visualisation, network analysis, natural language processing (NLP), and named entity recognition (NER), many of which digital methods are commonly employed in the Digital Humanities.

The process of integrating archival datasets and making them interoperable increases the quality of the data made available for the Digital Humanities as it is necessary to adhere to cross-domain standards and links are created across sources of data. Improvements in quality are manifold: archival data is richer, more expressive and granular, and semantically enriched (Browell 2016; Gartner 2015; Gracy et al. 2013; Niu 2016; Pattuelli et al. 2013; Pitti et al. 2016). The integration and harmonisation of heterogeneous sources of data facilitates automatic reasoning, providing the ability to generate new knowledge and infer additional implicit facts from those explicitly documented (Damova 2020; Tillman 2016). Archival data is also enriched through incorporating knowledge derived from both users and experts. Users can add their own knowledge to archival description through the incorporation of user-generated description into Linked Data datasets with activities such as semantic tagging and other user annotations (Gracy 2015; Niu 2016). Collection descriptions are enhanced through the provision of additional archival data, such as further descriptive, contextual, and authority data which might be internal administrative data not previously publicly available, or data drawn from external sources (Gartner 2015; Gracy 2015, 2018). Data is, therefore, better contextualised by it not being presented in isolation, enabling digital humanists to more easily gain a fuller understanding of archival data.

Current barriers to Digital Humanities scholarship

The challenges of producing and publishing Archival Linked Data are relatively well documented, and include technological issues, a prevalence of unstructured and not easily disambiguated data, a lack of financial and skilled human resources, low-levels of awareness of Linked Data within the profession, and an absence of an Archival Linked Data infrastructure, including tools, standards and best practice (Gracy 2018; Hyvönen 2012; McKenna et al. 2018; Smith-Yoshimura 2018). Challenges for which both archival and digital humanities scholars and practitioners share an impetus in finding a solution include: balancing the provision of access to open data with the maintenance of Intellectual Property Rights (Gartner 2015; McKenna et al. 2018), preventing the decontextualisation and loss of nuance of archives (Gartner 2015; Gracy 2018), and providing access without complicating the user search process (Niu 2016). As the growing body of case studies has demonstrated, many challenges can be overcome with the maturation of Archival Linked Data practices. Moreover, recent large-scale projects have begun to deliver some of the fundamental infrastructure required for engagement with the production of Archival Linked Data en masse. These projects include the development of an international archives sector Linked Data ontology, RiC-O, already referred to above, and, as part of the UK Towards a National Collection project, investigation of the infrastructure required for the creation and maintenance of Persistent Identifiers (PIDs) in the UK heritage sector, and exploration of Linked Open Data, knowledge graphs, and AI as a means of making connections between online representations of heritage objects and information about them.Footnote 1

It is clear that Archival Linked Data provides valuable advantages for digital humanists wishing to interrogate digitised archives. However, Linked Data is far from a panacea to the challenges of accessing the contents of digitised archives, and their associated metadata, as data. To date, Archival Linked Data is a minor player on the Linked Data scene. Furthermore, even though more is being contributed year on year, it is still a statistically insignificant proportion of existing digital archival data globally. In addition to this limited availability of archival data as Linked Data, a key hindrance to the success of Archival Linked Data for the purposes of digital humanities scholarship is the current lack of relevant archival and non-archival Linked Data datasets with which archival data can be interconnected and augmented.

The Linked Open Data Cloud provides a visualisation of all datasets currently available as Linked Open Data (i.e. as RDF, linked via URIs to other datasets, and with open access to the entire dataset) and the links between them. Composed (as at May 2021), of 1,301 datasets, it is dominated by datasets from the life sciences, linguistics and governmental data. At the centre of the Cloud are DBpediaFootnote 2 and Wikidata,Footnote 3 two crowd-sourced Linked Data datasets created as part of Wikimedia Foundation projects, and GeoNames, a multilingual geographical database.Footnote 4 Archival and bibliographic datasets are mostly included in the Publications Subdomain where key datasets include VIAF (the Virtual Internet Authority File),Footnote 5 which links to twenty-eight other datasets, the Library of Congress Subject Headings which is linked to twenty-seven datasets,Footnote 6 and the European digital cultural heritage metadata aggregator Europeana, which links to six.Footnote 7 A number of Archival Linked Data datasets are included, such as the National Digital Data Archive of Hungary,Footnote 8 the Bibliothèque nationale de France,Footnote 9 and the 20th Century Press Archives.Footnote 10 The former two datasets link to both DBpedia and VIAF, among others, while the latter links to DBpedia and GeoNames. Such Archival Linked Data datasets confirm the possibility of building meaningful connections between Archival Linked Data and other Linked Data datasets, thus providing impetus for the addition of further Archival Linked Data to the Linked Open Data Cloud.

However, available Linked Data datasets present a number of challenges when identifying relevant datasets which correspond to, and can enrich, digitised archives, partially as a result of the fragmented transnational and corporate infrastructures affecting the scope of datasets and their recorded information. One dataset which illustrates many of these issues is, GeoNames, a commonly used digital gazetteer in both archival and digital humanities Linked Data activity requiring historic geographic data. GeoNames integrates global geographic data and makes it available as Linked Data, including place names in multiple languages, latitude and longitude coordinates, and current population statistics. Although this is undoubtedly useful information for, say, the mapping of historic data, it provides little contextual information of relevance to historic places and spaces. Similarly, trying to integrate person-centred historic data is also not straightforward. VIAF is limited to information about published authors; Wikimedia Foundation projects are governed by a notability test which requires people to be considered “worthy of notice”, “significant”, or “remarkable” to warrant inclusion (Wikipedia 2021). Such requirements led to pages created to document enslaved people owned by Martha and George Washington at Dogue Run Farm not being published by Wikipedia due to an absence of relevant published source material (Herbert and Parilla 2021). It is clear, therefore, for the majority of individuals documented in digitised historical archives—who may be peasants or workers, prisoners or wives and mothers—little or no related content is available as Linked Data. As well as resulting in links only being possible between a limited number of individuals, the narrow scope of these key Linked Data datasets risks prolonging the privileging of a minority of “notable” individuals captured in the now digitised historic record, further marginalising historically excluded communities.

Fortunately, relevant digital data is available. The collections of digitised archives made accessible as a result of mass digitisation projects provide almost limitless potential for the Digital Humanities. Often driven by the needs of family historians for sources of names, census records, vital records (birth, death, marriage), title deeds, wills, and many other record types have been digitised and made publicly available in the form of digital images, and/or data, created through indexing and the use of Optical Character Recognition (OCR). The inadequacy of these approaches to digitisation for the purposes of the Digital Humanities, however, are captured by Sonia Ranade, who, reflecting on the digitisation work of the UK National Archives, states.

In early digitisation projects, our emphasis (and that of most other archival institutions) was facilitation of remote access, creating digitised assets with the assumption that these will be consumed in much the same way that an on-site researcher reads a paper file. In effect, our traditional digitisation activities deliver a “picture” of the page, alongside some limited indexing. We have found that digitised resources created in this way do not readily lend themselves to computational analysis and we are beginning to realise that our established approach to digitisation does little to facilitate innovative use of the records.

(Ranade 2016, pp. 3264).

Furthermore, increased commercialisation of the digitisation of archives has left much of this vast resource of potential data trapped behind the paywalls of academic publishers and commercial genealogy companies. Such commercialisation has not only dictated the types of record prioritised for digitisation, it has also influenced how the resulting data is formatted. For example, trying to meet the demand of genealogists has led to the publication of data in forms which enable search for individuals, rather than the examination of total populations (Morris 2017).

Through datafication, the potential of this largely untapped resource could be released, opening up digitised archives to digital methods of interaction, interrogation, and analysis. A number of large-scale projects have demonstrated the value of datafying digitised archives to form large databases combing digitised archives from multiple sources, with notable examples from the UK including University College London’s Legacies of British Slave-ownership project, which created the online Encyclopaedia of British Slave-ownership,Footnote 11 and People of Medieval Scotland, which has created a database of named individuals drawn from over 8,600 documents covering the period 1093–1371.Footnote 12 One key example is the Digital Panopticon, a cross-institutional and international digital humanities collaboration, which has created a database relating to people sentenced at the London Old Bailey between 1780 and 1875.Footnote 13 The project brought together genealogical, biometric, and criminal justice datasets held by multiple organisations based in the UK and Australia. The project also created new digital archival datasets by digitising and datafying further historic records, manually combining and harmonising digitised archives from multiple sources. The problem with the creation of such databases is that, neither open nor connected to external data, such databases are commonly structured as relational databases and provide only mediated access via a bespoke search tool, often developed specifically for the project. Frequently, such databases do not provide access to the underlying data, limiting its exploitation to within the parameters defined by the original project. Furthermore, when made available for download, considerable data-cleaning and harmonisation is required to enable integration with other datasets. The addition of such a wealth of data to the Linked Open Data Cloud would be a major boon far beyond digital humanists.

From this perspective, then, there are a number of obstacles to overcome, and there currently seems to be little viable alternative to the commercial model of mass digitisation. Furthermore, Thomas Padilla points to the role that cultural heritage organisations play in “enclosing” their own data through commercial partnerships, questioning “Is it worth boosting the cost of admission to an enclosed garden that weakens the library community and inhibits emerging forms of research”? (2018, pp. 297). One way forward might be individual projects negotiating free access with commercial providers to subsets of their data, as was the case of the Digital Panopticon which negotiated free access to aspects of the key genealogical data held by their commercial providers, Findmypast and Ancestry. Indeed, the impact of access restrictions to digitised archives on Archival Linked Data activity can be seen from the Irish Record Linkage 1864–1913 project which used digitised archives and data provided by the General Register Office “under strict terms and conditions” which limited access only to the project team (Debruyne et al. 2016, pp. 160). While such a compromise is far from the ideal solution—in the case of the Digital Panopticon, subscriptions are still required for full access to the transcriptions and digitised images from which the data has been extracted—the partnership model of digitisation funding, from commercial partners, internal funding, and grant funding, has become a relatively common one.

Future directions of Archival Linked Data?

Engagement with Linked Data has come a long way over the past ten years in both archival and digital humanities scholarship and practice. The previous review has indicated how, to date, Archival Linked Data case studies have focused primarily on increasing access to archival metadata or, more rarely, digitised archives, and a number of barriers remain to making a significant proportion of these materials available as Linked Data. Digital humanities case studies have focused on increasing the accessibility of Archival Linked Data and the development of tools for access and analysis. However, those currently available are yet to accommodate the wide range of digital humanities research topics and methods required (Hyvönen 2020b). Resolving the challenges of making digitised archives both available and accessible as Linked Data are complex. It not only requires technological developments and mass investment in the production of Linked Data by multiple sectors, but also a change in attitude towards data being open and freely available (within legal limits), in addition to an infrastructure that enables this. The FAIR Guiding Principles can aid the cultural heritage domain move in this direction and are highly compatible with Linked Data. Guidelines for the application of these principles to the digital objects and metadata held by libraries, archives and museums have been proposed, including the use of PIDs, standardised metadata formats, and external-links (Koster and Woutersen-Windhouwer 2018).

Significantly, digitised archives are only one category of the digital archives currently available. The exponential growth of the creation of born-digital records since the 1990s is set to vastly increase the volume of born-digital archives. The huge volume of digital archives requires archivists to no longer consider archives as texts to be read but as collections of data, for which new methods and tools are required to undertake a reformed approach to appraisal (Moss et al. 2018). While progress has been made in the preservation and provision of access to born-digital archives it has not been rapid enough: “born-digital archives are endangered archives, and we urgently need to preserve these collections, make them available and produce new knowledge… It is astonishing that email and born-digital archives are still treated as a new thing that few archivists really understand” (Jaillant 2019, pp. 300). Moreover, traditional search methods based on free-form natural language search are ineffective in exploring large born-digital collections (Winters and Prescott 2019). The potential of Linked Data as a means of making born-digital archives available as collections of data, accessible to multiple users and research methods, warrants further investigation. Lucy McKenna, in a paper presented at the AURA Workshop, suggested the benefits of Linked Data for born-digital archives included improved knowledge discovery, seamless navigation, and increased awareness of available resources in digital archives (2020). Whereas it is likely that many of the practices for the creation and provision of access to Archival Linked Data originating from digitised archives will be transferable to the context of born-digital archives, there will inevitably be some areas of difference and additional challenges. Linked Open Data is preferable for the purposes of digital humanists, yet the format is not always appropriate for archives where restrictions relating to sensitivity, confidentiality, data protection, and Intellectual Property Rights are not uncommon. Such restrictions are especially prevalent in born-digital archives and have resulted in a vast amount of born-digital archives currently being preserved but not made available for access. Tools to combat these challenges are beginning to emerge, particularly for use with email collections (Jaillant 2019; Ries and Palkó 2019; Schneider et al. 2019). For example, ePADD, an open-source software developed by Stanford Libraries’ Department of Special Collections and University Archives, incorporates machine learning and natural language processing to screen email collections for sensitive and legally restricted information, prepare emails for preservation, and make them discoverable and accessible to users (Schneider et al. 2019). As a recent editorial argued, while it is essential that the preservation of born-digital records is accelerated, “we also need to push for access to these archives through lobbying for open data respectful of privacy” (Jaillant 2019, pp. 301). One approach to this has been demonstrated by the 20th Century Press Archives which donated a large Linked Data dataset to Wikidata in 2019. By publishing only the metadata created by the 20th Century Press Archives under an open licence (CC0), the organisation found a way of contributing data regarding archives with access restrictions to the Linked Open Data cloud—providing a controlled level of access to their contents—while continuing to uphold the intellectual property rights to the original material (Neubert 2019). Linked Data, both that which is open and “closed” (i.e. made available with access and/or use restrictions), offers an as yet under-explored potential to be part of the solution to the challenge of providing some level of access to born-digital archives while maintaining certain restrictions.

Despite the scale of these challenges, the expertise of archivists and other cultural heritage professionals has an important role to play in overcoming the issues currently preventing the widespread adoption of Linked Data in the archives sector (McKenna et al. 2018). Even while the necessary infrastructure remains absent for the production, provision of access to, maintenance, and preservation of Archival Linked Data, archivists can make a significant contribution at an institutional or individual level. By increasing their own knowledge of Linked Data, and improving institutional awareness and preparedness for Archival Linked Data, archive services will be well placed to embrace Archival Linked Data when it becomes a more feasible prospect. Investing in data cleaning and standardisation, and exploring low-barrier means of contributing archival data as Linked Data, such as Wikidata, are achievable first steps for many archive services which can have a considerable impact on increasing access to high-quality archival data and ensuring its future interoperability.

The adoption of Linked Data in the archives sector is still a relatively new phenomenon and is open to expansion and development in multiple directions. In the library sector, a number of reports have been published in recent years concerning and advocating the use of Wikipedia and Wikidata (Association of Research Libraries 2019; International Federation of Library Associations and Institutions 2017; Program for Cooperative Cataloguing 2018). The library sector’s increasing engagement with Wikimedia Foundation projects may prophesize a future development of Archival Linked Data infrastructure. In providing a low-barrier entry to the Linked Data environment capable of scaling-up Linked Data adoption, Wikidata may prove a key tool in overcoming some of the outstanding infrastructural challenges. Cultural heritage institutions which have drawn in authority data from Wikidata to their datasets and reciprocally contributed their data to Wikimedia Foundation projects include the National Galleries of Scotland (Gould 2018), Museum of Modern Art (MOMA) (Romeo 2016), the Library of Congress (Thornton 2017), and WorldCat (Proffitt 2020). The advantages of making cultural heritage data freely accessible via Wikidata are clearly demonstrated by the Smithsonian’s experience: the amount of views of an identical image is on average 1000 times higher on Wikipedia than on the Smithsonian’s own web pages (Kapsalis 2019).

The increasing attention given to AI in both the Digital Humanities and archival scholarship (as elsewhere) suggests a further future direction of Archival Linked Data development. In both disciplines, engagement with AI to support scholarship and practice is in its infancy. While a detailed examination of current research falls beyond the scope of this article we can note that machine learning techniques have already been explored in a number of Archival Linked Data case-studies, including natural language processing, and its subfield, named entity recognition (Clough et al. 2011; Damova 2020; Pattuelli et al. 2015; Rademaker et al. 2015). It can be supposed that, in automating processes, and making the production of Archival Linked Data less resource intensive, other sub-fields of AI, such as deep learning, computer vision, and neural networks, also offer much potential for the archives sector to scale up the production of Linked Data and innovate how it is accessed. Furthermore, AI has been suggested to offer a potential solution to the challenge of providing controlled access to unstructured data originating from both catalogues and digitised archives (Schreur 2020), and to enable the development of analytical tools meeting the particular and multifaceted requirements of digital humanists requiring access to digital cultural heritage data (Hyvönen 2020b). Drawing on Leopoldina Fortunati’s prediction of the increasing “robotization” of the domestic sphere due to the rapid acceptance and incorporation of automated and/or robotic systems into daily life (2018), it is as yet unclear to what extent the datafication of archives will evolve into the robotisation of archives with the increased incorporation of AI into archival processes.

Regardless, AI cannot be seen as a silver bullet to reducing the resource intensiveness of the production of Archival Linked Data; there will always be aspects of the production workflow which will not be scalable and human intervention will remain essential. Regardless of how AI is incorporated into the workflow for the production or provision of access to Archival Linked Data, it is essential that any engagement is undertaken critically, with attention directed not only towards its utilisation, but also to its limitations, biases, and ethical implications. Such a conversation is beginning, for example in Jenny Bunn’s exploration of engaging with explainable artificial intelligence (XAI) from a recordkeeping perspective (Bunn 2020), and in the examination of four Australian AI case studies to assess the implications of AI for archives (Rolan et al. 2019). Much remains to be addressed, however, and the recent flourishing of archival research networks investigating AI, such as the AEOLIAN Network (Artificial Intelligence for Cultural Organisations), AURA Network (Archives in the UK/Republic of Ireland and AI), and HAIRA (the Hub for Artificial Intelligence Research in Archives), hints at future progress.

Moss et al. have argued that the move to the digital environment necessitates archive services working collaboratively with other organisations, suggesting web archives, newspaper archives, and data archives as key collaborators (2018). Collaboration is vital for the successful scaling up of the production of Archival Linked Data and the development of tools to make it accessible and interrogable to digital humanists and other users. Furthermore, increasing collaboration across cultural heritage, with other professional sectors and academic disciplines has been widely cited as a key benefit of adopting Linked Data in the archives sector (Llanes-Padrón and Pastor-Sánchez 2017; McKenna et al. 2018; Pitti et al. 2016; vander Sande et al. 2018), with some pointing to increased collaboration as a means of increasing the sustainability of archival Linked Data projects (Martins et al. 2021). Despite evidence of cross-institutional collaboration within Archival Linked Data activity, such collaboration has to date mainly been confined to between multiple cultural heritage institutions and Higher Education Institutes (Smith-Yoshimura 2018). Assessing the success of the Linking Lives project in 2012, Jane Stevenson of the UK Higher Education Archives Hub concluded by pondering: “Maybe we’ve reached the point in the Linked Data story where we need to focus more strongly on how it will answer the requirements of researchers…Surely we need a more collaborative approach that draws in the technical people, the information professionals and the researchers” (2012b). Even though the value and necessity of engaging with users of archives when developing and introducing new technologies was made clear by the LEADERS project, which centralised users in its early exploration of archival engagement with the World Wide Web (Sexton et al. 2004), it has yet to become commonplace in Archival Linked Data activity.

The Digital Humanities is an inherently interdisciplinary and collaborative field that is information and data-driven, and requires access to large-scale digital datasets. Digital humanists use a range of digital methods, and have a demonstrable interest in, and impetus for, making archival data more digitally accessible. On the other hand, Hannah Lee provided a convincing argument for how archivists, and others in Library and Information Sciences, could apply their skills and methods to support digital humanities research by better connecting information to its users (2017). There is, therefore, a natural synergy between digital humanists and archival scholars and practitioners which should be exploited to increase the production of, and advocate for investment in, Archival Linked Data. Whereas collaboration between digital humanists and archives professionals to increase the availability of relevant Linked Data datasets would be mutually beneficial in increasing access to digitised archives, cross-sectoral and inter-disciplinary collaboration is also essential to support research on born-digital collections “in order to enable GLAM institutions, institutional networks and infrastructures to develop their born-digital collections in meaningful ways, improve preservation formats, curation workflows, repositories, services and access for researchers” (Ries and Palkó 2019, pp. 4). Calls for such a collaboration are not new—writing in 2013 Matthew Kirschenbaum outlined how archivists and digital humanists need each other: “Digital archivists need digital humanities researchers and subject experts to use born-digital collections…Digital humanists need the long-term perspective on data that archivists have…Digital archivists and digital humanists need common and interoperable digital tools…Digital humanists need the collections expertise of digital archivists…Digital archivists need cyberinfrastructure” (2013, para 38). The potential outcomes of such a collaboration are demonstrated by the work of the Shanghai Library which has undertaken a number of digital humanities projects using Linked Data, resulting in the creation of large-scale datasets drawing together archival data from multiple sources. The creation of the Chinese Genealogy Knowledge Service Platform (or Jiapu) is one example which demonstrates how the application of digital humanities methods to digitised archives, in combination with Linked Data, can bring about transformative benefits to users more widely and open up digitised archives to multiple uses: “The Jiapu platform has transformed the ways of accessing genealogy information from providing scanned images to expanding users’ ability to dig deep…The user group has expanded to include scholars for research, the public to find their ancestors, and institutions and developers to create applications by using the linked data from the platform” (Xia and Bao 2020, pp. 81).


Mass investment in digitisation over the past two decades has created a vast resource of digitised archives of varying levels of accessibility. Those which have been subjected to further datafication are more fully accessible for utilisation by digital humanists than those made available only as digital surrogates, and are ripe for further exploitation using digital methods. Linked Data provides a viable means of making digitised and born-digital archives more accessible, producing integrated, enriched, and interoperable large-scale archival datasets available for reuse in multiple ways. To date, the publication of archival metadata as Linked Data has been the key concern of Archival Linked Data activity. However, it is the contents of born-digital and digitised records which offer the greatest potential to researchers, making it increasingly necessary for archive services to datafy their collections. A shared appreciation of and desire for the provision of access to archives in digital forms and as data suggests that collaboration with digital humanists would be especially fruitful to both the Digital Humanities and the archives sector. Digital humanists could prove a key ally in advocating for Archival Linked Data, demonstrating how it benefits the users of archives, and ensuring that it continues to do so.

The infrastructure required to scale up Archival Linked Data, including standards, best practice, and tools for the creation, maintenance, preservation, and provision of access, are beginning to emerge. Low-barrier, non-sector-specific tools, such as Wikidata, and the use of machine learning and other AI techniques to further automate aspects of the production of Archival Linked Data are on the horizon. Such developments are essential to move beyond individual case studies, expand the sources of archival data being published as Archival Linked Data, and move its production further beyond research institutions in the global north. In the meantime, there are steps that archivists can take on an individual level to prepare themselves and their institutions for engagement with Linked Data when it is more feasible to do so and for progressing their data further up the 5* Linked Open Data Scheme.

The digital environment has forced archival scholars and practitioners to adopt new attitudes to records and archives, viewing them as data, as well as information and records, and to develop a new suite of tools and approaches to process, preserve, and provide access. However, Archival Linked Data requires the development of a new set of technical skills, taking archivists beyond the traditional archival toolbox. It is essential that archivists involve themselves in the production of Archival Linked Data, tools and web services, and engage in scholarly discussion about Linked Data and the Semantic Web, to ensure that the distinguishing characteristics of “archives” are understood and retained. Not only do archivists have unique knowledge of their collections and working practices. Archival scholars and practitioners bring their own set of professional concerns and priorities to the Linked Data environment, ranging from rights management, and the documentation of provenance(s) and other necessary contextual and administrative data, to issues regarding the decolonisation of archival description and descriptive practices, and the long-term preservation of Archival Linked Data. Seth van Hooland and Ruben Verborgh see the growth of archival Linked Data as providing opportunity: “Technology has profoundly influenced the humanities, but we can also leave our marks on technology. Let us hope that the application of linked data principles in our cultural heritage institutions invites people from the humanities and the sciences to pick up the gloves, step into the intellectual boxing ring and engage in a challenging sparring round” (2014, pp. 249). Concerningly, however, Dominic Oldman et al. argue that the lack of context, misrepresentation and ambiguity of historical information published as Linked Data “can be explained, in part, by the lack of engagement or involvement of domain experts themselves in the digital representation of their data, and their lack of knowledge about the possibilities of Semantic technologies, ultimately resulting in the dominance of the technologist at the so-called intersection of digital humanities” (2016, pp. 256). Without the increased involvement and advocacy of archival scholars and practitioners, their professional concerns may continue to be overlooked in the drive to increase access to and datafy digital archives. Their absence is already being felt, diminishing the usability, trustworthiness, and sustainability of the Archival Linked Data datasets and web services produced, and limiting their potential for utilisation by digital humanists and other users.