Paradata for Digitization Processes and Digital Scholarly Editions

Dillen, Wout

doi:10.1007/978-3-031-53946-6_8

Wout Dillen⁶

Part of the book series: Knowledge Management and Organizational Learning ((IAKM,volume 13))

Abstract

As libraries and archives are increasingly digitizing their collections, their resulting digital reproductions are now also reused in various research outputs. Because their patrons typically come from diverse backgrounds, however, many of them lack the necessary experience with the intricacies of the digitization process to judge how this process may have affected the quality of the reproductions they intend to (re)use. Without easily comprehensible paradata (i.e., data that indicates how they were made), patrons have no choice but to take these digital objects at face value—which is a problematic research practice. To illustrate some of the ways in which the digitization process may affect the reproduction, this chapter discusses a case study where a researcher commissioned the digitization of a collection of manuscripts held by various memory institutions across Sweden. By zooming in on how quality standards are negotiated between researchers and library staff in a specific digitization project, and the problems they needed to resolve along the way, this chapter examines which types of paradata could be useful to contextualize digitization processes and gives a concrete suggestion how the reusers of those digital reproductions could in turn provide essential paradata to contextualize their own research outputs.

You have full access to this open access chapter, Download chapter PDF

1 Introduction

In the last few decades, the “digital turn” has drastically changed the way in which at least the Western world interacts with the cultural documents of its past. As more and more memory institutions (such as libraries and archives) have begun to digitize the cultural heritage documents in their collections, researchers (as well as the general public) interact less with the original artifacts directly, and more indirectly with their readily accessible digital representations. Since libraries and archives can essentially function as meeting places for researchers from any disciplinary background, and the documents in their collections are invariably consulted for a myriad of different purposes, only a fraction of their patrons will have a nuanced understanding of how the digitization of their source materials were performed, and how this largely invisible process may impact their research. Building on Isto Huvila’s definition of paradata in his recent publication on “Improving the Usefulness of Research Data with Better Paradata,” such “contextual knowledge about how data [in this case: the digital reproduction] was created and how it has been curated and used” is called paradata (Huvila, 2022, p. 28). Precisely which types of data and workflows are useful in this context will be discussed in more detail below. At this point, however, we would do well to remember that the availability of paradata (in general) is especially “critical when (re)users come from diverse disciplinary backgrounds and lack a shared tacit understanding of the priorities and usual practices of obtaining and processing data” (Huvila, 2022, p. 30; paraphrasing Doran et al., 2019). This is often the case when memory institutions share their digitized documents with their patrons. In the absence of paradata, those patrons have no choice but to take the digital representations at face value, unless they meticulously check each digital image against its original—thereby largely defeating the purpose of the reproduction. The lesson here is not so much that we should distrust the library or archive and the materials they make available, but rather that we need more data to build this trust on, if we want to reuse those materials in a research setting. As I will argue in this chapter, this issue is especially relevant in the fields of textual scholarship and its more practice oriented sister discipline “(digital) scholarly editing,” where researchers develop what are supposed to be critically informed authoritative editions of cultural heritage documents—although they rarely justify the relation between the original physical documents and their digital reproductions when doing so.

By having a closer look at a workflow for digitizing cultural heritage documents across a selection of memory institutions from a researcher’s perspective, this chapter will consider questions such as: what kind of problems can occur while photographing cultural heritage documents, and how may these problems influence the researcher’s results? What kind of paradata would be useful when we try to contextualize these types of digitization processes? And what kind of paradata would we need to provide when we reuse those digitized cultural heritage documents to develop our own digital scholarly editions? As such, the chapter makes a key contribution toward facilitating paradata provision in this area in the future. Still, while the chapter is based on the analysis of a selection of specific case studies, their results and conclusions are potentially useful in the broader area of information and knowledge management research and practice, especially in relation to those areas of scholarly and professional work where digital reproduction mechanisms and the application of critical and forensic perspectives to the organization of knowledge and information are crucial to the task at hand.

2 Textual Scholarship and the Digital Reproduction

Before doing so, however, I should first introduce the two academic disciplines that will frame our discussion of paradata in this chapter: textual scholarship and scholarly editing. Both these disciplines share a crucial focus on textual documents of cultural, historical, or political significance that are being preserved, cataloged, and made accessible in memory institutions across the world. Studying these documents (i.e., analyzing the differences and similarities between versions, what this information implies about how they have been transmitted over time, and how this knowledge may inform our understanding of those documents) falls within the purview of the field of textual scholarship. Applying this knowledge and visualizing it in the form of an authoritative edition that contextualizes such documents falls within the purview of the field of scholarly editing.^{Footnote 1} Both disciplines are sometimes looked down on as “auxiliary sciences” in academia—useful tools that nevertheless mainly serve to lay the groundwork for the “real” research. Formulated more positively, it has been argued that the theory and practice of scholarly editing is the bedrock of many “disciplines in the humanities, such as literary studies, philology, history, philosophy, library and information science, and bibliography” (Boot et al., 2017b, p. 15).

Whichever way you look at it, these disciplines provide us with some much needed context to the objects of our research—objects that are increasingly being digitized.^{Footnote 2} And these digital surrogates are also increasingly used in different stages of textual scholarship and scholarly editing workflows (Dillen, 2017b). It frequently occurs at the research stage, for example, when the scholar uses digital facsimiles to transcribe the texts of their source materials, interprets them, collates and compares different versions of the texts, etc. Ideally, of course, the scholar will not rely solely on these digital surrogates for their research, but rather complement their workflow with visits to the archives to inspect the original source materials, familiarize themselves with the materiality of the documents (which may provide new insights), and double-check their findings.^{Footnote 3} But here too, the fact remains that the bulk of the research time is spent handling digitized objects rather than their physical originals—which implies that a lot of trust is placed in their quality and representativeness.

This trust becomes even more critical in the reception phase. Once the digital facsimiles are published as a digital scholarly edition—an environment where they apparently receive an implicit seal of approval from subject experts in the field—they effectively become the default point of entry for the users of the digital scholarly edition, who are highly unlikely to compare each digital facsimile to its archived original. Which also means that it is exactly this digital facsimile that will be used as a point of reference to evaluate the accuracy of the editor’s transcriptions and the validity of the claims and arguments they are proposing about their source materials.^{Footnote 4} As such, this system enables researchers to push the responsibility for the quality of the digitized materials forward—from the critical user to the editor, and from the editor to the archive.

The fact that the users of digital editions are encouraged to take the edition’s facsimiles at face value in this way is ironic, to say the least, because textual scholarship, as a discipline, is centered around the assumption that when a text is transported from one carrier to the next, variance is likely to occur. And that it is exactly through the study of this variance that we can reconstruct the history of the text, learn how it has been transmitted over time, and how its contents and reception may have changed in the process. This is doubly true when a text is transported from one medium onto another, which will invariably alter our interaction with, and hence to some extent our understanding of, the text in some minor or major way. This was the case when our practices shifted from manuscript to print, and again when those handwritten or printed documents are digitally photographed. The fact that such digital photographs are never truly “unedited” or objective was already convincingly made by textual scholar Hans Zeller’s in the 1970s (Zeller, 1971). More recently, the reproductive potential of digital surrogates was questioned by Lars Björk, senior conservator at the National Library of Sweden (Björk, 2015). As Björk argues, when we neglect the complexities of transmission, and uncritically take for granted that the reproduction is “capable of replacing direct access to collections and documents, we will risk basing our knowledge and assumptions on sources that have a restricted capacity to convey the information potential of the document” (Björk, 2015, p. 5). Yet this is arguably exactly what is happening in the fields of textual scholarship and digital scholarly editing.

The problem here is that while the digitized images of source documents have become an increasingly significant component of the digital scholarly edition over the last decades, scholarly editors are largely unaware of the intricacies of the digitization processes that are being practiced at memory institutions across the world, and rarely (if ever) document them in their editions. This systematic lack of paradata is problematic because it does not account for the fact that the quality of the results of these digitization processes is largely determined by the availability of resources and expertise and the successful negotiation of terms with third parties that memory institutions employ to outsource some of their digitization needs. Indeed, the results of my studies across European National Libraries as part of the DiXiT project^{Footnote 5} confirmed our hypothesis that the digitization workflows and standards that are practiced in individual memory institutions were flexible and constructed through a process of negotiation that involved various parties that each brought their own stakes and perspectives to the table (Dillen, 2017a). It is my hypothesis that documenting our interaction with these standards, practices, workflows, and decisions in the form of paradata can help fill an existential lacuna in the contextualization of our research documents in digital scholarly editions and thereby also play an essential role in convincing the end user of the validity of our work.

3 Case Study: manuscripta.se

To back this hypothesis up, I will focus on the findings of my main case study in the DiXiT project: the National Library of Sweden (hereafter: KB-SE). Specifically, I will reflect on an interview I conducted with one of its resident researchers: Patrik Granholm, creator of manuscripta.se, a catalog of Medieval and Early Modern manuscripts in Sweden.^{Footnote 6} This interview focussed on some of the concrete problems Patrik faced while developing that resource. Although to some extent anecdotal, this discussion will serve to highlight the impact the tailored digitization process can have on the resulting digital facsimile images, and how their description in the form of paradata might benefit the desired end product (be that called an archive, catalog, or edition).

As is usual for a project such as manuscripta.se, it was developed in a series of phases that depend largely on research funding cycles. In its initial phase, the project took place at the Uppsala University Library from 2012 to 2016 and aimed to construct a new, improved, digital catalog of all 130 Greek manuscripts in Sweden.^{Footnote 7} Most of these manuscripts are held at the University Library in Uppsala, but manuscripts were found in holding libraries across Sweden. When I interviewed Patrik, the project was in its second phase, having moved with him to KB-SE. The project’s goal at the time was to expand the original catalog with 276 more medieval manuscripts in Sweden that were held either in the Uppsala University Library or in KB-SE itself. These manuscripts were written in Old Swedish.^{Footnote 8} At the time of writing, the project has completed this work and has moved into a new phase, where KB-SE’s Latin manuscripts are added to the catalog.^{Footnote 9} Our interview focussed mostly on Patrik’s experiences in the project’s first phase, because at the time that phase of the project incorporated its most significant digitization effort (because it was the only phase in the project that had allocated a budget specifically to the digitization of the manuscripts in question).

As the first phase of the project was hosted by the Uppsala University Library, the library that had also collected most of the Greek manuscripts in Sweden, the bulk of the materials were digitized by this institution. Besides the Greek manuscripts in its own holdings, manuscripts that are held by other Swedish university libraries (specifically Gothenburg University Library, the Linköping Diocesan Library and the Skokloster Castle Library) were transported to Uppsala to be digitized there, by the university library’s designated photographer. However, some memory institutions that have their own specialized digitization services (such as KB-SE and Riksarkivet—the Swedish national archive) insisted on digitizing the manuscripts themselves rather than transporting them to Uppsala. This invariably led to differences in approaches and standards. In Uppsala for instance, being a local rather than a national memory institution, digitization was still in its early stages. This gave Patrik the chance to provide more input for the development of the institution’s digitization workflow, and to place certain demands on the quality of the reproductions. At KB-SE, on the other hand, digitization practices were more advanced and noticeably more professional, and while this may have been beneficial for the initial quality of the reproductions, it also gave Patrik less of a chance to influence the digitization process. At Riksarkivet, then, Patrik had even less opportunity to place demands on the quality of the reproductions, as its high-end tier of digitization services was too expensive for the project, obliging the team to opt for lower-resolution scans of the manuscripts. This means that there is considerable qualitative variance for the reproductions that were produced at different memory institutions throughout the project.

Besides these differences in quality between institutions, Patrik also suggested that there may be considerable differences within the selection of reproductions produced by the Uppsala University Library—partly because he had more of an influence on the digitization practice. As the institution did not have much experience with digitization at the time the project started, some improvements were made during the course of the project, which were not always retroactively corrected. The most important factor here is perhaps the improvement of the institution’s equipment. At the start of the project, the library acquired a new book scanner for the specific purpose of this digitization project. After producing a few test batches, however, it became clear that the quality of the reproductions did not suffice for the project. Asked for examples, Patrik explained that the resolution was too low, the colors were inaccurate, and that the scanner’s dual light source produced a flare in the middle of the scan. Confronting the photographer with this problem, it was agreed that the quality of the reproductions needed to be improved and that a new solution had to be found. This is a clear example of how a researcher’s demands may influence the negotiation of the quality standards of digitization practices at memory institutions. Patrik explained that he pushed hard to have the manuscripts digitized with a camera, which in the end they were: the test batch was discarded and reshot with a 20 megapixel Hasselblad camera. In the course of the project, however, the institution made a new investment in its equipment and purchased a 60 megapixel camera, which, from that point onward, was used for the project instead. The result is that approximately \(\frac {1}{3}\) of the reproductions at the Uppsala University Library were digitized at a lower resolution.

Another example of an improvement to the digitization practice at Uppsala University Library that was not retroactively corrected was a better control of the lighting conditions in the designated photography room. Patrik explained that in the beginning of the project, the door of this room was not always closed, allowing some natural light to come into the room, which affected the lighting and color of the reproductions. This problem becomes especially apparent when you start compiling spreads—placing each recto alongside its rightful verso—which allows for a close comparison of facing leaves. If the two photographs were shot in exactly the same lighting conditions, there should be no color difference between recto and verso, allowing them to be stitched together seamlessly. But since rectos and versos were photographed consecutively (a common practice in heritage photography to speed up the workflow), and natural light was allowed to influence the digitization process, this could lead to a distinct color variance between recto and verso in individual spreads. In the course of the project, this problem was discovered and eventually solved, but not retroactively corrected, leading to some differences in the quality between individual reproductions digitized at Uppsala University Library, even within the same manuscript.

These instances suggest that there was a good dialogue between the researcher and the photographer, allowing both parties to learn a lot from each other in the process. The advantage of such close collaboration at a point where a memory institution is still developing its digitization practices was that it allowed Patrik to help steer the construction of the workflow to some extent, giving him more control over the final state and quality of the reproductions. The disadvantage, on the other hand, was that since the quality standards of the reproductions were still evolving over time, this produced a discrepancy in the quality of early versus late digitized reproductions. And this evolution continued on to the time of the interview: confident that this collaboration had taught both himself and the institution’s photographer a lot and brought them more or less on the same page with regard to quality control issues, Patrik nevertheless felt that there was still room for improvement and that a continued close collaboration between both parties would be advisable and mutually beneficial.

At an institution with a developed digitization workflow and carefully constructed quality standards like KB-SE, researchers may not be able to influence the reproduction of the originals as much, but there would arguably be less reason to do so in the first place. Although Patrik suggested that he would still think it necessary to check the quality of each of the images himself, the fact that KB-SE has a longstanding tradition of developing guidelines and setting (high) standards for their digitization practices inspires more confidence in the resulting reproductions and would hopefully decrease the time spent discussing issues with the photographers and reshooting individual pages as a result. In addition, the fact that a memory institution has an established digitization workflow that is less prone to change can also be an advantage. Especially for high-volume digitization projects like manuscripta.se, this would make the quality of the reproductions more consistent, and therefore also more easily accounted for. Even in the case of low-quality images, documented digitization guidelines and workflows at least give the researcher a tangible point of reference to explain any discrepancies between the original and the reproduction to the user. In the end, this is where the researcher will have to decide what still constitutes an acceptable quality standard for the edition, and to what extent it is worth to surrender image quality of the reproductions in order to secure an improved accountability for the edition.

This brings us to the final stage of the manuscripta.se project: to present the reproductions of these manuscripts as part of the digital catalog the team was developing. To do so, the catalog would need an interface—a publication platform that would make the digital information accessible to the user. The software that was used to display the images on the website is called IIPImage. To work, this software requires the images to be saved on the server in a multilayered TIFF or JPEG2000 format. For manuscripta.se, Patrik chose to work with the first option. To achieve this format, Patrik needed to process the (regular) TIFF images he received from the library, to turn them into tiled “Pyramidal TIFF” images before putting them on the server. Patrik reported that he did not perform any additional postprocessing on the images (except changing their file names to also include their folio number). He suspected that the photographer did perform some postprocessing to the presentation copies of the digital reproductions, such as cropping, color adjustment, and sharpening—a practice that would be consistent with KB-SE’s, but these were not explicitly communicated to him. This is a good example of some aspects of the decision-making process that remained invisible—even to the person who requested the images.

In the delivered images, a standard margin was left around each manuscript page, and a ruler was photographed in this margin to attest to the page’s size. The images were not rescaled to 1:1 at 300ppi (as is KB-SE’s default practice) because the manuscripts vary in size (sometimes being as small as 10 cm wide), which would have rendered text on some of the smaller manuscripts illegible. As discussed above, Patrik checked the quality of the images extensively before uploading them to the server, comparing them to the manuscript (leafing through the original while inspecting the digital copy). Sometimes color correctness, brightness, or focus issues were discovered, but the most frequent errors would be missing pages and the fact that a bookmark or part of the manuscript was concealing some of the text. In these cases requests were made to shoot or reshoot the relevant pages. When Patrik was satisfied with the quality of the digital reproductions, he uploaded them to the server and made them accessible to his users.

This case study shows that, depending on the memory institution, the researcher who commissions the digitization of specific cultural heritage objects may have more or less of an influence on the digitization practice and illustrates what the advantages and disadvantages of such influence may be. It also highlights the kind of demands that the researcher will place on the digital reproductions and how the negotiation of quality standards between researcher and photographer may take place. In Patrik’s case, there was already a difference here in his dealings with the various memory institutions. And indeed, for a large part, Patrik found himself in quite a rare position, where the connections he had established earlier with the library staff allowed him to be more closely involved in the process. To a less connected patron, such a privileged position may not be available, and even more parts of this process may remain invisible to them as a result. The example also indicates that this digitization still happens on an ad hoc basis and that digitization practices across memory institutions have not yet been standardized. This means that there will be considerable differences in quality and standards across memory institutions—or even within a single memory institution as guidelines and workflows are still under development, and its equipment and staff undergo changes. It should also be noted that at least in this particular case, the researcher was more concerned with getting the best possible quality for each individual image (pushing its quality as far as possible depending on the rapport he had with the photographer, and the type of quality the project’s budget could feasibly afford) rather than with providing a more consistent overall quality for all the images, based on a single and more easily documented quality standard.

At the moment of writing, the catalog’s reproductions of the manuscripts are offered without any reference to the quality standards that were used to photograph the images, but Patrik attested (based on his interactions with the users of his catalog) that the lack of such a description is also not an issue that the user is particularly concerned with. In part, this can be explained by the fact that manuscripta.se is a catalog first—not an image repository, nor a digital scholarly edition. Still, even in the case of scholarly editions, attesting to the way in which the digital reproductions that are used in the edition were produced is not a common practice. But that is of course not a reason why we should not start doing so. Indeed, as these digital reproductions are becoming more and more detailed and important in our digital scholarly editions, we should also become more and more vigilant about the way we present them (and with it: their relation to the original) to our users.

4 Paradata for Digitization Processes

As became clear in my comparative study of National Libraries for the DiXiT project, these high-level memory institutions usually offer several tiers of digitization services, from low-end services (with the help of librarians and the library’s most basic equipment) to high-end services (performed by photographers with professional equipment). Which of these services are relevant and available for a specific project is evaluated on a case-by-case basis, in a process of negotiation between the library’s staff and its patrons, that depends on a wide range of factors—such as the material state of the document, the needs of the project, the availability of equipment, the experience of the staff, and the funding that is available.^{Footnote 10} As my case study has demonstrated, the fact that so many aspects of this process potentially remain invisible to the end user is problematic, from a scientific perspective, because they can play an important role in the project. For example, the equipment and professional expertise a library has at its disposal may change over time, and certain tiers of digitization services may be unavailable through a lack of funding. Intuitively, one would think that a given memory institution’s high-end digitization services would be the ideal environment to procure the high quality, consistent digitization results that a prestigious high-quality digital scholarly edition (or catalog) requires—not just because it is potentially the most qualitative option (using the best equipment, producing the highest resolution scans and truest color reproductions), but also because it is usually the only option that follows high predetermined standards, where the digitization is performed by trained professionals, and takes place in a more or less controlled environment.

As the above case study has demonstrated, however, digitization is not a purely mechanical process that produces identical results at every iteration. To the contrary, it is a highly personal process that involves a number of different agents, each with their own experiences and demands, who are at times quite literally involved in a process of negotiation between themselves (based on personal preferences and professional experiences), the physical limitations of the document, the environment where the digitization takes place, and any contractual obligations that need to be met—all of which have a direct impact on the quality and consistency of the digitized end product. Indeed, as we have seen, there is still plenty of room for variation in quality among the delivered image files—even over the course of a single digitization project, conducted by a single team, at a single institution, performing a single digitization service. In some cases (e.g., when overall consistency is more important than factors such as image resolution) this may mean that a more mechanical, lower-end digitization service (like an automated book scanner) might be preferred over high-end alternatives.^{Footnote 11}

Whichever suitable service or digitization process is agreed upon, however, the case study highlights how important it is to provide some sort of context for the images you use (such as their origin and some specifications of the digitization process) to justify their quality (or lack thereof). This is especially crucial because a lower quality invariably points to a greater divide between the original and the reproduction and, in the case of a scholarly edition, potentially between the reproduction (which is often the user’s only point of reference) and the transcribed text. This divide needs to be addressed and accounted for somehow before the editor’s transcriptions can really be considered “assessable” by the user—which is arguably the whole point of adding these images to the digital scholarly edition in the first place. This is where paradata could enter the picture.

Paraphrasing my definition of paradata at the start of this chapter, we could say that it concerns data that describes data creation processes. In some cases, paradata comes in the form of metadata (highly structured “data about data,” described in metadata terms, following a specific ontology like DCMI,^{Footnote 12} or CIDOC-CRM^{Footnote 13}—to stick with examples that are relevant for memory institutions). For an example of what this may look like for digitization practices, we can consider EXIF (Exchangeable Image File Format)—a standard for storing metadata in image and audio files.^{Footnote 14} Nowadays, most digital cameras use this format to automatically store information about the equipment and settings that were used directly in the image file, at the moment the image was captured. This may include the make and model of the camera or scanner that was used, the model of its lens, together with the aperture and shutter speed that were used, etc. All of this data is structured as metadata and is highly relevant to the process of the image’s creation process, which means it qualifies as paradata too. But not all metadata is paradata. Say, for example, that we have a book in the collection of a library. And that the book’s metadata includes a Dewey Decimal Class (DDC) number, which classifies the book as belonging to a particular discipline and helps the library’s patrons locate the document. This number is part of the book’s metadata because it is a highly structured record that tells us something more about the book (i.e., it is “data about data”). But it is not necessarily paradata, because this number was given to the book by a librarian, in retrospect, based on their interpretation of the document—and tells us nothing about the circumstances in which the document was originally created. At the same time, not all paradata is metadata. Take, for example, the interview I conducted with Patrik. This interview is a source of data in itself, recorded as an audio recording and accompanying transcript, which has served to support some of the arguments I made in this chapter. And it qualifies as paradata, because it gives an account of a set of conditions and practices that had a direct impact on the creation of the data files (in this case: images) that would later become part of the manuscripta.se database. But it is not strictly metadata because it is not structured and mapped onto an existing ontology of metadata terms.

What this means is that paradata is very much in the eye of the beholder and that what does or does not count as paradata in a given context greatly depends on the phenomenon that is being studied. So we would do well to specify exactly which data creation workflows and processes we are dealing with, before considering what the paradata around these processes may look like, and what the benefit would be of sharing them with the research community. In turn, this will help us “explicat[e] in depth what to document and how to capture it” (Huvila, 2022, p. 33).

In the context of this chapter, the process we are interested in is the digitization process. Specifically, we are looking into the way in which we create digital reproductions of physical objects. With regard to this digitization process, we are not just focussing on the moment of capture (e.g., when the digital photograph is taken, the scan completed, or the video recorded), but also (and: especially) on the workflows that dictate how this act of capturing is performed, and how the resulting reproduction is processed and evaluated afterward. And as we have seen, these workflows are basically decision-making processes where different agents interact with each other, and directly or indirectly negotiate how the physical documents and their digital reproductions are handled—agents who each put their own demands on the physical and digital objects concerned, based on a strong foundation of professional expertise. This means that the primary sources of our paradata are human agents and that the most straightforward way of obtaining such paradata is through qualitative social science research methods, such as interviews, surveys, observation studies, etc. In addition, these primary research data could potentially be complemented with secondary materials, such as institutional web pages, internal (digitization) policy documents, descriptions of workflows and standards, third-party contracts and terms of agreement, automatically generated metadata, changelogs, paper trails of internal and external communication, etc. Combined, all these materials can help paint a detailed picture of exactly how the physical object was digitized, and why it was digitized in this way.

5 Paradata for Digital Scholarly Editions

In this last section of our chapter, I would like to reflect on what these findings mean for the digital scholarly edition. We would be too hasty to conclude, for example, that from this point onward, any digital scholarly edition worth its salt needs to incorporate all the relevant paradata pertaining to the images it uses, from EXIF metadata to research diaries and recorded interviews with any agent who was involved in the digitization process in some way or other. While a close examination of digitization practices at a given memory institution with a view to developing a portfolio of relevant paradata may be a worthwhile research project in itself that can provide us with new, valuable insights about the exact relation between physical objects and their digital reproductions, requiring such an investigation to be attached to each and every digital scholarly editing project is simply infeasible and arguably lies well outside the purview of the editorial endeavor.

To some extent, this is due to the nature of the paradata involved. Some paradata is easy to come by, but extremely hard to understand. For example, a data dump of metadata, technical specifications, and changelogs about the images that were used in the digital scholarly edition would be relatively easy to extract and provide alongside the digital scholarly edition. But without additional context, such data would be lost on the average user of the edition, who is mainly interested in the edition’s documents and their interpretation, and who does not necessarily know how to read (let alone assess) this kind of technical documentation—thereby defeating the purpose of the paradata. On the other hand, some paradata is easy to understand, but extremely hard to come by. Providing detailed descriptions of digitization workflows, interactions, and decision-making processes on the basis of interviews, observation studies, and other social science methods is simply infeasible for most digital scholarly editions. They would add a whole new line of inquiry to the research, which would require the team of editors to add an entirely new research profile to their project, with its own allocated researchers, time, and budget. This would put even more of a burden on the editorial team than is already the case.^{Footnote 15} Requiring this kind of treatment of any self-respecting digital scholarly edition would turn it even more into a prohibitively expensive research endeavor than it already is.

Still, it may be possible to find a middle way. As Huvila remarks in the concluding paragraphs of his essay: “an equally critical question to having enough [paradata] is how to avoid having too much” (Huvila, 2022, p. 41). And indeed, since different data users have different data needs in different contexts and situations, it is essential that we first try to understand which types of paradata are “useful and usable” in our particular context (Huvila, 2022, p. 29). In our case, this context is that of a scholarly editor developing a digital scholarly edition on the basis of a (series of) document(s) held by one or more memory institutions. In this sense, the scholarly editors are not exactly the creators of digital reproductions of physical cultural heritage documents, but rather the reusers of those reproductions. This means that while paradata that attests to the way these digital reproductions are created at a given memory institution would be relevant information to the scholarly editor (as well as to the resulting digital scholarly edition’s user), it is not the editor’s responsibility to commission, store, and distribute those data in the first place. Instead, the scholarly editor is reusing the memory institution’s data (be they physical, digital, or both) in order to make their own digital object: the digital scholarly edition. And it is paradata detailing how the latter digital object was created that contains essential contextual information for the digital scholarly edition’s (re)users.

The practice of describing exactly how scholarly editions are made is not new to the field of scholarly editing. In fact, it is arguably one of its strongest foundations. If centuries of textual critical thinking have taught us anything, it is that the authority of some documents, texts, or versions over others is very much a matter of interpretation, and that opinions differ greatly when it comes to, for example, the extent to which the scholarly editor is allowed to intervene in the texts they are editing. And it is exactly this realization of how much the presentation of the scholarly edition ultimately depends on the editor’s perspective that has since long required the scholarly editor to position their research in the field and justify any editorial decisions they have made in a designated section that is appended to the scholarly edition.^{Footnote 16} This section, which has existed long before the discipline took a digital turn, is often called “Editorial Principles” or something similar and does exactly this: explicate the editorial principles on which the edition is based, and the rules the editorial team have followed to construct the edition, with as few exceptions as possible. Since the digital turn, this section has only gained more significance and has been expanded to also include some type of “Encoding Description” and other “Technical Documentation.” In these sections the editors do not just explain the theoretical framework they have used, but also which technical standards they have used,^{Footnote 17} and which choices they have made within those standards, to put this framework into practice. Such sections have become a staple of the (digital) scholarly edition—to the extent that they are, for example, repeatedly mentioned in the evaluation criteria used by RIDE, the well-respected review journal for digital editions and resources in the field (Sahle et al., 2014).

These sections can be regarded as important sources of paradata avant la lettre. They are indeed narratives that embed useful information about the way in which a digital dataset was created—in our case: the digital scholarly edition. As scholarly editors have realized a long time ago, this type of information is absolutely essential if we want our data (be it digital or in print) to be used, understood, and potentially reused. And it is exactly by enclosing this information in the edition that it can achieve one of its most important goals: to inspire a critical attitude toward texts in general, and the relevant work(s) in particular. What seems to be missing from these sections, however, from the perspective of this chapter, is some similar information that helps contextualize the authority of the images that were used in the dataset. This information does not necessarily need to be as detailed as all the paradata we can imagine to contextualize the digitization process—as described above. The important information pertaining to the relevance and reusability of the images that were used in the digital scholarly edition is not so much exactly how the digital surrogate resembles or differs from the original document, but rather the extent to which the editor recognizes that the digital reproduction are sufficiently representative for the context in which they are used, despite its specific limitations. Ultimately, it is the editor who vouches for the authority of the images that are used in the edition, in the same way that they vouch for the authority of their texts. And like with their text—the authority of which is justified in dedicated sections detailing the edition’s “Editorial Principles,” the editor would do well to justify the authority of the images that were used by providing more context relating the way in which they were obtained. In both cases, these justifications serve to convince the reader of the soundness of the editor’s research and argumentation, or, failing that, to point the reader in the direction of the originals whenever they doubt the editor’s claims.

Writing such a section should not take up a disproportionate amount of time and space, nor should it require the editor to acquire any additional skills. What is needed is simply some additional context that frames the editor’s decision-making process around the time when they made or acquired the digital images, and determined to what extent they were sufficiently representative of their originals for the digital scholarly edition. This could be achieved, for example, by describing their own role in the digitization process—however large or small this may have been. This may include a brief description of the specific digitization service they engaged (and, where relevant, whether this choice was the result of a process of negotiation), which requirements and technical specifications they requested from the digitization team (and, where relevant, whether these requirements were met), whether or not the editor established a rapport with the digitization team (and, when relevant, whether this feedback loop had any effects on the overall quality of the images), and which measures the editor took to ensure quality and representativeness of the reproductions, in relation to their originals (for example, by comparing the reproductions to the originals to make sure they are accurate and complete). By doing so, the editor would acknowledge their position as one agent in a complex process while, at the same time, explicitly taking responsibility for this crucial aspect of the digital scholarly edition. This would provide the reader with the necessary information to decide whether or not to trust the editor’s judgment, thereby drastically improving the accountability of the edition. At the same time, such a section would raise the general awareness of the complexity of digitization processes and acknowledge that the resulting digital reproductions are objects in their own right, distinct from their physical originals, and differing from them in significant ways. All of this can only help inspire a critical attitude in the reader and open their mind to new interpretations of treasured cultural heritage documents. And that, to a large extent, is exactly what the fields of textual scholarship and scholarly editing are all about.

Notes

1.
For examples of digital scholarly editions, see Patrick Sahle and Greta Franzini’s respective catalogs: Sahle (2008) and Franzini et al. (2016). For novices to the field, see my introductory video lecture on “What is a Digital Scholarly Edition?” (Dillen, 2022).
2.
For more information on the digital turn in scholarly editing, see: Sahle (2013); Apollon et al. (2014); Pierazzo (2015); Driscoll and Pierazzo (2016); Boot et al. (2017a).
3.
Of course, this ideal is not always attainable—especially for scholars who lack institutional embedding or have limited funds.
4.
For discussions on how digital scholarly editions are used to construct an argument about their source materials, see: Andrews and van Zundert (2018); Bleeker and Kelly (2018); Dillen (2018).
5.
See: https://dixit.uni-koeln.de.
6.
See: https://www.manuscripta.se/.
7.
For more details, a final report on the project can be found in Nyrström (2016).
8.
A final report on this project (in Swedish) can be found in Nordin and Ahlbom (2021).
9.
For more information, see: https://www.rj.se/en/grants/2021/medieval-latin-manuscripts-in-the-national-library-cataloguing-and-digitization/. Alongside this current phase of the project, manuscripta.se is also developed further through the “Swedish Post-medieval Manuscripts at the National Library of Sweden and Uppsala University Library” project (a spin-off project of phase 2; for more information see: https://www.rj.se/en/grants/2018/swedish-post-medieval-manuscripts-at-the-national-library-of-sweden-and-uppsala-university-library--a-catalouging-and-digitization-project/), and a related project on West Norse manuscripts (see: https://www.rj.se/en/grants/2021/digitization-of-the-west-norse-manuscripts-in-swedish-collections/).
10.
I performed a more detailed analysis of these different tiers at KB-SE and proposed a mapping of the negotiations between different internal and external agents in the digitization process as part of my work on the DiXiT project. Because of space restrictions, however, it was impossible to include those findings in this chapter. I did, however, already discuss some of these aspects in a blog post (Dillen, 2017a).
11.
This is true, for example, in cases where the editor aims to publish a lightweight minimal edition in order to reach a wider, less privileged audience; or when the editor has a limited budget and needs to put the text first.
12.
See: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/.
13.
See: https://cidoc-crm.org/html/cidoc_crm_v7.1.1.html.
14.
See: https://exiftool.org/TagNames/EXIF.html.
15.
In her seminal work Digital Scholarly Editions: Theories, Models, Methods, Elena Pierazzo exposes how the digital turn has essentially required the editor (or editorial team) to acquire an entirely new set of skills. Where in the age of print the scholarly editor would mainly require a deep knowledge of philological skills such as “textual scholarship, codicology, palaeography, historical linguistics, literary criticism,” etc., moving the discipline into the digital age requires a whole new set of computational skills, such as data management, web design, digital infrastructure, tool development, and metadata standards (Pierazzo, 2015, p. 126). If we would require all the relevant paradata related to the digitization process of all relevant digital reproductions to be included in the edition as well, this would expand the list of skills even more, to include an understanding of social science research methods, digital photography, library collection management, etc.
16.
The parallel development of various schools of textual scholarship is a well-established part of the research field’s history that I explored in some more detail in my PhD thesis (Dillen, 2015). For more information and a more practical resource, see the Lexicon of Scholarly Editing (Dillen, 2020).
17.
Such as, for example, the guidelines of the Text Encoding Initiative (https://tei-c.org).

References

Andrews, T. L., & van Zundert, J. J. (2018). What are you trying to say? The interface as an integral element of argument. In Digital scholarly editions as interfaces (pp. 3–34).
Google Scholar
Apollon, D., Bélisle, C., & Régnier, P. (Eds.) (2014). Digital critical editions. University of Illinois Press.
Google Scholar
Björk, L. (2015). How Reproductive is a Reproduction? Digital Transmission of Text-based Documents. PhD Thesis, University of Borås, Borås, 2015. http://www.diva-portal.org/smash/get/diva2:860844/INSIDE01.pdf
Bleeker, E., & Kelly, A. (2018). Interfacing literary genesis. In R. Bleier, M. Bürgermeister, H. W. Klug, F. Neuber, & G. Schneider (Eds.), Digital scholary editions as interfaces (Vol. 12, pp. 193–218). BoD–Books on Demand.
Google Scholar
Boot, P., Cappellotto, A., Dillen, W., Fischer, F., Kelly, A., Mertgens, A., Sichani, A.-M., Spadini, E., & Van Hulle, D. (Eds.) (2017a). Advances in digital scholarly editing: Papers presented at the DiXiT conferences in The Hague, Cologne, and Antwerp. Sidestone Press. ISBN 978-90-8890-483-7. https://www.sidestone.com/books/advances-in-digital-scholarly-editing.
Google Scholar
Boot, P., Fischer, F., & Van Hulle, D. (2017b). Introduction. In P. Boot, A. Cappellotto, W. Dillen, F. Fischer, & A. Kelly (Eds.), Advances in digital scholarly editing: Papers presented at the DiXiT conferences in the Hague, Cologne, and Antwerp (pp. 15–22). Sidestone Press. https://www.sidestone.com/books/advances-in-digital-scholarly-editing
Google Scholar
Dillen, W. (2015). Digital Scholarly Editing for the Genetic Orientation: The Making of a Genetic Edition of Samuel Beckett’s Works. PhD Thesis, Universiteit Antwerpen, Faculteit Letteren en Wijsbegeerte, Departement Letterkunde. https://hdl.handle.net/10067/1305290151162165141.
Dillen, W. (2017a). ER1 - Reminiscing on a Year of DiXiT. https://dixit.hypotheses.org/1384.
Google Scholar
Dillen, W. (2017b). What you c(apture) is what you get: Authenticity and quality control in digitization practices. In P. Boot, A. Cappellotto, W. Dillen, F. Fischer, A. Kelly, A. Mertgens, A.-M. Sichani, E. Spadini, & D. Van Hulle (Eds.), Advances in digital scholarly editing: papers presented at the DiXiT conferences in the Hague, Cologne, and Antwerp (pp. 397–400). Sidestone Press. ISBN 978-90-8890-483-7. https://www.sidestone.com/books/advances-in-digital-scholarly-editing
Google Scholar
Dillen, W. (2018). The editor in the interface: Guiding the user through texts and images. In R. Bleier, M. Bürgermeister, H. W. Klug, F. Neuber, & G. Schneider (Eds.), Scholarly digital editions as interfaces. Number 12 in SIDE: Schriften des Instituts für Dokumentologie und Editorik (pp. 35–59). Books on Demand. ISBN 978-3-7481-0925-9. http://kups.ub.uni-koeln.de/9111/
Google Scholar
Dillen, W. (2020). A Lexicon of Scholarly Editing. https://doi.org/10.5281/ZENODO.4008433
Google Scholar
Dillen, W. (2022). What is a Digital Scholarly Edition?https://play.hb.se/media/0_v96oltn7
Google Scholar
Doran, M., Edmond, J., & Nugent-Folan, G. (2019). Reconciling the cultural complexity of research data: Can we make data interdisciplinary without hiding disciplinary knowledge? Preprint of manuscript submitted to CODATA.
Google Scholar
Driscoll, M. J., & Pierazzo, E. (2016). Digital scholarly editing: Theories and practices (Vol. 4). Open Book Publishers. https://books.openbookpublishers.com/10.11647/obp.0095.pdf
Franzini, G., Terras, M., & Mahony, S. (2016). A catalogue of digital editions. In M. J. Driscoll, & E. Pierazzo (Eds.), Web.https://github.com/gfranzini/digEds_cat (pp. 161–182). Open Book Publishers. https://www.openbookpublishers.com/reader/483/#page/180/mode/2up
Huvila, I. (2022). Improving the usefulness of research data with better paradata. Open Information Science, 6(1), 28–48. ISSN 2451-1781. https://doi.org/10.1515/opis-2022-0129 https://www.degruyter.com/document/doi/10.1515/opis-2022-0129/html.
Nordin, J., & Ahlbom, K. (2021). TTT: Text till tiden! Medeltida texter i kontext – Då och nu. Slutrapport. Technical Report, Kungliga Biblioteket, Stockholm. https://lucris.lub.lu.se/ws/portalfiles/portal/103218558/TTT_slutrapport_210622.pdf
Nyrström, E. (2016). Greek manuscripts in Sweden - a digitization and cataloguing project. https://www.rj.se/en/grants/2011/greek-manuscripts-in-sweden---a-digitization-and-cataloguing-project/
Google Scholar
Pierazzo, E. (2015). Digital scholarly editing: Theories, models and methods. Ashgate Publishing Limited.
Google Scholar
Sahle, P. (2008). A Catalog of Digital Scholarly Editions. https://www.digitale-edition.de/exist/apps/editions-browser/%3Ccurrencydollar%3Eapp/index.html
Google Scholar
Sahle, P. (2013). Digitale Editionsformen. SIDE: Schriften des Instituts für Dokumentologie und Editorik (Vol. 7–9). BoD–Books on Demand.
Google Scholar
Sahle, P., Vogeler, G., Broughton, M., Cummings, J., Fischer, F., Steinkrüger, P., & Scholger, W. (2014). Criteria for Reviewing Scholarly Digital Editions, version 1.1 |, 2014. https://www.i-d-e.de/publikationen/weitereschriften/criteria-version-1-1/
Google Scholar
Zeller, H. (1971). Befund und Deutung. Interpretation und Dokumentation als Ziel und Methode der Edition. In H. Zeller & G. Martens (Eds.), Texte und Varianten. Probleme ihrer Edition und Interpretation (pp. 45–89). CH Beck, München.
Google Scholar

Download references

Acknowledgements

As the author of this essay, I would like to thank Patrik Granholm for agreeing to what would turn out to be a very informative interview, and for granting me permission to discuss the contents of that interview in this publication. The research leading up to these results was funded in part by the Digital Scholarly Editions Innovative Training Network (DiXiT ITN), a Marie Sklodowska-Curie Action, underwritten by the EU Seventh Framework Programme (FP7/2007–2013), REA grant no. 317436.

Author information

Authors and Affiliations

Högskolan i Borås, Borås, Sweden
Wout Dillen

Authors

Wout Dillen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wout Dillen .

Editor information

Editors and Affiliations

Uppsala University, Uppsala, Sweden
Isto Huvila
Uppsala University, Uppsala, Sweden
Lisa Andersson
Uppsala University, Uppsala, Sweden
Olle Sköld

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dillen, W. (2024). Paradata for Digitization Processes and Digital Scholarly Editions. In: Huvila, I., Andersson, L., Sköld, O. (eds) Perspectives on Paradata. Knowledge Management and Organizational Learning, vol 13. Springer, Cham. https://doi.org/10.1007/978-3-031-53946-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-53946-6_8
Published: 18 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53945-9
Online ISBN: 978-3-031-53946-6
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics