Keywords

1 Introduction

This chapter centres on a research project based on the re-use of data detailing the material accessioned into archival institutions in the UK during the period 2007–2020. This project required the assemblage of both these data and an understanding of that data in order that it might become of use for a particular purpose different to that for which it was intended when it was originally collected. In this way, data became dataset as data were re-set and re-framed in order to serve this new use. The processing whereby the data were set will be described, as will the information which was gathered and/or recorded as part of this process of re-setting. During this process, the concept of paradata was encountered and the work being undertaken provided an opportunity to explore it further. Questions arose of whether, in the context of this work, such a concept was relevant and, if so, what, from that work, might be considered paradata. Reflections on these questions will be offered at the end of the chapter, in order to encourage readers to shape their own reflections on paradata as it applies to their own work and context.

The National Archives (TNA) holds a wealth of accumulated data concerning archival materials held across the UK and beyond. The existence of this data is the result of concerted collective effort by a number of bodies and projects over many years. These bodies have included staff at archive services operating at both local and national levels and projects including the National Register of Archives (NRA) (which was started by the Historical Manuscripts Commission shortly after the Second World War) and the Access to Archives project (which ran at the turn of the twenty-first century). The data used by the project described in this chapter are the result of the National Accessions to Archives Survey, which TNA runs annually. Accession is the term used within archive services to refer to both the process of taking ‘intellectual and physical custody of materials’ and those materials themselves—‘the materials physically and officially transferred to a repository [or archive service] as a unit at a single time’ being described as an accession (Society of American Archivists, n.d.). The National Accessions to Archives Survey requests archive services around the UK to provide brief details of all the accessions they have received within a calendar year.

To date, the main use and principal reason for the collection of this data has been to enable researchers to discover and locate sources of interest around the UK. However, a growing realisation that this data also held the potential to produce insights into both historical and contemporary patterns of collecting across the UK led to a desire to try to realise this potential. Focusing as much on investigating the possibilities and mechanisms for generating insights as on the insights themselves, a 12-month Research Fellowship was established in 2021 at The National Archives to re-assemble a subset of the data gathered during the annual survey exercise (hereafter referred to as the accessions to repositories data) and then to explore its potential for surfacing insights around patterns of archival collecting. The Fellowship concluded at the end of 2022, and further details of the work carried out during it can be found in the following section.

2 Starting to Set the Data

At the start of their work in 2021, the Fellow was presented with the data that were both readily available and thought to be of potential use given the aims of their work. These data consisted of details of annual accessions and existed in the main in two distinct forms. The first form existed as a set of csv data frames that each contained information on accessions for the years 2016–2020. These data frames had merged the annual accessions information supplied by repositories across the UK from these years. The framing of this data had been carried out according to the Tidy Data principles outlined by Hadley Wickham and others, in that each column in the data frame was a variable (2019). The content of these columns was defined in terms that included ARCHON Number (a unique identifier for the individual repository or archive service into which the material had been accessioned), Name of Repository, Record Creator (who had created the material being accessioned), Brief Description of the Record (what that material consisted of), Size of the Record (how much material there was), and the Dates Covered by the Record (the time period represented by/within the material).Footnote 1

Data from the years before 2016 were in not such an easily accessible form but rather in one that prevented analysis and the identification of patterns in collecting practices. The data detailing the accessions between the years 2007 and 2015 had been stored on TNA systems as a series of files that were in the original format returned by individual repositories participating in the annual accessions to repositories survey. These mainly took the form of an Excel file sent out by representatives of TNA to participating institutions. This form also collected data described in the previous paragraph, but since it had changed format and layout since 2007, a simple merging of the forms was not possible. Furthermore, many repositories had submitted their return to the survey in many other formats, including Word documents, .pdfs, and in some cases as image files of printouts. These were themselves attached to emails stored as MS Office Outlook (.msg) files in folders within the corporate Sharepoint system. The framing of data between 2007 and 2015 was therefore considerably less helpful with respect to its state of readiness for subsequent use or analysis.

The data from between 2007 and 2015 therefore needed lengthy processing in order to get them into the same state as the data for the years 2016–2020. The first lengthy task was to extract the attachments from the stored emails. Once this procedure had been completed, the next task was to bring all the data contained in these many different attachments together into a single data frame for each individual year. In both cases this required a largely manual, non-computational effort, which was simultaneously extremely time-consuming and repetitive, yet necessary to lessen the risk of accidental data loss or misinterpretation.

Issues encountered during this process included the discovery that differing interpretations of what should be entered into the different columns had led to problems with a small yet significant number of the returns. For example, in many returns, information had been included in the wrong place, such as details of record donors included within record creator fields, or the covering dates becoming confused with the date of record creation. A number of strategies were developed to identify, and where possible rectify, these problems with the data. In cases of the data being inserted into the wrong column, it was possible to use data wrangling programmes such as Google Refine and the Python library PANDAS to help identify and move the data. In another significant number of returns, the year to which they referred was not self-evident, and a judgement call needed to be made. This was because many institutions would submit a number of returns at once. For instance, one repository had not sent in returns between 2007 and 2010, but had sent through all the returns in 2011 for that year, as well as the preceding four. Since the goals of the Fellowship were to use the survey data to understand trends and patterns in accessions, the decision was taken in this and similar cases to separate the data by year of accession and not by the date it was received by TNA.

Another aspect of processing that shaped the final dataset concerned the standardisation of the data within it. Substantive work was carried out to standardise many common values, such as the names of prominent individuals, dates, and locations appearing in the individual accession descriptions. Hardly any of the fields in the standard returns template were authority controlled and the different styles of those completing them led to much variety. For example dates were written in a number of different formats—‘Jun 1915’, ‘June 1915’, ‘01/06/1915’.

To make the data machine readable, the Fellow decided to replace natural language phrases such as ‘mid-twentieth century’ and ‘Victorian England’, with numerical representations in the format yyyy–yyyy.Footnote 2 The nearest thing to a UK archival standard on how to do this was contained within the National Council on Archives Rules for the Construction of Personal, Place and Corporate Names, which offers the following model (National Council on Archives, 1997):

  • Early nineteenth century 1800–1840

  • Mid-nineteenth century 1830–1870

  • Late nineteenth century 1860–1899

However, before applying this model, the Fellow wanted first to ascertain whether it was one that would be shared by others. He therefore sent out a tweet to ask the question, ‘what is the most accurate way of numerically representing ‘mid-twentieth century’?’ (Fig. 1). The tweet received 45 responses, and whilst not a scientific measure, these responses demonstrate a decided difference of opinion on how ‘mid-century’ should be numerically represented. It was initially thought that xx25–xx75 would come as the most favoured choice because it rested on an intuitive division of a century into quartiles,Footnote 3 but as it turned out, it was the second least popular choice amongst researchers on Twitter, and some complained that 1933–1968 was not included as an option.Footnote 4

Fig. 1
A screenshot of a Twitter poll on numerically representing mid-20th century, with 4 options. 1933-1969, 1925-1975, 1930-1980, and 1939-1969 have scores, 26.7%, 24.4%, 8.9%, and 40%, respectively.

Results of a poll carried out on Twitter as to the most accurate way of numerically representing mid-twentieth century

Bowker and Star explore the problematic nature of standardisation and formalisation within the context of formal medical nosology, namely the World Health Organization’s International Classification of Diseases (2000). They provide an ideological reading of the medical nosology, claiming that the act of codification strips social and economic factors from the conditions it lists. A concern during this fellowship was that the standardisation required by the processing/purpose of the work would unwittingly remove meanings, conscious or unconscious, from the data in the returns, reducing or stripping entirely the agency of those who completed them. The need to make judgements on how to standardise data like dates had to be balanced with a desire not to lose meanings contained in diversity of expression. A number of compromises between diversity of expression and the need to make the dataset machine readable needed to be made during the work to clean the data. To improve transparency, a document that outlined and discussesd the decisions taken to standardise and therefore shape the data was produced in the form of a lengthy report. For instance, due to the condition of some of the returns and the time constraints upon the Fellowship, it was not possible to avoid all data loss during the processing described above. The amount of data that were lost is estimated at around 1% for each year. This was judged to be statistically insignificant for the exercise, but it is nonetheless a loss. The whole process of assembling and cleaning a dataset that consolidated and standardised the available accessions to repositories data from 2007 to 2020 took approximately 6 months.

3 Understanding the Data

Whilst the data were being set, and decisions were being taken about how it might best be standardised, attempts were also being made to understand better what it represented and where it had come from. As a result the following contextual narrative was constructed.

The annual accessions to repositories survey traces its origins back to 1923 and to a practice initiated at that time of the regular publication within the Bulletin of the Institute of Historical Research of listings of the movements of historical manuscripts. These listings were constructed from two main sources, firstly catalogues detailing the sale of manuscripts and secondly annual reports from local and national repositories, which provided details of manuscripts they had received during the year. By the early 1950s however the Bulletin was finding it difficult to allocate sufficient space to these ever-growing listings and overlap was also starting to be seen with the work of the National Register of Archives which had been established under the aegis of the Historical Manuscripts Commission (HMC) in the immediate post-Second World War period. In 1955, the first HMC List of Accessions to Repositories was published, providing a summary of those details of accessions that had been received from 86 repositories across the UK (Historical Manuscripts Commission, 1955, p.1).

The process of compiling this annual listing was framed very much as an editorial one. Prefatory and editorial notes often preceded the published listings, and these notes provide insight into some of the decisions being made as part of that process. For example, in the listing for 1955 it was noted that: ‘it has again been necessary to curtail some of the reports considerably, and much detail has had to be omitted’ (Historical Manuscripts Commission, 1956, p. 1). Then again, the listing for 1963 noted;

In order to control the increasing size of this List, certain entries have been given in less detail than hitherto, viz,:

  1. a)

    Deeds are not described in detail when they relate to the area where the repository is situated; details are however given when they relate to other areas.

  2. b)

    Parish and school records are not described in detail unless the number of such deposits shown is few or the records are of particular interest (Historical Manuscripts Commission, 1965, p.iii).

The editors of the list seem to have faced a constant battle to keep the publication to a ‘reasonable size’ as the number of repositories approached for returns steadily increased to over 200 by the mid-1990s (Sargent, 1995).

As the size of the volume grew, changes were also made in terms of arrangement, indexing, and layout to make the listings easier to consult. An alphabetical ordering by repository name was often favoured, although indexing repository to geographic county was also considered important, leading to various editorial notes on changing county boundaries over the years. A major typographical change in the listing for 1972 seems to have reflected a move towards an even more selective approach towards compilation. It being noted that:

Efforts are also being made to simplify the contents as far as possible by aiming to provide outline descriptions only of the more important accessions to each repository in place of what had become a welter of indiscriminate and undigested detail (The Royal Commission on Historical Manuscripts, 1974, p.v).

The annual listings were not the only way in which accessions to repositories information was published. Even more selective lists or digests—of those accessions relevant to researchers of particular subjects—were also commonly supplied for onward dissemination, often through publication in the bulletins or journals of research societies and communities. It is reported that digests of accessions for 1992 were supplied to around 30 such organisations.

The year 1992 also marked the discontinuation of the hard copy publication of the annual listings (Sargent, 1995). The Historical Manuscripts Commission—whose staff managed the annual accessions exercise—had eagerly embraced the coming of the computer and was quick to create its own presence at the birth of the Internet. Online publication of the listings of accessions (both the ‘full’ and subject-specific digests) became the norm from then on, and examples of this are preserved in Web Archives (The Royal Commission on Historical Manuscripts, 1995). Copies of the subject-specific digests did continue to be submitted to (and subsequently published in hard copy) in academic journals after that date, but eventually this practice also came to an end.

The information gathered in the regular accessions to repositories exercise never fed solely into the editorial process that led to the annual lists and digests of the same; rather it had long also fed into the wider work of the National Register of Archives. This body sought to act as a central gateway to and (even before the term gained meaning) database of UK archives. As well as the accessions to repositories returns, it also gathered in (and in some cases created and published) surveys and more detailed finding aids, as well as creating and maintaining master indexes to the same. New accessions engendered new index entries, which were in due course enhanced through connection to more detailed finding aids as the information held about UK archives by the National Register of Archives grew over time.

Gathering, holding, manipulating, and linking all this information in the pre-digital era involved a lot of paper and human labour in copying information across from returns to index cards and other such technologies. The potential for computerisation started to be explored in the mid-1980s and a bespoke system based on a Prime minicomputer and programs written in Ampersand Pace came into operation in 1987 (Sargent, 1995). By the mid-1990s, this system was migrated into the newer Windows/PC environment and contained eight databases: indexes to the people, organisations, families, and estates about which UK archives were held and some details of the location of manorial documents, of archive repositories, and of listings and finding aids received from them (Sargent, 1995). Information from accessions to repositories returns (mostly still received in a paper form) was selected to enhance (and manually input directly into) the indexes, but it also had to be selected, added (and manually input again) into a separate system/database to create the annual subject-specific and all repository listings.

The next big system change occurred around 2002/3 at the same time as the bringing together of HMC with the Public Record Office and Her Majesty’s Stationery Office to create The National Archives. This system, HMC Admin, was more interconnected, such that the need for double entry of information from the accessions to repositories returns was reduced. An explanation of the process through which information was added into the HMC Admin system from around 2012 runs as follows:

What happens to my return after it is submitted and acknowledged?

Your return is logged and saved into our electronic file management system. The Accessioner for your region (we have divided the UK and Ireland into 14 separate regions) reviews your returns and makes selections for inclusion into the NRA and Accessions. They then input the information into the NRA and tick a box in the NRA entry, which adds it to Accessions. On completion the Accessions Editor proof reads each Accessions entry and assigns it to a thematic digest if applicable. Once all of the entries have been checked and assigned to a relevant digest the complete survey is published online at http://www.nationalarchives.gov.uk/accessions/. (The National Archives, c.2012).

It is undeniable that there are a lot of choices and assumptions underlying the accessions to repositories data. The process into which it fed has prioritised its digestion over its analysis. As we saw above, there was a constant drive to surface only the most significant and interesting accessions—to facilitate use by providing the most pertinent selection from the ‘welter of indiscriminate and undigested detail’. This selection was made not only by those centrally processing the returns, but also by those submitting them, as the following extract from the editorial note to one of the early accessions listings makes clear:

Many repositories have assisted greatly by submitting their lists in a concise form suitable for publication with little alteration; it would be appreciated if others could do the same, as the selection of the most significant facts or items is far more easily done by the actual custodians of documents than by an editor at a distance (Historical Manuscripts Commission, 1956, p.1).

These multiple acts of selection are not well-documented, at least not at anything below the level of general principles as set out in the editorial notes and quoted earlier.

This absence is particularly felt in the case of data relating to accessions to repositories, because accessioning is itself a form of selection, the decision to accession and hence preserve certain records or not. It is here that the work of archivists can have its most powerful impact. In this decision lies the potential for continuing to perpetuate structural inequalities whereby the margins are rendered all but invisible to history. This potential has long been recognised. For example, writing in the 1970s, Felix Hull wrote that:

Quite bluntly so long as we are not involved in selection we can happily be all things to all archives, but once we assume the sword and scales of justice—what then? […] For some archivists, I am sure, this dilemma raises problems of action and of morality which they feel ill equipped to handle (Hull, 1979).

The archival profession is still wrestling with those problems of action and of morality that Hull described today, and it is for this reason that the decision was made to focus on the accessions to repositories data in this project. In theory, it held the potential to reflect back an evidenced picture of what had been collected in the past that could at least provide some more robust information for those continuing to wrestle with those problems. In practice however, its selective nature—the fact that it was a selection—raised concerns about the extent to which it could be relied on as such a picture. That selection may have served the purposes of the original use—to produce a digest or summary of recent accessions of significance for the benefit of researchers—but it did not serve the purposes of the data’s re-use—to undertake analysis of patterns of collecting in UK archival repositories over time.

4 Reflecting on Paradata

Unlike the term metadata, paradata does not yet have currency within the archives field and neither of the researchers involved in the above project were previously experienced in thinking in its terms. When prompted to do so, by an invitation to contribute to this volume, they undertook an initial investigation into its conceptualisation elsewhere, identifying (as others in this volume and elsewhere have already done) a number of origin stories and extant definitions.

The origin story and conceptualisation that resonated with them the most traced back to the coining of the term ‘paradata’ by Drew Baker during the course of the Making Space project carried out at King’s Visualisation Lab to investigate ‘a methodology for tracking and documenting the cognitive process in 3-dimensional visualisation-based research’ (King’s Visualisation Lab, c.2005). In this case, the concerns from which the need to conceptualise paradata arose were open questions around the credibility and validity of 3-D visualisations within the archaeological and wider arts and humanities research communities. Such concerns led not only to the coining of the term paradata, but also to the London Charter for the Computer Based Visualisation of Cultural Heritage, which ‘defines principles for the use of computer-based visualisation methods in relation to intellectual integrity, reliability, documentation, sustainability and access’ (2009a). It too provides a definition for paradata, the glossary stating that it is:

Information about human processes of understanding and interpretation of data objects. Examples of paradata include descriptions stored within a structured dataset of how evidence was used to interpret an artefact, or a comment on methodological premises within a research publication (London Charter, 2009b).

It is also via this path that a connection has been initiated with the archival field. This connection taking the form of Heidi Jacobs of the University of Windsor who references the London Charter and interprets paradata as ‘a way to reveal the “fingerprints” of those who created the heritage object and the choices and assumptions that led to its creation’ (2020).

This conceptualisation of paradata resonated with the work being conducted on the accessions to repositories data because of this focus on ideas of fingerprints, choices, and assumptions. As has been discussed above, not only were there lots of current choices being made in the reassembly and standardisation of these data, researching the history of them also led to many more choices and assumptions. Those of the archivists who not only decided what material they would or would not accession in any given year, but also whether, or not to complete the survey at all, and to what level of detail. Further choices were made by the ‘editors’ as they decided what they felt to be important enough for inclusion in the published digests. Practicalities such as the technology and time they had available to them played a role in such decisions, but clearly assumptions were also being made, assumptions which were not always consciously acknowledged at the time.

In many ways it was these assumptions, particularly those about what was or was not considered important, which the Fellowship had set out to uncover, looking through and working backwards from the data on what had been collected to try to draw inferences, or at the very least prompt reflection, about what had been considered important—the assumptions on which archivists of the past appeared to have been working in their selection of material for inclusion in their collections. Perhaps it was because of this—our purpose for the data in question—that the concept of paradata became so resonant? Perhaps paradata only arises as a concern when you seek to look through the data to infer something else from it, when you seek to understand if the drawing of such an inference is justified, evidenced, or even possible? Given then that this concern had arisen in relation to the project, how did we seek to address it? What did we find or use or produce ourselves that could, as a consequence of meeting the concern with paradata be considered to have acted as such, to be paradata?

Meeting the concern with paradata required information; it required the gathering and taking into account of information about (a) the processes and choices by which the data had originally been collected, (b) the way in which our own processing of it was reshaping it and leading to some data loss, and also (c) the extent to which the dataset could be seen to be complete and/or representative of what we were using it to stand in for—the pattern of collecting undertaken by archive repositories in the UK over the period 2007–2020. This information was sourced in many ways, from previously published articles and book chapters, internal TNA reports and documents, and the knowledge of those who had been involved in the work of originally collecting and processing the data, which was sometimes gathered through conversation with them and sometimes through the editorial notes they had left behind. Then again, it was also gathered by consideration of the gaps between what was represented in the dataset (e.g. in terms of the number of repositories who had made returns in any given year) and what was not (e.g. in terms of how that number differed from what was known of the total number of repositories in operation). It was on the basis of this sort of information that we worked with the dataset and in our placing of it alongside that dataset, outside it and yet fundamental to its interpretation and use, it can, from our perspective, all be considered paradata.

In many ways then, this view of paradata paints it as something similar perhaps to another concept, that of a knowledge base as defined by the Reference Model for an Open Archival System as ‘a set of information, incorporated by a person or system, that allows that person or system to understand received information’ (International Standards Organisation, 2012). As part of this project, we put much effort into assembling our knowledge base, the information that allowed us to use and interpret the data with which we had been presented. To be sure we incorporated this knowledge, but we also took pains, in our final reporting of the project to dis-incorporate or rather to disembody it, to set it out and alongside in sufficient detail that anyone coming across that data in the future would at the very least be able to work with them on the same base or basis that we had.

5 Conclusion

This chapter has focused on a project which aimed to re-use data (dating from 2007 to 2020) collected as part of the annual accessions to repositories survey in order to create a picture of collecting patterns across UK archives. The original process of collecting the data was outlined as was the state of the data when it was first encountered. The work involved in preparing this data for its new use was described, highlighting the importance of standardisation, and the history reconstructed around the data was also set out. Leading on from this, reflections were offered on how paradata came to be conceptualised during the course of the project. Looking into earlier conceptualisations, the idea of paradata was seen to resonate with the project in its concern with looking through that which was being considered as ‘the dataset’ to something beyond—the drawing of inference or conclusion from it. In order to achieve this goal, it was necessary within the project to seek out a range of additional information beyond the original dataset. This information as well as knowledge of the process being followed to reshape the dataset came to act as paradata in that it addressed a concern with the basis or base on which the data could be used and interpreted. In the final reporting of the project then, a selection was made that prioritised that information it was felt necessary to pass on in order that the dataset could be used on the same basis in the future. This reflection and conceptualisation has been offered to encourage readers to shape their own reflections on paradata as it applies to their own work and context. It is also hoped that readers will be encouraged to consider what information they should pass on alongside any dataset they define or set in order that others can either understand, interpret, and use it on the same basis, or be aware of how they are not doing so.