1 Introduction

Literary theorist Gérard Genette famously described paratext (e.g., the table of contents, chapter headings, index, and other framing text in a book) as “a threshold, or… a vestibule” and a “means by which a text makes a book of itself and proposes itself as such to its readers” (Genette, 1991, p 261). Paradata, however, is typically described in less poetic terms: it is a form of metadata that describes the ways in which a dataset was collected, processed, or manipulated (Pomerantz, 2015). This nuts-and-bolts definition elides paradata’s role in similarly presenting data and digital objects for use and interpretation. Paradata, like paratext, is a threshold through which users encounter, work with, and manage complex digital objects. It presents and preserves aspects of a dataset’s context of production and fundamentally shapes how an object is viewed. And when paradata is missing or incomplete, datasets and digital objects become harder—if not impossible—to “enter” or otherwise engage with.

In this chapter, we present a case study that illustrates the challenges of working with a particularly complex, long-lived digital object—a museum collection database—without complete documentation of its development or change over time. Museum collection databases are used to store information about the artifacts, specimens, and other objects in a museum’s collection. For many museums, their collections date back decades if not centuries, and their associated databases are decades old as well. The stewardship of these databases is “passed down” from one collection manager to the next, but, for a variety of reasons, sometimes without significant documentation. This leaves the “new” collection manager with the unenviable task of reverse engineering a database’s structure, contents, and data entry workflows for the next generation (Thomer et al., 2018; Thomer & Rayburn, 2023). Even once reverse engineered, not knowing the reasons underlying database design decisions makes database usage difficult.

Our case study focuses on the evolution of two collection databases with a common origin: the Matthaei Botanical Gardens and Nichols Arboretum (MBGNA) database and the University of Michigan (U-M) Herbarium collection database. Both databases grew out of a system called TAXIR in the 1960s; both have been repeatedly migrated between different software systems over the years; and both lack significant documentation explaining their origin, structure, or evolution (leading to much frustration for both databases’ current collection managers). As part of a larger project studying memory institution database maintenance over time, we interviewed database stewards at each museum and “read” each site’s databases (Feinberg, 2017) to reconstruct change in each site’s databases. We also tried different approaches of illustrating this change over time: first, versioned entity relationship diagrams (an illustration of a database’s underlying data model), and second, Sankey diagrams (an illustration of the “flow” of records between different database versions). In retrospectively creating this documentation, we ask: how and to what degree can we record the sometimes-subtle changes that occur in a database over long periods of time?

Our chapter proceeds as follows: we first contextualize our project in prior critical scholarship on databases, their paradata, and their use in memory institutions. Then we present our case study and the diagramming techniques we used to reconstruct database histories. We conclude by reflecting on the strengths and weaknesses of these approaches. We find that when paired together, entity relationship diagrams, Sankey diagrams, and narrative histories can provide new users an entry point into these complex data systems.

2 Background

Most fundamentally, a database is a structured collection of data organized for fast search and retrieval by a computer (Manovich, 2002). More abstractly, databases are tools for encoding the world (Dourish, 2017); that is, they translate non-computational entities into a machine-readable form. Database technology has drastically expanded possibilities for organizing complex or difficult-to-collect data and has played a critical role in allowing research data to be shared between collaborators, or in building networks for data sharing (Bruns et al., 1998; Cullings & Vogler, 1998; Mineta & Gojobori, 2016; Robertson et al., 2014; Stein & Wieczorek, 2004; Vieglais et al., 2000; Williams et al., 2018). There are many different technical standards for databases; here we are focusing on relational database management systems (RDMS), in which data is stored in a series of linked tables, each table connected through data points. The structure of these tables (the data model) is customized based on the nature of the data or its relevant metadata.

Databases, like most digital artifacts, are quite fragile and require constant maintenance to persist. They are sociotechnical objects, built on ever-changing physical and digital infrastructures and fundamentally shaped by the people and organizations that create them (Bowker & Star, 2000; Dourish, 2017; Hine, 2006). Different aspects of a database change at different rates, creating significant challenges for the maintainers of these systems. For example, in memory institutions—the museums, archives, and libraries that “contain memory of peoples, communities, institutions and individuals” (Dempsey, 2000)—the database’s structure and hardware might not change very often, but the users and needs of users of that database change fairly frequently (Fig. 1). This increases the chances that a system will break or require alteration or repair over time. As different parts of a database become obsolete over time, the database will need to be migrated from one piece of hardware or software to another. As we explore later, this can quickly become a complex process as migrations are often catalysts for other structural changes to the data, such as a schema change.

Fig. 1
A time-based model for sociotechnical change in museum collections. It includes students and volunteers, data entry workflows, administrative staff, software, collection managers, hardware, data models, and data collections.

A model for sociotechnical change in museum collections. Note the varied rates that change occurs, each factor contributing to data migrations and its accompanying paradata. Reprinted from Thomer & Rayburn, 2023 with permission

Critical to database migration is an understanding of the systems’ accompanying paradata. Though the term paradata was originally used by statisticians to describe process data needed to understand the quality of statistical data, such as surveys (Karr, 2010; Kreuter et al., 2010), the term has since come to be applied beyond survey data. However, as the term has gained adoption, it has also become more challenging to define. In many domains, paradata is often left undocumented or implicitly present within the data itself (rather than explicitly recorded) (Börjesson, 2022; Huvila et al., 2021). Paradata can take many different forms depending on the nature of the data it contextualizes. Identifying the form that paradata will take within a certain context is an endeavor on its own. Indeed, much literature examines the forms of paradata as a research question. For example, Börjesson et al. (2022) discuss the different forms of paradata that appear within archeological field data, while West and Sinibaldi (2013) examine types of paradata within social science datasets.

In this case study, the paradata we are speaking of is any provenance or processing data related to database creation, upkeep, and migration of natural history museum data. Natural history collections often contain millions of specimens that date back hundreds, even millions, of years, with each specimen requiring documentation. These data-intensive collections were early adopters of database technology and early contributors to scholarship in data curation (Palmer et al., 2013; Strasser, 2012; Thomer et al., 2018). Because of these long data histories, the paradata associated with these collections is lengthy and complex. Further, the users, creators, and maintainers of memory institution databases have varying work practices, often very different from best practices defined in computer science (Thomer & Wickett, 2020). These idiosyncratic practices are where much paradata is created. Though there are systems designed to track the provenance and changes to databases over time (Brodie & Stonebaker, 2015; Buneman et al., 2006, 2009), most of the databases in natural history museums (NHM) do not have this capability. Thus, in our cases, paradata must be inferred from data entry guides, documentation or communication related to database migrations, technical documentation guiding the structure of the system, and changes to metadata schemas that guide specimen documentation.

3 Database Evolution at the University of Michigan Matthaei Botanical Gardens and Nichols Arboretum and the University of Michigan Herbarium

As part of a larger project studying database maintenance in memory institutions, we developed case studies of database migration at multiple libraries, archives, and museums. Each case study was developed through semi-structured interviews (45–75 min each) with curatorial and collection staff at each site; close analysis and comparison of different versions of legacy databases; and review of papers, memos, emails, and other documentation related to database migration. Evidence is triangulated to develop explanations of how and why migrations are necessary, and to identify patterns motivating migrations, following a multi-case study design (Yin, 2017).

Here we present two intertwined cases from this project: database evolution at the U-M Herbarium and the MBGNA. In both cases, we had access to different versions of the legacy collection databases and were therefore better able to explore different ways of reconstructing paradata. In the process of creating narrative case reports, we also created entity relationship diagrams and Sankey diagrams to better illustrate change over time in these systems. We present our case study and diagrams below.

3.1 Common Origins

The U-M Herbarium is a substantial collection of plants and fungi that began in 1837 (University of Michigan Herbarium, 2023). Today, they have almost 1,750,000 specimens and databasing efforts that span 50 years. Like the herbarium, the MBGNA is essentially a collection of plants—but unlike the herbarium, it is a “living collection” distributed throughout four properties and over 700 acres of land in and around the University of Michigan. The MBGNA’s collections and catalogs date back to 1910, and their digital collections databases date back to the 1980s. The MBGNA’s digital data collections consist of tens of thousands of items and records in several different database systems. Data files include specimen records describing the type, locality, and provenance of each plant in the gardens and arboretum, as well as images, associated genetic data, and other data files.

Both the Herbarium and the MBGNA databases share a common origin: a holotype specimen database called TAXIR (Estabrook, 1979; Estabrook & Brill, 1969; Estabrook & Rogers, 1966). Holotypes are the single specimen used as a reference when first describing a species (Britannica, 2023). Holotypes might be thought of as the canonical version of a species, and they are incredibly important as a point of reference in biodiversity research. There are many other kinds of “type” specimens that are used in biodiversity research; for instance, paratypes are additional specimens that were referenced when an author described a species, and neotypes are specimens chosen to replace a holotype when lost or destroyed. Collectively, these specimens are referred to as a type collection in a museum. Type collections typically represent a small percentage of an entire natural history museum (NHM) collection (indeed, some NHMs may not have any). However, they are some of the most frequently used specimens and must be extremely well documented to be used as a comparison point for researchers.

TAXIR was a type database created not for the management of these collections but as a research tool to assist in taxonomic classification; it compared holotype records and mathematically calculated whether two specimens are of the same type by analyzing similarities between a pair of specimens (Estabrook & Rogers, 1966). Though TAXIR was an important early NHM database, it simply did not have the capability to serve as a collection management database, or a database where every object within the collection would have its own record, often resembling an analog card catalog. Both the Herbarium and the MBGNA eventually desired collection databases. At this point, both collections consequently diverged infrastructural paths to develop their own databases.

3.2 MBGNA Database Development

After TAXIR, the MBGNA transitioned to the collection management database BG-BASE in the 1980s, a system designed especially for botanic gardens and arboreta collections. This transition was challenging for multiple reasons. Firstly, it was a radical change in the data model as the information documented in a collection database would be vastly different from a type database. Secondly, this migration involved a software and hardware shift, as TAXIR operated on mainframe computer technology, with the data being accessed on computer terminals, and BG-BASE utilized personal computer technology.

Because this transition happened so long ago, our interview participants knew relatively little about BG-Base. However, the MBGNA was still dealing with the ramifications of this software—because they were not able to migrate data out of the system! In our interviews, a curator explained that the data in BG-BASE was originally entered by a volunteer who did not document their data entry process:

We have a number of databases that people created to, basically to suit their own needs. And so, unfortunately as those staff members have left, we haven’t always known exactly how or why those files were created. And so that has been a real challenge to decipher these past records and know what they related to. And sometimes they were created and then updated, and dates weren’t kept and so it’s really hard to know exactly how to use them (Participant MBGNA-01).

When staff went to migrate the data from BG-BASE to a new system (Microsoft Access) in the early 2000s, the volunteer had moved on, and current staff did not understand the logic behind the data structure. They ultimately had to abandon the BG-BASE database and start fresh. He stated,

We didn’t wanna lose a whole lot of legacy information, but we decided at a certain point it was so erratic that it wasn’t worth carrying over…it made no sense when you actually just look at the page (Participant MBGNA-02).

The new Access database was created based on the International Transfer Format, a metadata standard for botanical garden data. This standard was developed so that records could be shared easily between institutions (Botanic Gardens Conservation Secretariat, 1987; Botanical Gardens Conservation International, 2004). However, the standard itself is quite complex; the Access database included many small tables with specialized information that link to the main table of the database, aptly named tblplant. While this model followed an established standard, it increased the complexity of the database, ultimately making the system harder to use by those without database expertise.

At the time of our interviews with MGBNA staff in 2018, the Access database was being migrated to a new ArcGIS GeoDatabase, a system whose data model prioritizes spatial or geographical information first and foremost. Again, the migration to ArcGIS was unexpectedly challenging because there was little detailed documentation defining fields and relationships in the MS Access database, a similar problem experienced with the BG-BASE database two decades prior. The database specialist for the botanical gardens described this challenge of navigating the table relationships between the two systems:

When we link Access into [Arc]GIS, the way the data shows up is not the actual characters that were entered into the field originally. It’s numbers. Because they’re tied to the super relationship table…the primary keys show up, that link into GIS (Participant MBGNA-04).

As a result this migration took over 2 years to complete.

3.2.1 Capturing Paradata Through Entity Relationship Diagrams

For the MBGNA case, we had access to versions of the Microsoft Access and ArcGIS databases. To better understand how and what changed over time and to explore different modes of capturing paradata, we developed enhanced entity relationship (EER) diagrams to document each version of the database structures. Creating EER diagrams is a standard practice in database development; they use a highly structured visual language to represent the classes of information in a database, the relationships between those classes, and sometimes the specific data types of different data attributes (Chen, 1976). EERs are commonly created to support a number of different tasks, including database design, debugging, patching, and documentation.

Fig. 2
An E E R diagram. It includes t b l common names, genus, family, habit, leaf duration, plant, object, conservation global, conservation mi, conservation federal, subzone, site, object type, images, plaque condition, tribute condition, object condition, zone, handicap accessible, and tribute.

EER diagram of 2008 MBGNA Access database. To generate the above diagram, we migrated the original Microsoft Access database (.mdb file extension) to MySQL using MySQL Workbench

Fig. 3
An E E R diagram. It includes components under common names, synonyms, names, accessions, property layer, management subzone layer, management zone layer, sites layer, and natural communities layer.

EER diagram of the MBGNA’s current ArcGIS database. We created the diagram by using a Python script to convert the schema details available in JSON into a SQL script, which was then processed using MySQL and MySQL Workbench

We were able to create two EERs to try to show changes to the MBGNA’s structure in the transition from Access to ArcGIS. Figure 2 shows the MBGNA database as it existed in 2008 as an Access database with 20 related tables. Figure 3 shows the database as it was revised in 2016 to an ArcGIS system, condensed into nine tables. Note that the purpose of these figures is not to understand the data structure in detail, but, rather, to note the changes between these two database versions, adding to the paradata on what changes are made over time. There is no EER for the TAXIR or BG-BASE databases because this data was never maintained by staff: a clear example of lost paradata.

A close comparison of these EER diagrams gives a sense of a schema in transition, catalyzed in part by new standards and technologies. The 2008 Access database exhibits design choices tailored to data entry, such as Boolean fields that would have appeared as a checklist to students and interns doing data entry and shows signs of incremental change over time (e.g., several legacy tables that were used for short-term projects, yet were left in after the project was complete, adding unnecessary clutter to the data schema). The most recent database, on the other hand, was built using ArcGIS and represents a large-scale refactoring of the MBGNA’s system. It features fewer tables than its predecessors, largely because ArcGIS does not treat controlled vocabularies (or “domains” in their parlance) as separate tables, as the prior databases did. Fewer tables mean simpler queries and quicker data retrieval.

3.3 The U-M Herbarium Database Development

The Herbarium’s database evolution differs from the MBGNA’s considerably. After TAXIR, their databasing efforts centered around migrating to another type specimen database built in an RDMS called dBase in 1988. This transition was largely due to worries that TAXIR’s mainframe technology would become obsolete and out of a desire to access data on personal computers. The Herbarium didn’t adopt a collection management database until the early 2000s, after individual researchers began developing ad hoc systems for their own use. One participant describes these systems:

We had specimen label data…We had some algae data that we came up with from somewhere—we merged that in. I think at one time I had a compilation of 15 or 16 or more databases, that somebody had put a little bit of fungus data here, some algae data here. Stuff that we sort of combined it all together (UMNHM_014).

While decisions were being made regarding what unified RDMS the Herbarium would choose, they stored records in what they called “The Container,” which, in the words of UMNHM_013, “ran through SQL with an Access front end, because it was such a monster. It was an enormous flat file intended as a conversion vehicle.” But though “The Container” was meant to be a temporary storage solution, the various Herbarium databases wound up spending over a decade in this limbo.

In 2016, the Herbarium migrated to Specify for their in-house collection management database and Symbiota for collaborative research and data aggregation efforts. While the records in these systems are related, one of the challenges that appeared during our interviews is that Symbiota and Specify have historically had issues communicating with one another, largely because of Symbiota’s focus on georeferenced data. One participant describes this challenge:

Symbiota essentially would export out a Darwin Core archive as a big flat table with one exception, and Specify essentially divides up the data into multiple, multiple tables. Getting that all in the right place was very difficult. One of the things that has happened, was as the data is getting out in Symbiota portals, some of the project managers would georeference the data. And you would have georeferenced data out here, [and then realize,] ‘Oh I need to get it back in here. Oh, now I have a problem’ (UMNHM_014).

In sum, Specify records can be imported to Symbiota, but not vice versa. The provenance challenge here is a lack of documentation to the way in which these two systems (with different intended audiences) fail to interact.

Though the Herbarium database was successfully migrated, the “traces” left behind by legacy databases continue to impact current migration efforts—for instance, fields with unclear definitions, or that were split across multiple tables for unclear reasons. Further, these databases had multiple purposes, including to define type specimens, to serve as a collection database, to be used internally for research projects, and to aggregate collection data with that from other herbaria. This creates a challenge of not only tracing database changes but also intended uses of the systems as well. The Herbarium is fortunate to have a few staff members whose tenure spans since the beginning of databasing efforts and who can help track these changes over time, but this knowledge has not been clearly documented.

3.3.1 Capturing Paradata Through Sankey Diagrams

Because we did not have access to many of the legacy database files for the Herbarium, we turned to alternative methods of visualizing database migrations over time. Sankey diagrams are a type of “flow” diagram commonly used to visualize inputs and outputs in a system, e.g., the flow of nutrients in an ecosystem, or the energy transferred between different components of an engine; in an engineering context, they’re used to analyze the efficiency of a given system. Here, we use them to show the flow of records between data systems over time. The basic components of a Sankey diagram are a beginning node, an end node, and a quantitative value linking the two (Schmidt, 2008). For our cases, the beginning and end points for each “flow” are different data systems, and the quantitative value linking the two might be the number of records stored in each system (Fig. 4).

Fig. 4
A Sankey diagram. Herbarium card and M B G N A card catalog include Herbarium card catalog and Microsoft Access and Arc G I S, respectively. The herbarium card catalog includes specify, the container, and symbiota. Both include T A X I R, leading to dbase and B G-base, respectively.

Sankey diagram showing flow of records between databases at the U-M NHM Herbarium. Diagram created using SankeyMATIC

By creating a Sankey diagram, we can succinctly visualize the relationships between older database systems in both the Herbarium and MBGNA, such as the shared use of TAXIR. This diagram also shows the origin of the digital databases in legacy card catalogs (something not possible with an EER diagram). Our Sankey is somewhat incomplete, however, because we don’t know the exact numbers of records that flowed through each system. Because of this, we have represented all migrations as being of equal size (and therefore, equal widths of flows linking systems), which does not adequately represent changes in data scale over time. This is an unconventional form of paradata; it might be better thought of as a type of analysis made possible by paradata rather than paradata in and of itself. But it does succeed in presenting the provenance of each data system at a high level—something that a new database manager would need when taking over stewardship of a system.

3.4 Reconstructing Paradata: How and to What Degree Can We Record Change in Complex, Sociotechnical Systems?

Through our interviews, we were able to (mostly) reconstruct the history of these intertwined databases. In doing so, we also created three ways of looking at paradata: a series of EER diagrams that show change when compared; a Sankey diagram that shows the flow of records over time; and the case narratives themselves, which provide a high-level description of the histories of database use. Each of our approaches to capturing paradata records different degrees and scales of change to a system.

EER diagrams are best at capturing detailed changes to the data model, and less directly, the data entry workflow and what fields might be included in the individual record. They are a common tool in database design and management, and provide the clear, detailed views of a system necessary to plan a migration, or understand a current system; however, they can be challenging to create, particularly for those with less experience with databases. And note they are not possible to create for legacy systems that are no longer accessible. Further, these diagrams also require specialized knowledge to interpret. So, while they may be very useful to individuals with backgrounds in computer science, museum administrators or “domain” users may struggle to interpret them. And finally, these diagrams cannot capture why schema design choices were made in the first place.

Sankey diagrams are excellent for providing a broader view of how records have flowed between data systems over time and are even useful for systems that are no longer accessible or that are physical rather than digital. However, if there is no quantitative information on the number of records, or the scale of the data for each system, they run the risk of overly abstracting the complexity of data systems. Also, a Sankey alone lacks much of the behind the scenes work of managing a database, including the context (why and how) of migrations, issues within the databases, and issues with the database schemas.

Finally, we want to note an unexpected form of paradata that emerged from this work: the narrative case reports we developed through interviews and presented in part in this chapter. While these narratives don’t necessarily capture the nuanced changes to a data model shown in an EER diagram, or the high-level flow of information between systems in a Sankey, they do preserve much more the social and organizational context of a database that simply cannot be captured through computational methods alone. Databases are sociotechnical objects; as we reviewed before, they are fundamentally shaped by their users and their cultural contexts. Research methods designed to understand people and culture (e.g., qualitative interviews) are needed to surface the human (vs. computational) drivers of change.

We posit that database stewards themselves could create similar narratives, akin to a README. These documents could be a low-barrier way of documenting the changing context and development of a system over time. The strength of a written narrative is its relative ease of creation and comprehension: all database managers have the skills needed to create and read narrative histories of their information systems. Our work here echoes prior work by Bates et al. (2016), Feinberg (2017) Mosconi et al. (2022), and Witt et al. (2009), who similarly use qualitative methods to create sociotechnical narratives of digital systems or objects. By documenting provenance through qualitative narratives, as well as collecting computationally generated paradata, database stewards could create something more like an extended paratext, akin to a foreword or introduction of a book. All of our participants expressed a desire for documentation that served as a better “threshold,” in Genette’s (1991) framing, to their inherited systems. Better capture of paradata to show the change in data models and record flows is one part of this, but developing qualitative narratives is likely needed as well.

4 Conclusion: Creating Effective Thresholds to Complex Digital Objects

In this chapter, we have shown several approaches to retrospectively reconstructing paradata and other contextualizing documentation of a complex, long-lived digital object: the museum collection database. At our study sites, the U-M Herbarium and the MBGNA, collection managers face challenges using and migrating legacy database systems because they simply lacked the documentation necessary to understand their predecessors’ work. We demonstrated three ways to reconstruct this documentation: through EER diagrams, Sankey diagrams, and narrative case histories developed through qualitative interviews. From a practical standpoint, each of these approaches has strengths and weaknesses but together could be used by database stewards to either document their systems for future generations or reconstruct their databases’ histories for their current ongoing management and use. More theoretically, though, we have discussed how paradata, like paratext, is needed as a threshold to a digital object—a way of opening or presenting media to a new “reader.” We posit that thinking of paradata as a paratext may open fruitful future research and application avenues.