The Case for a Database Approach

Schloen, Sandra R.; Prosser, Miller C.

doi:10.1007/978-3-031-46696-0_2

Sandra R. Schloen⁸ &
Miller C. Prosser⁸

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

108 Accesses

Abstract

The management of research data in the humanities and social sciences is a genuine, non-trivial challenge from a computational perspective. In this chapter, a case is made for a database approach that allows for the integration of all project data in a way that fits productively into the research program, and which makes the resulting data maximally useful for analysis, sharing and publication. A research database platform should accommodate highly diverse data that is dispersed over space and time, that is characterized by high variability, that is semi-structured, and that contains uncertainty and disagreements. Integration of spatial, temporal, textual, lexical, and multi-media data should be supported naturally and intuitively. In this chapter, an evaluation of traditional options for managing data leads to a discussion of a hybrid model, as implemented by OCHRE using XML, that is inspired by all the major database paradigms—the hierarchical, the relational, and the graph/network—taking advantage of the best features of each. An appropriately generic upper ontology provides an underlying framework for managing data of all kinds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/wiki/Procrustes.
2.
U.K. Research and Innovation Concordat on Open Research Data (https://www.ukri.org/wp-content/uploads/2020/10/UKRI-020920-ConcordatonOpenResearchData.pdf).
3.
For our purposes, a text is both the observable signs used to communicate an idea and the interpretation of those signs by the reader.
4.
This sentiment was expressed by Jeffrey Heer, a professor of computer science at the University of Washington and a cofounder of Trifacta, a start-up based in San Francisco.
5.
It now seems inevitable that large language models will play a role in health care and beyond (Dave et al. 2023).
6.
Good tools are available to help with such tasks, like OpenRefine, “a free, open-source, powerful tool for working with messy data” (https://openrefine.org).
7.
By “picklist” we mean a user interface mechanism to provide a list (usually drop-down) of only valid values from which the user can choose.
8.
See Friedrich Nietzsche, The Will to Power, translated by Walter Kaufmann and R. J. Hollingdale (1968), section 481.
9.
See https://omeka.org/. WordPress and Drupal are similar platforms widely used in academic circles.
10.
We have had some success with CMS tools that can accommodate calls to an API as a means of fetching data, as does the more recent Omeka S version.
11.
On the use of virtual reality tools foreign language teaching, see Dobrova, et al. (2017).
12.
On the use of virtual reality tools to stimulate student engagement, see Lau and Lee (2015).
13.
For the Geography Markup Language, see Sharma and Herring (2018). The CIDOC Conceptual Reference Model (CIDOC-CRM) has been adopted by many in the archeology or cultural heritage communities. See also the textual markup guideline maintained by the Text Encoding Initiative (TEI) https://tei-c.org/. Many other markup standards exist for domain-specific purposes.
14.
The term “Dublin Core” is a trademark. For more information, see https://dublincore.org/.
15.
https://historicengland.org.uk/images-books/publications/midas-heritage/.
16.
https://www.getty.edu/research/publications/electronic_publications/cdwa/.
17.
http://www.loc.gov/today/pr/2013/files/twitter_report_2013jan.pdf.
18.
Optical character recognition (OCR) refers to the conversion of typed, printed, or handwritten text into machine-readable encoded text.
19.
The completion of the Chicago Assyrian Dictionary as reported by the New York Times: https://www.nytimes.com/2011/06/07/science/07dictionary.html.
20.
As of July 2023, the OCHRE database is managing almost 10,000,000 intentionally and carefully curated database items generated by almost one hundred projects.
21.
Commended as being accessible even to the technically uninclined, the data set is hosted online at “To be continued: The Australian Newspaper Fiction Database” (https://readallaboutit.com.au/).
22.
SQL has been an ANSI (American National Standards Institute) standard since 1986 and an ISO (International Organization for Standardization) standard since 1987. SQL was taken up by IBM, Microsoft, Oracle and others and includes implementations in systems like MySQL and PostgreSQL, to name two of the most popular.
23.
https://www.infoworld.com/article/3219795/what-is-sql-the-lingua-franca-of-data-analysis.html.
24.
Do not confuse data redundancy with a backup. A backup is an archived version of the primary data, hopefully never needed to be seen or used again. It is not another copy of the viable data, or some portion thereof, intended to be used in a different context for a different purpose or by different users. LOCKSS, “Lots of Copies Keep Stuff Safe,” seems to be a poor man’s sustainability option. It is not clear, exactly, where LOCKSS (or CLOCKSS, Controlled LOCKSS) is headed, so be careful.
25.
Pramod Sadalage; http://www.thoughtworks.com/insights/blog/nosql-databases-overview.
26.
For “your ultimate guide to the non-relational universe!” see http://nosql-database.org.
27.
https://www.ibm.com/ibm/history/ibm100/us/en/icons/ibmims/.
28.
As a computer science graduate in the early 1980s, S. Schloen began her professional career by maintaining an IMS application on an IBM System/360 mainframe for one of the world’s largest oil companies.
29.
Website developers will recognize the hierarchical document model in the Document Object Model (DOM) interface that lets them manipulate the content, structure, and style of a web page programmatically (e.g., using JavaScript).
30.
https://tei-c.org/guidelines/.
31.
This simplified view of a text does not do justice to the complexity allowed by TEI, but is a general starting point for a well-documented discussion of texts represented by hierarchical XML.
32.
https://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html.
33.
See Chap. 4 for a discussion of OCHRE’s implementation of the Lexical Markup Framework.
34.
The original, now discontinued OpenOffice, used the OpenDocument format which was XML-based. This was innovative at the time and very helpful. CHD documents were saved in this format, which exposed the formatting clearly as XML elements and attributes. These were then transformed (using XSLT) to a more semantically meaningful document for import to OCHRE.
35.
See CHD (Güterbock and Hoffner 1997), Volume P, p. 58.
36.
The term “graph” is not intended to evoke images of data visualizations often informally referred to as graphs, like pie graphs and bar graphs. Instead, we will refer to those as charts or diagrams, reserving the use of “graph” for its mathematical meaning derived from graph theory and adopted by computer science as a data structure. Wikipedia provides a satisfactory introduction to graphs (https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)).
37.
The Network Data Model (NDM) was proposed in 1971 by the Data Base Task Group (DBTG) of the Programming Language Committee (subsequently renamed the COBOL committee) of the Conference on Data Systems Language (CODASYL), the organization responsible for the definition of the COBOL programming language.
38.
https://en.wikipedia.org/wiki/Semantic_Web .
39.
Websites using structured formats like XML, or its simpler cousin JSON (Javascript Object Notation) do make this process more user-friendly and predictable.
40.
The most useful SPARQL query endpoints include preconfigured sample queries that give the user various models as help for getting started.
41.
On Digital Object Identifiers (DOI), for example, see https://www.doi.org/.
42.
https://www.explainablestartup.com/2016/08/the-history-of-semantic-web-is-the-future-of-intelligent-assistants.html.). As summarized by a more systematic critique, “The Semantic Web: Two Decades On”: “a lack of usable tools, a lack of incentives, a lack of robustness for unreliable publishers, and overly verbose standards, in particular, are widely acknowledged as valid criticisms of the Semantic Web” (https://aidanhogan.com/docs/semantic-web-now.pdf, p. 13).
43.
See, for example, “RIP: The Semantic Web” (https://blog.diffbot.com/rip-the-semantic-web/) and “Whatever Happened to the Semantic Web?” (https://twobithistory.org/2018/05/27/semantic-web.html).
44.
https://www.techradar.com/news/the-inventor-of-the-world-wide-web-says-his-creation-has-been-abused-for-too-long.
45.
Neo4j, for example, is structured as a labeled property graph (LPG), in contrast to Semantic Web graphs based on RDF.
46.
https://graphbase.ai.
47.
Neo4j developed its own query language, Cypher, which is a strong influence on the standard being devised. See https://neo4j.com/press-releases/query-language-graph-databases-international-standard/.
48.
See Software AG, “Tamino: Advanced Concepts” (2015, p. 13) for a helpful, illustrated discussion on normalizing XML.
49.
https://en.wikipedia.org/wiki/Encyclopédie.
50.
https://www.w3schools.com/xml/xml:schema.asp. Special-purpose tools like RELAX NG, and many XML editors will validate an XML document against a given schema.
51.
The team at Pompeii uses a custom FileMaker 12 application which seems to be well-designed but highly specific to their needs at Pompeii (http://classics.uc.edu/pompeii/ Pompeii Archaeological Research Project: Porta Stabia [PARP:PS]).
52.
Ives was the speaker for a workshop sponsored by the Neubauer Collegium at the University of Chicago, “Data Integration to Facilitate Data Science,” October 4, 2019, https://neubauercollegium.uchicago.edu/events/data-integration-to-facilitate-data-science.
53.
For an in-depth discussion, we recommend Doan et al. (2012), especially section 1.3. The advent of cloud computing has also spawned a massive industry with options for cloud-based data warehouses, data lakes, lake houses, data meshes, etc. Each strategy has a range of features with corresponding pros and cons. Amazon Web Services and Microsoft’s Azure products are big players in this game, for example.
54.
For an example of a system in a similar academic space that is based on virtual integration, see the Digital Archaeology Record (tDAR), “your online archive for archaeological information” (https://core.tdar.org/).
55.
For additional benefits of a warehouse strategy, see Doan et al. (2012, p. 319).
56.
This is OCHRE’s version of the “pipeline of procedural ETL (extract/transform/load) tools” typical of a data warehouse (ibid.).
57.
https://www.gartner.com/en/information-technology/glossary/master-data-management-mdm.
58.
https://www.stibosystems.com/what-is-master-data-management.
59.
https://profisee.com/master-data-management-what-why-how-who.
60.
Consultants at the OCHRE Data Service guide academic researchers in the adoption and implementation of data management strategies. Our experience has been that without such a consultation service, this stage of the process can often derail a research project.
61.
https://www.informatica.com/services-and-training/glossary-of-terms/master-data-management-definition.html. See Berson and Dubov (2011, pp. 21–23).
62.
XML is well-known for its nimbleness and its ease of transformation using eXtensible Stylesheet Language Transformations (XSLT), for example.
63.
For justification of a graph database approach. see Harrison, G. (2015) who explains why “graph database systems shine” in comparison to relational systems or NoSQL databases.
64.
Doan et al. (2012, p. 31). See also Doan et al. Chap. 11 for a rigorous discussion of XML and its aptness for data integration.
65.
OCHRE’s item-variable-value approach resonates with, and easily mapped to, the subject-predicate-object formulation (often described as “tuples”) of the Resource Description Framework (RDF), an official specification of the W3C published in 1999. OCHRE and its predecessors, XSTAR and INFRA, were using the concept of an item with its set of variables and their values from the start, beginning with INFRA in 1989. The RDF specification served only later as a welcome validation of this highly compatible approach.
66.
For TEI, see https://tei-c.org/; for EDM, see https://pro.europeana.eu/page/edm-documentation. OCHRE accommodates conformity with “standard” ontologies by providing tools to import from, or export to, formats based on the data models documented and encouraged by these specifications.
67.
https://opencontext.org/.
68.
The benefits of ArchaeoML are described by Eric Kansa et al. (2010, pp. 309–312).
69.
Schloen, D. 2023 https://digitalculture.uchicago.edu/platforms/ochre-ontology/.

Citations

Aguinaga, S., Nambiar, A., Liu, Z., & Weninger, T. (2015). Concept hierarchies and human navigation. 2015 IEEE International Conference on Big Data (Big Data), 38–45. https://doi.org/10.1109/BigData.2015.7363739
Chapter Google Scholar
Bachman, C. W. (1973). The programmer as navigator. Communications of the ACM, 16(11), 653–658. https://doi.org/10.1145/355611.362534
Article Google Scholar
Berners-Lee, T. (2009). Linked Data—Design Issues. https://www.w3.org/DesignIssues/LinkedData.html
Berners-Lee, T., & Fischetti, M. (1999). Weaving the Web: The original design and ultimate destiny of the World Wide Web by its inventor (1st ed). HarperSanFrancisco.
Google Scholar
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 284(5), 34–43. https://www.jstor.org/stable/26059207
Article Google Scholar
Berson, Alex., & Dubov, Lawrence. (2011). Master data management and data governance (2nd ed.). McGraw-Hill.
Google Scholar
Blanchard, G., & Olsen, M. (2002). Le système de renvois dans l’Encyclopédie: Une cartographie des structures de connaissances au XVIIIe siècle. Recherches sur Diderot et sur l’Encyclopédie, 31–32, 45. https://doi.org/10.4000/rde.122
Article Google Scholar
Bode, K. (2018). A world of fiction: Digital collections and the future of literary history. University of Michigan Press.
Google Scholar
Bryant, J. (2002). The Fluid Text. University of Michigan Press. https://www.press.umich.edu/12020/fluid_text
Book Google Scholar
Codd, E. F. (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377–387.
Article Google Scholar
Cox, A., & Verbaan, Eddy. (2018). Exploring research data management. Facet Publishing.
Book Google Scholar
Date, C. J. (2004). An introduction to database systems (8th ed.). Pearson/Addison Wesley.
Google Scholar
Dave, T., Athaluri, S. A., & Singh, S. (2023). ChatGPT in medicine: An overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 6. https://doi.org/10.3389/frai.2023.1169595
Doan, A., Halevy, Alon., & Ives, Z. G. (2012). Principles of data integration. Morgan Kaufmann.
Google Scholar
Dobrova, V., Trubitsin, K., Labzina, P., Ageenko, N., & Gorbunova, Y. (2017). Virtual Reality in Teaching of Foreign Languages. 7th International Scientific and Practical Conference “Current Issues of Linguistics and Didactics: The Interdisciplinary Approach in Humanities” (CILDIAH 2017), 63–68. https://doi.org/10.2991/cildiah-17.2017.12
Ellis, S. J. R. (2016). Are We Ready for New (Digital) Ways to Record Archaeological Fieldwork? A Case Study from Pompeii. In E. Walcek Averett, J. M. Gordon, & D. B. Counts (Eds.), Mobilizing the Past for a Digital Future (pp. 51–75). The Digital Press, The University of North Dakota.
Google Scholar
Francopoulo, G. (2012). LMF Lexical Markup Framework. John Wiley & Sons, Incorporated.
Google Scholar
Garcia-Molina, H., Ullman, J. D., & Widom, J. (2008). Database Systems: The Complete Book (2nd ed.). Pearson.
Google Scholar
Gordon, J. M., Walcek Averett, E., & Counts, D. B. (2016). Mobile Computing in Archaeology: Exploring and Interpreting Current Practices. In E. Walcek Averett, J. M. Gordon, & D. B. Counts (Eds.), Mobilizing the Past for a Digital Future (pp. 1–30). The Digital Press, The University of North Dakota.
Google Scholar
Güterbock, H. G., & Hoffner, H. A. (Eds.). (1997). The Hittite dictionary of the Oriental Institute of the University of Chicago: Volume P. The Oriental Institute of the University of Chicago.
Google Scholar
Haigh, T. (2016). How Charles Bachman invented the DBMS, a foundation of our digital world. Communications of the ACM, 59(7), 25–30. https://doi.org/10.1145/2935880
Article Google Scholar
Herrmann Rimmer, V., & Schloen, J. D. (Eds.). (2014). In remembrance of me: Feasting with the dead in the ancient Middle East (Vol. 37). Oriental Institute.
Google Scholar
Hunger, M., Boyd, R., & Lyon, W. (2021). The Definitive Guide to Graph Databases. 35.
Google Scholar
James, B. (2021, May 11). Top 5 Graph Analytics Takeaways from Gartner’s Data & Analytics Summit. Neo4j Graph Database Platform. https://neo4j.com/blog/top-5-graph-analytics-takeaways-gartners-data-analytics-summit/
Kansa, E. C., Kansa, S. W., Burton, M. M., & Stankowski, C. (2010). Googling the Grey: Open Data, Web Services, and Semantics. Archaeologies, 6(2), 301–326. https://doi.org/10.1007/s11759-010-9146-4
Article Google Scholar
Knappett, C. (2013). Using network thinking to understand transmission and innovation in ancient societies. https://api.semanticscholar.org/CorpusID:55244144
Knappett, Carl. (2011). An archaeology of interaction: Network perspectives on material culture and society. Oxford University Press.
Book Google Scholar
Lau, K. W., & Lee, P. Y. (2015). The use of virtual reality for creating unusual environmental stimulation to motivate students to explore creative ideas. Interactive Learning Environments, 23(1), 3–18. https://doi.org/10.1080/10494820.2012.745426
Article Google Scholar
Lohr, S. (2014, August 18). For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. The New York Times. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Loshin, D. (2006). Defining Master Data. BeyeNetwork. https://web.archive.org/web/20070510063838/http://www.b-eye-network.com/view/2918
McGrath, L. B. (2019). More Specific, More Complex. Post45. https://post45.org/2019/05/more-specific-more-complex/
Nelson, T. H. (2015). What Box? In D. Dechow & D. C. Struppa (Eds.), Intertwingled: The work and influence of Ted Nelson (pp. 133–150). Springer.
Chapter Google Scholar
Newman, M. E. J. (2003). The Structure and Function of Complex Networks. SIAM Review, 45(2), 167–256. https://doi.org/10.1137/S003614450342480
Article MathSciNet Google Scholar
Perry, S. (2015, April 2). Why are heritage interpreters voiceless at the trowel’s edge? A plea for reframing the archaeological workflow. SARA PERRY. https://saraperry.wordpress.com/2015/04/02/why-are-heritage-interpreters-voiceless-at-the-trowels-edge-a-plea-for-reframing-the-archaeological-workflow/
Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases: New Opportunities for Connected Data (2nd edition). O’Reilly Media.
Google Scholar
Schloen, J. D. (2001a). Archaeological Data Models and Web Publication Using XML. Computers and the Humanities, 35(2), 123–152. https://doi.org/10.1023/A:1002471112790
Article Google Scholar
Schloen, J. D., & Schloen, S. (2014). Beyond Gutenberg: Transcending the Document Paradigm in Digital Humanities. Digital Humanities Quarterly, 8(4). http://digitalhumanities.org:8081/dhq/vol/8/4/000196/000196.html
Sharma, J., & Herring, J. (2018). Geography Markup Language. In M. T. Özsu & L. Liu (Eds.), Encyclopedia of database systems: Vol. G (Second edition, pp. 1605–1608). Springer. https://doi.org/10.1007/978-1-4614-8265-9
Chapter Google Scholar
Shirky, C. (2008). Ontology is Overrated: Categories, Links, and Tags. Clay Shirky’s Writings about the Internet. https://oc.ac.ge/file.php/16/_1_Shirky_2005_Ontology_is_Overrated.pdf
Software AG. (2015). Tamino: Advanced Concepts. https://documentation.softwareag.com/webmethods/tamino/ins97/print/advconc.pdf
Walcek Averett, E., Gordon, J. M., & Counts, D. B. (Eds.). (2016). Mobilizing the Past for a Digital Future. The Digital Press, The University of North Dakota.
Google Scholar
Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLOS ONE, 8(7), e67332. https://doi.org/10.1371/journal.pone.0067332
Article Google Scholar

Download references

Author information

Authors and Affiliations

Forum for Digital Culture, The University of Chicago, Chicago, IL, USA
Sandra R. Schloen & Miller C. Prosser

Authors

Sandra R. Schloen
View author publications
You can also search for this author in PubMed Google Scholar
Miller C. Prosser
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schloen, S.R., Prosser, M.C. (2023). The Case for a Database Approach. In: Database Computing for Scholarly Research. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-031-46696-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-46696-0_2
Published: 24 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46694-6
Online ISBN: 978-3-031-46696-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics