Skip to main content

The Case for a Database Approach

  • Chapter
  • First Online:
Database Computing for Scholarly Research

Abstract

The management of research data in the humanities and social sciences is a genuine, non-trivial challenge from a computational perspective. In this chapter, a case is made for a database approach that allows for the integration of all project data in a way that fits productively into the research program, and which makes the resulting data maximally useful for analysis, sharing and publication. A research database platform should accommodate highly diverse data that is dispersed over space and time, that is characterized by high variability, that is semi-structured, and that contains uncertainty and disagreements. Integration of spatial, temporal, textual, lexical, and multi-media data should be supported naturally and intuitively. In this chapter, an evaluation of traditional options for managing data leads to a discussion of a hybrid model, as implemented by OCHRE using XML, that is inspired by all the major database paradigms—the hierarchical, the relational, and the graph/network—taking advantage of the best features of each. An appropriately generic upper ontology provides an underlying framework for managing data of all kinds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Procrustes.

  2. 2.

    U.K. Research and Innovation Concordat on Open Research Data (https://www.ukri.org/wp-content/uploads/2020/10/UKRI-020920-ConcordatonOpenResearchData.pdf).

  3. 3.

    For our purposes, a text is both the observable signs used to communicate an idea and the interpretation of those signs by the reader.

  4. 4.

    This sentiment was expressed by Jeffrey Heer, a professor of computer science at the University of Washington and a cofounder of Trifacta, a start-up based in San Francisco.

  5. 5.

    It now seems inevitable that large language models will play a role in health care and beyond (Dave et al. 2023).

  6. 6.

    Good tools are available to help with such tasks, like OpenRefine, “a free, open-source, powerful tool for working with messy data” (https://openrefine.org).

  7. 7.

    By “picklist” we mean a user interface mechanism to provide a list (usually drop-down) of only valid values from which the user can choose.

  8. 8.

    See Friedrich Nietzsche, The Will to Power, translated by Walter Kaufmann and R. J. Hollingdale (1968), section 481.

  9. 9.

    See https://omeka.org/. WordPress and Drupal are similar platforms widely used in academic circles.

  10. 10.

    We have had some success with CMS tools that can accommodate calls to an API as a means of fetching data, as does the more recent Omeka S version.

  11. 11.

    On the use of virtual reality tools foreign language teaching, see Dobrova, et al. (2017).

  12. 12.

    On the use of virtual reality tools to stimulate student engagement, see Lau and Lee (2015).

  13. 13.

    For the Geography Markup Language, see Sharma and Herring (2018). The CIDOC Conceptual Reference Model (CIDOC-CRM) has been adopted by many in the archeology or cultural heritage communities. See also the textual markup guideline maintained by the Text Encoding Initiative (TEI) https://tei-c.org/. Many other markup standards exist for domain-specific purposes.

  14. 14.

    The term “Dublin Core” is a trademark. For more information, see https://dublincore.org/.

  15. 15.

    https://historicengland.org.uk/images-books/publications/midas-heritage/.

  16. 16.

    https://www.getty.edu/research/publications/electronic_publications/cdwa/.

  17. 17.

    http://www.loc.gov/today/pr/2013/files/twitter_report_2013jan.pdf.

  18. 18.

    Optical character recognition (OCR) refers to the conversion of typed, printed, or handwritten text into machine-readable encoded text.

  19. 19.

    The completion of the Chicago Assyrian Dictionary as reported by the New York Times: https://www.nytimes.com/2011/06/07/science/07dictionary.html.

  20. 20.

    As of July 2023, the OCHRE database is managing almost 10,000,000 intentionally and carefully curated database items generated by almost one hundred projects.

  21. 21.

    Commended as being accessible even to the technically uninclined, the data set is hosted online at “To be continued: The Australian Newspaper Fiction Database” (https://readallaboutit.com.au/).

  22. 22.

    SQL has been an ANSI (American National Standards Institute) standard since 1986 and an ISO (International Organization for Standardization) standard since 1987. SQL was taken up by IBM, Microsoft, Oracle and others and includes implementations in systems like MySQL and PostgreSQL, to name two of the most popular.

  23. 23.

    https://www.infoworld.com/article/3219795/what-is-sql-the-lingua-franca-of-data-analysis.html.

  24. 24.

    Do not confuse data redundancy with a backup. A backup is an archived version of the primary data, hopefully never needed to be seen or used again. It is not another copy of the viable data, or some portion thereof, intended to be used in a different context for a different purpose or by different users. LOCKSS, “Lots of Copies Keep Stuff Safe,” seems to be a poor man’s sustainability option. It is not clear, exactly, where LOCKSS (or CLOCKSS, Controlled LOCKSS) is headed, so be careful.

  25. 25.

    Pramod Sadalage; http://www.thoughtworks.com/insights/blog/nosql-databases-overview.

  26. 26.

    For “your ultimate guide to the non-relational universe!” see http://nosql-database.org.

  27. 27.

    https://www.ibm.com/ibm/history/ibm100/us/en/icons/ibmims/.

  28. 28.

    As a computer science graduate in the early 1980s, S. Schloen began her professional career by maintaining an IMS application on an IBM System/360 mainframe for one of the world’s largest oil companies.

  29. 29.

    Website developers will recognize the hierarchical document model in the Document Object Model (DOM) interface that lets them manipulate the content, structure, and style of a web page programmatically (e.g., using JavaScript).

  30. 30.

    https://tei-c.org/guidelines/.

  31. 31.

    This simplified view of a text does not do justice to the complexity allowed by TEI, but is a general starting point for a well-documented discussion of texts represented by hierarchical XML.

  32. 32.

    https://www.tei-c.org/release/doc/tei-p5-doc/en/html/DS.html.

  33. 33.

    See Chap. 4 for a discussion of OCHRE’s implementation of the Lexical Markup Framework.

  34. 34.

    The original, now discontinued OpenOffice, used the OpenDocument format which was XML-based. This was innovative at the time and very helpful. CHD documents were saved in this format, which exposed the formatting clearly as XML elements and attributes. These were then transformed (using XSLT) to a more semantically meaningful document for import to OCHRE.

  35. 35.

    See CHD (Güterbock and Hoffner 1997), Volume P, p. 58.

  36. 36.

    The term “graph” is not intended to evoke images of data visualizations often informally referred to as graphs, like pie graphs and bar graphs. Instead, we will refer to those as charts or diagrams, reserving the use of “graph” for its mathematical meaning derived from graph theory and adopted by computer science as a data structure. Wikipedia provides a satisfactory introduction to graphs (https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)).

  37. 37.

    The Network Data Model (NDM) was proposed in 1971 by the Data Base Task Group (DBTG) of the Programming Language Committee (subsequently renamed the COBOL committee) of the Conference on Data Systems Language (CODASYL), the organization responsible for the definition of the COBOL programming language.

  38. 38.

    https://en.wikipedia.org/wiki/Semantic_Web .

  39. 39.

    Websites using structured formats like XML, or its simpler cousin JSON (Javascript Object Notation) do make this process more user-friendly and predictable.

  40. 40.

    The most useful SPARQL query endpoints include preconfigured sample queries that give the user various models as help for getting started.

  41. 41.

    On Digital Object Identifiers (DOI), for example, see https://www.doi.org/.

  42. 42.

    https://www.explainablestartup.com/2016/08/the-history-of-semantic-web-is-the-future-of-intelligent-assistants.html.). As summarized by a more systematic critique, “The Semantic Web: Two Decades On”: “a lack of usable tools, a lack of incentives, a lack of robustness for unreliable publishers, and overly verbose standards, in particular, are widely acknowledged as valid criticisms of the Semantic Web” (https://aidanhogan.com/docs/semantic-web-now.pdf, p. 13).

  43. 43.

    See, for example, “RIP: The Semantic Web” (https://blog.diffbot.com/rip-the-semantic-web/) and “Whatever Happened to the Semantic Web?” (https://twobithistory.org/2018/05/27/semantic-web.html).

  44. 44.

    https://www.techradar.com/news/the-inventor-of-the-world-wide-web-says-his-creation-has-been-abused-for-too-long.

  45. 45.

    Neo4j, for example, is structured as a labeled property graph (LPG), in contrast to Semantic Web graphs based on RDF.

  46. 46.

    https://graphbase.ai.

  47. 47.

    Neo4j developed its own query language, Cypher, which is a strong influence on the standard being devised. See https://neo4j.com/press-releases/query-language-graph-databases-international-standard/.

  48. 48.

    See Software AG, “Tamino: Advanced Concepts” (2015, p. 13) for a helpful, illustrated discussion on normalizing XML.

  49. 49.

    https://en.wikipedia.org/wiki/Encyclopédie.

  50. 50.

    https://www.w3schools.com/xml/xml:schema.asp. Special-purpose tools like RELAX NG, and many XML editors will validate an XML document against a given schema.

  51. 51.

    The team at Pompeii uses a custom FileMaker 12 application which seems to be well-designed but highly specific to their needs at Pompeii (http://classics.uc.edu/pompeii/ Pompeii Archaeological Research Project: Porta Stabia [PARP:PS]).

  52. 52.

    Ives was the speaker for a workshop sponsored by the Neubauer Collegium at the University of Chicago, “Data Integration to Facilitate Data Science,” October 4, 2019, https://neubauercollegium.uchicago.edu/events/data-integration-to-facilitate-data-science.

  53. 53.

    For an in-depth discussion, we recommend Doan et al. (2012), especially section 1.3. The advent of cloud computing has also spawned a massive industry with options for cloud-based data warehouses, data lakes, lake houses, data meshes, etc. Each strategy has a range of features with corresponding pros and cons. Amazon Web Services and Microsoft’s Azure products are big players in this game, for example.

  54. 54.

    For an example of a system in a similar academic space that is based on virtual integration, see the Digital Archaeology Record (tDAR), “your online archive for archaeological information” (https://core.tdar.org/).

  55. 55.

    For additional benefits of a warehouse strategy, see Doan et al. (2012, p. 319).

  56. 56.

    This is OCHRE’s version of the “pipeline of procedural ETL (extract/transform/load) tools” typical of a data warehouse (ibid.).

  57. 57.

    https://www.gartner.com/en/information-technology/glossary/master-data-management-mdm.

  58. 58.

    https://www.stibosystems.com/what-is-master-data-management.

  59. 59.

    https://profisee.com/master-data-management-what-why-how-who.

  60. 60.

    Consultants at the OCHRE Data Service guide academic researchers in the adoption and implementation of data management strategies. Our experience has been that without such a consultation service, this stage of the process can often derail a research project.

  61. 61.

    https://www.informatica.com/services-and-training/glossary-of-terms/master-data-management-definition.html. See Berson and Dubov (2011, pp. 21–23).

  62. 62.

    XML is well-known for its nimbleness and its ease of transformation using eXtensible Stylesheet Language Transformations (XSLT), for example.

  63. 63.

    For justification of a graph database approach. see Harrison, G. (2015) who explains why “graph database systems shine” in comparison to relational systems or NoSQL databases.

  64. 64.

    Doan et al. (2012, p. 31). See also Doan et al. Chap. 11 for a rigorous discussion of XML and its aptness for data integration.

  65. 65.

    OCHRE’s item-variable-value approach resonates with, and easily mapped to, the subject-predicate-object formulation (often described as “tuples”) of the Resource Description Framework (RDF), an official specification of the W3C published in 1999. OCHRE and its predecessors, XSTAR and INFRA, were using the concept of an item with its set of variables and their values from the start, beginning with INFRA in 1989. The RDF specification served only later as a welcome validation of this highly compatible approach.

  66. 66.

    For TEI, see https://tei-c.org/; for EDM, see https://pro.europeana.eu/page/edm-documentation. OCHRE accommodates conformity with “standard” ontologies by providing tools to import from, or export to, formats based on the data models documented and encouraged by these specifications.

  67. 67.

    https://opencontext.org/.

  68. 68.

    The benefits of ArchaeoML are described by Eric Kansa et al. (2010, pp. 309–312).

  69. 69.

    Schloen, D. 2023 https://digitalculture.uchicago.edu/platforms/ochre-ontology/.

Citations

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Schloen, S.R., Prosser, M.C. (2023). The Case for a Database Approach. In: Database Computing for Scholarly Research. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-031-46696-0_2

Download citation

Publish with us

Policies and ethics