Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data

Abstract

Criticism of big data has focused on showing that more is not necessarily better, in the sense that data may lose their value when taken out of context and aggregated together. The next step is to incorporate an awareness of pitfalls for aggregation into the design of data infrastructure and institutions. A common strategy minimizes aggregation errors by increasing the precision of our conventions for identifying and classifying data. As a counterpoint, we argue that there are pragmatic trade-offs between precision and ambiguity that are key to designing effective solutions for generating big data about biodiversity. We focus on the importance of theory-dependence as a source of ambiguity in taxonomic nomenclature and hence a persistent challenge for implementing a single, long-term solution to storing and accessing meaningful sets of biological specimens. We argue that ambiguity does have a positive role to play in scientific progress as a tool for efficiently symbolizing multiple aspects of taxa and mediating between conflicting hypotheses about their nature. Pursuing a deeper understanding of the trade-offs and synthesis of precision and ambiguity as virtues of scientific language and communication systems then offers a productive next step for realizing sound, big biodiversity data services.

This is a preview of subscription content, access via your institution.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Fig. 1
Fig. 2

Notes

  1. 1.

    We use “data aggregation” to refer to merging multiple sets of data of the same kind (e.g., multiple collections of specimens or multiple runs of the same experiment) as distinct from “data integration,” which refers to combining multiple kinds of data to solve an inference problem (Berman 2013). The limits of this distinction, where aggregation and integration become hard to tell apart, are an important topic outside the scope of this article.

  2. 2.

    A taxonomic concept is a description of what a taxonomic name refers to as stated by a particular author in a particular publication. A taxonomic concept can be defined in terms of rules for appropriate use (an intensional definition), by a set of organisms included under the concept (an extensional definition), or by a mixture of these two approaches.

  3. 3.

    For more on the concept of trajectory as a tool for comparative research in the social sciences, see Strauss (1993).

  4. 4.

    Note that these hypotheses are about the nature of individual species as entities, not the nature of biological species in general, which has been another source of ongoing debate among biologists and philosophers.

  5. 5.

    More generally, we also include vouchered occurrence records, such as image-vouchered observations or tissue samples not linked to physical specimen depositions, under our use of the term “specimen data.”

  6. 6.

    While some of these treatments may provide nomenclatural synonymy information intended to resolve such conflicts, this information can nonetheless still be incomplete, incorrect, or out of date.

  7. 7.

    Our argument in this article does not presuppose that named taxa are monophyletic groups, but we will set aside the debate over how to define biological species as a further complication that only magnifies the difficulty of meaningful classification of specimen data.

  8. 8.

    Interestingly, the type method is not mandatory above the family level where the codes of nomenclature have no regulatory power (cf. Franz and Thau 2010).

  9. 9.

    Figure 2 illustrates the relationship between these two taxon concepts.

  10. 10.

    These statements are hypothetical examples and should not be taken as necessarily true.

  11. 11.

    Although see Figures 3C and 3D in Remsen (2016) for a visual analog to sentence (2).

  12. 12.

    Note that Berendsohn’s extended syntax maintains coherence with existing practices in taxonomy by adding onto the binomial system rather than replacing it wholesale.

References

  1. Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete. Wired, 23 June. http://www.wired.com/2008/06/pb-theory/

  2. Aronova E, von Oertzen C, Sepkoski D (eds) (2017) Data histories, vol 32. Osiris, New York (in press)

    Google Scholar 

  3. Atran S (1998) Folk biology and the anthropology of science: cognitive universals and culturalparticulars. Behav Brain Sci 21:547–569

    Google Scholar 

  4. Berendsohn W (1995) The concept of “potential taxa” in databases. Taxon 44:207–212

    Article  Google Scholar 

  5. Berman JJ (2013) Principles of big data. Elsevier, Waltham

    Google Scholar 

  6. Blomquist HL (1948) The grasses of North Carolina. Duke University Press, Durham

    Google Scholar 

  7. Bowker GC (2000) Biodiversity datadiversity. Soc Stud Sci 30:643–683

    Article  Google Scholar 

  8. boyd D, Crawford K (2012) Critical questions for big data. Inform Commun Soc 15:662–679

    Article  Google Scholar 

  9. Cain AJ (1958) Logic and memory in Linnaeus’ system of taxonomy. Proc Linn Soc Lond 169:144–163

    Article  Google Scholar 

  10. Charmantier I, Müller-Wille S (2014) Carl Linnaeus’s botanical paper slips (1767–1773). Intellect Hist Rev 24:215–238

    Article  Google Scholar 

  11. Chomsky N (2002) An interview on minimalism. In: Belletti A, Rizzi L (eds) On nature and language. Cambridge University Press, Cambridge, pp 92–161

    Google Scholar 

  12. Ciardelli P, Kelbert P, Kohbecker A et al (2009) The EDIT platform for cybertaxonomy and the taxonomic workflow: selected components. Lect Notes Inform 154:625–38

    Google Scholar 

  13. Cui H, Xu D, Chong SS et al (2016) Introducing explorer of taxon concepts with a case study on spider measurement matrix building. BMC Bioinform 17(1):471

    Article  Google Scholar 

  14. Dayrat B (2010) Celebrating 250 dynamic years of nomenclatural debates. In: Polaszek A (ed) Systema Naturae 250: The Linnean ark. CRC Press, Boca Raton, pp 186–239

    Google Scholar 

  15. Dietz B (2012) Contribution and co-production: the collaborative culture of Linnaean botany. Ann Sci 69:551–569

    Article  Google Scholar 

  16. Edwards PN, Mayernik MS, Batcheller AL et al (2011) Science friction: data, metadata, and collaboration. Soc Stud Sci 41:667–690

    Article  Google Scholar 

  17. Franz NM, Peet RK (2009) Towards a language for mapping relationships among taxonomic concepts. Syst Biodivers 7:5–20

    Article  Google Scholar 

  18. Franz NM, Thau D (2010) Biological taxonomy and ontology development: scope and limitations. Biodiv Inform 7:45–66

    Google Scholar 

  19. Franz NM, Peet RK, Weakley AS (2008) On the use of taxonomic concepts in support of biodiversity research and taxonomy. In: Wheeler QD (ed) The new taxonomy. CRC Press, Boca Raton, pp 63–86

    Google Scholar 

  20. Franz NM, Chen M, Yu S et al (2015) Reasoning over taxonomic change: exploring alignments for the Perelleschus use case. PLoS ONE 10(2):e0118247

    Article  Google Scholar 

  21. Franz NM, Chen M, Kianmajd P et al (2016a) Names are not good enough: reasoning over taxonomic change in the Andropogon complex. Semant Web 7:645–667

    Article  Google Scholar 

  22. Franz NM, Pier NM, Reeder DM et al (2016b) Two influential primate classifications logically aligned. Syst Biol 65:561–582

    Article  Google Scholar 

  23. Franz N, Gilbert E, Ludäscher B, Weakley A (2016c) Controlling the taxonomic variable: taxonomic concept resolution for a southeastern United States herbarium portal. Res Ideas Outcomes 2:e10610

    Article  Google Scholar 

  24. Gandy L, Gumm J, Fertig B et al (2016) Synthesizer: expediting synthesis studies from context-free data with natural language processing. biorXiv. doi:10.1101/053629

    Google Scholar 

  25. Geoffroy M, Berendsohn WG (2003) The concept problem in taxonomy: importance, components, approaches. Schrift Vegetationsk 39: 5–14

    Google Scholar 

  26. Gerson EM (2008) Reach, bracket, and the limits of rationalized coordination: some challenges for CSCW. In: Ackerman MS, Halverson CA, Erickson T, Kellogg WA (eds) Resources, co-evolution and artifacts: theory in CSCW (computer supported cooperative work). Springer, London, pp 193–220

    Google Scholar 

  27. Godfray, HCJ (2002) Challenges for taxonomy. Nature 417(6884):17–19

    Article  Google Scholar 

  28. Goodwin ZA, Harris DJ, Filer D et al (2015) Widespread mistaken identity in tropical plant collections. Curr Biol 25:R1066–R1067

    Article  Google Scholar 

  29. Gratton P, Trucchi E, Trasatti A et al (2016) Testing classical species properties with contemporary data: how “bad species” in the brassy ringlets (Erebia tyndarus complex, Lepidoptera) turned good. Syst Biol 65: 292–303

    Article  Google Scholar 

  30. Griesemer JR (2012) Formalization and the meaning of ‘theory’ in the inexact biological sciences. Biol Theory 7(4):298–310

    Article  Google Scholar 

  31. Hey T, Tansley S, Tolle K (eds) (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond

    Google Scholar 

  32. Hinchcliff CE, Smith SA, Allman JF et al (2015) Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc Natl Acad Sci USA 112:12764–12769

    Article  Google Scholar 

  33. Hitchcock AS, Chase A (1950) Manual of the grasses of the United States, 2nd edn. United States Department of Agriculture Miscellaneous Publication No. 200. US Department of Agriculture, Washington, DC

    Google Scholar 

  34. Hoeppe G (2014) Working data together: the accountability and reflexivity of digital astronomical practice. Soc Stud Sci 44:243–270

    Article  Google Scholar 

  35. Hutter H, Moerman D (2015) Big data in Caenorhabditis elegans: quo vadis? Mol Biol Cell 26:3909–3914

    Article  Google Scholar 

  36. Jansen MA, Franz NM (2015) Phylogenetic revision of Minyomerus Horn, 1876s. Jansen & Franz, 2015 (Coleoptera, Curculionidae) using taxonomic concept annotations and alignments. ZooKeys 528:1–133

    Article  Google Scholar 

  37. Jansonius J (1981) Linnaean nomenclature. Universal language of taxonomists. And the Sporae Dispersae (with a commentary on Hughes’ proposal). Taxon 30:438–448

    Article  Google Scholar 

  38. Koperski M, Sauer M, Braun W, Gradstein SR (2000) Referenzliste der Moose Deutschlands. Schrift Vegetationsk 34:1–519

    Google Scholar 

  39. Kuhn TS (1996[1962]) The structure of scientific revolutions. University of Chicago Press, Chicago

  40. Lagoze C (2014) Big data, data integrity, and the fracturing of the control zone. Big Data Soc 1(2):1–11

    Article  Google Scholar 

  41. Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Application Delivery Strategies, META Group Inc, Atlanta

    Google Scholar 

  42. Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of Google flu: traps in big data analysis. Science 343:1203–1205

    Article  Google Scholar 

  43. Leonelli S (2014) What difference does quantity make? On the epistemology of big data in biology. Big Data Soc 1(1):1–11

    Article  Google Scholar 

  44. Leonelli S (2016) Data-centric biology: a philosophical study. University of Chicago Press, Chicago

    Google Scholar 

  45. Lepage D, Vaidya G, Guralnick R (2014) Avibase – a database system for managing and organizing taxonomic concepts. ZooKeys 420:117–135

    Article  Google Scholar 

  46. Levinson SC (2000) Presumptive meanings: the theory of generalized conversational implicature. MIT Press, Cambridge

    Google Scholar 

  47. Mayer-Schönberger V, Cukier K (2013) Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt, New York

    Google Scholar 

  48. Meng X-L (2014) A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it). In: Lin X, Genest C, Banks DL et al (eds) Past, present, and future of statistical science. CRC Press, Boca Raton, FL, pp 537–562

    Google Scholar 

  49. Merriam-Webster (2016) Metonymy. http://www.merriam-webster.com/dictionary/metonymy. Accessed 13 Jan 2017

  50. Millerand F, Ribes D, Baker KS, Bowker GC (2013) Making an issue out of a standard: storytelling practices in a scientific community. Sci Technol Hum Values 38:7–43

    Article  Google Scholar 

  51. Müller-Wille S, Charmantier I (2012) Natural history and information overload: the case of Linnaeus. Stud Hist Philos Biol Biomed Sci 43:4–15

    Article  Google Scholar 

  52. O’Malley MA (2013) When integration fails: prokaryote phylogeny and the tree of life. Stud Hist Philos Biol Biomed Sci 44:551–562

    Article  Google Scholar 

  53. Ogilvie BrW (2003) The many books of nature: renaissance naturalists and information overload. J Hist Ideas 64:29–40

    Article  Google Scholar 

  54. Page RD (2016) Surfacing the deep data of taxonomy. Zookeys 550:247–260

    Article  Google Scholar 

  55. Patterson D, Mozzherin D, Shorthouse D, Thessen A (2016) Challenges with using names to link digital biodiversity information. Biodivers Data J 4:e8080. doi:10.3897/BDJ.4.e8080.

    Article  Google Scholar 

  56. Peterson AT, Navarro-Sigüenza AG (1999) Alternate species concepts as bases for determining priority conservation areas. Conserv Biol 13:427–431

    Article  Google Scholar 

  57. Piantadosi ST, Tily H, Gibson E (2012) The communicative function of ambiguity in language. Cognition 122:280–291

    Article  Google Scholar 

  58. Pullan MR, Watson MF, Kennedy JB et al (2000) The Prometheus taxonomic model: a practical approach to representing multiple classifications. Taxon 49:55–75

    Article  Google Scholar 

  59. Pullan MR, Armstrong KE, Paterson T et al (2005) The Prometheus description model: an examination of the taxonomic description-building process and its representation. Taxon 54:751–765

    Article  Google Scholar 

  60. Radford AE, Ahles HE, Bell CR (1968) Manual of the vascular flora of the Carolinas. University of North Carolina Press, Chapel Hill

    Google Scholar 

  61. Remsen D (2016) The use and limits of scientific names in biological informatics. ZooKeys 550:207–223

    Article  Google Scholar 

  62. Rogers N (2016) Museum drawers go digital. Science 352:762–765

    Article  Google Scholar 

  63. Rosenberg MS (2014) Contextual cross-referencing of species names for fiddler crabs (genus Uca): an experiment in cyber-taxonomy. PLoS ONE 9(7):e101704

    Article  Google Scholar 

  64. Sepkoski D (2012) Rereading the fossil record: the growth of paleobiology as an evolutionary discipline. University of Chicago Press, Chicago

    Google Scholar 

  65. Shavit A, Griesemer JR (2009) There and back again, or the problem of locality in biodiversity surveys. Philos Sci 76:273–294

    Article  Google Scholar 

  66. Shavit A, Griesemer JR (2011) Transforming objects into data: how minute technicalities of recording ‘species location’ entrench a basic challenge for biodiversity. In: Carrier M, Nordmann A (eds) Science in the context of application. Boston Studies in the Philosophy of Science, vol 274. Springer Science + Business Media, Netherlands, pp 169–193

    Google Scholar 

  67. Smith BE, Johnston MK, Lücking R (2016) From GenBank to GBIF: phylogeny-based predictive niche modeling tests accuracy of taxonomic identifications in large occurrence data repositories. PLoS ONE 11(3):e0151232

    Article  Google Scholar 

  68. Star SL, Griesemer JR (1989) Institutional ecology, ‘translations’ and boundary objects: amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907–39. Soc Stud Sci 19:387–420

    Article  Google Scholar 

  69. Stearn WT (1959) The background of Linnaeus’s contributions to the nomenclature and methods of systematic biology. Syst Zool 8:4–22

    Article  Google Scholar 

  70. Stevens PF (2002) Why do we name organisms? Some reminders from the past. Taxon 51:11–26

    Article  Google Scholar 

  71. Strasser BJ (2011) The experimenter’s museum: GenBank, natural history, and the moral economies of biomedicine. Isis 102:60–96

    Article  Google Scholar 

  72. Strauss AL (1993) Continual permutations of action. de Gruyter, New York

    Google Scholar 

  73. Suciu D (2013) Big data begets big database theory. In: Gottlob G, Grasso G, Olteanu D, Schallhart C (eds) Proceedings of the 29th British National Conference on Databases, BNCOD 2013, Oxford, UK, July 8–10, 2013. Spring, Berlin. Lect Notes Comput Sci 7968:pp 1–5

  74. Wilson D, Sperber D (2012) Meaning and relevance. Cambridge University Press, New York, NY

    Google Scholar 

  75. Witteveen J (2015a) Naming and contingency: the type method of biological taxonomy. Biol Philos 30:569–586

    Article  Google Scholar 

  76. Witteveen J (2015b) Suppressing synonymy with a homonym: the emergence of the nomenclatural type concept in nineteenth century natural history. J Hist Biol 49:135–189

    Article  Google Scholar 

  77. Zipf G (1949) Human behavior and the principle of least effort. Addison-Wesley, New York

    Google Scholar 

Download references

Acknowledgements

The authors are grateful to Hong Cui, Bertram Ludäscher, and Jonathan Rees for helpful feedback on this subject. Support of the authors’ research through the National Science Foundation is kindly acknowledged (NMF: DEB–1155984, DBI–1342595; BS: SES–1153114).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Beckett Sterner.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sterner, B., Franz, N.M. Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data. Biol Theory 12, 99–111 (2017). https://doi.org/10.1007/s13752-017-0259-5

Download citation

Keywords

  • Big data
  • Cognitive pragmatics
  • Concept taxonomy
  • Data aggregation
  • Knowledge representation and reasoning
  • Nomenclature