Despite an enormous philosophical literature on models in science, surprisingly little has been written about data models and how they are constructed. In this paper, I examine the case of how paleodiversity data models are constructed from the fossil data. In particular, I show how paleontologists are using various model-based techniques to correct the data. Drawing on this research, I argue for the following related theses: first, the ‘purity’ of a data model is not a measure of its epistemic reliability. Instead it is the fidelity of the data that matters. Second, the fidelity of a data model in capturing the signal of interest is a matter of degree. Third, the fidelity of a data model can be improved ‘vicariously’, such as through the use of post hoc model-based correction techniques. And, fourth, data models, like theoretical models, should be assessed as adequate (or inadequate) for particular purposes.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
What has often been overlooked in many discussions of data models is that Suppes’s view of data models is tied to the Tarskian ‘instantial’ view of models. Elsewhere it is argued that the notion of data models should be disentangled from this instantial view, and that data models, like other models in science, should be understood as representations. This move is important not only philosophically for avoiding what van Fraassen (2008) calls the “loss of reality objection,” but also for making adequate sense of scientific practice. See Parker and Bokulich (in preparation) for further discussion.
For example, the mammoth Springer Handbook of Model-Based Science (Magnani and Bertolotti 2017), though covering many excellent topics in its 53 chapters, fails to have an entry on data models.
A fuller discussion of some of the interesting parallels between data in paleontology and data in climate science is taken up in Parker and Bokulich (in preparation).
Of course, the fossil record is not just critical for understanding the processes of biological evolution, but also gives information about the history of the climate and the movements of tectonic plates. Thus, one must pay attention to the purpose for which the data is intended.
For more on the MBL model see, for example, Huss (2009).
The historian David Sepkoski is the son of the paleontologist Jack Sepkoski.
This issue of the adequacy of a data model for a purpose will be discussed further below.
Due to limited space, I will only very briefly discuss the first, skip the second, and focus primarily on the third "corrected" approach to reading the fossil record.
For an excellent philosophical discussion of punctuated equilibrium in connection with paleontology see Turner (2011).
As an example, Raup notes that the observed diversity of insects during the Cretaceous is essentially zero, not because the actual diversity was zero, but because of the absence of Lagerstätten of this time period to record them.
This method was first developed by the Woods Hole benthic ecologist Howard Sanders. While ecologists tend to use the term ‘rarefaction’, paleontologists typically prefer the term ‘subsampling’ (see Alroy 2010b, p. 61 for a discussion of the terminology).
Note that the raw taxic diversity estimate is not really "raw," insofar as it already involves substantial theoretical categorization, cleaning up, and processing. Paleontologists often seem to use the term ‘raw’ to refer to the level of data model below the data-correction techniques they are investigating; hence it is a term that can shift with context.
My use of the notion of "signal" here bears some affinity to Turner's (2007) informational interpretation of traces (e.g., 18–20). More recently Currie (2018, Chapter 3) has argued that a strictly ontological notion of trace, such as the informational view, should be replaced with an epistemic notion of trace that builds in the notion of evidential relevance. A discussion of these interesting issues is outside the scope of this work.
It should be noted that there are many different ways to implement residual diversity model corrections (involving, for example, different choice of proxies); hence, Brocklehurst's conclusion here only applies to the "optimal" implementation of the method. Significant problems have been raised with other widely-used implementations of the residuals method, especially those that use the more restricted clade-bearing formations as the proxy (see Sakamoto et al. 2017 for a discussion). I thank Mike Benton (personal communication) for underscoring this point.
These tests are of course fallible, depending on the reliability of the assumptions made in the simulation; however, this is arguably no different than elsewhere science, which is understood to be an iterative, ongoing process.
This example follows Upchurch and Barrett (2005, p. 108).
Lazarus taxa, which are genuine descendants, must be carefully distinguished from ‘Elvis taxa’, which are not actually descendants of the original taxon, but merely appear to be, due to a similar morphology resulting from convergent evolution (Erwin and Droser 1993).
The story of the coelacanth along with a clear illustrations of ghost lineages can be found at http://www.ucmp.berkeley.edu/taxa/verts/archosaurs/ghost_lineages.php.
Lane et al. (2005) propose the term ‘zombie lineage’ for the unsampled terminal (as opposed to initial) portion of a taxon’s range (pp. 22–23), though some authors use ‘ghost lineage’ for both.
As Brocklehurst notes, a method that cannot even perform well in the simplified simulation scenario is unlikely to perform better under the more complicated conditions found in the real world (2015, p. 12).
I am very grateful to an anonymous referee for calling my attention to this important point and the following examples.
Data reduction is just another term for the process by which raw data is turned into a scientifically useful data model by being cleaned up, ordered, and corrected.
This notion of the adequacy of a data model for a purpose is elaborated in greater detail in Parker and Bokulich (in preparation).
More precisely, I have in mind those fossil rocks that have been collected, prepared, and categorized. I will not engage the difficult question here of where exactly to draw the line between (raw) data and a data model. It may very well be that the distinction is one of degree with vague boundaries, rather than a difference of kind (though as with other vague categories, that does not mean there are no important differences); and where the line is drawn may further be context dependent. My inclination here is to say that if a fossil rock has been collected, categorized, and/or prepared, that is sufficient for it to count as a data model.
As noted before, fossil data can be taken to be a representation of more than just past life (e.g., they can also represent facts about the geological or paleoclimatological record).
Although not always required, preparation is typically needed for vertebrate fossils, and sometimes needed for invertebrate fossils as well.
While most numerical data-model correction techniques are reversible, many physical data-model correction techniques are not, and hence call for more caution.
A fuller discussion of this notion of model-data symbiosis and a taxonomy of the different ways that data can be model-filtered is provided in Bokulich (forthcoming).
Of course not all model-corrected data will be better than the raw—it will depend on the particular concrete details of the scientific case. Data correction methods typically work best when there is a) a detailed, quantitative understanding of the biases and their effects on the data and b) robust, independent lines of evidence providing the grounds for the model-based corrections.
Alroy, J. (2010a). Geographical, environmental, and intrinsic biotic controls on phanerozoic marine diversification. Paleontology, 53(6), 1211–1235.
Alroy, J. (2010b). Fair sampling of taxanomic richness and unbiased estimation of origination and extinction rates. In J. Alroy & G. Hunt (Eds.), Quantitative methods in paleobiology (pp. 55–80). Baltimore: The Paleontological Society.
Benton, M., Dunhill, A., Lloyd, G., & Marx, F. (2011). Assessing the quality of the fossil record: Insights from vertebrates. In A. McGowan & A. Smith (Eds.), Comparing the geological and fossil records: Implications for biodiversity studies (Vol. 358, pp. 63–94). London: Geological Society.
Benton, M., & Harper, D. (2009). Introduction to paleobiology and the fossil record. Chichester: Wiley.
Bokulich, A. (forthcoming). Towards a taxonomy of the model-ladenness of data. In Presentation in Symposium session: Exploring model-data symbiosis in the geosciences. Philosophy of Science Association Biennial Meeting, November 2018, Seattle, WA.
Brocklehurst, N. (2015). A simulation-based examination of residual diversity estimates as a method of correcting for sampling bias. Palaeontologia Electronica, 18.3.7T, 1–15.
Collins, M., & Simberloff, D. (2009). Rarefaction and nonrandom spatial dispersion patterns. Environmental and Ecological Statistics, 16, 89–103.
Currie, A. (2018). Rock, bone, and ruin: An optimist’s guide to the historical sciences. Cambridge, MA: The MIT Press.
Darwin, C. (1859). On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. London: John Murray. Retrieved from https://en.wikisource.org/w/index.php?title=On_the_Origin_of_Species_(1859)&oldid=6512451.
Edwards, P. (2001). Representing the global atmosphere: Computer models, data, and knowledge about climate change. In C. Miller & P. Edwards (Eds.), Changing the atmosphere: Expert knowledge and environmental governance (pp. 31–65). Cambridge, MA: MIT Press.
Edwards, P. (2010). A vast machine: Computer models, climate data, and the politics of global warming. Cambridge, MA: MIT Press.
Eldredge, N., & Gould, S. J. (1972). Punctuated equilibria: An alternative to phyletic gradualism. In T. Schopf (Ed.), Models in paleobiology (pp. 82–115). San Francisco: Freeman, Cooper, and Co.
Erwin, D., & Droser, M. (1993). Elvis taxa. Palaios, 8, 623–624.
Foote, M. (1996). Perspective: Evolutionary patterns in the fossil record. Evolution, 50(1), 1–11.
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika Trust, 40(3/4), 237–264.
Gould, S. J., Raup, D., Sepkoski, J., Jr., Schopf, T., & Simberloff, D. (1977). The shape of evolution: A comparison of real and random clades. Paleobiology, 3, 23–40.
Huss, J. (2009). The shape of evolution: The MBL model and clade shape. In D. Sepkoski & M. Ruse (Eds.), The paleobiological revolution: Essays on the growth of modern paleontology. Chicago: University of Chicago Press.
Lane, A., Janis, C., & Sepkoski, J. (2005). Estimating paleodiversities: A test of taxic and phylogenetic methods. Paleobiology, 31(1), 21–34.
Leonelli, S. (2016). Data-centric biology: A philosophical study. Chicago: University of Chicago Press.
Lyell, C. (1830). Principles of geology: Being an attempt to explain the former changes of the earth’s surface, by references to causes now in operation. London: John Murray. Retrieved from http://www.esp.org/books/lyell/principles/facsimile/contents/lyell-v1-aa-fm.pdf.
Magnani, L., & Bertolotti, T. (Eds.). (2017). Springer handbook of model-based science. Dordrecht: Springer.
Metcalfe, I., & Isozaki, Y. (2009). Current perspectives on the permian-triassic boundary and end-permian mass extinction: Preface. Journal of Asian Earth Sciences, 36, 407–412.
Norton, S., & Suppe, F. (2001). Why atmospheric modeling is good science. In C. Miller & P. Edwards (Eds.), Changing the atmosphere: Expert knowledge and environmental governance (pp. 67–105). Cambridge, MA: MIT Press.
Norwell, M. (1993). Tree-based approaches to understanding history: Comments on ranks, rules, and the quality of the fossil record. American Journal of Science, 293, 407–417.
Parker, W. (2010). Scientific models and adequacy for purpose. The Modern Schoolman, 87, 285–293.
Parker, W., & Bokulich, A. (in preparation). Data models, representation, and adequacy-for-purpose.
Raup, D. (1972). Taxonomic diversity during the phanerozoic. Science, 177(4054), 1065–1071.
Raup, D. (1975). Taxanomic diversity estimation using rarefaction. Paleobiology, 1, 333–342.
Sakamoto, M., Benton, M., & Venditti, C. (2016). Dinosaurs in decline tens of millions of years before their final extinction. Proceedings of the National Academy of Science, 113(18), 5036–5040.
Sakamoto, M., Venditti, C., & Benton, M. (2017). ‘Residual diversity estimates’ do not correct for sampling bias in palaeodiversity data. Methods in Ecology and Evolution, 8, 453–459.
Sepkoski, J. (1982). Compendium of fossil marine families. Milwaukee Public Museum Contributions in Biology and Geology, 51, 1–125.
Sepkoski, J. (1984). A kinetic model of phanerozoic taxanomic diversity. III. Post-paleozoic families and mass extinctions. Paleobiology, 10(2), 246–267.
Sepkoski, J. (1994). What I did with my research career: Or how research on biodiversity yielded data on extinction. In W. Glenn (Ed.), Mass-extinction debates: How science works in a crisis. Stanford, CA: Stanford University Press.
Sepkoski, D. (2012a). Reading the fossil record: The growth of paleobiology as an evolutionary discipline. Chicago: University of Chicago Press.
Sepkoski, D. (2012b). ‘Replying life’s tape’: Simulations, metaphors, and historicity in Stephen Jay Gould’s view of life. Studies in History and Philosophy of Biological and Biomedical Sciences, 58, 73–81.
Sepkoski, D. (2013). ‘Towards a natural history of data’: Evolving practices and epistemologies of data in paleontology, 1800–2000. Journal of the History of Biology, 46, 401–444.
Sepkoski, D. (2016). ‘Replaying life’s tape’: Simulations, metaphors, and historicity in Stephen Jay Gould’s view of life. Studies in History and Philosophy of Biological and Biomedical Sciences, 58, 73–81.
Sepkoski, D., & Ruse, M. (2009). The paleobiological revolution: Essays on the growth of modern paleontology. Chicago: University of Chicago Press.
Signor, P., III, & Lipps, J. (1982). Sampling bias, gradual extinction patterns and catastrophes in the fossil record. In L. Silver & P. Schultz (Eds.), Geological implications of large asteroids and comets on the earth (Vol. 190, pp. 291–296). Boulder: Geological Society of America.
Smith, A. (1994). Systematics and the fossil record: Documenting evolutionary patterns. Oxford: Blackwell Science Ltd.
Smith, A., & McGowan, A. (2007). The shape of the phanerozoic marine paleodiversity curve: How much can be predicted from the sedimentary rock record of Western Europe. Palaeontology, 50(4), 765–774.
Suppes, P. (1962). Models of data. In E. Nagel, P. Suppes, & A. Tarski (Eds.), Logic, methodology and philosophy of science: Proceedings of the 1960 international congress (pp. 252–261). Stanford: Stanford University Press.
Turner, D. (2007). Making prehistory: Historical science and the scientific realism debate. Cambridge studies in philosophy and biology. Cambridge: Cambridge University Press.
Turner, D. (2011). Paleontology: A philosophical introduction. Cambridge: Cambridge University Press.
Upchurch, P., & Barrett, P. (2005). Phylogenetic and taxic perspectives on sauropod diversity. In K. Rogers & J. Wilson (Eds.), The sauropods: Evolution and paleobiology (pp. 104–124). Berkeley: University of California Press.
van Fraassen, B. (2008). Scientific representation: Paradoxes of perspective. Oxford: Clarendon Press.
Wylie, C. (2009). Preparation in action: Paleontological skill and the role of the fossil preparator. In: M. Brown, J. Kane, & W. Parker (Eds.), Methods in fossil preparation: Proceedings of the first annual fossil preparation and collections symposium (pp. 3–12).
Wylie, C. (2016). “Overcoming underdetermination” on extinct: The philosophy of palaeontology blog (April 11, 2016). Retrieved August 5, 2017 from http://www.extinctblog.org/extinct/2016/4/11/overcoming-underdetermination.
I am grateful to Wendy Parker, Adrian Currie, Mike Benton, and two anonymous referees for helpful comments on an earlier version of this paper. I also thank Demetris Portides for first encouraging me to write this paper and for his patience seeing it through to completion. I gratefully acknowledge the support of the Institute of Advanced Study at Durham University, COFUND Senior Research Fellowship, under EU grant agreement number 609412.
About this article
Cite this article
Bokulich, A. Using models to correct data: paleodiversity and the fossil record. Synthese (2018). https://doi.org/10.1007/s11229-018-1820-x
- Climate science
- Data models