Using models to correct data: paleodiversity and the fossil record


Despite an enormous philosophical literature on models in science, surprisingly little has been written about data models and how they are constructed. In this paper, I examine the case of how paleodiversity data models are constructed from the fossil data. In particular, I show how paleontologists are using various model-based techniques to correct the data. Drawing on this research, I argue for the following related theses: first, the ‘purity’ of a data model is not a measure of its epistemic reliability. Instead it is the fidelity of the data that matters. Second, the fidelity of a data model in capturing the signal of interest is a matter of degree. Third, the fidelity of a data model can be improved ‘vicariously’, such as through the use of post hoc model-based correction techniques. And, fourth, data models, like theoretical models, should be assessed as adequate (or inadequate) for particular purposes.

This is a preview of subscription content, log in to check access.

Fig. 1

(Metcalfe and Isozaki 2009, Fig. 1, after Sepkoski 1984; with permission from Elsevier)

Fig. 2

(Redrawn after Upchurch and Barrett 2005)


  1. 1.

    What has often been overlooked in many discussions of data models is that Suppes’s view of data models is tied to the Tarskian ‘instantial’ view of models. Elsewhere it is argued that the notion of data models should be disentangled from this instantial view, and that data models, like other models in science, should be understood as representations. This move is important not only philosophically for avoiding what van Fraassen (2008) calls the “loss of reality objection,” but also for making adequate sense of scientific practice. See Parker and Bokulich (in preparation) for further discussion.

  2. 2.

    For example, the mammoth Springer Handbook of Model-Based Science (Magnani and Bertolotti 2017), though covering many excellent topics in its 53 chapters, fails to have an entry on data models.

  3. 3.

    A fuller discussion of some of the interesting parallels between data in paleontology and data in climate science is taken up in Parker and Bokulich (in preparation).

  4. 4.

    Of course, the fossil record is not just critical for understanding the processes of biological evolution, but also gives information about the history of the climate and the movements of tectonic plates. Thus, one must pay attention to the purpose for which the data is intended.

  5. 5.

    For more on the MBL model see, for example, Huss (2009).

  6. 6.

    Such subtraction models play an important role not only in current paleontological research (e.g., Smith and McGowan 2007, “residuals method”), but also in current climate research, where they have been termed “intermediate models” (e.g., Edwards 2001, p. 61).

  7. 7.

    The historian David Sepkoski is the son of the paleontologist Jack Sepkoski.

  8. 8.

    This issue of the adequacy of a data model for a purpose will be discussed further below.

  9. 9.

    Due to limited space, I will only very briefly discuss the first, skip the second, and focus primarily on the third "corrected" approach to reading the fossil record.

  10. 10.

    For an excellent philosophical discussion of punctuated equilibrium in connection with paleontology see Turner (2011).

  11. 11.

    As an example, Raup notes that the observed diversity of insects during the Cretaceous is essentially zero, not because the actual diversity was zero, but because of the absence of Lagerstätten of this time period to record them.

  12. 12.

    This method was first developed by the Woods Hole benthic ecologist Howard Sanders. While ecologists tend to use the term ‘rarefaction’, paleontologists typically prefer the term ‘subsampling’ (see Alroy 2010b, p. 61 for a discussion of the terminology).

  13. 13.

    Note that the raw taxic diversity estimate is not really "raw," insofar as it already involves substantial theoretical categorization, cleaning up, and processing. Paleontologists often seem to use the term ‘raw’ to refer to the level of data model below the data-correction techniques they are investigating; hence it is a term that can shift with context.

  14. 14.

    My use of the notion of "signal" here bears some affinity to Turner's (2007) informational interpretation of traces (e.g., 18–20). More recently Currie (2018, Chapter 3) has argued that a strictly ontological notion of trace, such as the informational view, should be replaced with an epistemic notion of trace that builds in the notion of evidential relevance. A discussion of these interesting issues is outside the scope of this work.

  15. 15.

    It should be noted that there are many different ways to implement residual diversity model corrections (involving, for example, different choice of proxies); hence, Brocklehurst's conclusion here only applies to the "optimal" implementation of the method. Significant problems have been raised with other widely-used implementations of the residuals method, especially those that use the more restricted clade-bearing formations as the proxy (see Sakamoto et al. 2017 for a discussion). I thank Mike Benton (personal communication) for underscoring this point.

  16. 16.

    These tests are of course fallible, depending on the reliability of the assumptions made in the simulation; however, this is arguably no different than elsewhere science, which is understood to be an iterative, ongoing process.

  17. 17.

    This example follows Upchurch and Barrett (2005, p. 108).

  18. 18.

    Lazarus taxa, which are genuine descendants, must be carefully distinguished from ‘Elvis taxa’, which are not actually descendants of the original taxon, but merely appear to be, due to a similar morphology resulting from convergent evolution (Erwin and Droser 1993).

  19. 19.

    The story of the coelacanth along with a clear illustrations of ghost lineages can be found at

  20. 20.

    Lane et al. (2005) propose the term ‘zombie lineage’ for the unsampled terminal (as opposed to initial) portion of a taxon’s range (pp. 22–23), though some authors use ‘ghost lineage’ for both.

  21. 21.

    As Brocklehurst notes, a method that cannot even perform well in the simplified simulation scenario is unlikely to perform better under the more complicated conditions found in the real world (2015, p. 12).

  22. 22.

    I am very grateful to an anonymous referee for calling my attention to this important point and the following examples.

  23. 23.

    Data reduction is just another term for the process by which raw data is turned into a scientifically useful data model by being cleaned up, ordered, and corrected.

  24. 24.

    This notion of the adequacy of a data model for a purpose is elaborated in greater detail in Parker and Bokulich (in preparation).

  25. 25.

    More precisely, I have in mind those fossil rocks that have been collected, prepared, and categorized. I will not engage the difficult question here of where exactly to draw the line between (raw) data and a data model. It may very well be that the distinction is one of degree with vague boundaries, rather than a difference of kind (though as with other vague categories, that does not mean there are no important differences); and where the line is drawn may further be context dependent. My inclination here is to say that if a fossil rock has been collected, categorized, and/or prepared, that is sufficient for it to count as a data model.

  26. 26.

    As noted before, fossil data can be taken to be a representation of more than just past life (e.g., they can also represent facts about the geological or paleoclimatological record).

  27. 27.

    Although not always required, preparation is typically needed for vertebrate fossils, and sometimes needed for invertebrate fossils as well.

  28. 28.

    While most numerical data-model correction techniques are reversible, many physical data-model correction techniques are not, and hence call for more caution.

  29. 29.

    A fuller discussion of this notion of model-data symbiosis and a taxonomy of the different ways that data can be model-filtered is provided in Bokulich (forthcoming).

  30. 30.

    Of course not all model-corrected data will be better than the raw—it will depend on the particular concrete details of the scientific case. Data correction methods typically work best when there is a) a detailed, quantitative understanding of the biases and their effects on the data and b) robust, independent lines of evidence providing the grounds for the model-based corrections.


  1. Alroy, J. (2010a). Geographical, environmental, and intrinsic biotic controls on phanerozoic marine diversification. Paleontology, 53(6), 1211–1235.

    Article  Google Scholar 

  2. Alroy, J. (2010b). Fair sampling of taxanomic richness and unbiased estimation of origination and extinction rates. In J. Alroy & G. Hunt (Eds.), Quantitative methods in paleobiology (pp. 55–80). Baltimore: The Paleontological Society.

    Google Scholar 

  3. Benton, M., Dunhill, A., Lloyd, G., & Marx, F. (2011). Assessing the quality of the fossil record: Insights from vertebrates. In A. McGowan & A. Smith (Eds.), Comparing the geological and fossil records: Implications for biodiversity studies (Vol. 358, pp. 63–94). London: Geological Society.

    Google Scholar 

  4. Benton, M., & Harper, D. (2009). Introduction to paleobiology and the fossil record. Chichester: Wiley.

    Google Scholar 

  5. Bokulich, A. (forthcoming). Towards a taxonomy of the model-ladenness of data. In Presentation in Symposium session: Exploring model-data symbiosis in the geosciences. Philosophy of Science Association Biennial Meeting, November 2018, Seattle, WA.

  6. Brocklehurst, N. (2015). A simulation-based examination of residual diversity estimates as a method of correcting for sampling bias. Palaeontologia Electronica, 18.3.7T, 1–15.

    Google Scholar 

  7. Collins, M., & Simberloff, D. (2009). Rarefaction and nonrandom spatial dispersion patterns. Environmental and Ecological Statistics, 16, 89–103.

    Article  Google Scholar 

  8. Currie, A. (2018). Rock, bone, and ruin: An optimist’s guide to the historical sciences. Cambridge, MA: The MIT Press.

    Google Scholar 

  9. Darwin, C. (1859). On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. London: John Murray. Retrieved from

  10. Edwards, P. (2001). Representing the global atmosphere: Computer models, data, and knowledge about climate change. In C. Miller & P. Edwards (Eds.), Changing the atmosphere: Expert knowledge and environmental governance (pp. 31–65). Cambridge, MA: MIT Press.

    Google Scholar 

  11. Edwards, P. (2010). A vast machine: Computer models, climate data, and the politics of global warming. Cambridge, MA: MIT Press.

    Google Scholar 

  12. Eldredge, N., & Gould, S. J. (1972). Punctuated equilibria: An alternative to phyletic gradualism. In T. Schopf (Ed.), Models in paleobiology (pp. 82–115). San Francisco: Freeman, Cooper, and Co.

    Google Scholar 

  13. Erwin, D., & Droser, M. (1993). Elvis taxa. Palaios, 8, 623–624.

    Article  Google Scholar 

  14. Foote, M. (1996). Perspective: Evolutionary patterns in the fossil record. Evolution, 50(1), 1–11.

    Article  Google Scholar 

  15. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika Trust, 40(3/4), 237–264.

    Article  Google Scholar 

  16. Gould, S. J., Raup, D., Sepkoski, J., Jr., Schopf, T., & Simberloff, D. (1977). The shape of evolution: A comparison of real and random clades. Paleobiology, 3, 23–40.

    Article  Google Scholar 

  17. Huss, J. (2009). The shape of evolution: The MBL model and clade shape. In D. Sepkoski & M. Ruse (Eds.), The paleobiological revolution: Essays on the growth of modern paleontology. Chicago: University of Chicago Press.

    Google Scholar 

  18. Lane, A., Janis, C., & Sepkoski, J. (2005). Estimating paleodiversities: A test of taxic and phylogenetic methods. Paleobiology, 31(1), 21–34.

    Article  Google Scholar 

  19. Leonelli, S. (2016). Data-centric biology: A philosophical study. Chicago: University of Chicago Press.

    Google Scholar 

  20. Lyell, C. (1830). Principles of geology: Being an attempt to explain the former changes of the earth’s surface, by references to causes now in operation. London: John Murray. Retrieved from

  21. Magnani, L., & Bertolotti, T. (Eds.). (2017). Springer handbook of model-based science. Dordrecht: Springer.

    Google Scholar 

  22. Metcalfe, I., & Isozaki, Y. (2009). Current perspectives on the permian-triassic boundary and end-permian mass extinction: Preface. Journal of Asian Earth Sciences, 36, 407–412.

    Article  Google Scholar 

  23. Norton, S., & Suppe, F. (2001). Why atmospheric modeling is good science. In C. Miller & P. Edwards (Eds.), Changing the atmosphere: Expert knowledge and environmental governance (pp. 67–105). Cambridge, MA: MIT Press.

    Google Scholar 

  24. Norwell, M. (1993). Tree-based approaches to understanding history: Comments on ranks, rules, and the quality of the fossil record. American Journal of Science, 293, 407–417.

    Article  Google Scholar 

  25. Parker, W. (2010). Scientific models and adequacy for purpose. The Modern Schoolman, 87, 285–293.

    Article  Google Scholar 

  26. Parker, W., & Bokulich, A. (in preparation). Data models, representation, and adequacy-for-purpose.

  27. Raup, D. (1972). Taxonomic diversity during the phanerozoic. Science, 177(4054), 1065–1071.

    Article  Google Scholar 

  28. Raup, D. (1975). Taxanomic diversity estimation using rarefaction. Paleobiology, 1, 333–342.

    Article  Google Scholar 

  29. Sakamoto, M., Benton, M., & Venditti, C. (2016). Dinosaurs in decline tens of millions of years before their final extinction. Proceedings of the National Academy of Science, 113(18), 5036–5040.

    Article  Google Scholar 

  30. Sakamoto, M., Venditti, C., & Benton, M. (2017). ‘Residual diversity estimates’ do not correct for sampling bias in palaeodiversity data. Methods in Ecology and Evolution, 8, 453–459.

    Article  Google Scholar 

  31. Sepkoski, J. (1982). Compendium of fossil marine families. Milwaukee Public Museum Contributions in Biology and Geology, 51, 1–125.

    Google Scholar 

  32. Sepkoski, J. (1984). A kinetic model of phanerozoic taxanomic diversity. III. Post-paleozoic families and mass extinctions. Paleobiology, 10(2), 246–267.

    Article  Google Scholar 

  33. Sepkoski, J. (1994). What I did with my research career: Or how research on biodiversity yielded data on extinction. In W. Glenn (Ed.), Mass-extinction debates: How science works in a crisis. Stanford, CA: Stanford University Press.

    Google Scholar 

  34. Sepkoski, D. (2012a). Reading the fossil record: The growth of paleobiology as an evolutionary discipline. Chicago: University of Chicago Press.

    Google Scholar 

  35. Sepkoski, D. (2012b). ‘Replying life’s tape’: Simulations, metaphors, and historicity in Stephen Jay Gould’s view of life. Studies in History and Philosophy of Biological and Biomedical Sciences, 58, 73–81.

    Article  Google Scholar 

  36. Sepkoski, D. (2013). ‘Towards a natural history of data’: Evolving practices and epistemologies of data in paleontology, 1800–2000. Journal of the History of Biology, 46, 401–444.

    Article  Google Scholar 

  37. Sepkoski, D. (2016). ‘Replaying life’s tape’: Simulations, metaphors, and historicity in Stephen Jay Gould’s view of life. Studies in History and Philosophy of Biological and Biomedical Sciences, 58, 73–81.

    Article  Google Scholar 

  38. Sepkoski, D., & Ruse, M. (2009). The paleobiological revolution: Essays on the growth of modern paleontology. Chicago: University of Chicago Press.

    Google Scholar 

  39. Signor, P., III, & Lipps, J. (1982). Sampling bias, gradual extinction patterns and catastrophes in the fossil record. In L. Silver & P. Schultz (Eds.), Geological implications of large asteroids and comets on the earth (Vol. 190, pp. 291–296). Boulder: Geological Society of America.

    Google Scholar 

  40. Smith, A. (1994). Systematics and the fossil record: Documenting evolutionary patterns. Oxford: Blackwell Science Ltd.

    Google Scholar 

  41. Smith, A., & McGowan, A. (2007). The shape of the phanerozoic marine paleodiversity curve: How much can be predicted from the sedimentary rock record of Western Europe. Palaeontology, 50(4), 765–774.

    Article  Google Scholar 

  42. Suppes, P. (1962). Models of data. In E. Nagel, P. Suppes, & A. Tarski (Eds.), Logic, methodology and philosophy of science: Proceedings of the 1960 international congress (pp. 252–261). Stanford: Stanford University Press.

    Google Scholar 

  43. Turner, D. (2007). Making prehistory: Historical science and the scientific realism debate. Cambridge studies in philosophy and biology. Cambridge: Cambridge University Press.

    Google Scholar 

  44. Turner, D. (2011). Paleontology: A philosophical introduction. Cambridge: Cambridge University Press.

    Google Scholar 

  45. Upchurch, P., & Barrett, P. (2005). Phylogenetic and taxic perspectives on sauropod diversity. In K. Rogers & J. Wilson (Eds.), The sauropods: Evolution and paleobiology (pp. 104–124). Berkeley: University of California Press.

    Google Scholar 

  46. van Fraassen, B. (2008). Scientific representation: Paradoxes of perspective. Oxford: Clarendon Press.

    Google Scholar 

  47. Wylie, C. (2009). Preparation in action: Paleontological skill and the role of the fossil preparator. In: M. Brown, J. Kane, & W. Parker (Eds.), Methods in fossil preparation: Proceedings of the first annual fossil preparation and collections symposium (pp. 3–12).

  48. Wylie, C. (2016). “Overcoming underdetermination” on extinct: The philosophy of palaeontology blog (April 11, 2016). Retrieved August 5, 2017 from

Download references


I am grateful to Wendy Parker, Adrian Currie, Mike Benton, and two anonymous referees for helpful comments on an earlier version of this paper. I also thank Demetris Portides for first encouraging me to write this paper and for his patience seeing it through to completion. I gratefully acknowledge the support of the Institute of Advanced Study at Durham University, COFUND Senior Research Fellowship, under EU grant agreement number 609412.

Author information



Corresponding author

Correspondence to Alisa Bokulich.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bokulich, A. Using models to correct data: paleodiversity and the fossil record. Synthese (2018).

Download citation


  • Paleontology
  • Paleobiology
  • Evolution
  • Data
  • Model
  • Suppes
  • Fossil
  • Biodiversity
  • Representation
  • Simulations
  • Climate science
  • Sepkoski
  • Data models