Abstract
In this paper we give a mathematically precise formulation of an old idea in bacterial taxonomy, namely cumulative classification, where the taxonomy is continuously updated and possibly augmented as new strains are identified. Our formulation is based on Bayesian predictive probability distributions. The criterion for founding a new taxon is given a firm theoretical foundation based on prediction and it is given a clear-cut interpretation. We formulate an algorithm for cumulative classification and apply it to a large database of bacteria belonging to the family Enterobacteriaceae. The resulting taxonomy makes microbiological sense.
Similar content being viewed by others
References
D’Amato, R. F., B. Holmes and E. J. Bottone (1981). The systems approach to diagnostic microbiology. Crit. Rev. Microbiol. 9, 1–44.
Barnett, J. A., S. Bascomb and J. C. Gower (1975). A maximal predictive classification of Klebsielleae and of the yeasts. J. Gen. Microbiol. 86, 93–102.
Baron, E. J., L. R. Peterson and S. M. Finegold (Eds) (1994). Bailey and Scott’s Diagnostic Microbiology, Ninth Edition. St Louis: Mosby.
Beers, R. J. and W. R. Lockhart (1962). Experimental methods in computer taxonomy. J. Gen. Microbiol. 28, 633–640.
Bender, E. A. (1996). Mathematical Methods of Artificial Intelligence, Los Alamitos, CA: IEEE Computer Society Press.
Berger, S. A. (1990). Lack of precision in commercial identification systems: correction using Bayesian analysis. J. Appl. Bacteriol. 68, 285–288.
Bernardo, J. M. and A. F. M. Smith (1994). Bayesian Theory. New York: Wiley.
Bryant, T. N. (1993). A compilation of probabilistic bacterial identification matrices. Binary 5, 207–210.
Busse, H.-J., E. B. M. Denner and W. Lubitz (1996). Classification and identification of bacteria: current approaches to an old problem. Overview of methods used in bacterial systematics. J. Biotechnol. 47, 3–38.
Dawid, A. P. (1984). Statistical theory. The prequential approach. J. Roy. Stat. Soc. A147, 278–292.
Dawid, A. P. (1992). Prequential analysis, stochastic complexity and Bayesian inference, in Bayesian Statistics 4, J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (Eds), Oxford: Oxford University Press, pp. 109–125.
Dybowski, W. and D. A. Franklin (1968). Conditional probability and the identification of bacteria. J. Gen. Microbiol. 54, 215–229.
Engen, S. (1978). Stochastic Abundance Models, London: Chapman and Hall.
Farmer, J. J. III et al. (1985). Biochemical identification of new species and biogroups of Enterobacteriaceae isolated from clinical specimens. J. Clin. Microbiol. 21, 46–76.
Farris, J. S. (1978). The information content of the phylogenetic system. Systematic Zoology 28, 483–482.
De Finetti, B. (1971). Theory of Probability, Vol. 1–2, New York: Wiley.
Fisher, R. A., A. S. Corbet and C. B. Williams (1943). The relation between the number of species and the number of individuals in a random sample from an animal population. J. Anim. Ecol. 12, 42–58.
Friedman, R., D. Bruce, J. MacLowry and V. Brenner (1973). Computer-assisted identification of bacteria. Am. J. Clin. Pathol. 60, 395–403.
Geisser, S. (1966). Predictive discrimination, in Multivariate Analysis, P. R. Krishnaiah (Ed.), New York: Academic Press, pp. 149–163.
Geisser, S. (1985). On the prediction of the observables: a selective update. in Bayesian Statistics 2, J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith (Eds), Oxford: Oxford University Press, pp. 203–230.
Geisser, S. (1993). Predictive Inference. An Introduction, London: Chapman and Hall.
Good, I. J. (1965). The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Cambridge, MA: MIT Press.
Good, I. J. (1967). A Bayesian significance test for multinomial distributions. J. Roy. Stat. Soc. Ser. B, 29, 399–431.
Gower, J. C. (1974). Maximal predictive classification. Biometrics 30, 643–654.
Gyllenberg, H. G. (1976). Development of reference systems for automatic identification of clinical isolates of bacteria. Arch. Immunologiae et Therapiae Experimentalis 24, 1–19.
Gyllenberg, H. G. (1981). Continuous cumulation of identification matrices. Helsingin Yliopiston Mikrobiologian Laitoksen Julkaisuja 20.
Gyllenberg, H. G., M. Gyllenberg, T. Koski, T. Lund, J. Schindler and M. Verlaan (1997). Classification of Enterobacteriaceae by minimization of stochastic complexity. Microbiology 143, 721–732.
Gyllenberg, H. G., M. Gyllenberg, T. Koski and T. Lund (1998a) Stochastic complexity as a taxonomix tool. Comput. Methods Programs in Biomed. 56, 11–22.
Gyllenberg, H. G., M. Gyllenberg, T. Koski, T. Lund and J. Schindler (1998b). An assessment of cumulative classification, submitted.
Gyllenberg, H. G. and T. K. Niemelä (1975). Basic principles in computer-assisted identification of microorganisms, in New Approaches to the Identification of Microorganisms, C.-G. Hedén and T. Illéni (Eds), New York: Wiley, pp. 201–223.
Gyllenberg M., H. G. Gyllenberg, T. Koski and J. Schindler (1993). Nonuniqueness of numerical taxonomic structures. Binary 5, 138–144.
Gyllenberg, M. and T. Koski (1996). Numerical taxonomy and the principle of maximum entropy. J. Classification 13, 213–230.
Gyllenberg, M. and T. Koski (1998). Bayesian predictiveness and exchangeability in classification, submitted.
Gyllenberg, M., T. Koski, E. Reilink and M. Verlaan (1996). Probabilistic aspects of numerical identification in microbiology, in Frontiers in Pure and Applied Probability II, A. N. Shiryaev, A. V. Melnikov, H. Niemi and E. Valkeila (Eds), Moscow: TVP Science Publishers, pp. 67–78.
Gyllenberg, M., T. Koski and M. Verlaan (1997). Classification of of binary vectors by stochastic complexity. J. Multivariate Anal. 63, 47–72.
Györfi, L., Z. Györfi and I. Vajda (1976). Bayesian decision with rejection. Prob. Control Inf. Theory 8, 445–452.
Hill, L. R. (1974). Theoretical aspects of numerical identification. Int. J. Syst. Bacteriol. 24, 494–499.
Hilpinen, R. (1968). Rules of Acceptance and Inductive Logic. Acta Philosophica Fennica 22, Amsterdam: North-Holland.
Hinkley, D. (1979). Predictive likelihood. Ann. Stat. 7, 718–728.
Hintikka, J. and I. Niiniluoto (1974). An axiomatic foundation for the logic of inductive generalization, in Formal Methods in the Methodology of Empirical Sciences, M. Przelecki, K. Szaniawski and R. Wojcicki (Eds), Boston: Reidel, pp. 57–92.
Holmes, B. and M. Costas (1992). Identification and typing of Enterobacteriaceae by computerized methods, in Identification methods in applied and environmental microbiology, R. G. Board, D. Jones and F. A. Skinner (Eds), Oxford: Blackwell Scientific Publications. 127–149.
Jilly, B. J. (1988). Microcomputer application of Bayesean probability testing for the identification of bacteria. Int. J. Bio-med. Comput. 22, 107–119.
Kanerva, P. (1990). Sparse Distributed Memory, Second Printing, Cambridge MA: MIT Press.
Kohonen, T. (1989). Self-Organizing and Associative Memory, Berlin: Springer.
Lapage, S. P., S. Bascomb, W. R. Willcox and M. A. Curtis (1973). Identification of bacteria by computer: general aspects and perspectives. J. Gen. Microbiol. 77, 291–315.
Liston, J., W. J. Wiebe and R. R. Colwell (1963). Quantitative approach to the study of bacterial organisms. J. Bacteriol. 85, 1061–1070.
Neapolitan, R. E. (1990). Probabilistic Reasoning in Expert Systems, New York: Wiley.
Pankhurst, R. J. (1991). Practical Taxonomic Computing, Cambridge: Cambridge University Press.
Paynes, L. C. (1963). Towards medical automation. World Medical Electronics 2, 6–11.
Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields 102, 145–158.
Pitman, J. (1996). Some developments of the Blackwell-MacQueen urn scheme, in Statistics, Probability and Game Theory, T. S. Ferguson and J. B. MacQueen (Eds), IMS Lecture Notes, Monograph Series, Vol. 30, pp. 245–267.
Ripley, B. D. (1996). Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, Singapore: World Scientific.
Ristad, E. S. (1995). A natural law of succession, Research Report CS-TR 495-95, Department of Computer Science, University of Princeton.
Roberts, H. V. (1965). Probabilistic prediction. J. Am. Stat. Assoc. 60, 50–62.
Sneath, P. H. A. (1964). New approaches to bacterial taxonomy: use of computers. Annu. Rev. Microbiol. 18, 335–346.
Sneath, P. H. A. (1979a). BASIC program for identification of an unknown with presence-absence data against an identification matrix of percent positive characteristics. Comput. Geosci. 5, 195–213.
Sneath, P. H. A. (1979b). BASIC program for determining the best identification scores possible from the most typical examples when compared with an identification matrix of percent positive characteristics. Comput. Geosci. 6, 27–34.
Sneath, P. H.A (1995). The history and future potential of numerical concepts in systematics: the contributions of H. G. Gyllenberg. Binary 7, 32–36.
Sneath, P. H. A. and R. I. C. Hansell (1985). Naturalness and predictivity of classifications, Biol. J. Linnean Soc. 24, 217–231.
Stager, C. E. and J. R. Davis (1992). Automated systems for identification of microorganisms. Clin. Microbiol. Rev. 5, 302–327.
Vlachonikolis, I. G. (1990). Predictive discrimination and classification with mixed binary and continuous variables. Biometrika 77, 657–662.
Wilks, S. S. (1962). Mathematical Statistics, New York: Wiley.
Willcox, W. R, S. P. Lapage and B. Holmes (1980). A review of numerical methods in bacterial identification. Antonie van Leeuwenhoek 46, 233–299.
Zabell, S. L. (1982) W. E. Johnson’s ’sufficientness’ principle. Ann. Stat. 10, 1091–1099.
Zabell, S. L. (1992). Predicting the unpredictable. Synthese 90, 205–232.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gyllenberg, M., Koski, T., Lund, T. et al. Bayesian predictive identification and cumulative classification of bacteria. Bull. Math. Biol. 61, 85–111 (1999). https://doi.org/10.1006/bulm.1998.0076
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1006/bulm.1998.0076