Abstract
The human genome project and other genome projects provide us with rich sources of data which invite many new forms of statistical analysis. The nature of the data is often different from that in many other areas of science. This has led to novel forms of data analysis, not to be found in the classical statistical literature. The purpose of this paper is to describe some of these new forms, with a focus on those cases where the biology drives the questions asked, and the statistical analysis presents new features as well as raising further challenges.
Similar content being viewed by others
References
Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Proceedings of the Royal Statistical Society, Series B, 57, 289–300.
Benjamini, Y., Yekultieli, D., 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165–1188.
Dayhoff, M. O., Schwartz, R. M., Orcutt, B. C., 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence Structure 5, Supplement 3.
Feller, W., 1968. An Introduction to Probability Theory and its Applications, Vol. 1, 3rd edition, Wiley, New York.
Jensen, S. T., Liu, J. S., 2004. BioOptimizer: A Bayesian scoring function approach to motif discovery. Bioinformatics, 20, 1557–1563.
Jensen, S. T., Liu, X. S., Zhou, Q., Liu, J. S., 2004. Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Statistical Science, 19, 188–204.
Jukes, T. H., Cantor, C. R., 1969. Evolution of protein molecules. In Munro, H.N. (ed.), Mammalian Protein Metabolism, Academic Press, New York.
Karlin, S., Altschul, S. F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science, 87, 3364–3368.
Karlin, S., Altschul, S. F., 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Science, 90, 5873–5877.
Karlin, S., Dembo, A., 1992. Limit distributions of maximal segmental scores among Markov-dependent partial sums. Advances in Applied Probability, 24, 113–140.
Karlin, S., Macken, C., 1991a. Assessment of inhomogeneities in an E. Coli physical map. Necleic Acids Research, 19, 4241–4246.
Karlin, S., Macken, C., 1991b. Some statistical problems in the assessment of inhomogeneities in DNA sequence data. Journal of the American Statistical Association, 86, 27–35.
Kimura, M., 1980. A simple method for estimating evolutionary rate in a finite population due to mutational production of neutral and nearly neutral base substitution through comparative studies of nucleotide sequences. Journal of Molecular Biology, 16, 111–120.
Robin, S., 2002. A compound Poisson model for word occurrences in DNA sequences. Journal of the Royal Statistical Society, Series C, 51, 1–15.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ewens, W.J. On the Use of Statistics in Genomics and Bioinformatics. J Stat Theory Pract 2, 159–172 (2008). https://doi.org/10.1080/15598608.2008.10411868
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1080/15598608.2008.10411868