Skip to main content
Log in

Identifying Change-Points in Biological Sequences via Sequential Importance Sampling

  • Published:
Environmental Modeling & Assessment Aims and scope Submit manuscript

Abstract

The genomes of complex organisms, including the human genome, are highly structured. This structure takes the form of segmental patterns of variation in various properties and may be caused by the division of genomes into regions of distinct function, by the contingent evolutionary processes that gave rise to genomes, or by a combination of both. Whatever the cause, identifying the change-points between segments is potentially important, as a means of discovering the functional components of a genome, understanding the evolutionary processes involved, and fully describing genomic architecture. One property of genomes that is known to display a segmental pattern of variation is GC content. The GC content of a portion of DNA is the proportion of GC pairs that it contains. Sharp changes in GC content can be observed in human and other genomes. Such change-points may be the boundaries of functional elements or may play a structural role. We model genome sequences as a multiple change-point process, that is, a process in which sequential data are separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process. We consider a Sequential Importance Sampling approach to change-point modeling using Monte Carlo simulation to find estimates of change-points as well as parameters of the process on each segment. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates for the locations of change-points in artificially generated sequences and compare the accuracy of these estimates to those obtained via Markov chain Monte Carlo and a well-known method, IsoFinder. We also provide examples with real data sets to illustrate the usefulness of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Braun, J. V., & Muller, H.-G. (1998). Statistical methods for DNA sequence segmentation. Statistical Science, 13, 142–162.

    Article  Google Scholar 

  2. Keith, J., Kroese, D. P., & Bryant D. (2004). A generalized markov sampler. Methodology and Computing in Applied Probability, 6(1), 29–53.

    Article  Google Scholar 

  3. Keith, J. M. (2006). Segmenting eukaryotic genomes with the generalized Gibbs sampler. Journal of Computational Biology, 13(7), 1369–1383.

    Article  CAS  Google Scholar 

  4. Keith, J. M., Adams, P., Stephen, S., & Mattick, J. S. (2008). Delineating slowly and rapidly evolving fractions of the drosophila genome. Journal of Computational Biology, 15(4), 407–430.

    Article  CAS  Google Scholar 

  5. Oliver, J. L., Bernaola-Galvan, P., Carpena, P., & Roman-Roldan, R. (2001). Isochore chromosome maps of eukaryotic genomes. Gene, 276, 47–56.

    Article  CAS  Google Scholar 

  6. Oliver, J. L., Carpena, P., Hackenberg, M., & Bernaola-Galvan, P. (2005). IsoFinder. http://bioinfo2.ugr.es/IsoF/isofinder.html.

  7. Oliver, J. L., Carpena, P., Hackenberg, M., & Bernaola-Galvan, P. (2004). IsoFinder: computational prediction of isochores in genome sequences. Nucleic Acids Research, 32(Web Server issue), W287–W292.

    Article  CAS  Google Scholar 

  8. Oliver, J. L., Carpena, P., Roman-Roldan, R., Mata-Balaguer, T., et al. (2002). Isochore chromosome maps of the human genome. Gene, 300, 117–127.

    Article  CAS  Google Scholar 

  9. Oliver, J. L., Roman-Roldan, R., Perez, J., & Bernaola-Galvan, P. (1999). Segment: identifying compositional domains in DNA sequences. Bioinformatics, 15, 974–979.

    Article  CAS  Google Scholar 

  10. Rubinstein, R. Y., & Kroese, D. P. (2007). Simulation and the Monte Carlo method, 2nd edition. Wiley, New York.

    Google Scholar 

  11. Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

G. Yu. Sofronov and D. P. Kroese acknowledge the support of an Australian Research Council discovery grant (DP0556631). J. M. Keith would like to acknowledge the support of the Australian Research Council discovery grants (DP0452412, DP0556631) and a National Medical and Health Research Council grant “Statistical methods and algorithms for analysis of high-throughput genetics and genomics platforms” (389892).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Yu. Sofronov.

Additional information

This is an extended version of a paper presented at the 17th Biennial Congress on Modelling and Simulation, Christchurch, New Zealand, December 2007.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sofronov, G.Y., Evans, G.E., Keith, J.M. et al. Identifying Change-Points in Biological Sequences via Sequential Importance Sampling. Environ Model Assess 14, 577–584 (2009). https://doi.org/10.1007/s10666-008-9160-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10666-008-9160-8

Keywords

Navigation