Indexing Finite Language Representation of Population Genotypes

  • Jouni Sirén
  • Niko Välimäki
  • Veli Mäkinen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6833)


We propose a way to index population genotype information together with the complete genome sequence, so that one can use the index to efficiently align a given sequence to the genome with all plausible genotype recombinations taken into account. This is achieved through converting a multiple alignment of individual genomes into a finite automaton recognizing all strings that can be read from the alignment by switching the sequence at any time. The finite automaton is indexed with an extension of Burrows-Wheeler transform to allow pattern search inside the plausible recombinant sequences. The size of the index stays limited, because of the high similarity of individual genomes. The index finds applications in variation calling and in primer design.


Multiple Alignment Exact Match Edit Distance Variation Calling Outgoing Edge 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Albers, C.A., et al.: Dindel: Accurate indel calls from short-read data. Genome Research (October 2010)Google Scholar
  2. 2.
    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)Google Scholar
  3. 3.
    Darling, A.E., et al.: ProgressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement. PLoS ONE 5(6), e11147 (2010)CrossRefGoogle Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proc. OSDI 2004, pp. 137–150. USENIX Association (2004)Google Scholar
  5. 5.
    Ferragina, P., et al.: Compressing and indexing labeled trees, with applications. Journal of the ACM 57(1), article 4 (2009)Google Scholar
  6. 6.
    Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)CrossRefzbMATHGoogle Scholar
  7. 7.
    Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009)CrossRefGoogle Scholar
  8. 8.
    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2005)CrossRefzbMATHGoogle Scholar
  9. 9.
    Lander, E.S., et al.: Initial sequencing and analysis of the human genome. Nature 409(6822), 860–921 (2001)CrossRefGoogle Scholar
  10. 10.
    Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)CrossRefGoogle Scholar
  11. 11.
    Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Biol. 5(10), e254 (2007)CrossRefGoogle Scholar
  12. 12.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 27(14), 1754–1760 (2009)CrossRefGoogle Scholar
  13. 13.
    Li, R., et al.: SOAP2. Bioinformatics 25(15), 1966–1967 (2009)CrossRefGoogle Scholar
  14. 14.
    Li, R., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2), 265–272 (2010)CrossRefGoogle Scholar
  15. 15.
    Mäkinen, V., et al.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)CrossRefGoogle Scholar
  16. 16.
    Mäkinen, V., et al.: Unified view of backward backtracking in short read mapping. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Ukkonen Festschrift 2010. LNCS, vol. 6060, pp. 182–195. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  17. 17.
    Metzker, M.L.: Sequencing technologies – the next generation. Nature Reviews Genetics 11, 31–46 (2010)CrossRefGoogle Scholar
  18. 18.
    Myers, S., et al.: A fine-scale map of recombination rates and hotspots across the human genome. Science 310(5746), 321–324 (2005)CrossRefGoogle Scholar
  19. 19.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)CrossRefzbMATHGoogle Scholar
  20. 20.
    Puglisi, S.J., et al.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 4 (2007)CrossRefGoogle Scholar
  21. 21.
    Spang, R., et al.: A novel approach to remote homology detection: Jumping alignments. Journal of Computational Biology 9(5), 747–760 (2002)CrossRefGoogle Scholar
  22. 22.
    Venter, J.C., et al.: The sequence of the human genome. Science 291(5507), 1304–1351 (2001)CrossRefGoogle Scholar
  23. 23.
    Wheeler, D.A., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189), 872–876 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Jouni Sirén
    • 1
  • Niko Välimäki
    • 1
  • Veli Mäkinen
    • 1
  1. 1.Helsinki Institute for Information Technology (HIIT) &, Department of Computer ScienceUniversity of HelsinkiFinland

Personalised recommendations