Skip to main content

Fast Implementations of Markov Clustering for Protein Sequence Grouping

  • Conference paper
Modeling Decisions for Artificial Intelligence (MDAI 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8234))

Abstract

Two efficient versions of a Markov clustering algorithm are proposed, suitable for fast and accurate grouping of protein sequences. First, the essence of the matrix splitting approach consists in optimal reordering of rows and columns in the similarity matrix after every iteration, transforming it into a matrix with several compact blocks along the diagonal, and zero similarities outside the blocks. These blocks are treated separately in later iterations, thus significantly reducing the overall computational load. Alternately, a special sparse matrix architecture is employed to represent the similarity matrix of the Markov clustering algorithm, which also helps getting rid of a severe amount of unnecessary computations. The proposed algorithms were tested to classify sequences of protein databases like SCOP95. The proposed solutions achieve a speed-up factor in the range 15-300 compared to the conventionally implemented Markov clustering, depending on input data size and parameter settings, without damaging the partition accuracy. The convergence is usually reached in 40-50 iterations. Combining the two proposed approaches brings us close to the 1000 times speed-up ratio.

This work was supported by the Hungarian National Research Funds (OTKA) under grant no. PD103921, the Hungarian Academy of Science through the János Bolyai Fellowship Program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., Madden, T.L., Schaffen, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search program. Nucleic Acids Res. 25, 3389–3402 (1997)

    Article  Google Scholar 

  2. Andreeva, A., Howorth, D., Chadonia, J.M., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D419–D425 (2008)

    Article  Google Scholar 

  3. Dayhoff, M.O.: The origin and evolution of protein superfamilies. Fed. Proc. 35, 2132–2138 (1976)

    Google Scholar 

  4. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)

    Article  Google Scholar 

  5. Enright, A.J., van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002)

    Article  Google Scholar 

  6. Everitt, B.S., Landau, S., Leese, M., Stahl, D.: Cluster Analysis, 5th edn. John Wiley & Sons, Chichester (2011)

    Book  MATH  Google Scholar 

  7. Gáspári, Z., Vlahovicek, K., Pongor, S.: Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm. Bioinformatics 21, 3322–3323 (2005)

    Article  Google Scholar 

  8. Heger, A., Holm, L.: Towards a covering set of protein family profiles. Prog. Biophys. Mol. Bio. 73, 321–337 (2000)

    Article  Google Scholar 

  9. Hegyi, H., Gerstein, M.: The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147–164 (1999)

    Article  Google Scholar 

  10. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)

    Article  Google Scholar 

  11. Protein Classification Benchmark Collection, http://net.icgeb.org/benchmark

  12. Structural Classification of Proteins database, http://scop.mrc-lmb.cam.ac.uk/scop

  13. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  14. Szilágyi, L., Medvés, L., Szilágyi, S.M.: A modified Markov clustering approach to unsupervised classification of protein sequences. Neurocomputing 73, 2332–2345 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Szilágyi, L., Szilágyi, S.M. (2013). Fast Implementations of Markov Clustering for Protein Sequence Grouping. In: Torra, V., Narukawa, Y., Navarro-Arribas, G., Megías, D. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2013. Lecture Notes in Computer Science(), vol 8234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41550-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41550-0_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41549-4

  • Online ISBN: 978-3-642-41550-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics