Skip to main content

Protein Sequence Analysis by Proximities

  • Protocol
Statistical Analysis in Proteomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1362))

  • 3700 Accesses

Abstract

Sequence data are widely used to get a deeper insight into biological systems. From a data analysis perspective they are given as a set of sequences of symbols with varying length. In general they are compared using nonmetric score functions. In this form the data are nonstandard, because they do not provide an immediate metric vector space and their analysis using standard methods is complicated. In this chapter we provide various strategies for how to analyze these type of data in a mathematically accurate way instead of the often seen ad hoc solutions. Our approach is based on the scoring values from protein sequence data although be applicable in a broader sense. We discuss potential recoding concepts of the scores and discuss algorithms to solve clustering, classification and embedding tasks for score data for a protein sequence application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis and discovery. Cambridge University Press, Cambridge

    Book  Google Scholar 

  2. Pekalska E, Duin R (2005) The dissimilarity representation for pattern recognition. World Scientific, Amsterdam

    Google Scholar 

  3. Hastie T, Tibshirani R, Friedman J (2013) The elements of statistical learning. Springer, New York

    Google Scholar 

  4. Hofmann D, Schleif F-M, Hammer B (2014) Learning interpretable kernelized prototype-based models. Neurocomputing 131:43–51

    Article  Google Scholar 

  5. Graepel T, Obermayer K (1999) A stochastic self-organizing map for proximity data. Neural Comput 11:139–155

    Article  CAS  PubMed  Google Scholar 

  6. Hammer B, Hasenfuss A (2010) Topographic mapping of large dissimilarity data sets. Neural Comput 22:2229–2284

    Article  PubMed  Google Scholar 

  7. Chen Y, Gupta MR, Recht B (2009) Learning kernels from indefinite similarities. In: Danyluk AP, Bottou L, Littman ML (eds) Proceedings of the 26th annual international conference on machine learning, Montreal, Quebec, Canada, June 14–18, 2009. ACM international conference proceeding series, Madison, WI, USA, pp 145–152

    Google Scholar 

  8. Pekalska E, Haasdonk B (2009) Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans Pattern Anal Mach Intell 31:1017–1032

    Article  PubMed  Google Scholar 

  9. Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge

    Book  Google Scholar 

  10. Kohonen T, Somervuo P (2002) How to make large self-organizing maps for nonvectorial data. Neural Netw 15:945–952

    Article  PubMed  Google Scholar 

  11. Boeckmann B, Bairoch A, Apweiler R et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31:365–370

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Gisbrecht A, Mokbel B, Schleif F-M et al (2012) Linear time relational prototype based learning. Int J Neural Syst 22:1250021

    Article  PubMed  Google Scholar 

  13. Williams CKI, Seeger M (2000) Using the Nystrom method to speed up kernel machines. In: Todd KL, Dietterich TG, Tresp V (eds) Advances in neural information processing systems 13, papers from neural information processing systems, Denver, CO, USA, pp 682–688

    Google Scholar 

  14. Kwok M, Li JT, Lu B-L (2010) Making large-scale Nystrom approximation possible. In: Furnkranz J, Joachims T (eds) Proceedings of the 27th international conference on machine learning (ICML-10), June 21–24, 2010, Haifa, Israel, Omnipress, Madison, WI, USA, pp 631–638

    Google Scholar 

  15. Drineas P, Mahoney MW (2005) On the Nystrom method for approximating a gram matrix for improved kernel-based learning. J Mach Learn Res 6:2153–2175

    Google Scholar 

  16. Schleif F-M, Gisbrecht A (2013) Data analysis of (non-)metric proximities at linear costs. In: Hancock ER, Pelillo M (eds) Similarity-based pattern recognition—second international workshop, York, UK, July 3–5, 2013. Proceedings, pp 59–74

    Google Scholar 

  17. Chen Y, Garcia EK, Gupta MR et al (2009) Similarity-based classification: concepts and algorithms. J Mach Learn Res 10:747–776

    Google Scholar 

  18. Schleif F-M (2014) Proximity learning for non-standard big data. In: Verleysen M (ed) 22th European symposium on artificial neural networks, 2014, Bruges, Belgium, April 23–25, 2014

    Google Scholar 

  19. Strickert M, Bunte K, Schleif FM et al (2014) Correlation-based embedding of pairwise score data. Neurocomputing 141:97–109

    Article  Google Scholar 

  20. Zhang K, Kwok JT (2010) Clustered Nystrom method for large scale manifold learning and dimension reduction. IEEE Trans Neural Netw 21:1576–1587

    Article  PubMed  Google Scholar 

  21. Lee J, Verleysen M (2007) Nonlinear dimensionality reduction, Information science and statistics. Springer, New York

    Book  Google Scholar 

  22. Yang Z, Peltonen J, Kaski S (2013) Scalable optimization of neighbor embedding for visualization. In: Proceedings of the 30th international conference on machine learning, Atlanta, GA, USA, 16–21 June 2013. JMLR proceedings, vol 28. pp 127–135. JMLR.org

    Google Scholar 

  23. Cuturi M (2011) Fast Global Alignment Kernels. In: Getoor L, Scheffer T (eds) Proceedings of the 28th international conference on machine learning, Bellevue, Washington, USA, June 28–July 2, Omnipress 2011, Madison, WI, USA, pp 929–936

    Google Scholar 

  24. Rognes T (2011) Faster smith-waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12:221

    Article  PubMed Central  PubMed  Google Scholar 

  25. Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Ghahramani Z (ed) Machine learning, proceedings of the twenty-fourth international conference, Corvallis, Oregon, USA, June 20–24. 227. ACM, pp 911–918

    Google Scholar 

  26. Gisbrecht A, Lueks W, Mokbel B et al (2012) Out-of-sample kernel extensions for nonparametric dimensionality reduction. In: Verleysen M (ed) 20th European symposium on artificial neural networks, 2012, Bruges, Belgium, April 25–27

    Google Scholar 

Download references

Acknowledgment

I would like to thank my former colleagues Andrej Gisbrecht, Barbara Hammer and Alexander Grigor’yan, University of Bielefeld for supporting parts of this work. I am grateful to the Max-Planck-Institute for Physics of Complex Systems in Dresden and Michael Biehl, Thomas Villmann and Manfred Opper as the organizer of the Statistical Inference: Models in Physics and Learning-Workshop for providing a nice working atmosphere leading to some of the ideas outlined in this manuscript. The author was supported by a Marie Curie Intra-European Fellowship (IEF): FP7-PEOPLE-2012-IEF (FP7-327791-ProMoS). I would like to thank R. Duin and E. Pekalska for providing various information about pseudo-Euclidean spaces and access to distools and prtools.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank-Michael Schleif Ph.D. .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Schleif, FM. (2016). Protein Sequence Analysis by Proximities. In: Jung, K. (eds) Statistical Analysis in Proteomics. Methods in Molecular Biology, vol 1362. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3106-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3106-4_12

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3105-7

  • Online ISBN: 978-1-4939-3106-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics