Protein Sequence Analysis by Proximities

Schleif, Frank-Michael

doi:10.1007/978-1-4939-3106-4_12

Frank-Michael Schleif Ph.D.³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1362))

3700 Accesses

Abstract

Sequence data are widely used to get a deeper insight into biological systems. From a data analysis perspective they are given as a set of sequences of symbols with varying length. In general they are compared using nonmetric score functions. In this form the data are nonstandard, because they do not provide an immediate metric vector space and their analysis using standard methods is complicated. In this chapter we provide various strategies for how to analyze these type of data in a mathematically accurate way instead of the often seen ad hoc solutions. Our approach is based on the scoring values from protein sequence data although be applicable in a broader sense. We discuss potential recoding concepts of the scores and discuss algorithms to solve clustering, classification and embedding tasks for score data for a protein sequence application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis and discovery. Cambridge University Press, Cambridge
Book Google Scholar
Pekalska E, Duin R (2005) The dissimilarity representation for pattern recognition. World Scientific, Amsterdam
Google Scholar
Hastie T, Tibshirani R, Friedman J (2013) The elements of statistical learning. Springer, New York
Google Scholar
Hofmann D, Schleif F-M, Hammer B (2014) Learning interpretable kernelized prototype-based models. Neurocomputing 131:43–51
Article Google Scholar
Graepel T, Obermayer K (1999) A stochastic self-organizing map for proximity data. Neural Comput 11:139–155
Article CAS PubMed Google Scholar
Hammer B, Hasenfuss A (2010) Topographic mapping of large dissimilarity data sets. Neural Comput 22:2229–2284
Article PubMed Google Scholar
Chen Y, Gupta MR, Recht B (2009) Learning kernels from indefinite similarities. In: Danyluk AP, Bottou L, Littman ML (eds) Proceedings of the 26th annual international conference on machine learning, Montreal, Quebec, Canada, June 14–18, 2009. ACM international conference proceeding series, Madison, WI, USA, pp 145–152
Google Scholar
Pekalska E, Haasdonk B (2009) Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans Pattern Anal Mach Intell 31:1017–1032
Article PubMed Google Scholar
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Book Google Scholar
Kohonen T, Somervuo P (2002) How to make large self-organizing maps for nonvectorial data. Neural Netw 15:945–952
Article PubMed Google Scholar
Boeckmann B, Bairoch A, Apweiler R et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31:365–370
Article PubMed Central CAS PubMed Google Scholar
Gisbrecht A, Mokbel B, Schleif F-M et al (2012) Linear time relational prototype based learning. Int J Neural Syst 22:1250021
Article PubMed Google Scholar
Williams CKI, Seeger M (2000) Using the Nystrom method to speed up kernel machines. In: Todd KL, Dietterich TG, Tresp V (eds) Advances in neural information processing systems 13, papers from neural information processing systems, Denver, CO, USA, pp 682–688
Google Scholar
Kwok M, Li JT, Lu B-L (2010) Making large-scale Nystrom approximation possible. In: Furnkranz J, Joachims T (eds) Proceedings of the 27th international conference on machine learning (ICML-10), June 21–24, 2010, Haifa, Israel, Omnipress, Madison, WI, USA, pp 631–638
Google Scholar
Drineas P, Mahoney MW (2005) On the Nystrom method for approximating a gram matrix for improved kernel-based learning. J Mach Learn Res 6:2153–2175
Google Scholar
Schleif F-M, Gisbrecht A (2013) Data analysis of (non-)metric proximities at linear costs. In: Hancock ER, Pelillo M (eds) Similarity-based pattern recognition—second international workshop, York, UK, July 3–5, 2013. Proceedings, pp 59–74
Google Scholar
Chen Y, Garcia EK, Gupta MR et al (2009) Similarity-based classification: concepts and algorithms. J Mach Learn Res 10:747–776
Google Scholar
Schleif F-M (2014) Proximity learning for non-standard big data. In: Verleysen M (ed) 22th European symposium on artificial neural networks, 2014, Bruges, Belgium, April 23–25, 2014
Google Scholar
Strickert M, Bunte K, Schleif FM et al (2014) Correlation-based embedding of pairwise score data. Neurocomputing 141:97–109
Article Google Scholar
Zhang K, Kwok JT (2010) Clustered Nystrom method for large scale manifold learning and dimension reduction. IEEE Trans Neural Netw 21:1576–1587
Article PubMed Google Scholar
Lee J, Verleysen M (2007) Nonlinear dimensionality reduction, Information science and statistics. Springer, New York
Book Google Scholar
Yang Z, Peltonen J, Kaski S (2013) Scalable optimization of neighbor embedding for visualization. In: Proceedings of the 30th international conference on machine learning, Atlanta, GA, USA, 16–21 June 2013. JMLR proceedings, vol 28. pp 127–135. JMLR.org
Google Scholar
Cuturi M (2011) Fast Global Alignment Kernels. In: Getoor L, Scheffer T (eds) Proceedings of the 28th international conference on machine learning, Bellevue, Washington, USA, June 28–July 2, Omnipress 2011, Madison, WI, USA, pp 929–936
Google Scholar
Rognes T (2011) Faster smith-waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12:221
Article PubMed Central PubMed Google Scholar
Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Ghahramani Z (ed) Machine learning, proceedings of the twenty-fourth international conference, Corvallis, Oregon, USA, June 20–24. 227. ACM, pp 911–918
Google Scholar
Gisbrecht A, Lueks W, Mokbel B et al (2012) Out-of-sample kernel extensions for nonparametric dimensionality reduction. In: Verleysen M (ed) 20th European symposium on artificial neural networks, 2012, Bruges, Belgium, April 25–27
Google Scholar

Download references

Acknowledgment

I would like to thank my former colleagues Andrej Gisbrecht, Barbara Hammer and Alexander Grigor’yan, University of Bielefeld for supporting parts of this work. I am grateful to the Max-Planck-Institute for Physics of Complex Systems in Dresden and Michael Biehl, Thomas Villmann and Manfred Opper as the organizer of the Statistical Inference: Models in Physics and Learning-Workshop for providing a nice working atmosphere leading to some of the ideas outlined in this manuscript. The author was supported by a Marie Curie Intra-European Fellowship (IEF): FP7-PEOPLE-2012-IEF (FP7-327791-ProMoS). I would like to thank R. Duin and E. Pekalska for providing various information about pseudo-Euclidean spaces and access to distools and prtools.

Author information

Authors and Affiliations

School of Computer Science, University of Birmingham, Birmingham, Edgbaston, B15 2TT, UK
Frank-Michael Schleif Ph.D.

Authors

Frank-Michael Schleif Ph.D.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank-Michael Schleif Ph.D. .

Editor information

Editors and Affiliations

Department of Medical Statistics, University Medical Center Göttingen, Göttingen, Germany
Klaus Jung

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Schleif, FM. (2016). Protein Sequence Analysis by Proximities. In: Jung, K. (eds) Statistical Analysis in Proteomics. Methods in Molecular Biology, vol 1362. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3106-4_12

Download citation

DOI: https://doi.org/10.1007/978-1-4939-3106-4_12
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3105-7
Online ISBN: 978-1-4939-3106-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics