Abstract
Sequence data are widely used to get a deeper insight into biological systems. From a data analysis perspective they are given as a set of sequences of symbols with varying length. In general they are compared using nonmetric score functions. In this form the data are nonstandard, because they do not provide an immediate metric vector space and their analysis using standard methods is complicated. In this chapter we provide various strategies for how to analyze these type of data in a mathematically accurate way instead of the often seen ad hoc solutions. Our approach is based on the scoring values from protein sequence data although be applicable in a broader sense. We discuss potential recoding concepts of the scores and discuss algorithms to solve clustering, classification and embedding tasks for score data for a protein sequence application.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis and discovery. Cambridge University Press, Cambridge
Pekalska E, Duin R (2005) The dissimilarity representation for pattern recognition. World Scientific, Amsterdam
Hastie T, Tibshirani R, Friedman J (2013) The elements of statistical learning. Springer, New York
Hofmann D, Schleif F-M, Hammer B (2014) Learning interpretable kernelized prototype-based models. Neurocomputing 131:43–51
Graepel T, Obermayer K (1999) A stochastic self-organizing map for proximity data. Neural Comput 11:139–155
Hammer B, Hasenfuss A (2010) Topographic mapping of large dissimilarity data sets. Neural Comput 22:2229–2284
Chen Y, Gupta MR, Recht B (2009) Learning kernels from indefinite similarities. In: Danyluk AP, Bottou L, Littman ML (eds) Proceedings of the 26th annual international conference on machine learning, Montreal, Quebec, Canada, June 14–18, 2009. ACM international conference proceeding series, Madison, WI, USA, pp 145–152
Pekalska E, Haasdonk B (2009) Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Trans Pattern Anal Mach Intell 31:1017–1032
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Kohonen T, Somervuo P (2002) How to make large self-organizing maps for nonvectorial data. Neural Netw 15:945–952
Boeckmann B, Bairoch A, Apweiler R et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31:365–370
Gisbrecht A, Mokbel B, Schleif F-M et al (2012) Linear time relational prototype based learning. Int J Neural Syst 22:1250021
Williams CKI, Seeger M (2000) Using the Nystrom method to speed up kernel machines. In: Todd KL, Dietterich TG, Tresp V (eds) Advances in neural information processing systems 13, papers from neural information processing systems, Denver, CO, USA, pp 682–688
Kwok M, Li JT, Lu B-L (2010) Making large-scale Nystrom approximation possible. In: Furnkranz J, Joachims T (eds) Proceedings of the 27th international conference on machine learning (ICML-10), June 21–24, 2010, Haifa, Israel, Omnipress, Madison, WI, USA, pp 631–638
Drineas P, Mahoney MW (2005) On the Nystrom method for approximating a gram matrix for improved kernel-based learning. J Mach Learn Res 6:2153–2175
Schleif F-M, Gisbrecht A (2013) Data analysis of (non-)metric proximities at linear costs. In: Hancock ER, Pelillo M (eds) Similarity-based pattern recognition—second international workshop, York, UK, July 3–5, 2013. Proceedings, pp 59–74
Chen Y, Garcia EK, Gupta MR et al (2009) Similarity-based classification: concepts and algorithms. J Mach Learn Res 10:747–776
Schleif F-M (2014) Proximity learning for non-standard big data. In: Verleysen M (ed) 22th European symposium on artificial neural networks, 2014, Bruges, Belgium, April 23–25, 2014
Strickert M, Bunte K, Schleif FM et al (2014) Correlation-based embedding of pairwise score data. Neurocomputing 141:97–109
Zhang K, Kwok JT (2010) Clustered Nystrom method for large scale manifold learning and dimension reduction. IEEE Trans Neural Netw 21:1576–1587
Lee J, Verleysen M (2007) Nonlinear dimensionality reduction, Information science and statistics. Springer, New York
Yang Z, Peltonen J, Kaski S (2013) Scalable optimization of neighbor embedding for visualization. In: Proceedings of the 30th international conference on machine learning, Atlanta, GA, USA, 16–21 June 2013. JMLR proceedings, vol 28. pp 127–135. JMLR.org
Cuturi M (2011) Fast Global Alignment Kernels. In: Getoor L, Scheffer T (eds) Proceedings of the 28th international conference on machine learning, Bellevue, Washington, USA, June 28–July 2, Omnipress 2011, Madison, WI, USA, pp 929–936
Rognes T (2011) Faster smith-waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12:221
Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Ghahramani Z (ed) Machine learning, proceedings of the twenty-fourth international conference, Corvallis, Oregon, USA, June 20–24. 227. ACM, pp 911–918
Gisbrecht A, Lueks W, Mokbel B et al (2012) Out-of-sample kernel extensions for nonparametric dimensionality reduction. In: Verleysen M (ed) 20th European symposium on artificial neural networks, 2012, Bruges, Belgium, April 25–27
Acknowledgment
I would like to thank my former colleagues Andrej Gisbrecht, Barbara Hammer and Alexander Grigor’yan, University of Bielefeld for supporting parts of this work. I am grateful to the Max-Planck-Institute for Physics of Complex Systems in Dresden and Michael Biehl, Thomas Villmann and Manfred Opper as the organizer of the Statistical Inference: Models in Physics and Learning-Workshop for providing a nice working atmosphere leading to some of the ideas outlined in this manuscript. The author was supported by a Marie Curie Intra-European Fellowship (IEF): FP7-PEOPLE-2012-IEF (FP7-327791-ProMoS). I would like to thank R. Duin and E. Pekalska for providing various information about pseudo-Euclidean spaces and access to distools and prtools.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Schleif, FM. (2016). Protein Sequence Analysis by Proximities. In: Jung, K. (eds) Statistical Analysis in Proteomics. Methods in Molecular Biology, vol 1362. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3106-4_12
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3106-4_12
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3105-7
Online ISBN: 978-1-4939-3106-4
eBook Packages: Springer Protocols