Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models

Raghavan, Vijay V.; Benton, Ryan G.; Johnsten, Tom; Xie, Ying

doi:10.1007/978-3-642-41218-9_3

Vijay V. Raghavan²⁴,
Ryan G. Benton²⁴,
Tom Johnsten²⁵ &
…
Ying Xie²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8170))

Included in the following conference series:

International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing

1213 Accesses
1 Citations

Abstract

Analyzing and classifying sequence data based on structural similarities and differences is a mathematical problem of escalating relevance. Indeed, a primary challenge in designing machine learning algorithms to analyzing sequence data is the extraction and representation of significant features. This paper introduces a generalized sequence feature extraction model, referred to as the Generalized Multi-Layered Vector Spaces (GMLVS) model. Unlike most models that represent sequence data based on subsequences frequency, the GMLVS model represents a given sequence as a collection of features, where each individual feature captures the spatial relationships between two subsequences and can be mapped into a feature vector. The utility of this approach is demonstrated via two special cases of the GMLVS model, namely, Lossless Decomposition (LD) and the Multi-Layered Vector Spaces (MLVS). Experimental evaluation show the GMLVS inspired models generated feature vectors that, combined with basic machine learning techniques, are able to achieve high classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Xie, Y., Fisher, J., Raghavan, V.V., Johnsten, T., Akkoc, C.: Granular approach for protein sequence analysis. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 414–421. Springer, Heidelberg (2012)
Chapter Google Scholar
Akkoç, C., Johnsten, T., Benton, R.: Multi-layered vector spaces for classifying and analyzing biological sequences. In: BICoB, pp. 160–166 (2011)
Google Scholar
Liao, L., Noble, S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology, 857–868 (2003)
Google Scholar
Needleman, B., Wunsch, D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 443–453 (1970)
Google Scholar
Smith, F., Waterman, S.: Identification of common molecular subsequences. Journal of Molecular Biology, 195–197 (1981)
Google Scholar
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, pp. 564–575 (2002)
Google Scholar
Leslie, C., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for SVM protein classification. In: Neural Information Processing Systems, pp. 1441–1448 (2003)
Google Scholar
Sonego, P., Pacurar, M., Dhir, S., Kertesz-Farkas, A., Kocsor, A., Gaspari, Z., Leunissen, J., Pongor, S.: A protein classification benchmark collection for machine learning, D232-D236 (2007)
Google Scholar
Quinlan, J.: C4.5: Programs for machine learning. Morgan Kaufmann (1993)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDDD Explorations, 10–18 (2009)
Google Scholar
Supplementary data (from paper [3]), http://noble.gs.washington.edu/proj/svm-pairwise/

Download references

Author information

Authors and Affiliations

Center for Advanced Computer Studies, University of Louisiana at Lafayette, Louisiana, USA
Vijay V. Raghavan & Ryan G. Benton
School of Computing, University of South Alabama, Alabama, USA
Tom Johnsten
Department of Computer Science, Kennesaw State University, Georgia, USA
Ying Xie

Authors

Vijay V. Raghavan
View author publications
You can also search for this author in PubMed Google Scholar
Ryan G. Benton
View author publications
You can also search for this author in PubMed Google Scholar
Tom Johnsten
View author publications
You can also search for this author in PubMed Google Scholar
Ying Xie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Milano-Bicocca, viale Sarca 336/14, 20126, Milano, Italy
Davide Ciucci
Osaka University, 560-8531, Toyonaka, Osaka, Japan
Masahiro Inuiguchi
University of Regina, S4S 0A2, Regina, SK, Canada
Yiyu Yao
University of Warsaw, ul. Banacha, 2, 02-097, Warsaw, Poland
Dominik Ślęzak
Chongqing Institute of Green and Intelligent Technology, CAS, 401122, Chongqing, China
Guoyin Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Raghavan, V.V., Benton, R.G., Johnsten, T., Xie, Y. (2013). Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models. In: Ciucci, D., Inuiguchi, M., Yao, Y., Ślęzak, D., Wang, G. (eds) Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. RSFDGrC 2013. Lecture Notes in Computer Science(), vol 8170. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41218-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-41218-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41217-2
Online ISBN: 978-3-642-41218-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics