Skip to main content

Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models

  • Conference paper
Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing (RSFDGrC 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8170))

Abstract

Analyzing and classifying sequence data based on structural similarities and differences is a mathematical problem of escalating relevance. Indeed, a primary challenge in designing machine learning algorithms to analyzing sequence data is the extraction and representation of significant features. This paper introduces a generalized sequence feature extraction model, referred to as the Generalized Multi-Layered Vector Spaces (GMLVS) model. Unlike most models that represent sequence data based on subsequences frequency, the GMLVS model represents a given sequence as a collection of features, where each individual feature captures the spatial relationships between two subsequences and can be mapped into a feature vector. The utility of this approach is demonstrated via two special cases of the GMLVS model, namely, Lossless Decomposition (LD) and the Multi-Layered Vector Spaces (MLVS). Experimental evaluation show the GMLVS inspired models generated feature vectors that, combined with basic machine learning techniques, are able to achieve high classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Xie, Y., Fisher, J., Raghavan, V.V., Johnsten, T., Akkoc, C.: Granular approach for protein sequence analysis. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 414–421. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  2. Akkoç, C., Johnsten, T., Benton, R.: Multi-layered vector spaces for classifying and analyzing biological sequences. In: BICoB, pp. 160–166 (2011)

    Google Scholar 

  3. Liao, L., Noble, S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology, 857–868 (2003)

    Google Scholar 

  4. Needleman, B., Wunsch, D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 443–453 (1970)

    Google Scholar 

  5. Smith, F., Waterman, S.: Identification of common molecular subsequences. Journal of Molecular Biology, 195–197 (1981)

    Google Scholar 

  6. Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, pp. 564–575 (2002)

    Google Scholar 

  7. Leslie, C., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for SVM protein classification. In: Neural Information Processing Systems, pp. 1441–1448 (2003)

    Google Scholar 

  8. Sonego, P., Pacurar, M., Dhir, S., Kertesz-Farkas, A., Kocsor, A., Gaspari, Z., Leunissen, J., Pongor, S.: A protein classification benchmark collection for machine learning, D232-D236 (2007)

    Google Scholar 

  9. Quinlan, J.: C4.5: Programs for machine learning. Morgan Kaufmann (1993)

    Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDDD Explorations, 10–18 (2009)

    Google Scholar 

  11. Supplementary data (from paper [3]), http://noble.gs.washington.edu/proj/svm-pairwise/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Raghavan, V.V., Benton, R.G., Johnsten, T., Xie, Y. (2013). Representations for Large-Scale Sequence Data Mining: A Tale of Two Vector Space Models. In: Ciucci, D., Inuiguchi, M., Yao, Y., Ślęzak, D., Wang, G. (eds) Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. RSFDGrC 2013. Lecture Notes in Computer Science(), vol 8170. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41218-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41218-9_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41217-2

  • Online ISBN: 978-3-642-41218-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics