A Hierarchical n-Grams Extraction Approach for Classification Problem

Mhamdi, Faouzi; Rakotomalala, Ricco; Elloumi, Mourad

doi:10.1007/978-3-642-01350-8_20

Faouzi Mhamdi¹⁸,
Ricco Rakotomalala¹⁹ &
Mourad Elloumi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4879))

Included in the following conference series:

International Conference on Signal-Image Technology and Internet-Based Systems

389 Accesses
3 Citations

Abstract

We are interested in protein classification based on their primary structures. The goal is to automatically classify proteins sequences according to their families. This task goes through the extraction of a set of descriptors that we present to the supervised learning algorithms. There are many types of descriptors used in the literature. The most popular one is the n-gram. It corresponds to a series of characters of n-length. The standard approach of the n-grams consists in setting first the parameter n, extracting the corresponding ngrams descriptors, and in working with this value during the whole data mining process. In this paper, we propose an hierarchical approach to the n-grams construction. The goal is to obtain descriptors of varying length for a better characterization of the protein families. This approach tries to answer to the domain knowledge of the biologists. The patterns, which characterize the proteins’ family, have most of the time a various length. Our idea is to transpose the frequent itemsets extraction principle, mainly used for the association rule mining, in the n-grams extraction for protein classification context. The experimentation shows that the new approach is consistent with the biological reality and has the same accuracy of the standard approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fayyad, U., Shapiro, G., Smyth, P.: From data mining to knowledge discovery: A overview. In: Advances in Knowledge Discovery and Data Mining, pp. 1–34. MIT Press, Cambridge (1996)
Google Scholar
Gibas, C., Jambeck, P.: Introduction à la bioinformatique, Oreilly (2002)
Google Scholar
Karplus, K., Barrett, C., Hughey, R.: Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856 (1998)
Article Google Scholar
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J.A., Hofmann, K., Bairoch, A.: The PROSITE database, its status in 2002. Nucleic Acids Res. 30, 235–238 (2002)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorisation. ACM Survey 34(1), 1–47 (2002)
Article Google Scholar
Mhamdi, F., Elloumi, M., Rakotomalala, R.: Textmining, features selection and datamining for proteins classification. In: IEEE/ICTTA 2004 (2004)
Google Scholar
Mhamdi, F., Elloumi, M., Rakotomalala, R.: Desciptors Extraction for Proteins Classification. In: Proceeding of NCEI 2004, New Zealand (2004)
Google Scholar
Lallich, S., Teytaud, O.: Évaluation et validation de l’intérêt des règles d’association, n°spécial Mesures de qualité pour la fouille des données, Revue des Nouvelles Technologies de l’Information, RNTI-E-1, 193–218 (2004)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB Conference, Santiago, Chile (1994)
Google Scholar
Murzin, G.A., Brenner, E.S., Hubbard, T., Chothia, C.: SCOP, a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Bio. 247, 536–540 (1995)
Google Scholar
Dietterich, T.: Approximate statistical tests for comparing supervised classification learning. Neural Computation journal 10(7), 1895–1924 (1999)
Article Google Scholar
Rakotomalala, R., Mhamdi, F.: Évaluation des méthodes supervisées pour la discrimination de protéines. In: Dans le proceeding de la conférence SFC 2006, Metz (2006)
Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-base learning methods. Cambridge University Press, Cambridge (2000)
Book MATH Google Scholar
Eddy, S., Mitchison, G., Durbin, R.: Maximum discrimination hidden Markov models of sequences consensus. Journal of Computational Biology 2, 9–23 (1995)
Article Google Scholar
Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, D.: Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology 235(5), 1501–1531 (1994)
Article Google Scholar
Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg
Google Scholar
Guyon, I., Gupta, H.: An introduction to variable and feature selection. Journal of Machine Learning Reasearch, 157–1182 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

UTIC, Unité de recherche en Technologies de l’Information et de la Communication, École Supérieure des Sciences et Techniques de Tunis, Tunisie
Faouzi Mhamdi & Mourad Elloumi
Laboratoire ERIC, Université Lyon 2, France
Ricco Rakotomalala

Authors

Faouzi Mhamdi
View author publications
You can also search for this author in PubMed Google Scholar
Ricco Rakotomalala
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Elloumi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartemento Tecnologie dell’Informazione, Universitá degli Studi di Milano, Via Bramante 65, 26013, Crema, Italy
Ernesto Damiani
LE2I-CNRS, Université de Bourgogne, Aile de l’Ingénieur, 21078, Dijon Cedex, France
Kokou Yetongnon , Richard Chbeir & Albert Dipanda , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mhamdi, F., Rakotomalala, R., Elloumi, M. (2009). A Hierarchical n-Grams Extraction Approach for Classification Problem. In: Damiani, E., Yetongnon, K., Chbeir, R., Dipanda, A. (eds) Advanced Internet Based Systems and Applications. SITIS 2006. Lecture Notes in Computer Science, vol 4879. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01350-8_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-01350-8_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01349-2
Online ISBN: 978-3-642-01350-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics