A New Dimensionality Reduction Technique Based on HMM for Boosting Document Classification

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 375)

Abstract

Many classification problems, such as text classification, require the ability to handle the high dimension of a structured representation of the documents. The enormous size of the data would result in burdensome computations. Consequently, there is a strong need for reducing the quantity of handled information to develop the classification process. In this paper, we propose a dimensionality reduction technique on text datasets based on a clustering method to group documents with a simple Hidden Markov Model to represent them. We have applied the new method on the OHSUMED benchmark text corpora using the \(k\)-NN and SVM classifiers. The results obtained are very satisfactory and demonstrate the suitability of the proposed technique for the problem of dimensionality reduction and document classification.

Keywords

Hidden markov model Text classification Dimensionality reduction Document clustering Similarity-based classification 

References

  1. 1.
    Sebastiani, F.: Text categorization. In: Text Mining and its Applications to Intelligence, CRM and Knowledge Management, pp. 109–129. WIT Press (2005)Google Scholar
  2. 2.
    Tsimboukakis, N., Tambouratzis, G.: Document classification system based on hmm word map. In Proceedings of the 5th International Conference on Soft Computing as Transdisciplinary Science and Technology, CSTST ’08, ACM, pp. 7–12, New York, NY, USA (2008)Google Scholar
  3. 3.
    Janecek, A.G., Gansterer, W.N., Demel, M.A., Ecker, G.F.: On the relationship between feature selection and classification accuracy. JMLR Workshop Conf. Proc. 4, 90–105 (2008)Google Scholar
  4. 4.
    Pekalska, E., Duin, R.P.W.: Dissimilarity representations allow for building good classifiers. Pattern Recogn. Lett. 23, 943–956 (2002)CrossRefMATHGoogle Scholar
  5. 5.
    Bicego, M., Murino, V., Figueiredo, M.A.T.: Similarity-based classification of sequences using hidden markov models. Pattern Recogn. 37(12), 2281–2291 (2004)CrossRefMATHGoogle Scholar
  6. 6.
    Seara Vieira, A., Iglesias, E.L., Borrajo, L.: T-HMM: a novel biomedical text classifier based on hidden markov models. In: 8th International Conference on Practical Applications of Computational Biology and Bioinformatics (PACBB 2014), volume 294 of Advances in Intelligent Systems and Computing, pp. 225–234. Springer International Publishing (2014)Google Scholar
  7. 7.
    Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, Morgan Kaufmann Publishers Inc, pp. 727–734, San Francisco, CA, USA (2000)Google Scholar
  8. 8.
    Rabiner, L.R.: Readings in speech recognition. Chapter A tutorial on hidden Markov models and selected applications in speech recognition, pp. 267–296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1990)Google Scholar
  9. 9.
    Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In SIGIR, pp. 192–201 (1994)Google Scholar
  10. 10.
    Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman (1999)Google Scholar
  11. 11.
    Caporaso, J.G., Baumgartner, W.A., Cohen, K.B., Johnson, H.L., Paquette, J., Hunter, L.: Concept recognition and the trec genomics tasks. In: Voorhees, E.M., Buckland, L.P. (eds.), TREC, volume Special Publication 500–266. National Institute of Standards and Technology (NIST) (2005)Google Scholar
  12. 12.
    Chang, C., Lin, C.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3):27:1–27:27 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Computer Science Dept.University of Vigo, Escola Superior de Enxeñería InformáticaOurenseSpain

Personalised recommendations