Short Text Classification Using Semantic Random Forest

  • Ameni Bouaziz
  • Christel Dartigues-Pallez
  • Célia da Costa Pereira
  • Frédéric Precioso
  • Patrick Lloret
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8646)

Abstract

Using traditional Random Forests in short text classification revealed a performance degradation compared to using them for standard texts. Shortness, sparseness and lack of contextual information in short texts are the reasons of this degradation. Existing solutions to overcome these issues are mainly based on data enrichment. However, data enrichment can also introduce noise. We propose a new approach that combines data enrichment with the introduction of semantics in Random Forests. Each short text is enriched with data semantically similar to its words. These data come from an external source of knowledge distributed into topics thanks to the Latent Dirichlet Allocation model. Learning process in Random Forests is adapted to consider semantic relations between words while building the trees. Tests performed on search-snippets using the new method showed significant improvements in the classification. The accuracy has increased by 34% compared to traditional Random Forests and by 20% compared to MaxEnt.

Keywords

Short text classification Random Forest Latent Dirichlet Allocation Semantics 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Yang, L., Li, C., Ding, Q., Li, L.: Combining Lexical and Semantic Features for Short Text Classification. In: 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems - KES (2013)Google Scholar
  2. 2.
    Amaratunga, D., Cabrera, J., Lee, Y.S.: Enriched Random Forests. Bioinformatics 24(18), 2010–2014 (2008)CrossRefGoogle Scholar
  3. 3.
    Chen, M., Jin, X., Shen, D.: Short Text Classification Improved by Learning Multi-Granularity Topics. In: 22nd International Joint Conference on Artificial Intelligence (2011)Google Scholar
  4. 4.
    Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short Text Conceptualization using a Probabilistic Knowledge base. In: 22nd International Joint Conference on Artificial Intelligence, pp. 2330–2336 (2011)Google Scholar
  5. 5.
    Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)CrossRefMATHGoogle Scholar
  6. 6.
    Guerts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006)CrossRefGoogle Scholar
  7. 7.
    Chen, C., Liaw, A., Breiman, L.: Using Random Forest to Learn Imbalanced Data (2004)Google Scholar
  8. 8.
    Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In: www 2008 Data Mining-Learning, Beijing, China (2008)Google Scholar
  9. 9.
    Hu, X., Zhang, X., Caimei, L., Park, E.K., Zhou, X.: Exploiting Wikipedia as External Knowledge for Document Clustering. In: KDD 2009, Paris, France (2009)Google Scholar
  10. 10.
    Hu, X., Sun, N., Zhang, C., Tat-Seng, C.: Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. In: CIKM 2009, Hong Kong, China, pp. 2–6 (2009)Google Scholar
  11. 11.
    Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research, 993–1022 (2003)Google Scholar
  12. 12.
    Dumais, S.T.: Latent Semantic Indexing. In: TExt REtrieval Conference, pp. 219–230 (1995)Google Scholar
  13. 13.
    Berger, A., Pietra, A., Pietra, J.: A maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22(1), 39–71 (1996)Google Scholar
  14. 14.
    Caragea, D., Bahirwani, V., Aljandal, W., Hsu, W.: Ontology-Based Link Prediction in the LiveJournal Social Network. In: 8th Symposium on Abstraction, Reformulation and Approximation (2009)Google Scholar
  15. 15.
    Chen, Z., Zhang, W.: Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight. Plos Computational Biology 9, e1002956 (2013)Google Scholar
  16. 16.
    Scikit-Learn Machine Learning in Python, http://scikit-learn.org

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Ameni Bouaziz
    • 1
  • Christel Dartigues-Pallez
    • 1
  • Célia da Costa Pereira
    • 1
  • Frédéric Precioso
    • 1
  • Patrick Lloret
    • 2
  1. 1.Laboratoire I3S (CNRS UMR-7271)Université Nice Sophia AntipolisFrance
  2. 2.Semantic Group CompanyParisFrance

Personalised recommendations