Vocabulary Reduction in BoW Representing by Topic Modeling

  • Rubén Fernández-Beltran
  • Raul Montoliu
  • Filiberto Pla
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7887)

Abstract

In this work, a new approach to vocabulary reduction is presented. It is based on filtering words in the topic feature space instead of directly in the original word space. The main goal is to analyze the differences between the application of the Cumulative Count-based word filter (fcc) in word feature space (BoW: Bag of Words) with respect to its application in topic descriptions (obtained by LDA: Latent Dirichlet Allocation). Three well-known text datasets (Reuters, WebKB and NewsGroup) have been used to show the performance of the proposed approach.

Keywords

Vocabulary Reduction Topic Modeling Word Filtering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Fei-Fei, L., Perona, P.: A Bayesian Hierarchical Model for Learning Natural Scene Categories. In: IEEE Computer Vision and Pattern Recognition, pp. 524–531 (2005)Google Scholar
  2. 2.
    Sivic, J.: Efficient visual search of videos cast as text retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence 31(4), 591–605 (2009)CrossRefGoogle Scholar
  3. 3.
    Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman Hall/CRC (2007)Google Scholar
  4. 4.
    Blei, D.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2012)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Blei, D., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)MATHGoogle Scholar
  6. 6.
    Brants, T., Chen, F., Tsochantaridis, I.: Topic-based document segmentation with probabilistic latent semantic analysis. In: International Conference on Information and Knowledge Management (CIKM), McLean, VA, pp. 211–218 (2002)Google Scholar
  7. 7.
    Monay, F., Gatica-Perez, D.: On image auto-annotation with latent space models. In: 11th ACM International Conference on Multimedia, pp. 275–278. ACM, New York (2003)Google Scholar
  8. 8.
    Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Farrahi, K., Gatica-Perez, D.: Discovering Routines from Large-Scale Human Locations using Probabilistic Topic Models. ACM Transactions on Intelligent Systems and Technology, Special Issue on Activity Recognition 2(1) (2011)Google Scholar
  10. 10.
    Montoliu, R.: Discovering mobility patterns on bicycle-based public transportation system by using probabilistic topic models. In: Novais, P., Hallenborg, K., Tapia, D.I., Rodríguez, J.M.C. (eds.) Ambient Intelligence - Software and Applications. AISC, vol. 153, pp. 145–153. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  11. 11.
    Quelhas, P., Monay, F., Odobez, J.-M., Gatica-Perez, D., Tuytelaars, T.: A Thousand Words in a Scene. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(9), 1575–1589 (2007)CrossRefGoogle Scholar
  12. 12.
    Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning (2007)Google Scholar
  13. 13.
    Jones, K.S., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann (1997)Google Scholar
  14. 14.
    van Rijsbergen, C.J., Robertson, S.E., Porter, M.F.: New models in probabilistic information retrieval. British Library, London (1980) (British Library Research and Development Report, no. 5587)Google Scholar
  15. 15.
    Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based. Learning Methods, 1st edn. Cambridge University Press (2000)Google Scholar
  16. 16.
    Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Trans. on Neural Networks 13, 415–425 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Rubén Fernández-Beltran
    • 1
  • Raul Montoliu
    • 1
  • Filiberto Pla
    • 1
  1. 1.Institute of New Imaging TechnologyJaume I UniversityCastellónSpain

Personalised recommendations