Skip to main content

Vocabulary Reduction in BoW Representing by Topic Modeling

  • Conference paper
  • 1650 Accesses

Part of the Lecture Notes in Computer Science book series (LNIP,volume 7887)

Abstract

In this work, a new approach to vocabulary reduction is presented. It is based on filtering words in the topic feature space instead of directly in the original word space. The main goal is to analyze the differences between the application of the Cumulative Count-based word filter (f cc ) in word feature space (BoW: Bag of Words) with respect to its application in topic descriptions (obtained by LDA: Latent Dirichlet Allocation). Three well-known text datasets (Reuters, WebKB and NewsGroup) have been used to show the performance of the proposed approach.

Keywords

  • Vocabulary Reduction
  • Topic Modeling
  • Word Filtering

This work was partially supported by FPU-AP-2009-4435 from the Spanish Ministry of Education, PROMETEO/2010/028 project from Generalitat Valenciana and P1-1B2010-27 project from the Plan de Promoció de la Investigació UJI.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-38628-2_77
  • Chapter length: 8 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-3-642-38628-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   155.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fei-Fei, L., Perona, P.: A Bayesian Hierarchical Model for Learning Natural Scene Categories. In: IEEE Computer Vision and Pattern Recognition, pp. 524–531 (2005)

    Google Scholar 

  2. Sivic, J.: Efficient visual search of videos cast as text retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence 31(4), 591–605 (2009)

    CrossRef  Google Scholar 

  3. Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman Hall/CRC (2007)

    Google Scholar 

  4. Blei, D.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2012)

    MathSciNet  CrossRef  Google Scholar 

  5. Blei, D., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Brants, T., Chen, F., Tsochantaridis, I.: Topic-based document segmentation with probabilistic latent semantic analysis. In: International Conference on Information and Knowledge Management (CIKM), McLean, VA, pp. 211–218 (2002)

    Google Scholar 

  7. Monay, F., Gatica-Perez, D.: On image auto-annotation with latent space models. In: 11th ACM International Conference on Multimedia, pp. 275–278. ACM, New York (2003)

    Google Scholar 

  8. Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer, Heidelberg (2006)

    CrossRef  Google Scholar 

  9. Farrahi, K., Gatica-Perez, D.: Discovering Routines from Large-Scale Human Locations using Probabilistic Topic Models. ACM Transactions on Intelligent Systems and Technology, Special Issue on Activity Recognition 2(1) (2011)

    Google Scholar 

  10. Montoliu, R.: Discovering mobility patterns on bicycle-based public transportation system by using probabilistic topic models. In: Novais, P., Hallenborg, K., Tapia, D.I., Rodríguez, J.M.C. (eds.) Ambient Intelligence - Software and Applications. AISC, vol. 153, pp. 145–153. Springer, Heidelberg (2012)

    CrossRef  Google Scholar 

  11. Quelhas, P., Monay, F., Odobez, J.-M., Gatica-Perez, D., Tuytelaars, T.: A Thousand Words in a Scene. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(9), 1575–1589 (2007)

    CrossRef  Google Scholar 

  12. Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning (2007)

    Google Scholar 

  13. Jones, K.S., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann (1997)

    Google Scholar 

  14. van Rijsbergen, C.J., Robertson, S.E., Porter, M.F.: New models in probabilistic information retrieval. British Library, London (1980) (British Library Research and Development Report, no. 5587)

    Google Scholar 

  15. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based. Learning Methods, 1st edn. Cambridge University Press (2000)

    Google Scholar 

  16. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Trans. on Neural Networks 13, 415–425 (2002)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fernández-Beltran, R., Montoliu, R., Pla, F. (2013). Vocabulary Reduction in BoW Representing by Topic Modeling. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds) Pattern Recognition and Image Analysis. IbPRIA 2013. Lecture Notes in Computer Science, vol 7887. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38628-2_77

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38628-2_77

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38627-5

  • Online ISBN: 978-3-642-38628-2

  • eBook Packages: Computer ScienceComputer Science (R0)