Advertisement

Exploring Multidimensional Continuous Feature Space to Extract Relevant Words

  • Márius Šajgalík
  • Michal Barla
  • Mária Bieliková
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8791)

Abstract

With growing amounts of text data the descriptive metadata become more crucial in efficient processing of it. One kind of such metadata are keywords, which we can encounter e.g. in everyday browsing of webpages. Such metadata can be of benefit in various scenarios, such as web search or content-based recommendation. We research keyword extraction problem from the perspective of vector space and present a novel method to extract relevant words from an article, where we represent each word and phrase of the article as a vector of its latent features. We evaluate our method within text categorisation problem using a well-known 20-newsgroups dataset and achieve state-of-the-art results.

Keywords

Feature Vector Noun Phrase Natural Language Processing Restrict Boltzmann Machine Candidate Phrase 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgement

This work was partially supported by grants No. VG1/0675/11, APVV-0208-10 and it is the partial result of the Research and Development Operational Programme project “University Science Park of STU Bratislava”, ITMS 26240220084, co-funded by the European Regional Development Fund.

References

  1. 1.
    Barla, M., Bieliková, M.: On deriving tagsonomies: keyword relations coming from crowd. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 309–320. Springer, Heidelberg (2009)Google Scholar
  2. 2.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)Google Scholar
  3. 3.
    Fara, D.G., Russell, G.: The Routledge Companion to Philosophy of Language, p. 92. Routledge, New York (2013). ISBN: 978-0-203-20696-6Google Scholar
  4. 4.
    Giesbrecht, E.: In search of semantic compositionality in vector spaces. In: Rudolph, S., Dau, F., Kuznetsov, S.O. (eds.) ICCS 2009. LNCS, vol. 5662, pp. 173–184. Springer, Heidelberg (2009)Google Scholar
  5. 5.
    Harris, Z.S.: Distributional structure. Word 10(23), 146–162 (1954)Google Scholar
  6. 6.
    Hinton, G.E., McClelland, J.L., Rumelhart, D.E.: Distributed representations. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations, pp. 77–109. MIT Press, Cambridge (1986)Google Scholar
  7. 7.
    Kramár, T., Barla, M., Bieliková, M.: Personalizing search using socially enhanced interest model, built from the stream of user’s activity. J. Web Eng. 12(1–2), 65–92 (2013)Google Scholar
  8. 8.
    Lan, M., Tan, C., Low, H.: Proposing a new term weighting scheme for text categorization. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 763–768. AAAI Press (2008)Google Scholar
  9. 9.
    Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th International Conference on Machine Learning, pp. 536–543. ACM (2008)Google Scholar
  10. 10.
    Li, B., Vogel, C.: Improving multiclass text classification with error-correcting output coding and sub-class partitions. In: Farzindar, A., Kešelj, V. (eds.) Canadian AI 2010. LNCS, vol. 6085, pp. 4–15. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Van der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  12. 12.
    Martinský, L., Návrat, P.: Query formulation improved by suggestions resulting from intermediate web search results. Comput. Inf. Syst. J. 16(1), 56–73 (2012)Google Scholar
  13. 13.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates (2013)Google Scholar
  14. 14.
    Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL HLT, pp. 746–751. ACL (2013)Google Scholar
  15. 15.
    Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of the 46th Annual Meeting of the ACL, pp. 236–244. ACL (2008)Google Scholar
  16. 16.
    Bauer, J., Socher, R., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 455–465. ACL (2013)Google Scholar
  17. 17.
    Šajgalík, M., Barla, M., Bieliková, M.: From ambiguous words to key-concept extraction. In: Proceedings of the 24th International Workshop on Database and Expert Systems Applications, pp. 63–67. IEEE (2013)Google Scholar
  18. 18.
    Vu, T., Aw, A.T., Zhang, M.: Term extraction through unithood and termhood unification. In: Proceedings of the Third International Joint Conference on NLP, pp. 631–636. ACL (2004)Google Scholar
  19. 19.
    Wang, D., Zhang, H.: Inverse-category-frequency based supervised term weighting scheme for text categorization (2010). arXiv preprint arXiv:1012.2609Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Márius Šajgalík
    • 1
  • Michal Barla
    • 1
  • Mária Bieliková
    • 1
  1. 1.Faculty of Informatics and Information TechnologiesSlovak University of Technology in BratislavaBratislavaSlovakia

Personalised recommendations