Abstract
Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Broder, A., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., Zhang, T.: Robust classification of rare queries using web knowledge. In: SIGIR, pp. 231–238 (2007)
Chirita, P.A., Nejdl, W., Paiu, R., Kohlschütter, C.: Using ODP metadata to personalize search. In: SIGIR, pp. 178–185 (2005)
Ha, J., Lee, J.H., Jang, W.J., Lee, Y.K., Lee, S.: Toward robust classification using the open directory project. In: DSAA, pp. 607–612 (2014)
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP, pp. 1746–1751 (2014)
Kurata, G., Xiang, B., Zhou, B.: Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In: NAACL, pp. 521–526 (2016)
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196 (2014)
Lee, J.H., Ha, J., Jung, J.Y., Lee, S.: Semantic contextual advertising based on the open directory project. ACM Trans. Web 7(4), 24:1–24:22 (2013)
McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, pp. 359–367 (1998)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop) (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Nam, J., Kim, J., Loza MencÃa, E., Gurevych, I., Fürnkranz, J.: Large-scale multi-label text classification — revisiting neural networks. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 437–452. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44851-9_28
Shin, H., Lee, G., Ryu, W.J., Lee, S.: Utilizing wikipedia knowledge in open directory project-based text classification. In: SAC, pp. 309–314 (2017)
Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: NAACL, pp. 1275–1280 (2015)
Wang, Z., Wang, H.: Understanding short texts. In: ACL (Tutorial) (2016)
Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)
Acknowledgment
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (number 2015R1A2A1A10052665).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Kim, KM., Dinara, A., Choi, BJ., Lee, S. (2018). Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-93037-4_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93036-7
Online ISBN: 978-3-319-93037-4
eBook Packages: Computer ScienceComputer Science (R0)