Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification

Kim, Kang-Min; Dinara, Aliyeva; Choi, Byung-Ju; Lee, SangKeun

doi:10.1007/978-3-319-93037-4_30

Kang-Min Kim¹⁹,
Aliyeva Dinara¹⁹,
Byung-Ju Choi¹⁹ &
…
SangKeun Lee¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10938))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1950 Accesses
3 Citations

Abstract

Recently, implicit representation models, such as embedding or deep learning, have been successfully adopted to text classification task due to their outstanding performance. However, these approaches are limited to small- or moderate-scale text classification. Explicit representation models are often used in a large-scale text classification, like the Open Directory Project (ODP)-based text classification. However, the performance of these models is limited to the associated knowledge bases. In this paper, we incorporate word embeddings into the ODP-based large-scale classification. To this end, we first generate category vectors, which represent the semantics of ODP categories by jointly modeling word embeddings and the ODP-based text classification. We then propose a novel semantic similarity measure, which utilizes the category and word vectors obtained from the joint model. The evaluation results clearly show the efficacy of our methodology in large-scale text classification. The proposed scheme exhibits significant improvements of 10% and 28% in terms of macro-averaging F1-score and precision at k, respectively, over state-of-the-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Broder, A., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., Zhang, T.: Robust classification of rare queries using web knowledge. In: SIGIR, pp. 231–238 (2007)
Google Scholar
Chirita, P.A., Nejdl, W., Paiu, R., Kohlschütter, C.: Using ODP metadata to personalize search. In: SIGIR, pp. 178–185 (2005)
Google Scholar
Ha, J., Lee, J.H., Jang, W.J., Lee, Y.K., Lee, S.: Toward robust classification using the open directory project. In: DSAA, pp. 607–612 (2014)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP, pp. 1746–1751 (2014)
Google Scholar
Kurata, G., Xiang, B., Zhou, B.: Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In: NAACL, pp. 521–526 (2016)
Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196 (2014)
Google Scholar
Lee, J.H., Ha, J., Jung, J.Y., Lee, S.: Semantic contextual advertising based on the open directory project. ACM Trans. Web 7(4), 24:1–24:22 (2013)
Article Google Scholar
McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, pp. 359–367 (1998)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop) (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Nam, J., Kim, J., Loza Mencía, E., Gurevych, I., Fürnkranz, J.: Large-scale multi-label text classification — revisiting neural networks. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014. LNCS (LNAI), vol. 8725, pp. 437–452. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44851-9_28
Chapter Google Scholar
Shin, H., Lee, G., Ryu, W.J., Lee, S.: Utilizing wikipedia knowledge in open directory project-based text classification. In: SAC, pp. 309–314 (2017)
Google Scholar
Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: NAACL, pp. 1275–1280 (2015)
Google Scholar
Wang, Z., Wang, H.: Understanding short texts. In: ACL (Tutorial) (2016)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)
Article MathSciNet Google Scholar

Download references

Acknowledgment

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (number 2015R1A2A1A10052665).

Author information

Authors and Affiliations

Korea University, Seoul, Republic of Korea
Kang-Min Kim, Aliyeva Dinara, Byung-Ju Choi & SangKeun Lee

Authors

Kang-Min Kim
View author publications
You can also search for this author in PubMed Google Scholar
Aliyeva Dinara
View author publications
You can also search for this author in PubMed Google Scholar
Byung-Ju Choi
View author publications
You can also search for this author in PubMed Google Scholar
SangKeun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to SangKeun Lee .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, KM., Dinara, A., Choi, BJ., Lee, S. (2018). Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-93037-4_30
Published: 20 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93036-7
Online ISBN: 978-3-319-93037-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics