Abstract
Text documents often contain information relevant for a particular domain in short “snippets”. The social science field of peace and conflict studies is such a domain, where identifying, classifying and tracking drivers of conflict from text sources is important, and snippets are typically classified by human analysts using an ontology. One issue in automating this process is that snippets tend to contain infrequent “rare” terms which lack class-conditional evidence. In this work we develop a method to enrich a bag-of-words model by complementing rare terms in the text to be classified with related terms from a Word Vector model. This method is then combined with standard linear text classification algorithms. By reducing sparseness in the bag-of-words, these enriched models perform better than the baseline classifiers. A second issue is to improve performance on “small” classes having only a few examples, and here we show that Paragraph Vectors outperform the enriched models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Guo, J., Che, W., Wang, H., Liu, T.: Revisiting embedding features for simple semi-supervised learning. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 110–120 (2014)
Heap, B., Krzywicki, A., Schmeidl, S., Wobcke, W., Bain, M.: A joint human/machine process for coding events and conflict drivers. In: Cong, G., Peng, W.-C., Zhang, W.E., Li, C., Sun, A. (eds.) ADMA 2017. LNCS (LNAI), vol. 10604, pp. 639–654. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69179-4_45
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 143–151 (1997)
Kuzi, S., Shtok, A., Kurland, O.: Query expansion using word embeddings. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management, pp. 1929–1932 (2016)
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1188–1196 (2014)
Mansuy, T.N., Hilderman, R.J.: A characterization of WordNet features in Boolean models for text classification. In: Proceedings of the Fifth Australasian Data Mining Conference, pp. 103–109 (2006)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Proceedings of the NAACL:HLT Conference, pp. 1275–1280 (2015)
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)
Wang, P., Domeniconi, C.: Building semantic kernels for text classification using Wikipedia. In: Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 713–721 (2008)
Acknowledgment
This work was supported by Data to Decisions Cooperative Research Centre. We thank Josie Gardner for coding the ICG DRC dataset.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Krzywicki, A., Heap, B., Bain, M., Wobcke, W., Schmeidl, S. (2018). Using Word Embeddings with Linear Models for Short Text Classification. In: Mitrovic, T., Xue, B., Li, X. (eds) AI 2018: Advances in Artificial Intelligence. AI 2018. Lecture Notes in Computer Science(), vol 11320. Springer, Cham. https://doi.org/10.1007/978-3-030-03991-2_74
Download citation
DOI: https://doi.org/10.1007/978-3-030-03991-2_74
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03990-5
Online ISBN: 978-3-030-03991-2
eBook Packages: Computer ScienceComputer Science (R0)