Skip to main content
Log in

Word-class embeddings for multiclass text classification

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multiclass text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Given a set of classes (a.k.a. a codeframe) \(\mathcal {C}=\{c_{1},\ldots ,c_{m}\}\), a classification problem is said to be multiclass if \(m>2\); it is said to be single-label if each item always belongs to exactly one class; it is said to be multilabel if each item can belong to any number (i.e., 0, 1, or more than 1) of classes in \(\mathcal {C}\).

  2. fastText can consider not only unigrams but also n-grams and subwords as the surface forms of input.

  3. Pointwise Mutual Information (PMI) is defined as \(\mathrm {PMI}(w_{i},c_{j})=\log \frac{\Pr (w_{i},c_{j})}{\Pr (w_{i})\Pr (c_{j})}\), where \(\Pr (w_{i},c_{j})\) is the joint probability of word \(w_{i}\) and context \(c_{j}\), and \(\Pr (w_{i})\) and \(\Pr (c_{j})\) are the marginal probabilities of the word and context, respectively. PPMI takes the positive part of PMI, i.e., \(\mathrm {PPMI}(w_{i},c_{j})=\max \{0,\mathrm {PMI}(w_{i},c_{j})\}\).

  4. The compatibility between a label embedding matrix E and a document embedding h is defined to be proportional to \(\sigma (EU+b_u)\sigma (Vh+b_v)\), and this is in contrast to what is customarily done in previous related literature that relied instead on bi-linear models of the form EWh for the same purpose (\(U,b_u,V,b_v,W\) are learnable parameters).

  5. Put it another way, L1 normalization fixes a “budget” of mass 1 to the score a term can deliver for any class, irrespectively of its prevalence in language or in the corpus.

  6. It is worth recalling that the bag-of-words model tends to produce matrices that are highly sparse. Many software packages take advantage of this sparsity in order to compute matrix multiplication efficiently, at a cost that, in practice, falls far below the asymptotic bound \(O(|\mathcal {V}|nm)\). We discuss empirical computational complexity issues in Sect. 4.7.

  7. PCA is based on (truncated) Singular Value Decomposition (SVD). The SVD of a matrix \(\mathbf {M}\) is a factorization of the form \(\mathbf {U}{\varvec{\Sigma }}\mathbf {V}^\top \), in which \(\mathbf {U}\) and \(\mathbf {V}\) are orthogonal matrices containing the left- and right-eigenvectors of \(\mathbf {M}\) as their columns, respectively, and \(\varvec{\Sigma }\) is a diagonal matrix containing the eigenvalues of \(\mathbf {M}\). That is, PCA is an alternative way of factoring \(\mathbf {M}\) w.r.t. Eq. 7. The dimensionality reduction is achieved by ordering the components by decreasing order of eigenvalues, and truncating the matrices. The optimal rank r approximation of \(\mathbf {M}\) is thus given by \(\mathbf {U}_{r}\varSigma _{r}\) which only accounts for the r largest eigenvalues and their corresponding r left-eigenvectors.

  8. Note that the columns of \(\mathbf {Y}\) are binary, indicating the presence or absence of the label for each document. It is interesting to look at \(\mathbf {Y}\)’s binary columns as indicator functions that decide which elements from \(\mathbf {X}_{1}^\top \) rows contribute to the summation in the dot product.

  9. Since we undertake a stochastic optimization, this actually applies to batches of data.

  10. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  11. http://qwone.com/~jason/20Newsgroups/. Note that this version of 20Newsgroups is indeed single-label: while a previous version contained a small set of document with more than one label (corresponding to posts that had been cross-posted to more than one newsgroup), that set is not present in this version we use.

  12. While some previous papers [e.g., Tang et al. (2015)] have reported substantially higher scores for this dataset, it is worth noticing that we use a harder, more realistic version of the dataset than has been used in those papers. Following Moreo et al. (2020), in our version we remove all headers, footers, and quotes, since these fields contain words that are highly correlated with the target labels, thus making the classification task unrealistically easy; see http://scikit-learn.org/stable/datasets/twenty_newsgroups.html for further details. Our results are indeed consistent with other papers following the same policy.

  13. http://disi.unitn.it/moschitti/corpora.htm.

  14. Available from http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.

  15. https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis.

  16. https://www.wipo.int/classifications/ipc/en/ITsupport/Categorization/dataset/.

  17. http://scikit-learn.org/.

  18. Note that these deep models are here not meant to be used as baselines, but to serve as vehicles on which to test WCEs. In other words, the actual baseline for any model equipped with WCEs is the same model not using WCEs.

  19. http://nlp.stanford.edu/data/glove.840B.300d.zip.

  20. We generate the validation set by randomly sampling 20% of the training set, with a maximum of 20,000 documents; the rest is taken to be the training set proper. We keep the training/validation split consistent across all methods.

  21. Note that, consistently with (Cortes and Vapnik 1995; Morik et al. 1999), in this formulation we assume the class labels \(y_{k}\) to be in \(\{-1,+1\}\), while in Sect. 3 we had assumed them to be in \(\{0,1\}\); the difference is, of course, unproblematic.

  22. In scikit-learn this is achieved by setting \(J_+=n/(mP)\) and \(J_-=n/(mN)\), and corresponds to setting the parameter class_weight to “balanced”.

  23. Somehow surprisingly, though, several relevant related works where SVMs are used as baselines [see, e.g., (Grave et al. 2017; Jiang et al. 2018; Zhang et al. 2015)] do not report the details of how, if at all, they tune the SVM hyperparameters.

  24. Using k-fold cross-validation (k-FCV) on the full set of labelled documents is a more expensive, but stronger, way of doing parameter optimization than using a single split between a training set and a validation set, because k-FCV performs k such splits. We here use k-FCV for SVMs and single-split optimization for all the other deep learning-based architectures because it is realistic to do so, i.e., because SVMs are computationally cheap enough for us to be able to afford k-FCV, while neural architectures are not.

  25. This implementation relies on liblinear. See https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html for further details.

  26. The STW functions we have considered include chi-square, information gain, gain ratio, pointwise mutual information (Debole and Sebastiani 2003), ConfWeight (Soucy and Mineau 2005), and relevance frequency (Lan et al. 2009).

  27. Given a word w, a codeframe \(\mathcal {C}=\{c_{1},\ldots ,c_m\}\), and a STW functions f that generates a list of scores \(S=(f(w,c_{1}),\ldots ,f(w,c_m))\), we consider the following aggregation functions: averaging \(\left( \frac{1}{m}\sum _{c\in \mathcal {C}}f(w,c)\right) \), averaging weighted by class prevalence \(\left( \frac{\sum _{c\in \mathcal {C}}f(w,c)p(c)}{\sum _{c\in \mathcal {C}}p(c)}\right) \) where p(c) is the prevalence of class c, and max-pooling \(\left( \max _{c\in \mathcal {C}}\{f(w,c)\}\right) \).

  28. https://github.com/ThilinaRajapakse/simpletransformers.

  29. https://huggingface.co/bert-base-uncased.

  30. https://github.com/guoyinwang/LEAM.

  31. Note that by fastText we here mean its “supervised” mode, that is, fastText as a classifier. The set of embeddings that fastText produces when working in “unsupervised mode” are later used and discussed in Sect. 4.9, along with other sets of embeddings.

  32. https://fasttext.cc/.

  33. We modified the official implementation of https://github.com/guoyinwang/LEAM to use early-stop.

  34. Though most traditional functions used for feature selection can only use presence/absence, other metrics exist that work with weighted scores, e.g., the Fisher score. In initial experiments not described in this paper we have indeed tried to use the Fisher score, but we have eventually given up, due to the fact that (a) its computation is very slow, and (b) the classification accuracy that we have observed is not much different from what can be obtained with the other functions mentioned above, and is often intermediate between the best and the worst recorded values.

  35. Available at https://code.google.com/archive/p/word2vec/.

  36. Available at https://fasttext.cc/docs/en/english-vectors.html.

  37. We used the Huggingface’s implementation available at https://github.com/huggingface/transformers.

  38. See https://fasttext.cc/docs/en/unsupervised-tutorial.html.

  39. More often than not, BERT is used by fine-tuning the entire model to the task at hand. In this set of experiments we prefer to reproduce a simpler scenario, in which the practitioner simply uses pre-trained models as made available by the developers of BERT. Fine-tuning models such as BERT requires a considerable amount of computational power, which might not be at everyone’s reach. Experiments showcasing how a properly fine-tuned BERT works (with and without WCEs) on our datasets are illustrated in Sect. 4.4.

  40. https://projector.tensorflow.org/.

  41. Another technique for solving this problem is Latent Semantic Imputation (Yao et al. 2019). This methods allows filling the missing representation in a vector space (in our case: in the space of WCEs) by analyzing the neighborhood of the word representation in another vector space (in our case: the space of unsupervised embeddings) via techniques inspired by manifold learning.

  42. The fact that WCEs are not suitable for codeframes containing just a few classes is the reason why all the datasets we have chosen for our experiments are for classification by topic (CBT). While WCEs are not inherently about CBT, it is a matter of fact that large enough codeframes are mostly to be found in CBT (e.g., when classifying text according to domain-specific taxonomies/thesauri). Other classification tasks of a non-topical nature are often characterized by codeframes consisting of two or three classes; examples of this are classification by sensitivity (Sensitive versus NonSensitive) (Berardi et al. 2015), sentiment classification (Positive versus Neutral versus Negative) (Pang and Lee 2008), or classification by subjectivity (Subjective versus Objective) (Riloff et al. 2005).

  43. While in this paper we have focused on classification, we should note that WCEs are straightforwardly applicable to regression tasks too. One reason why we exclusively concentrate on classification is that, in the realm of text, classification is a way more popular task than regression. In other words, there are many more applications of text classification than of text regression, which also means that there are fewer publicly available datasets for experimenting on text regression. A second reason why we have focused on classification is that most text regression tasks are not multiclass, i.e., there is a single class (or “concept”) of interest and the regressor must label a document with a real-valued score for that concept. “Single-class regression” is the regression equivalent of binary classification, and in Sect. 5.3 we have argued that WCEs are not suitable for binary classification; for the very same reasons they are not suitable to “single-class regression”. For all these reasons, in this paper we restrict our interest to (multiclass) classification.

References

  • Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853

    MathSciNet  MATH  Google Scholar 

  • Baker D, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR 1998), Melbourne, AU, pp 96–103. https://doi.org/10.1145/290941.290970

  • Baldi P (2011) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of the ICML 2011 workshop on unsupervised and transfer learning, Bellevue, US, pp 37–49

  • Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL 2014), Baltimore, US, pp 238–247. https://doi.org/10.3115/v1/p14-1023

  • Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208

    MATH  Google Scholar 

  • Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    MATH  Google Scholar 

  • Berardi G, Esuli A, Macdonald C, Ounis I, Sebastiani F (2015) Semi-automated text classification for sensitivity identification. In: Proceedings of the 24th ACM international conference on information and knowledge management (CIKM 2015), Melbourne, AU, pp 1711–1714. https://doi.org/10.1145/2806416.2806597

  • Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 730–738

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 4th conference on empirical methods in natural language processing (EMNLP 2006), Sydney, AU, pp 120–128. https://doi.org/10.3115/1610075.1610094

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  • Bullinaria JA, Levy JP (2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behav Res Methods 39(3):510–526. https://doi.org/10.3758/bf03193020

    Article  Google Scholar 

  • Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res 63:743–788. https://doi.org/10.1613/jair.1.11259

    Article  MathSciNet  MATH  Google Scholar 

  • Caruana R (1993) Multitask learning: A knowledge-based source of inductive bias. In: Proceedings of the 10th international conference on machine learning (ICML 1993), Amherst, US, pp 41–48. https://doi.org/10.1016/b978-1-55860-307-3.50012-5

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  • Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  • Daumé H (2007) Frustratingly easy domain adaptation. In: Proceedings of the 45th annual meeting of the association for computational linguistics (ACL 2007), Prague, CZ, pp 256–263

  • Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM symposium on applied computing (SAC 2003), Melbourne, US, pp 784–788. https://doi.org/10.1145/952532.952688

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (NAACL 2019), Minneapolis, US, pp 4171–4186

  • Dong Y, Liu P, Zhu Z, Wang Q, Zhang Q (2020) A fusion model-based label embedding and self-interaction attention for text classification. IEEE Access 8:30548–30559. https://doi.org/10.1109/access.2019.2954985

    Article  Google Scholar 

  • Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM 1998), Bethesda, US, pp 148–155. https://doi.org/10.1145/288627.288651

  • Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660

    MathSciNet  MATH  Google Scholar 

  • Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st international conference on machine learning (ICML 2004), Banff, CA, pp 38–45. https://doi.org/10.1145/1015330.1015356

  • Garneau N, Leboeuf J, Lamontagne L (2019) Contextual generation of word embeddings for out-of-vocabulary words in downstream tasks. In: Proceedings of the 32nd Canadian conference on artificial intelligence (Canadian AI), Kingston, CA, pp 563–569. https://doi.org/10.1007/978-3-030-18305-9_60

  • Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics (AISTATS 2010), Chia Laguna, Italy, pp 249–256

  • González P, Castaño A, Chawla NV, del Coz JJ (2017) A review on quantification learning. ACM Comput Surv 50(5):74:1–74:40. https://doi.org/10.1145/3117807

    Article  Google Scholar 

  • Grave E, Mikolov T, Joulin A, Bojanowski P (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics (EACL 2017), Valencia, ES, pp 427–431. https://doi.org/10.18653/v1/e17-2068

  • Gupta S, Kanchinadam T, Conathan D, Fung G (2019) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67

    Article  Google Scholar 

  • Harris ZS (1954) Distributional structure. Word 10(2–3):146–162. https://doi.org/10.1007/978-94-017-6059-1_36

    Article  Google Scholar 

  • Hersh W, Buckley C, Leone T, Hickman D (1994) OHSUMED: an interactive retrieval evaluation and new large text collection for research. In: Proceedings of the 17th ACM international conference on research and development in information retrieval (SIGIR 1994), Dublin, IE, pp 192–201. https://doi.org/10.1007/978-1-4471-2099-5_20

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Hsu DJ, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: Proceedings of the 23rd annual conference on neural information processing systems (NIPS 2009), Vancouver, CA, pp 772–780

  • Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70. https://doi.org/10.1007/s00521-016-2401-x

    Article  Google Scholar 

  • Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. In: Proceedings of the 26th international joint conference on artificial intelligence (IJCAI 2016), New York, US, pp 2824–2830

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML 1998), Chemnitz, DE, pp 137–142. https://doi.org/10.1007/bfb0026683

  • Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th ACM conference on research and development in information retrieval (SIGIR 2001), New Orleans, US, pp 128–136. https://doi.org/10.1145/383952.383974

  • Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1746–1751

  • Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI 2016), Phoenix, US, pp 2741–2749

  • Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations (ICLR 2015), San Diego, US

  • Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI conference on artificial intelligence (AAAI 2015), Austin, US, pp 2267–2273

  • Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Article  Google Scholar 

  • Le HT, Cerisara C, Denis A (2018) Do convolutional networks need to be deep for text classification?. In: Proceedings of the AAAI 2018 workshop on affective content analysis, New Orleans, US, pp 29–36

  • LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  • Lei X, Cai Y, Xu J, Ren D, Li Q, Leung HF (2019) Incorporating task-oriented representation in text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 401–415

  • Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225

    Article  Google Scholar 

  • Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014), Montreal, CA, pp 2177–2185

  • Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th ACM international conference on research and development in information retrieval (SIGIR 1992), Kobenhavn, DK, pp 37–50

  • Lin J (2019) The neural hype and comparisons against weak baselines. SIGIR Forum 52(1):40–51

    Article  Google Scholar 

  • Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP 2015), Lisbon, PT, pp 1412–1421

  • McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 6294–6305

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Workshop track proceedings of the 1st international conference on learning representations (ICLR 2013), Scottsdale, US

  • Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the 11th international conference on language resources and evaluation (LREC 2018), Miyazaki, JP

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 3111–3119

  • Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 2265–2273

  • Moreo A, Esuli A, Sebastiani F (2016) Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. J Artif Intell Res 55:131–163

    Article  MathSciNet  Google Scholar 

  • Moreo A, Esuli A, Sebastiani F (2020) Learning to weight for text classification. IEEE Trans Knowl Data Eng 32(2):302–316. https://doi.org/10.1109/TKDE.2018.2883446

    Article  MATH  Google Scholar 

  • Moreo A, Pedrotti A, Sebastiani F (2021) Heterogeneous document embeddings for cross-lingual text classification. In: Proceedings of the 36th ACM symposium on applied computing (SAC 2021), Gwangju, KR. https://doi.org/10.1145/3412841.3442093(forthcoming)

  • Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach. A case study in intensive care monitoring. In: Proceedings of the 16th international conference on machine learning (ICML 1999), Bled, SL, pp 268–277

  • Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1/2):1–135

    Article  Google Scholar 

  • Pappas N, Henderson J (2019) Gile: a generalized input-label embedding for text classification. Trans Assoc Comput Linguist 7:139–155

    Article  Google Scholar 

  • Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1532–1543

  • Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics (NAACL 2018), New Orleans, US, pp 2227–2237

  • Ren H, Zeng Z, Cai Y, Du Q, Li Q, Xie H (2019) A weighted word embedding model for text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 419–434

  • Riloff E, Wiebe J, Phillips W (2005) Exploiting subjectivity classification to improve information extraction. In: Proceedings of the 12th conference of the american association for artificial intelligence (AAAI 2005), Pittsburgh, US, pp 1106–1111

  • Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536. https://doi.org/10.1038/323533a0

    Article  MATH  Google Scholar 

  • Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1):21–41. https://doi.org/10.1162/089976602753284446

    Article  MATH  Google Scholar 

  • Sahlgren M (2005) An introduction to random indexing. In: Proceedings of the TKE workshop on methods and applications of semantic indexing, Copenhagen, DK

  • Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP 2013), Seattle, US, pp 1631–1642

  • Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI 2005), Edinburgh, UK, pp 1130–1135

  • Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  • Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genova, IT, pp 2142–2147

  • Tang J, Qu M, Mei Q (2015) PTE: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21st ACM international conference on knowledge discovery and data mining (KDD 2015), Sydney, AU, pp 1165–1174

  • van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 5998–6008

  • Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X, Henao R, Carin L (2018) Joint embedding of words and labels for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (ACL 2018), Melbourne, AU, pp 2321–2331

  • Wang S, Manning CD (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012), Jeju Island, KR, pp 90–94

  • Yang Y, Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Trans Inf Syst 12(3):252–277

    Article  Google Scholar 

  • Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019b) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd annual conference on neural information processing systems (NeurIPS 2019), Vancouver, CA, pp 5754–5764

  • Yang W, Lu K, Yang P, Lin J (2019a) Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd ACM conference on research and development in information retrieval (SIGIR 2019), Paris, FR, pp 1129–1132. https://doi.org/10.1145/3331184.3331340

  • Yao S, Yu D, Xiao K (2019) Enhancing domain word embedding via latent semantic imputation. In: Proceedings of the 25th ACM conference on knowledge discovery and data mining (KDD 2019), Anchorage, US, pp 557–565. https://doi.org/10.1145/3292500.3330926

  • Yu HF, Jain P, Kar P, Dhillon I (2014) Large-scale multi-label learning with missing labels. In: Proceedings of the 31st international conference on machine learning (ICML 2014), Beijing, CN, pp 593–601

  • Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):e1253. https://doi.org/10.1002/widm.1253

    Article  Google Scholar 

  • Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 649–657

Download references

Acknowledgements

The present work has been supported by the ARIADNEplus project, funded by the European Commission (Grant 823914) under the H2020 Programme INFRAIA-2018-1, by the SoBigdata++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1, and by the AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020 . The authors’ opinions do not necessarily reflect those of the European Commission. Thanks to NVidia for donating the two Titan GPUs on which many of the experiments discussed in this paper were run. We thank the anonymous reviewers for valuable feedback that allowed us to improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alejandro Moreo.

Additional information

Responsible editor: Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moreo, A., Esuli, A. & Sebastiani, F. Word-class embeddings for multiclass text classification. Data Min Knowl Disc 35, 911–963 (2021). https://doi.org/10.1007/s10618-020-00735-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-020-00735-3

Keywords

Navigation