Word-class embeddings for multiclass text classification

Moreo, Alejandro; Esuli, Andrea; Sebastiani, Fabrizio

doi:10.1007/s10618-020-00735-3

Word-class embeddings for multiclass text classification

Published: 19 February 2021

Volume 35, pages 911–963, (2021)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

1488 Accesses
22 Citations
2 Altmetric
Explore all metrics

Abstract

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multiclass text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-label, Multi-class Classification Using Polylingual Embeddings

Text classification using embeddings: a survey

Article 26 March 2023

A Review on Word Embedding Techniques for Text Classification

Notes

Given a set of classes (a.k.a. a codeframe) \(\mathcal {C}=\{c_{1},\ldots ,c_{m}\}\), a classification problem is said to be multiclass if \(m>2\); it is said to be single-label if each item always belongs to exactly one class; it is said to be multilabel if each item can belong to any number (i.e., 0, 1, or more than 1) of classes in \(\mathcal {C}\).
fastText can consider not only unigrams but also n-grams and subwords as the surface forms of input.
Pointwise Mutual Information (PMI) is defined as \(\mathrm {PMI}(w_{i},c_{j})=\log \frac{\Pr (w_{i},c_{j})}{\Pr (w_{i})\Pr (c_{j})}\), where \(\Pr (w_{i},c_{j})\) is the joint probability of word \(w_{i}\) and context \(c_{j}\), and \(\Pr (w_{i})\) and \(\Pr (c_{j})\) are the marginal probabilities of the word and context, respectively. PPMI takes the positive part of PMI, i.e., \(\mathrm {PPMI}(w_{i},c_{j})=\max \{0,\mathrm {PMI}(w_{i},c_{j})\}\).
The compatibility between a label embedding matrix E and a document embedding h is defined to be proportional to \(\sigma (EU+b_u)\sigma (Vh+b_v)\), and this is in contrast to what is customarily done in previous related literature that relied instead on bi-linear models of the form EWh for the same purpose (\(U,b_u,V,b_v,W\) are learnable parameters).
Put it another way, L1 normalization fixes a “budget” of mass 1 to the score a term can deliver for any class, irrespectively of its prevalence in language or in the corpus.
It is worth recalling that the bag-of-words model tends to produce matrices that are highly sparse. Many software packages take advantage of this sparsity in order to compute matrix multiplication efficiently, at a cost that, in practice, falls far below the asymptotic bound \(O(|\mathcal {V}|nm)\). We discuss empirical computational complexity issues in Sect. 4.7.
PCA is based on (truncated) Singular Value Decomposition (SVD). The SVD of a matrix \(\mathbf {M}\) is a factorization of the form \(\mathbf {U}{\varvec{\Sigma }}\mathbf {V}^\top \), in which \(\mathbf {U}\) and \(\mathbf {V}\) are orthogonal matrices containing the left- and right-eigenvectors of \(\mathbf {M}\) as their columns, respectively, and \(\varvec{\Sigma }\) is a diagonal matrix containing the eigenvalues of \(\mathbf {M}\). That is, PCA is an alternative way of factoring \(\mathbf {M}\) w.r.t. Eq. 7. The dimensionality reduction is achieved by ordering the components by decreasing order of eigenvalues, and truncating the matrices. The optimal rank r approximation of \(\mathbf {M}\) is thus given by \(\mathbf {U}_{r}\varSigma _{r}\) which only accounts for the r largest eigenvalues and their corresponding r left-eigenvectors.
Note that the columns of \(\mathbf {Y}\) are binary, indicating the presence or absence of the label for each document. It is interesting to look at \(\mathbf {Y}\)’s binary columns as indicator functions that decide which elements from \(\mathbf {X}_{1}^\top \) rows contribute to the summation in the dot product.
Since we undertake a stochastic optimization, this actually applies to batches of data.
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
http://qwone.com/~jason/20Newsgroups/. Note that this version of 20Newsgroups is indeed single-label: while a previous version contained a small set of document with more than one label (corresponding to posts that had been cross-posted to more than one newsgroup), that set is not present in this version we use.
While some previous papers [e.g., Tang et al. (2015)] have reported substantially higher scores for this dataset, it is worth noticing that we use a harder, more realistic version of the dataset than has been used in those papers. Following Moreo et al. (2020), in our version we remove all headers, footers, and quotes, since these fields contain words that are highly correlated with the target labels, thus making the classification task unrealistically easy; see http://scikit-learn.org/stable/datasets/twenty_newsgroups.html for further details. Our results are indeed consistent with other papers following the same policy.
http://disi.unitn.it/moschitti/corpora.htm.
Available from http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.
https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis.
https://www.wipo.int/classifications/ipc/en/ITsupport/Categorization/dataset/.
http://scikit-learn.org/.
Note that these deep models are here not meant to be used as baselines, but to serve as vehicles on which to test WCEs. In other words, the actual baseline for any model equipped with WCEs is the same model not using WCEs.
http://nlp.stanford.edu/data/glove.840B.300d.zip.
We generate the validation set by randomly sampling 20% of the training set, with a maximum of 20,000 documents; the rest is taken to be the training set proper. We keep the training/validation split consistent across all methods.
Note that, consistently with (Cortes and Vapnik 1995; Morik et al. 1999), in this formulation we assume the class labels \(y_{k}\) to be in \(\{-1,+1\}\), while in Sect. 3 we had assumed them to be in \(\{0,1\}\); the difference is, of course, unproblematic.
In scikit-learn this is achieved by setting \(J_+=n/(mP)\) and \(J_-=n/(mN)\), and corresponds to setting the parameter class_weight to “balanced”.
Somehow surprisingly, though, several relevant related works where SVMs are used as baselines [see, e.g., (Grave et al. 2017; Jiang et al. 2018; Zhang et al. 2015)] do not report the details of how, if at all, they tune the SVM hyperparameters.
Using k-fold cross-validation (k-FCV) on the full set of labelled documents is a more expensive, but stronger, way of doing parameter optimization than using a single split between a training set and a validation set, because k-FCV performs k such splits. We here use k-FCV for SVMs and single-split optimization for all the other deep learning-based architectures because it is realistic to do so, i.e., because SVMs are computationally cheap enough for us to be able to afford k-FCV, while neural architectures are not.
This implementation relies on liblinear. See https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html for further details.
The STW functions we have considered include chi-square, information gain, gain ratio, pointwise mutual information (Debole and Sebastiani 2003), ConfWeight (Soucy and Mineau 2005), and relevance frequency (Lan et al. 2009).
Given a word w, a codeframe \(\mathcal {C}=\{c_{1},\ldots ,c_m\}\), and a STW functions f that generates a list of scores \(S=(f(w,c_{1}),\ldots ,f(w,c_m))\), we consider the following aggregation functions: averaging \(\left( \frac{1}{m}\sum _{c\in \mathcal {C}}f(w,c)\right) \), averaging weighted by class prevalence \(\left( \frac{\sum _{c\in \mathcal {C}}f(w,c)p(c)}{\sum _{c\in \mathcal {C}}p(c)}\right) \) where p(c) is the prevalence of class c, and max-pooling \(\left( \max _{c\in \mathcal {C}}\{f(w,c)\}\right) \).
https://github.com/ThilinaRajapakse/simpletransformers.
https://huggingface.co/bert-base-uncased.
https://github.com/guoyinwang/LEAM.
Note that by fastText we here mean its “supervised” mode, that is, fastText as a classifier. The set of embeddings that fastText produces when working in “unsupervised mode” are later used and discussed in Sect. 4.9, along with other sets of embeddings.
https://fasttext.cc/.
We modified the official implementation of https://github.com/guoyinwang/LEAM to use early-stop.
Though most traditional functions used for feature selection can only use presence/absence, other metrics exist that work with weighted scores, e.g., the Fisher score. In initial experiments not described in this paper we have indeed tried to use the Fisher score, but we have eventually given up, due to the fact that (a) its computation is very slow, and (b) the classification accuracy that we have observed is not much different from what can be obtained with the other functions mentioned above, and is often intermediate between the best and the worst recorded values.
Available at https://code.google.com/archive/p/word2vec/.
Available at https://fasttext.cc/docs/en/english-vectors.html.
We used the Huggingface’s implementation available at https://github.com/huggingface/transformers.
See https://fasttext.cc/docs/en/unsupervised-tutorial.html.
More often than not, BERT is used by fine-tuning the entire model to the task at hand. In this set of experiments we prefer to reproduce a simpler scenario, in which the practitioner simply uses pre-trained models as made available by the developers of BERT. Fine-tuning models such as BERT requires a considerable amount of computational power, which might not be at everyone’s reach. Experiments showcasing how a properly fine-tuned BERT works (with and without WCEs) on our datasets are illustrated in Sect. 4.4.
https://projector.tensorflow.org/.
Another technique for solving this problem is Latent Semantic Imputation (Yao et al. 2019). This methods allows filling the missing representation in a vector space (in our case: in the space of WCEs) by analyzing the neighborhood of the word representation in another vector space (in our case: the space of unsupervised embeddings) via techniques inspired by manifold learning.
The fact that WCEs are not suitable for codeframes containing just a few classes is the reason why all the datasets we have chosen for our experiments are for classification by topic (CBT). While WCEs are not inherently about CBT, it is a matter of fact that large enough codeframes are mostly to be found in CBT (e.g., when classifying text according to domain-specific taxonomies/thesauri). Other classification tasks of a non-topical nature are often characterized by codeframes consisting of two or three classes; examples of this are classification by sensitivity (Sensitive versus NonSensitive) (Berardi et al. 2015), sentiment classification (Positive versus Neutral versus Negative) (Pang and Lee 2008), or classification by subjectivity (Subjective versus Objective) (Riloff et al. 2005).
While in this paper we have focused on classification, we should note that WCEs are straightforwardly applicable to regression tasks too. One reason why we exclusively concentrate on classification is that, in the realm of text, classification is a way more popular task than regression. In other words, there are many more applications of text classification than of text regression, which also means that there are fewer publicly available datasets for experimenting on text regression. A second reason why we have focused on classification is that most text regression tasks are not multiclass, i.e., there is a single class (or “concept”) of interest and the regressor must label a document with a real-valued score for that concept. “Single-class regression” is the regression equivalent of binary classification, and in Sect. 5.3 we have argued that WCEs are not suitable for binary classification; for the very same reasons they are not suitable to “single-class regression”. For all these reasons, in this paper we restrict our interest to (multiclass) classification.

References

Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853
MathSciNet MATH Google Scholar
Baker D, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR 1998), Melbourne, AU, pp 96–103. https://doi.org/10.1145/290941.290970
Baldi P (2011) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of the ICML 2011 workshop on unsupervised and transfer learning, Bellevue, US, pp 37–49
Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL 2014), Baltimore, US, pp 238–247. https://doi.org/10.3115/v1/p14-1023
Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3:1183–1208
MATH Google Scholar
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Berardi G, Esuli A, Macdonald C, Ounis I, Sebastiani F (2015) Semi-automated text classification for sensitivity identification. In: Proceedings of the 24th ACM international conference on information and knowledge management (CIKM 2015), Melbourne, AU, pp 1711–1714. https://doi.org/10.1145/2806416.2806597
Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 730–738
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 4th conference on empirical methods in natural language processing (EMNLP 2006), Sydney, AU, pp 120–128. https://doi.org/10.3115/1610075.1610094
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051
Article Google Scholar
Bullinaria JA, Levy JP (2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behav Res Methods 39(3):510–526. https://doi.org/10.3758/bf03193020
Article Google Scholar
Camacho-Collados J, Pilehvar MT (2018) From word to sense embeddings: a survey on vector representations of meaning. J Artif Intell Res 63:743–788. https://doi.org/10.1613/jair.1.11259
Article MathSciNet MATH Google Scholar
Caruana R (1993) Multitask learning: A knowledge-based source of inductive bias. In: Proceedings of the 10th international conference on machine learning (ICML 1993), Amherst, US, pp 41–48. https://doi.org/10.1016/b978-1-55860-307-3.50012-5
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
MATH Google Scholar
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Daumé H (2007) Frustratingly easy domain adaptation. In: Proceedings of the 45th annual meeting of the association for computational linguistics (ACL 2007), Prague, CZ, pp 256–263
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM symposium on applied computing (SAC 2003), Melbourne, US, pp 784–788. https://doi.org/10.1145/952532.952688
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (NAACL 2019), Minneapolis, US, pp 4171–4186
Dong Y, Liu P, Zhu Z, Wang Q, Zhang Q (2020) A fusion model-based label embedding and self-interaction attention for text classification. IEEE Access 8:30548–30559. https://doi.org/10.1109/access.2019.2954985
Article Google Scholar
Dumais ST, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM 1998), Bethesda, US, pp 148–155. https://doi.org/10.1145/288627.288651
Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660
MathSciNet MATH Google Scholar
Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st international conference on machine learning (ICML 2004), Banff, CA, pp 38–45. https://doi.org/10.1145/1015330.1015356
Garneau N, Leboeuf J, Lamontagne L (2019) Contextual generation of word embeddings for out-of-vocabulary words in downstream tasks. In: Proceedings of the 32nd Canadian conference on artificial intelligence (Canadian AI), Kingston, CA, pp 563–569. https://doi.org/10.1007/978-3-030-18305-9_60
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics (AISTATS 2010), Chia Laguna, Italy, pp 249–256
González P, Castaño A, Chawla NV, del Coz JJ (2017) A review on quantification learning. ACM Comput Surv 50(5):74:1–74:40. https://doi.org/10.1145/3117807
Article Google Scholar
Grave E, Mikolov T, Joulin A, Bojanowski P (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics (EACL 2017), Valencia, ES, pp 427–431. https://doi.org/10.18653/v1/e17-2068
Gupta S, Kanchinadam T, Conathan D, Fung G (2019) Task-optimized word embeddings for text classification representations. Front Appl Math Stat 5:67
Article Google Scholar
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162. https://doi.org/10.1007/978-94-017-6059-1_36
Article Google Scholar
Hersh W, Buckley C, Leone T, Hickman D (1994) OHSUMED: an interactive retrieval evaluation and new large text collection for research. In: Proceedings of the 17th ACM international conference on research and development in information retrieval (SIGIR 1994), Dublin, IE, pp 192–201. https://doi.org/10.1007/978-1-4471-2099-5_20
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hsu DJ, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: Proceedings of the 23rd annual conference on neural information processing systems (NIPS 2009), Vancouver, CA, pp 772–780
Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70. https://doi.org/10.1007/s00521-016-2401-x
Article Google Scholar
Jin P, Zhang Y, Chen X, Xia Y (2016) Bag-of-embeddings for text classification. In: Proceedings of the 26th international joint conference on artificial intelligence (IJCAI 2016), New York, US, pp 2824–2830
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning (ECML 1998), Chemnitz, DE, pp 137–142. https://doi.org/10.1007/bfb0026683
Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th ACM conference on research and development in information retrieval (SIGIR 2001), New Orleans, US, pp 128–136. https://doi.org/10.1145/383952.383974
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1746–1751
Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware neural language models. In: Proceedings of the 30th AAAI conference on artificial intelligence (AAAI 2016), Phoenix, US, pp 2741–2749
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations (ICLR 2015), San Diego, US
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI conference on artificial intelligence (AAAI 2015), Austin, US, pp 2267–2273
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Le HT, Cerisara C, Denis A (2018) Do convolutional networks need to be deep for text classification?. In: Proceedings of the AAAI 2018 workshop on affective content analysis, New Orleans, US, pp 29–36
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Lei X, Cai Y, Xu J, Ren D, Li Q, Leung HF (2019) Incorporating task-oriented representation in text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 401–415
Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225
Article Google Scholar
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 28th annual conference on neural information processing systems (NIPS 2014), Montreal, CA, pp 2177–2185
Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th ACM international conference on research and development in information retrieval (SIGIR 1992), Kobenhavn, DK, pp 37–50
Lin J (2019) The neural hype and comparisons against weak baselines. SIGIR Forum 52(1):40–51
Article Google Scholar
Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP 2015), Lisbon, PT, pp 1412–1421
McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 6294–6305
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Workshop track proceedings of the 1st international conference on learning representations (ICLR 2013), Scottsdale, US
Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the 11th international conference on language resources and evaluation (LREC 2018), Miyazaki, JP
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 3111–3119
Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, US, pp 2265–2273
Moreo A, Esuli A, Sebastiani F (2016) Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. J Artif Intell Res 55:131–163
Article MathSciNet Google Scholar
Moreo A, Esuli A, Sebastiani F (2020) Learning to weight for text classification. IEEE Trans Knowl Data Eng 32(2):302–316. https://doi.org/10.1109/TKDE.2018.2883446
Article MATH Google Scholar
Moreo A, Pedrotti A, Sebastiani F (2021) Heterogeneous document embeddings for cross-lingual text classification. In: Proceedings of the 36th ACM symposium on applied computing (SAC 2021), Gwangju, KR. https://doi.org/10.1145/3412841.3442093(forthcoming)
Morik K, Brockhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach. A case study in intensive care monitoring. In: Proceedings of the 16th international conference on machine learning (ICML 1999), Bled, SL, pp 268–277
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1/2):1–135
Article Google Scholar
Pappas N, Henderson J (2019) Gile: a generalized input-label embedding for text classification. Trans Assoc Comput Linguist 7:139–155
Article Google Scholar
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), Doha, QA, pp 1532–1543
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics (NAACL 2018), New Orleans, US, pp 2227–2237
Ren H, Zeng Z, Cai Y, Du Q, Li Q, Xie H (2019) A weighted word embedding model for text classification. In: Proceedings of the 24th international conference on database systems for advanced applications (DASFAA 2019), Chiang Mai, TH, pp 419–434
Riloff E, Wiebe J, Phillips W (2005) Exploiting subjectivity classification to improve information extraction. In: Proceedings of the 12th conference of the american association for artificial intelligence (AAAI 2005), Pittsburgh, US, pp 1106–1111
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536. https://doi.org/10.1038/323533a0
Article MATH Google Scholar
Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1):21–41. https://doi.org/10.1162/089976602753284446
Article MATH Google Scholar
Sahlgren M (2005) An introduction to random indexing. In: Proceedings of the TKE workshop on methods and applications of semantic indexing, Copenhagen, DK
Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP 2013), Seattle, US, pp 1631–1642
Soucy P, Mineau GW (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI 2005), Edinburgh, UK, pp 1130–1135
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation (LREC 2006), Genova, IT, pp 2142–2147
Tang J, Qu M, Mei Q (2015) PTE: Predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the 21st ACM international conference on knowledge discovery and data mining (KDD 2015), Sydney, AU, pp 1165–1174
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st annual conference on neural information processing systems (NIPS 2017), Long Beach, US, pp 5998–6008
Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X, Henao R, Carin L (2018) Joint embedding of words and labels for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (ACL 2018), Melbourne, AU, pp 2321–2331
Wang S, Manning CD (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL 2012), Jeju Island, KR, pp 90–94
Yang Y, Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Trans Inf Syst 12(3):252–277
Article Google Scholar
Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019b) XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd annual conference on neural information processing systems (NeurIPS 2019), Vancouver, CA, pp 5754–5764
Yang W, Lu K, Yang P, Lin J (2019a) Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd ACM conference on research and development in information retrieval (SIGIR 2019), Paris, FR, pp 1129–1132. https://doi.org/10.1145/3331184.3331340
Yao S, Yu D, Xiao K (2019) Enhancing domain word embedding via latent semantic imputation. In: Proceedings of the 25th ACM conference on knowledge discovery and data mining (KDD 2019), Anchorage, US, pp 557–565. https://doi.org/10.1145/3292500.3330926
Yu HF, Jain P, Kar P, Dhillon I (2014) Large-scale multi-label learning with missing labels. In: Proceedings of the 31st international conference on machine learning (ICML 2014), Beijing, CN, pp 593–601
Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):e1253. https://doi.org/10.1002/widm.1253
Article Google Scholar
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Proceedings of the 29th annual conference on neural information processing systems (NIPS 2015), Montreal, CA, pp 649–657

Download references

Acknowledgements

The present work has been supported by the ARIADNEplus project, funded by the European Commission (Grant 823914) under the H2020 Programme INFRAIA-2018-1, by the SoBigdata++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1, and by the AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020 . The authors’ opinions do not necessarily reflect those of the European Commission. Thanks to NVidia for donating the two Titan GPUs on which many of the experiments discussed in this paper were run. We thank the anonymous reviewers for valuable feedback that allowed us to improve the quality of this paper.

Author information

Authors and Affiliations

Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, 56124, Pisa, Italy
Alejandro Moreo, Andrea Esuli & Fabrizio Sebastiani

Authors

Alejandro Moreo
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Esuli
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alejandro Moreo.

Additional information

Responsible editor: Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moreo, A., Esuli, A. & Sebastiani, F. Word-class embeddings for multiclass text classification. Data Min Knowl Disc 35, 911–963 (2021). https://doi.org/10.1007/s10618-020-00735-3

Download citation

Received: 09 September 2020
Accepted: 31 December 2020
Published: 19 February 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10618-020-00735-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Word-class embeddings for multiclass text classification

Abstract

Access this article

Similar content being viewed by others

Multi-label, Multi-class Classification Using Polylingual Embeddings

Text classification using embeddings: a survey

A Review on Word Embedding Techniques for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Word-class embeddings for multiclass text classification

Abstract

Access this article

Similar content being viewed by others

Multi-label, Multi-class Classification Using Polylingual Embeddings

Text classification using embeddings: a survey

A Review on Word Embedding Techniques for Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation