Active neural learners for text with dual supervision

Shama Sastry, Chandramouli; Milios, Evangelos E.

doi:10.1007/s00521-019-04681-0

Active neural learners for text with dual supervision

Original Article
Published: 01 January 2020

Volume 32, pages 13343–13362, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Chandramouli Shama Sastry¹ &
Evangelos E. Milios¹

370 Accesses
Explore all metrics

Abstract

Dual supervision for text classification and information retrieval, which involves training the machine with class labels augmented with text annotations that are indicative of the class, has been shown to provide significant improvements, both in and beyond active learning (AL) settings. Annotations in the simplest form are highlighted portions of the text that are indicative of the class. In this work, we aim to identify and realize the full potential of unsupervised pretrained word embeddings for text-related tasks in AL settings by training neural nets—specifically, convolutional and recurrent neural nets—through dual supervision. We propose an architecture-independent algorithm for training neural networks with human rationales for class assignments and show how unsupervised embeddings can be better leveraged in active learning settings using the said algorithm. The proposed solution involves the use of gradient-based feature attributions for constraining the machine to follow the user annotations; further, we discuss methods for overcoming the architecture-specific challenges in the optimization. Our results on the sentiment classification task show that one annotated and labeled document can be worth up to seven labeled documents, giving accuracies of up to 70% for as few as ten labeled and annotated documents, and shows promise in significantly reducing user effort for total-recall information retrieval task in systematic literature reviews.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

"Challenges and future in deep learning for sentiment analysis: a comprehensive review and a proposed novel hybrid approach"

Article Open access 05 March 2024

Sentiment analysis using deep learning architectures: a review

Article 02 December 2019

Extracting Urgent Questions from MOOC Discussions: A BERT-Based Multi-output Classification Approach

Article 31 May 2024

Notes

Code for replicating the experiments: https://github.com/chandramouli-sastry/dual-AL
https://code.google.com/archive/p/word2vec/
Available at https://zenodo.org/record/1162952#.W3aQFNgzbRY

References

Abdi A, Shamsuddin SM, Hasan S, Piran J (2019) Deep learning-based sentiment classification of evaluative text based on multi-feature fusion. Inf Process Manag 56(4):1245–1259. https://doi.org/10.1016/j.ipm.2019.02.018
Article Google Scholar
Ali F, Kwak D, Khan P, El-Sappagh S, Ali A, Ullah S, Kim KH, Kwak KS (2019) Transportation sentiment analysis using word embedding and ontology-based topic modeling. Knowl Based Syst 174:27–42
Article Google Scholar
Ancona M, Ceolini E, Öztireli C, Gross M (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In: Proceedings of the 6th international conference on learning representations (ICLR), Vancouver, BC, Canada, pp 1–16. https://openreview.net/forum?id=Sy21R9JAW
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd international conference on learning representations (ICLR), San Diego, CA, USA, pp 1–15
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. In: Proceedings of the 32nd international conference on machine learning, vol 37, JMLR.org, Lille, France, ICML’15, pp 1613–1622. http://dl.acm.org/citation.cfm?id=3045118.3045290
Chegini M, Bernard J, Berger P, Sourin A, Andrews K, Schreck T (2019) Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning. Vis Inform 3(1):9–17
Article Google Scholar
Chen F, Huang Y (2019) Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews. Neurocomputing 368:51–58
Article Google Scholar
Cherman EA, Papanikolaou Y, Tsoumakas G, Monard MC (2019) Multi-label active learning: key issues and a novel query strategy. Evol Syst 10(1):63–78
Article Google Scholar
Cormack GV, Grossman MR (2014) Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM, Gold Coast, Queensland, Australia, SIGIR’14, pp 153–162. https://doi.org/10.1145/2600428.2609601
Cour T, Sapp B, Taskar B (2011) Learning from partial labels. J Mach Learn Res 12:1501–1536
MathSciNet MATH Google Scholar
Dong X, de Melo G (2018) A helping hand: transfer learning for deep sentiment analysis. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Vol 1: long papers), Association for Computational Linguistics, Melbourne, Australia, pp 2524–2534. https://www.aclweb.org/anthology/P18-1235
Druck G, Mann G, McCallum A (2008) Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, ACM, Singapore, Singapore, SIGIR’08, pp 595–602. https://doi.org/10.1145/1390334.1390436
Feng Y, Fan L (2019) Ontology semantic integration based on convolutional neural network. Neural Comput Appl 31:8253–8266. https://doi.org/10.1007/s00521-019-04043-w
Article Google Scholar
Fung G, Mangasarian OL, Shavlik JW (2002) Knowledge-based support vector machine classifiers. In: Advances in neural information processing systems 15, Vancouver, British Columbia, Canada, pp 521–528. http://papers.nips.cc/paper/2222-knowledge-based-support-vector-machine-classifiers
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. In: Proceedings of the 30th international conference on neural information processing systems, Curran Associates Inc., Barcelona, Spain, NIPS’16, pp 1027–1035. http://dl.acm.org/citation.cfm?id=3157096.3157211
Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of the 33rd international conference on machine learning, vol 48, JMLR.org, New York, NY, USA, ICML’16, pp 1050–1059. http://dl.acm.org/citation.cfm?id=3045390.3045502
Gal Y, Islam R, Ghahramani Z (2017) Deep Bayesian active learning with image data. In: Proceedings of the 34th international conference on machine learning, Vol 70, JMLR.org, Sydney, NSW, Australia, ICML’17, pp 1183–1192. http://dl.acm.org/citation.cfm?id=3305381.3305504
Guyon I, Cawley GC, Dror G, Lemaire V (2011) Results of the active learning challenge. In: Active learning and experimental design workshop, in conjunction with the international conference on artificial intelligence and statistics (AISTATS), Sardinia, Italy, pp 19–45. http://jmlr.org/proceedings/papers/v16/guyon11a/guyon11a.pdf
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304. https://doi.org/10.1109/TSE.2011.103
Article Google Scholar
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics, ACL vol 1: long papers, Melbourne, Australia, pp 328–339. https://aclanthology.info/papers/P18-1031/p18-1031
Hu P, Lipton Z, Anandkumar A, Ramanan D (2019) Active learning with partial feedback. In: Proceedings of the 7th international conference on learning representations, New Orleans, USA, pp 1–15. https://openreview.net/forum?id=HJfSEnRqKQ
Jain S, Wallace BC (2019) Attention is not explanation (to appear). In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Kampffmeyer M, Salberg AB, Jenssen R (2016) Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Las Vegas, Nevada, California, pp 1–9
Kendall A, Gal Y (2017) What uncertainties do we need in Bayesian deep learning for computer vision? In: Proceedings of the 31st international conference on neural information processing systems, Curran Associates Inc., Long Beach, California, USA, NIPS’17, pp 5580–5590. http://dl.acm.org/citation.cfm?id=3295222.3295309
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 1746–1751. https://doi.org/10.3115/v1/D14-1181,
Kitchenham B, Brereton P (2013) A Systematic review of systematic review process research in software engineering. Inf Softw Technol 55(12):2049–2075. https://doi.org/10.1016/j.infsof.2013.07.010
Article Google Scholar
Konyushkova K, Sznitman R, Fua P (2019) Geometry in active learning for binary and multi-class image segmentation. Comput Vis Image Underst 182:1–16
Article Google Scholar
Kumar R, Pannu HS, Malhi AK (2019) Aspect-based sentiment analysis using deep networks and stochastic optimization. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04105-z
Article Google Scholar
Li J, Hu R, Liu X et al (2019) A distant supervision method based on paradigmatic relations for learning word embeddings. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04071-6
Article Google Scholar
Liu J, Wu F, Wu C, Huang Y, Xie X (2019) Neural chinese word segmentation with dictionary. Neurocomputing 338:46–54
Article Google Scholar
Liu JN, Yl H, Lim EH, Xz W (2014) Domain ontology graph model and its application in chinese text classification. Neural Comput Appl 24(3–4):779–798
Article Google Scholar
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: The 49th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference, Portland, Oregon, USA, pp 142–150. http://www.aclweb.org/anthology/P11-1015
Melville P, Sindhwani V (2009) Active dual supervision: reducing the cost of annotating examples and features. In: Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing, Association for Computational Linguistics, Boulder, Colorado, pp 49–57. https://www.aclweb.org/anthology/W09-1907
Melville P, Gryc W, Lawrence RD (2009) Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, pp 1275–1284. https://doi.org/10.1145/1557019.1557156
Merity S, McCann B, Socher R (2017) Revisiting activation regularization for language RNNs. In Proceedings of the 1st workshop on learning to generate natural language at the 34th international conference on machine learning, pp 1–6
Min F, Liu FL, Wen LY, Zhang ZH (2019) Tri-partition cost-sensitive active learning through knn. Soft Comput 23(5):1557–1572
Article Google Scholar
Nguyen N, Caruana R (2008) Classification with partial labels. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Las Vegas, Nevada, USA, KDD’08, pp 551–559. https://doi.org/10.1145/1401890.1401958
Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting of the association for computational linguistics, Barcelona, Spain, pp 271–278. http://aclweb.org/anthology/P/P04/P04-1035.pdf
Plaza-del Arco FM, Martín-Valdivia MT, Ureña-López LA, Mitkov R (2019) Improved emotion recognition in spanish social media through incorporation of lexical knowledge. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2019.09.034
Article Google Scholar
Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics. Inf Softw Technol 55(8):1397–1418. https://doi.org/10.1016/j.infsof.2013.02.009
Article Google Scholar
Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right reasons: training differentiable models by constraining their explanations. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 2662–2670. https://doi.org/10.24963/ijcai.2017/371
Segarra J, Sumba X, Ortiz J, Gualán R, Espinoza-Mejia M, Saquicela V (2019) Author-topic classification based on semantic knowledge. In: Iberoamerican knowledge graphs and semantic web conference. Springer, pp 56–71
Sener O, Savarese S (2018) Active learning for convolutional neural networks: a core-set approach. In: Proceedings of the 6th international conference on learning representations, Vancouver, BC, Canada, pp 1–13. https://openreview.net/forum?id=H1aIuk-RW
Sharma M, Zhuang D, Bilgic M (2015) Active learning with rationales for text classification. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Denver, Colorado, pp 441–451, https://doi.org/10.3115/v1/N15-1047
Shen Y, Yun H, Lipton ZC, Kronrod Y, Anandkumar A (2018) Deep active learning for named entity recognition. In: Proceedings of the 6th international conference on learning representations, Vancouver, BC, Canada, https://openreview.net/forum?id=ry018WZAZ
Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: Proceedings of the 34th international conference on machine learning, vol 70, JMLR. org, Sydney, NSW, Australia, ICML’17, pp 3145–3153
Siddhant A, Lipton ZC (2018) Deep Bayesian active learning for natural language processing: results of a large-scale empirical study. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, pp 2904–2909. https://www.aclweb.org/anthology/D18-1318
Sinoara RA, Camacho-Collados J, Rossi RG, Navigli R, Rezende SO (2019) Knowledge-enhanced document embeddings for text classification. Knowl Based Syst 163:955–971
Article Google Scholar
Small K, Wallace BC, Brodley CE, Trikalinos TA (2011) The constrained weight space SVM: learning with ranked features. In: Proceedings of the 28th international conference on machine learning ICML, Bellevue, Washington, USA, pp 865–872. https://icml.cc/2011/papers/465_icmlpaper.pdf
Song M, Park H, Shin K (2019) Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in Korean. Inf Process Manag 56(3):637–653. https://doi.org/10.1016/j.ipm.2018.12.005
Article Google Scholar
Sun Q, De Jong G (2005) Explanation-augmented SVM: an approach to incorporating domain knowledge into SVM learning. In: Proceedings of the 22nd international conference on machine learning, ACM, Bonn, Germany, ICML’05, pp 864–871. https://doi.org/10.1145/1102351.1102460
Tsou YL, Lin HT (2019) Annotation cost-sensitive active learning by tree sampling. Mach Learn 108(5):785–807
Article MathSciNet MATH Google Scholar
Wahono RS (2015) A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks. J Softw Eng 1(1):1–16
Google Scholar
Wang K, Zhang D, Li Y, Zhang R, Lin L (2017) Cost-effective active learning for deep image classification. IEEE Trans Circuit Syst Video Technol 27(12):2591–2600. https://doi.org/10.1109/TCSVT.2016.2589879
Article Google Scholar
Wang M, Fu K, Min F, Jia X (2019) Active learning through label error statistical methods. Knowl Based Syst. https://doi.org/10.1016/j.knosys.2019.105140
Article Google Scholar
Wang M, Lin Y, Min F, Liu D (2019) Cost-sensitive active learning through statistical methods. Inf Sci 501:460–482
Article MathSciNet Google Scholar
Wu D, Lin CT, Huang J (2019) Active learning for regression using greedy sampling. Inf Sci 474:90–105
Article MathSciNet Google Scholar
Wu YX, Min XY, Min F, Wang M (2019) Cost-sensitive active learning with a label uniform distribution model. Int J Approx Reason 105:49–65
Article MathSciNet MATH Google Scholar
Xing FZ, Pallucchini F, Cambria E (2019) Cognitive-inspired domain adaptation of sentiment lexicons. Inf Process Manag 56(3):554–564
Article Google Scholar
Xiong L, Jiao L, Mao S, Zhang L (2012) Active learning based on coupled knn pseudo pruning. Neural Comput Appl 21(7):1669–1686
Article Google Scholar
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, San Diego, California, pp 1480–1489. https://doi.org/10.18653/v1/N16-1174
Yu M, Guo X, Yi J, Chang S, Potdar S, Cheng Y, Tesauro G, Wang H, Zhou B (2018) Diverse few-shot text classification with multiple metrics. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT, volu 1 (long papers), New Orleans, Louisiana, USA, pp 1206–1215. https://aclanthology.info/papers/N18-1109/n18-1109
Yu Z, Menzies T (2019) FAST2: an intelligent assistant for finding relevant papers. Expert Syst Appl 120:57–71. https://doi.org/10.1016/j.eswa.2018.11.021D
Article Google Scholar
Yu Z, Kraft NA, Menzies T (2018) Finding better active learners for faster literature reviews. Empir Softw Eng 23(6):3161–3186. https://doi.org/10.1007/s10664-017-9587-0
Article Google Scholar
Zaidan O, Eisner J (2008) Modeling annotators: a generative approach to learning from annotator rationales. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Association for Computational Linguistics, Honolulu, Hawaii, pp 31–40. https://www.aclweb.org/anthology/D08-1004
Zaidan O, Eisner J, Piatko CD (2007) Using “annotator rationales” to improve machine learning for text categorization. In: Human language technology conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, Rochester, New York, USA, pp 260–267. http://www.aclweb.org/anthology/N07-1033
Zhang J, Liu Y, Luan H, Xu J, Sun M (2017) Prior knowledge integration for neural machine translation using posterior regularization. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL vol 1: long papers, Vancouver, Canada, pp 1514–1523. https://doi.org/10.18653/v1/P17-1139
Zhang Y, Wallace B (2017) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the eighth international joint conference on natural language processing, vol 1: long papers, Asian Federation of Natural Language Processing, Taipei, Taiwan, pp 253–263. https://www.aclweb.org/anthology/I17-1026
Zhang Y, Marshall IJ, Wallace BC (2016) Rationale-augmented convolutional neural networks for text classification. In: Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP, Austin, Texas, USA, pp 795–804. http://aclweb.org/anthology/D/D16/D16-1076.pdf
Zhang Y, Lease M, Wallace BC (2017) Active discriminative text representation learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. AAAI Press, San Francisco, California, USA, AAAI’17, pp 3386–3392. http://dl.acm.org/citation.cfm?id=3298023.3298060

Download references

Acknowledgements

The research was funded by the Natural Sciences and Engineering Research Council of Canada, the Boeing Company and Compute Canada.

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University, Halifax, Canada
Chandramouli Shama Sastry & Evangelos E. Milios

Authors

Chandramouli Shama Sastry
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos E. Milios
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chandramouli Shama Sastry.

Ethics declarations

Conflicts of interest

We declare that this manuscript is our original work and that it has not been published anywhere, nor is it under review anywhere else at this point. We further declare that none of the authors have any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: CNN and RNN Architectures

1.1 CNN for text classification

We will use the CNN architecture proposed in Kim [25] and investigated in detail by Zhang and Wallace [68]. A simplified architecture is shown in Fig. 8 (adapted from Zhang and Wallace [68] and Kim [25]).

We used 50 filters each of sizes 3,4 and 5 as suggested in Zhang et al. [70] and Zhang and Wallace [68]; the weights of the filters and their biases and the weights and biases of the softmax are trainable parameters. However, we trained the machine with static embeddings as suggested by Zhang and Wallace [68] (unlike Zhang et al. [70]). We applied dropout regularization to the penultimate layer ($v_s$) with a dropout probability of 0.3.

1.2 RNN for text classification

We use an adaptation of the hierarchical attention network Yang et al. [61] for text classification; specifically, we use single GRU (gated recurrent units) instead of bidirectional GRU and we ignore the hierarchy and consider the document as one long sentence. The architecture is shown in Fig. 9.

The hidden states $h_i$ are derived using standard GRU formulation. The attentions $\alpha _i$ are derived using context vector u (Bahdanau et al. [4]) as :

$$\begin{aligned} \alpha _i = \frac{exp(u_i^{T}u)}{\sum _j exp(u_j^{T}u)} \end{aligned}$$

(11)

where $u_i = tanh(W_A h_i + b_A)$. The feature vector $v_s$ is later obtained as:

$$\begin{aligned} v_s = \sum _i \alpha _i h_i \end{aligned}$$

(12)

The trainable parameters include the various weights and biases used for computing $h_i$ along with the attention parameters ($W_A$ and $b_A$) and the context vector u. We used hidden units of size 100 and context vector of size 50. We found that the RNNs needed more regularization than CNNs and used all of the following: variational dropout between recurrent states Gal and Ghahramani [15], input embedding dropout, activity regularization of recurrent states Merity et al. [35] and L2 regularization.

Appendix 2: Architecture-specific adaptations of misattribution error

1.1 CNNs

In the CNN architecture considered, the misattribution error is hard to optimize because the gradients through the 1-max pooling will contain zeros and will not be informative enough. Essentially, the machine should first suppress values from nonuseful features and then wait around till the right values are chosen by the machine. This increases train time. Further, another challenge is that if the machine does end up selecting most of the user-suggested features, the zero gradients to the nonuser-suggested features do not really guarantee that the activation values of the nonuser-suggested features are sufficiently suppressed. This makes the optimization hard.

Let us call the given CNN network defined in Fig. 8 as N. We propose computing the attributions for the misattribution error by considering a proxy network P instantiated with the exact same weights as N but with 1-max pooling replaced with sum; i.e., the sentence features are computed by summing across all of the feature maps instead of computing max. This enables the machine to adjust attributions of all words, in parallel. Let $\tilde{A}^{P}$ be the attributions computed using the network P.

$$\begin{aligned} Misattribution_{Error} = \mu _{c}(\mathbf {A^{U}},{\tilde{\varvec{A}}}^{P}) \end{aligned}$$

(13)

However, there is one problem with this approach: The machine gets the same reward irrespective of whether it chooses to minimize the objective function by using several filters or just a few filters. In order to better demonstrate, consider the hypothetical feature maps shown in Tables 4 and 5; the shaded columns correspond to the activation values from the user-suggested features. Observe that both of them satisfy the objective function in Eq. 13, although the misattribution error for the true network N would be truly minimized if the feature maps looked like Table 2. Another side effect if the machine chose to behave like the one shown in Table 1 is that the machine ends up underutilizing the number of filters and reduces the capacity. Note that this is just one of the many possible ways to not result in desired attribution distribution.

Table 4 Hypothetical feature maps after training with Eq. 1 when three filters A, B and C of the same region size are used

Full size table

Table 5 Hypothetical feature maps after training with Eq. 1 when three filters A, B and C are used

Full size table

The fix to this problem would be to introduce an additional term in the objective function encouraging the machine to optimize Eq. 13 by using features across different filters instead of features across the same filter. Note that we are primarily concerned in maximizing the number of filters which select user-suggested features when 1-max pooling is applied. Let us first consider the nonzero attributions to the user-suggested features from each of the filters such that each filter is allowed to transfer information from only one set of input words; these can be extracted greedily by applying 1-max pooling only to activations corresponding to user-suggested features (Ex: along A3, B2 and C2 for feature maps in Table 1). It is easy to see that pushing these attribution scores higher would encourage the proxy network P to use features across many filters. Let $A_i^{B,c}$ be the attribution assigned to $i^{th}$ user-suggested feature along the best-possible path leading to logit of class c. We choose to maximize the sum of these attributions $\sigma _c = \sum _{i}A^{B,c}_i$ as we do not know the exact distribution of attribution among user-defined features; however, as we do not know the required magnitude of these attributions, we need to maximize it in comparison with another number. For simplicity, we choose to maximize the attributions with respect to the expected class E in comparison with the other classes. Let us quantify the total attribution from user-specified features leading up to expected class, E, with respect to other classes using softmax as $\gamma _E$:

$$\begin{aligned} \gamma _E = \frac{\exp {\sigma _E}}{\sum _{c}\exp {\sigma _c}} \end{aligned}$$

(14)

Finally, we can define the misattribution error using cross-entropy loss as:

$$\begin{aligned} Misattribution_{Error} = \mu _{c}(A^{U},\tilde{A}^{P}) + log(\gamma _E) \end{aligned}$$

(15)

Interestingly, if the activation used is ReLU or Leaky-ReLU, the second term is equivalent to adding a new example having only user-labeled features. Thus, masking features for factoring in user suggestions, like in traditional methods, is applicable in certain cases such as this and is a special case of attribution; however, unlike in traditional models, this is similar to adding a new example rather than modifying an existing data point.

In the text classification task: If $\mu _{c}(A^{U},\tilde{A}^{P})$ is used alone, we observe that the accuracies with training pool smaller than 50 examples remain unchanged, but saturate way earlier as the machine starts using fewer filters and has a reduced capacity as illustrated. Further, if the machine is trained with just $log(\gamma _E)$, the accuracies with larger number ($\ge 100$) of training examples remain relatively unchanged but those with fewer examples have a lesser accuracy; this is because the machine does not get automatically tuned to ignore nonannotated words when there are fewer examples. In the information retrieval task, however, we did not notice any advantage (even for higher number of training examples) of using the second term as the number of user-suggested features was very few.

1.2 RNNs

For the same reasons as in max pooling, optimizing through the attention is hard because the gradients are too small for the nonattended features. As we chose to compute attributions by setting F to class logits in CNNs, we choose to optimize using feature vector $v_s$ (Eq. 12) in RNNs. Following the same idea as in CNNs, we propose to compute attributions using hidden states without attention (i.e., $h_i$ instead of $\alpha _ih_i$) and independently optimize the $\alpha _i$s to select user-suggested features. In order to optimize attentions, we can easily write the misattribution error by equating the normalized attributions $\tilde{A}_i$ with the attentions $\alpha _i$:

$$\begin{aligned} Misattribution_{Error} = \mu _f\left( A^{U},\alpha _i \right) \end{aligned}$$

(16)

We need to now factor user suggestions into the computation of hidden features: In order to do that, we can very easily compute the attributions by setting F to be the sequence of hidden states; in other words, the gradients from all hidden states at time steps $\ge t$ would be considered to compute attribution of word $w_t$ as shown in Eq. 17 [the derivatives are computed at $\mathbf {V_0=w_0,V_1=w_1 ... V_n=w_n}$]

$$\begin{aligned} A_i = \mathbf {w_i \cdot } \left( \frac{\partial \mathbf {h_i}}{\partial \mathbf {V_i}} + \frac{\partial \mathbf {h_{i+1}}}{\partial \mathbf {V_{i}}} + \dots + \frac{\partial \mathbf {h_{n}}}{\partial \mathbf {V_{i}}} \right) \end{aligned}$$

(17)

Using this, we can write out the complete misattribution error as:

$$\begin{aligned} Misattribution_{Error} = \mu _f\left( A^{U},\alpha \right) + \mu _f\left( A^{U},\tilde{A}\right) \end{aligned}$$

(18)

However, this is computationally expensive and we propose approximating it by just considering the gradients from hidden state at time $t=i$. It turns out that this approximation works quite well not only for those cases where all the hidden states are used but also where the last hidden state is used. Mathematically, if $A_i$ is large, $\mathbf {w_i \cdot }\frac{\partial \mathbf {h_i}}{\partial \mathbf {V_i}}$ should be large as well; this is because, if the gradient of hidden state at $i^{th}$ time step with respect to $w_i$ is small, the gradients of hidden states at later time steps would be smaller as they are a function of $h_i$.

Gated recurrent neural networks can be interpreted as machines which read through a sequence of inputs and accumulate useful information in the hidden states from the inputs that it finds important (Fig. 10). From the figure, it is easy to see how the approximation satisfies attribution with respect to both feature vectors: mean or weighted mean of all hidden states and last hidden state. Note that this is just an approximation to speed up the training for single recurrent layer.

Appendix 3: Additional text classification results

In this section, we illustrate the advantages of combining annotations with unsupervised word embeddings by benchmarking with Sharma et al. [44]. We simulated the human labeler by extracting features as described in the paper: We choose only the positive (negative) features that had the highest $\chi ^2$ (chi-squared) statistic in at least 5% of the positive (negative) documents. Further, the artificial labeler returns any one word, rather than the top word or all the words, as rationale. Here is the description of the datasets:

WvsH This classification task consists of classifying between comp.os.ms-windows.misc and comp.sys.ibm.pc.hardware in the 20-newsgroup classification task.
Nova This is a binary classification task derived from the 20-newsgroup classification task. More details can be found at Guyon et al. [18].
SRAA This is a binary classification task consisting of 48K documents that discuss either auto or aviation.

The improvement in the AUCs is illustrated in Table 6. We derived the AUCs by taking a mean of the AUCs obtained by RNNs and CNNs when trained on ten randomly chosen documents—five from each class. The AUCs for Sharma et al. [44] are extracted from the paper.

Table 6 Text classification

Full size table

Additionally, we also empirically evaluated the gains we would get when we perform one-shot text classification with annotations using Yahoo! Answer Classification dataset. In this ten-way classification task, we randomly chose one document from each class and manually annotated for the purposes of the experiments. In the traditional machine learning method, we obtained an accuracy of about 14% with annotations, while we obtained 28% with RNNs and 31% with CNNs.

1.1 RNN attribution distribution

The attribution distribution obtained for RNNs is shown in Fig. 11.

Appendix 4: Results of information retrieval

The details of the experiments are presented in Table 7. The experimental results of other three datasets are presented in Fig. 12.

Table 7 Although the query is a subset of the keywords in all of the datasets and the keywords significantly overlap with the search strings used to collect candidate studies, we report improvements in recall up to 5%

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shama Sastry, C., Milios, E.E. Active neural learners for text with dual supervision. Neural Comput & Applic 32, 13343–13362 (2020). https://doi.org/10.1007/s00521-019-04681-0

Download citation

Received: 29 July 2019
Accepted: 10 December 2019
Published: 01 January 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00521-019-04681-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Active neural learners for text with dual supervision

Abstract

Access this article

Similar content being viewed by others

"Challenges and future in deep learning for sentiment analysis: a comprehensive review and a proposed novel hybrid approach"

Sentiment analysis using deep learning architectures: a review

Extracting Urgent Questions from MOOC Discussions: A BERT-Based Multi-output Classification Approach

Notes

References

Acknowledgements