Skip to main content
Log in

Active neural learners for text with dual supervision

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Dual supervision for text classification and information retrieval, which involves training the machine with class labels augmented with text annotations that are indicative of the class, has been shown to provide significant improvements, both in and beyond active learning (AL) settings. Annotations in the simplest form are highlighted portions of the text that are indicative of the class. In this work, we aim to identify and realize the full potential of unsupervised pretrained word embeddings for text-related tasks in AL settings by training neural nets—specifically, convolutional and recurrent neural nets—through dual supervision. We propose an architecture-independent algorithm for training neural networks with human rationales for class assignments and show how unsupervised embeddings can be better leveraged in active learning settings using the said algorithm. The proposed solution involves the use of gradient-based feature attributions for constraining the machine to follow the user annotations; further, we discuss methods for overcoming the architecture-specific challenges in the optimization. Our results on the sentiment classification task show that one annotated and labeled document can be worth up to seven labeled documents, giving accuracies of up to 70% for as few as ten labeled and annotated documents, and shows promise in significantly reducing user effort for total-recall information retrieval task in systematic literature reviews.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Code for replicating the experiments: https://github.com/chandramouli-sastry/dual-AL

  2. https://code.google.com/archive/p/word2vec/

  3. Available at https://zenodo.org/record/1162952#.W3aQFNgzbRY

References

  1. Abdi A, Shamsuddin SM, Hasan S, Piran J (2019) Deep learning-based sentiment classification of evaluative text based on multi-feature fusion. Inf Process Manag 56(4):1245–1259. https://doi.org/10.1016/j.ipm.2019.02.018

    Article  Google Scholar 

  2. Ali F, Kwak D, Khan P, El-Sappagh S, Ali A, Ullah S, Kim KH, Kwak KS (2019) Transportation sentiment analysis using word embedding and ontology-based topic modeling. Knowl Based Syst 174:27–42

    Article  Google Scholar 

  3. Ancona M, Ceolini E, Öztireli C, Gross M (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In: Proceedings of the 6th international conference on learning representations (ICLR), Vancouver, BC, Canada, pp 1–16. https://openreview.net/forum?id=Sy21R9JAW

  4. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd international conference on learning representations (ICLR), San Diego, CA, USA, pp 1–15

  5. Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. In: Proceedings of the 32nd international conference on machine learning, vol 37, JMLR.org, Lille, France, ICML’15, pp 1613–1622. http://dl.acm.org/citation.cfm?id=3045118.3045290

  6. Chegini M, Bernard J, Berger P, Sourin A, Andrews K, Schreck T (2019) Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning. Vis Inform 3(1):9–17

    Article  Google Scholar 

  7. Chen F, Huang Y (2019) Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews. Neurocomputing 368:51–58

    Article  Google Scholar 

  8. Cherman EA, Papanikolaou Y, Tsoumakas G, Monard MC (2019) Multi-label active learning: key issues and a novel query strategy. Evol Syst 10(1):63–78

    Article  Google Scholar 

  9. Cormack GV, Grossman MR (2014) Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM, Gold Coast, Queensland, Australia, SIGIR’14, pp 153–162. https://doi.org/10.1145/2600428.2609601

  10. Cour T, Sapp B, Taskar B (2011) Learning from partial labels. J Mach Learn Res 12:1501–1536

    MathSciNet  MATH  Google Scholar 

  11. Dong X, de Melo G (2018) A helping hand: transfer learning for deep sentiment analysis. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Vol 1: long papers), Association for Computational Linguistics, Melbourne, Australia, pp 2524–2534. https://www.aclweb.org/anthology/P18-1235

  12. Druck G, Mann G, McCallum A (2008) Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, ACM, Singapore, Singapore, SIGIR’08, pp 595–602. https://doi.org/10.1145/1390334.1390436

  13. Feng Y, Fan L (2019) Ontology semantic integration based on convolutional neural network. Neural Comput Appl 31:8253–8266. https://doi.org/10.1007/s00521-019-04043-w

    Article  Google Scholar 

  14. Fung G, Mangasarian OL, Shavlik JW (2002) Knowledge-based support vector machine classifiers. In: Advances in neural information processing systems 15, Vancouver, British Columbia, Canada, pp 521–528. http://papers.nips.cc/paper/2222-knowledge-based-support-vector-machine-classifiers

  15. Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. In: Proceedings of the 30th international conference on neural information processing systems, Curran Associates Inc., Barcelona, Spain, NIPS’16, pp 1027–1035. http://dl.acm.org/citation.cfm?id=3157096.3157211

  16. Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of the 33rd international conference on machine learning, vol 48, JMLR.org, New York, NY, USA, ICML’16, pp 1050–1059. http://dl.acm.org/citation.cfm?id=3045390.3045502

  17. Gal Y, Islam R, Ghahramani Z (2017) Deep Bayesian active learning with image data. In: Proceedings of the 34th international conference on machine learning, Vol 70, JMLR.org, Sydney, NSW, Australia, ICML’17, pp 1183–1192. http://dl.acm.org/citation.cfm?id=3305381.3305504

  18. Guyon I, Cawley GC, Dror G, Lemaire V (2011) Results of the active learning challenge. In: Active learning and experimental design workshop, in conjunction with the international conference on artificial intelligence and statistics (AISTATS), Sardinia, Italy, pp 19–45. http://jmlr.org/proceedings/papers/v16/guyon11a/guyon11a.pdf

  19. Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304. https://doi.org/10.1109/TSE.2011.103

    Article  Google Scholar 

  20. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics, ACL vol 1: long papers, Melbourne, Australia, pp 328–339. https://aclanthology.info/papers/P18-1031/p18-1031

  21. Hu P, Lipton Z, Anandkumar A, Ramanan D (2019) Active learning with partial feedback. In: Proceedings of the 7th international conference on learning representations, New Orleans, USA, pp 1–15. https://openreview.net/forum?id=HJfSEnRqKQ

  22. Jain S, Wallace BC (2019) Attention is not explanation (to appear). In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  23. Kampffmeyer M, Salberg AB, Jenssen R (2016) Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Las Vegas, Nevada, California, pp 1–9

  24. Kendall A, Gal Y (2017) What uncertainties do we need in Bayesian deep learning for computer vision? In: Proceedings of the 31st international conference on neural information processing systems, Curran Associates Inc., Long Beach, California, USA, NIPS’17, pp 5580–5590. http://dl.acm.org/citation.cfm?id=3295222.3295309

  25. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 1746–1751. https://doi.org/10.3115/v1/D14-1181,

  26. Kitchenham B, Brereton P (2013) A Systematic review of systematic review process research in software engineering. Inf Softw Technol 55(12):2049–2075. https://doi.org/10.1016/j.infsof.2013.07.010

    Article  Google Scholar 

  27. Konyushkova K, Sznitman R, Fua P (2019) Geometry in active learning for binary and multi-class image segmentation. Comput Vis Image Underst 182:1–16

    Article  Google Scholar 

  28. Kumar R, Pannu HS, Malhi AK (2019) Aspect-based sentiment analysis using deep networks and stochastic optimization. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04105-z

    Article  Google Scholar 

  29. Li J, Hu R, Liu X et al (2019) A distant supervision method based on paradigmatic relations for learning word embeddings. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04071-6

    Article  Google Scholar 

  30. Liu J, Wu F, Wu C, Huang Y, Xie X (2019) Neural chinese word segmentation with dictionary. Neurocomputing 338:46–54

    Article  Google Scholar 

  31. Liu JN, Yl H, Lim EH, Xz W (2014) Domain ontology graph model and its application in chinese text classification. Neural Comput Appl 24(3–4):779–798

    Article  Google Scholar 

  32. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: The 49th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference, Portland, Oregon, USA, pp 142–150. http://www.aclweb.org/anthology/P11-1015

  33. Melville P, Sindhwani V (2009) Active dual supervision: reducing the cost of annotating examples and features. In: Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing, Association for Computational Linguistics, Boulder, Colorado, pp 49–57. https://www.aclweb.org/anthology/W09-1907

  34. Melville P, Gryc W, Lawrence RD (2009) Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, pp 1275–1284. https://doi.org/10.1145/1557019.1557156

  35. Merity S, McCann B, Socher R (2017) Revisiting activation regularization for language RNNs. In Proceedings of the 1st workshop on learning to generate natural language at the 34th international conference on machine learning, pp 1–6

  36. Min F, Liu FL, Wen LY, Zhang ZH (2019) Tri-partition cost-sensitive active learning through knn. Soft Comput 23(5):1557–1572

    Article  Google Scholar 

  37. Nguyen N, Caruana R (2008) Classification with partial labels. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Las Vegas, Nevada, USA, KDD’08, pp 551–559. https://doi.org/10.1145/1401890.1401958

  38. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting of the association for computational linguistics, Barcelona, Spain, pp 271–278. http://aclweb.org/anthology/P/P04/P04-1035.pdf

  39. Plaza-del Arco FM, Martín-Valdivia MT, Ureña-López LA, Mitkov R (2019) Improved emotion recognition in spanish social media through incorporation of lexical knowledge. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2019.09.034

    Article  Google Scholar 

  40. Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics. Inf Softw Technol 55(8):1397–1418. https://doi.org/10.1016/j.infsof.2013.02.009

    Article  Google Scholar 

  41. Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right reasons: training differentiable models by constraining their explanations. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 2662–2670. https://doi.org/10.24963/ijcai.2017/371

  42. Segarra J, Sumba X, Ortiz J, Gualán R, Espinoza-Mejia M, Saquicela V (2019) Author-topic classification based on semantic knowledge. In: Iberoamerican knowledge graphs and semantic web conference. Springer, pp 56–71

  43. Sener O, Savarese S (2018) Active learning for convolutional neural networks: a core-set approach. In: Proceedings of the 6th international conference on learning representations, Vancouver, BC, Canada, pp 1–13. https://openreview.net/forum?id=H1aIuk-RW

  44. Sharma M, Zhuang D, Bilgic M (2015) Active learning with rationales for text classification. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Denver, Colorado, pp 441–451, https://doi.org/10.3115/v1/N15-1047

  45. Shen Y, Yun H, Lipton ZC, Kronrod Y, Anandkumar A (2018) Deep active learning for named entity recognition. In: Proceedings of the 6th international conference on learning representations, Vancouver, BC, Canada, https://openreview.net/forum?id=ry018WZAZ

  46. Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: Proceedings of the 34th international conference on machine learning, vol 70, JMLR. org, Sydney, NSW, Australia, ICML’17, pp 3145–3153

  47. Siddhant A, Lipton ZC (2018) Deep Bayesian active learning for natural language processing: results of a large-scale empirical study. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, pp 2904–2909. https://www.aclweb.org/anthology/D18-1318

  48. Sinoara RA, Camacho-Collados J, Rossi RG, Navigli R, Rezende SO (2019) Knowledge-enhanced document embeddings for text classification. Knowl Based Syst 163:955–971

    Article  Google Scholar 

  49. Small K, Wallace BC, Brodley CE, Trikalinos TA (2011) The constrained weight space SVM: learning with ranked features. In: Proceedings of the 28th international conference on machine learning ICML, Bellevue, Washington, USA, pp 865–872. https://icml.cc/2011/papers/465_icmlpaper.pdf

  50. Song M, Park H, Shin K (2019) Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in Korean. Inf Process Manag 56(3):637–653. https://doi.org/10.1016/j.ipm.2018.12.005

    Article  Google Scholar 

  51. Sun Q, De Jong G (2005) Explanation-augmented SVM: an approach to incorporating domain knowledge into SVM learning. In: Proceedings of the 22nd international conference on machine learning, ACM, Bonn, Germany, ICML’05, pp 864–871. https://doi.org/10.1145/1102351.1102460

  52. Tsou YL, Lin HT (2019) Annotation cost-sensitive active learning by tree sampling. Mach Learn 108(5):785–807

    Article  MathSciNet  MATH  Google Scholar 

  53. Wahono RS (2015) A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks. J Softw Eng 1(1):1–16

    Google Scholar 

  54. Wang K, Zhang D, Li Y, Zhang R, Lin L (2017) Cost-effective active learning for deep image classification. IEEE Trans Circuit Syst Video Technol 27(12):2591–2600. https://doi.org/10.1109/TCSVT.2016.2589879

    Article  Google Scholar 

  55. Wang M, Fu K, Min F, Jia X (2019) Active learning through label error statistical methods. Knowl Based Syst. https://doi.org/10.1016/j.knosys.2019.105140

    Article  Google Scholar 

  56. Wang M, Lin Y, Min F, Liu D (2019) Cost-sensitive active learning through statistical methods. Inf Sci 501:460–482

    Article  MathSciNet  Google Scholar 

  57. Wu D, Lin CT, Huang J (2019) Active learning for regression using greedy sampling. Inf Sci 474:90–105

    Article  MathSciNet  Google Scholar 

  58. Wu YX, Min XY, Min F, Wang M (2019) Cost-sensitive active learning with a label uniform distribution model. Int J Approx Reason 105:49–65

    Article  MathSciNet  MATH  Google Scholar 

  59. Xing FZ, Pallucchini F, Cambria E (2019) Cognitive-inspired domain adaptation of sentiment lexicons. Inf Process Manag 56(3):554–564

    Article  Google Scholar 

  60. Xiong L, Jiao L, Mao S, Zhang L (2012) Active learning based on coupled knn pseudo pruning. Neural Comput Appl 21(7):1669–1686

    Article  Google Scholar 

  61. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, San Diego, California, pp 1480–1489. https://doi.org/10.18653/v1/N16-1174

  62. Yu M, Guo X, Yi J, Chang S, Potdar S, Cheng Y, Tesauro G, Wang H, Zhou B (2018) Diverse few-shot text classification with multiple metrics. In: Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT, volu 1 (long papers), New Orleans, Louisiana, USA, pp 1206–1215. https://aclanthology.info/papers/N18-1109/n18-1109

  63. Yu Z, Menzies T (2019) FAST2: an intelligent assistant for finding relevant papers. Expert Syst Appl 120:57–71. https://doi.org/10.1016/j.eswa.2018.11.021D

    Article  Google Scholar 

  64. Yu Z, Kraft NA, Menzies T (2018) Finding better active learners for faster literature reviews. Empir Softw Eng 23(6):3161–3186. https://doi.org/10.1007/s10664-017-9587-0

    Article  Google Scholar 

  65. Zaidan O, Eisner J (2008) Modeling annotators: a generative approach to learning from annotator rationales. In: Proceedings of the 2008 conference on empirical methods in natural language processing, Association for Computational Linguistics, Honolulu, Hawaii, pp 31–40. https://www.aclweb.org/anthology/D08-1004

  66. Zaidan O, Eisner J, Piatko CD (2007) Using “annotator rationales” to improve machine learning for text categorization. In: Human language technology conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, Rochester, New York, USA, pp 260–267. http://www.aclweb.org/anthology/N07-1033

  67. Zhang J, Liu Y, Luan H, Xu J, Sun M (2017) Prior knowledge integration for neural machine translation using posterior regularization. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL vol 1: long papers, Vancouver, Canada, pp 1514–1523. https://doi.org/10.18653/v1/P17-1139

  68. Zhang Y, Wallace B (2017) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the eighth international joint conference on natural language processing, vol 1: long papers, Asian Federation of Natural Language Processing, Taipei, Taiwan, pp 253–263. https://www.aclweb.org/anthology/I17-1026

  69. Zhang Y, Marshall IJ, Wallace BC (2016) Rationale-augmented convolutional neural networks for text classification. In: Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP, Austin, Texas, USA, pp 795–804. http://aclweb.org/anthology/D/D16/D16-1076.pdf

  70. Zhang Y, Lease M, Wallace BC (2017) Active discriminative text representation learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. AAAI Press, San Francisco, California, USA, AAAI’17, pp 3386–3392. http://dl.acm.org/citation.cfm?id=3298023.3298060

Download references

Acknowledgements

The research was funded by the Natural Sciences and Engineering Research Council of Canada, the Boeing Company and Compute Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chandramouli Shama Sastry.

Ethics declarations

Conflicts of interest

We declare that this manuscript is our original work and that it has not been published anywhere, nor is it under review anywhere else at this point. We further declare that none of the authors have any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: CNN and RNN Architectures

1.1 CNN for text classification

We will use the CNN architecture proposed in Kim [25] and investigated in detail by Zhang and Wallace [68]. A simplified architecture is shown in Fig. 8 (adapted from Zhang and Wallace [68] and Kim [25]).

Fig. 8
figure 8

CNN architecture: An input sentence is converted into a sentence matrix, M, of dimension \(N \times d\) where N is the number of words and d is the dimension of word embedding. The activated feature maps are computed by applying 1-D convolution over M using k filters of sizes \(r_i \times d\), where \(r_i\), the region size, indicates the number of adjacent words jointly considered. The feature vector, \(v_s\), is constructed by applying 1-max pooling on each of the activated feature maps; the dimensions of \(v_s\) are therefore equal to k. \(v_s\) is used as feature set for softmax classifier

We used 50 filters each of sizes 3,4 and 5 as suggested in Zhang et al. [70] and Zhang and Wallace [68]; the weights of the filters and their biases and the weights and biases of the softmax are trainable parameters. However, we trained the machine with static embeddings as suggested by Zhang and Wallace [68] (unlike Zhang et al. [70]). We applied dropout regularization to the penultimate layer (\(v_s\)) with a dropout probability of 0.3.

1.2 RNN for text classification

We use an adaptation of the hierarchical attention network Yang et al. [61] for text classification; specifically, we use single GRU (gated recurrent units) instead of bidirectional GRU and we ignore the hierarchy and consider the document as one long sentence. The architecture is shown in Fig. 9.

Fig. 9
figure 9

RNN architecture: The input sequence \(w_i\) of words represented by their embeddings and the hidden states \(h_i\) are computed according to standard GRU formulation. The attentions \(\alpha _i\) are derived using the context vector u (Eq. 11), and feature vector \(v_s\) is computed by taking the weighted mean of the hidden activations (Eq. 12). \(v_s\) is used as feature set for softmax classifier

The hidden states \(h_i\) are derived using standard GRU formulation. The attentions \(\alpha _i\) are derived using context vector u (Bahdanau et al. [4]) as :

$$\begin{aligned} \alpha _i = \frac{exp(u_i^{T}u)}{\sum _j exp(u_j^{T}u)} \end{aligned}$$
(11)

where \(u_i = tanh(W_A h_i + b_A)\). The feature vector \(v_s\) is later obtained as:

$$\begin{aligned} v_s = \sum _i \alpha _i h_i \end{aligned}$$
(12)

The trainable parameters include the various weights and biases used for computing \(h_i\) along with the attention parameters (\(W_A\) and \(b_A\)) and the context vector u. We used hidden units of size 100 and context vector of size 50. We found that the RNNs needed more regularization than CNNs and used all of the following: variational dropout between recurrent states Gal and Ghahramani [15], input embedding dropout, activity regularization of recurrent states Merity et al. [35] and L2 regularization.

Appendix 2: Architecture-specific adaptations of misattribution error

1.1 CNNs

In the CNN architecture considered, the misattribution error is hard to optimize because the gradients through the 1-max pooling will contain zeros and will not be informative enough. Essentially, the machine should first suppress values from nonuseful features and then wait around till the right values are chosen by the machine. This increases train time. Further, another challenge is that if the machine does end up selecting most of the user-suggested features, the zero gradients to the nonuser-suggested features do not really guarantee that the activation values of the nonuser-suggested features are sufficiently suppressed. This makes the optimization hard.

Let us call the given CNN network defined in Fig. 8 as N. We propose computing the attributions for the misattribution error by considering a proxy network P instantiated with the exact same weights as N but with 1-max pooling replaced with sum; i.e., the sentence features are computed by summing across all of the feature maps instead of computing max. This enables the machine to adjust attributions of all words, in parallel. Let \(\tilde{A}^{P}\) be the attributions computed using the network P.

$$\begin{aligned} Misattribution_{Error} = \mu _{c}(\mathbf {A^{U}},{\tilde{\varvec{A}}}^{P}) \end{aligned}$$
(13)

However, there is one problem with this approach: The machine gets the same reward irrespective of whether it chooses to minimize the objective function by using several filters or just a few filters. In order to better demonstrate, consider the hypothetical feature maps shown in Tables 4 and 5; the shaded columns correspond to the activation values from the user-suggested features. Observe that both of them satisfy the objective function in Eq. 13, although the misattribution error for the true network N would be truly minimized if the feature maps looked like Table 2. Another side effect if the machine chose to behave like the one shown in Table 1 is that the machine ends up underutilizing the number of filters and reduces the capacity. Note that this is just one of the many possible ways to not result in desired attribution distribution.

Table 4 Hypothetical feature maps after training with Eq. 1 when three filters A, B and C of the same region size are used
Table 5 Hypothetical feature maps after training with Eq. 1 when three filters A, B and C are used

The fix to this problem would be to introduce an additional term in the objective function encouraging the machine to optimize Eq. 13 by using features across different filters instead of features across the same filter. Note that we are primarily concerned in maximizing the number of filters which select user-suggested features when 1-max pooling is applied. Let us first consider the nonzero attributions to the user-suggested features from each of the filters such that each filter is allowed to transfer information from only one set of input words; these can be extracted greedily by applying 1-max pooling only to activations corresponding to user-suggested features (Ex: along A3, B2 and C2 for feature maps in Table 1). It is easy to see that pushing these attribution scores higher would encourage the proxy network P to use features across many filters. Let \(A_i^{B,c}\) be the attribution assigned to \(i^{th}\) user-suggested feature along the best-possible path leading to logit of class c. We choose to maximize the sum of these attributions \(\sigma _c = \sum _{i}A^{B,c}_i\) as we do not know the exact distribution of attribution among user-defined features; however, as we do not know the required magnitude of these attributions, we need to maximize it in comparison with another number. For simplicity, we choose to maximize the attributions with respect to the expected class E in comparison with the other classes. Let us quantify the total attribution from user-specified features leading up to expected class, E, with respect to other classes using softmax as \(\gamma _E\):

$$\begin{aligned} \gamma _E = \frac{\exp {\sigma _E}}{\sum _{c}\exp {\sigma _c}} \end{aligned}$$
(14)

Finally, we can define the misattribution error using cross-entropy loss as:

$$\begin{aligned} Misattribution_{Error} = \mu _{c}(A^{U},\tilde{A}^{P}) + log(\gamma _E) \end{aligned}$$
(15)

Interestingly, if the activation used is ReLU or Leaky-ReLU, the second term is equivalent to adding a new example having only user-labeled features. Thus, masking features for factoring in user suggestions, like in traditional methods, is applicable in certain cases such as this and is a special case of attribution; however, unlike in traditional models, this is similar to adding a new example rather than modifying an existing data point.

In the text classification task: If \(\mu _{c}(A^{U},\tilde{A}^{P})\) is used alone, we observe that the accuracies with training pool smaller than 50 examples remain unchanged, but saturate way earlier as the machine starts using fewer filters and has a reduced capacity as illustrated. Further, if the machine is trained with just \(log(\gamma _E)\), the accuracies with larger number (\(\ge 100\)) of training examples remain relatively unchanged but those with fewer examples have a lesser accuracy; this is because the machine does not get automatically tuned to ignore nonannotated words when there are fewer examples. In the information retrieval task, however, we did not notice any advantage (even for higher number of training examples) of using the second term as the number of user-suggested features was very few.

1.2 RNNs

For the same reasons as in max pooling, optimizing through the attention is hard because the gradients are too small for the nonattended features. As we chose to compute attributions by setting F to class logits in CNNs, we choose to optimize using feature vector \(v_s\) (Eq. 12) in RNNs. Following the same idea as in CNNs, we propose to compute attributions using hidden states without attention (i.e., \(h_i\) instead of \(\alpha _ih_i\)) and independently optimize the \(\alpha _i\)s to select user-suggested features. In order to optimize attentions, we can easily write the misattribution error by equating the normalized attributions \(\tilde{A}_i\) with the attentions \(\alpha _i\):

$$\begin{aligned} Misattribution_{Error} = \mu _f\left( A^{U},\alpha _i \right) \end{aligned}$$
(16)
Fig. 10
figure 10

RNN (without attentions and softmax) with an example sentence. The annotated words (shaded) can be interpreted as those words worth accumulating in the hidden state. The thickened arrows indicate those terms which the machine might remember

We need to now factor user suggestions into the computation of hidden features: In order to do that, we can very easily compute the attributions by setting F to be the sequence of hidden states; in other words, the gradients from all hidden states at time steps \(\ge t\) would be considered to compute attribution of word \(w_t\) as shown in Eq. 17 [the derivatives are computed at \(\mathbf {V_0=w_0,V_1=w_1 ... V_n=w_n}\)]

$$\begin{aligned} A_i = \mathbf {w_i \cdot } \left( \frac{\partial \mathbf {h_i}}{\partial \mathbf {V_i}} + \frac{\partial \mathbf {h_{i+1}}}{\partial \mathbf {V_{i}}} + \dots + \frac{\partial \mathbf {h_{n}}}{\partial \mathbf {V_{i}}} \right) \end{aligned}$$
(17)

Using this, we can write out the complete misattribution error as:

$$\begin{aligned} Misattribution_{Error} = \mu _f\left( A^{U},\alpha \right) + \mu _f\left( A^{U},\tilde{A}\right) \end{aligned}$$
(18)

However, this is computationally expensive and we propose approximating it by just considering the gradients from hidden state at time \(t=i\). It turns out that this approximation works quite well not only for those cases where all the hidden states are used but also where the last hidden state is used. Mathematically, if \(A_i\) is large, \(\mathbf {w_i \cdot }\frac{\partial \mathbf {h_i}}{\partial \mathbf {V_i}}\) should be large as well; this is because, if the gradient of hidden state at \(i^{th}\) time step with respect to \(w_i\) is small, the gradients of hidden states at later time steps would be smaller as they are a function of \(h_i\).

Gated recurrent neural networks can be interpreted as machines which read through a sequence of inputs and accumulate useful information in the hidden states from the inputs that it finds important (Fig. 10). From the figure, it is easy to see how the approximation satisfies attribution with respect to both feature vectors: mean or weighted mean of all hidden states and last hidden state. Note that this is just an approximation to speed up the training for single recurrent layer.

Appendix 3: Additional text classification results

In this section, we illustrate the advantages of combining annotations with unsupervised word embeddings by benchmarking with Sharma et al. [44]. We simulated the human labeler by extracting features as described in the paper: We choose only the positive (negative) features that had the highest \(\chi ^2\) (chi-squared) statistic in at least 5% of the positive (negative) documents. Further, the artificial labeler returns any one word, rather than the top word or all the words, as rationale. Here is the description of the datasets:

  • WvsH This classification task consists of classifying between comp.os.ms-windows.misc and comp.sys.ibm.pc.hardware in the 20-newsgroup classification task.

  • Nova This is a binary classification task derived from the 20-newsgroup classification task. More details can be found at Guyon et al. [18].

  • SRAA This is a binary classification task consisting of 48K documents that discuss either auto or aviation.

The improvement in the AUCs is illustrated in Table 6. We derived the AUCs by taking a mean of the AUCs obtained by RNNs and CNNs when trained on ten randomly chosen documents—five from each class. The AUCs for Sharma et al. [44] are extracted from the paper.

Table 6 Text classification

Additionally, we also empirically evaluated the gains we would get when we perform one-shot text classification with annotations using Yahoo! Answer Classification dataset. In this ten-way classification task, we randomly chose one document from each class and manually annotated for the purposes of the experiments. In the traditional machine learning method, we obtained an accuracy of about 14% with annotations, while we obtained 28% with RNNs and 31% with CNNs.

1.1 RNN attribution distribution

The attribution distribution obtained for RNNs is shown in Fig. 11.

Fig. 11
figure 11

Attribution distribution for RNN: The graphs show the distribution of shares allocated to class-indicating words in predicting the class as the training progresses. For RNNs trained with as few as 40 labeled and annotated documents, 50% of the unlabeled documents are assigned a share of at least 0.5 to the class-indicating words

Appendix 4: Results of information retrieval

The details of the experiments are presented in Table 7. The experimental results of other three datasets are presented in Fig. 12.

Table 7 Although the query is a subset of the keywords in all of the datasets and the keywords significantly overlap with the search strings used to collect candidate studies, we report improvements in recall up to 5%
Fig. 12
figure 12

Recall curves for Kitchenham, Wahono and Radjenovic: The graphs show the percentage of relevant documents retrieved over five review rounds. The first ten documents are retrieved using BM25 scoring based on the query terms (Table 3). In subsequent rounds, the top ten most relevant documents in the unlabeled pool, as predicted by the machine, are shown to the user. The curves corresponding to CNN-A and RNN-A are obtained by training the machine with the corresponding keywords (see Table 3) as corpus-level annotations. Interestingly, the machines with annotations retrieve about 5% more relevant documents although all the documents in the corpus have at least one of these terms

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shama Sastry, C., Milios, E.E. Active neural learners for text with dual supervision. Neural Comput & Applic 32, 13343–13362 (2020). https://doi.org/10.1007/s00521-019-04681-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04681-0

Keywords

Navigation