Abstract
Supervised machine learning methods depend highly on the quality of the training dataset and the underlying model. In particular, neural network models, that have shown great success in dealing with natural language problems, require a large dataset to learn a vast number of parameters. However, it is not always easy to build a large (labelled) dataset. For example, due to the complex nature of tweets and the manual labour involved, it is hard to create a large Twitter data set with the misogynistic label. In this paper, we propose to regularise a long short-term memory (LSTM) classifier using a pretrained LSTM-based language model (LM) to build an accurate classification model with a small training set. We explain transfer learning (TL) with a Bayesian interpretation and show that TL can be viewed as an uncertainty regularisation technique in Bayesian inference. We show that a LM pre-trained on a sequence of general to task-specific domain datasets can be used to regularise a LSTM classifier effectively when a small training dataset is available. Empirical analysis with two small Twitter datasets reveals that an LSTM model trained in this way can outperform the state-of-the-art classification models.
This is a preview of subscription content, access via your institution.



Notes
References
- 1.
Ahluwalia R, Soni H, Callow E, Nascimento A, De Cock M (2018) Detecting hate speech against women in English tweets. EVALITA Eval NLP Speech Tools Ital 12:194
- 2.
Amnesty International (2018) Toxic twitter—a toxic place for women. https://bit.ly/2FZYQhV
- 3.
Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide web companion, international world wide web conferences steering committee, pp 759–760
- 4.
Bartlett J, Norrie R, Patel S, Rumpel R, Wibberley S (2014) Misogyny on twitter. Demos. Retrieved from analysis and policy observatory website https://apo.org.au/node/39610
- 5.
Bashar MA, Nayak R, Suzor N, Weir B (2018) Misogynistic tweet detection: modelling CNN with small datasets. In: The 16th Australasian data mining conference
- 6.
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network. In: International conference on machine learning, pp 1613–1622
- 7.
Bouchard G, Triggs B (2004) The tradeoff between generative and discriminative classifiers. In: 16th IASC international symposium on computational statistics (COMPSTAT’04), pp 721–728
- 8.
Bradbury J, Merity S, Xiong C, Socher R (2016) Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576
- 9.
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 785–794
- 10.
Dai AM, Le QV (2015) Semi-supervised sequence learning. In: Advances in neural information processing systems, pp 3079–3087
- 11.
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. arXiv preprint arXiv:1703.04009
- 12.
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
- 13.
Downey A (2012) Think Bayes: Bayesian statistics made simple. Green Tea Press, Needham
- 14.
Dragiewicz M, Burgess J, Matamoros-Fernández A, Salter M, Suzor NP, Woodlock D, Harris B (2018) Technology facilitated coercive control: domestic violence and the competing roles of digital media platforms. Fem Med Stud 18:1–17
- 15.
Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 2: short papers), vol 2, pp 567–573
- 16.
Fersini E, Nozza D, Rosso P (2018) Overview of the evalita 2018 task on automatic misogyny identification (AMI). In: Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA’18). Italy, Turin
- 17.
Gal Y (2016) Uncertainty in deep learning. University of Cambridge, Cambridge
- 18.
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. In: Advances in neural information processing systems, pp 1019–1027
- 19.
Gitari ND, Zuping Z, Damien H, Long J (2015) A lexicon-based approach for hate speech detection. Int J Multimed Ubiquitous Eng 10(4):215–230
- 20.
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
- 21.
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
- 22.
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
- 23.
Hoerl AE, Kennard RW (1970) Ridge regression: applications to nonorthogonal problems. Technometrics 12(1):69–82
- 24.
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), vol 1, pp 328–339
- 25.
Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410
- 26.
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751
- 27.
Kwok I, Wang Y (2013) Locate the hate: detecting tweets against blacks. In: Twenty-seventh AAAI conference on artificial intelligence
- 28.
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556–562
- 29.
Lewis DD (1998) Naive (Bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15
- 30.
Li Y, Algarni A, Zhong N (2010) Mining positive and negative patterns for relevance feature discovery. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Washington, pp 753–762
- 31.
Li Y, Gal Y (2017) Dropout inference in bayesian neural networks with alpha-divergences. In: Proceedings of the 34th international conference on machine learning, JMLR.org, vol 70, pp 2052–2061
- 32.
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R News 2(3):18–22
- 33.
Liu P, Li W, Zou L (2019) Nuli at semeval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers. In: Proceedings of the 13th international workshop on semantic evaluation, pp 87–91
- 34.
Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893
- 35.
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, vol 1, pp 142–150
- 36.
MacKay DJ (1992) A practical bayesian framework for backpropagation networks. Neural Comput 4(3):448–472
- 37.
McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282
- 38.
Melis G, Dyer C, Blunsom P (2017) On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589
- 39.
Merity S, Keskar NS, Socher R (2017) Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182
- 40.
Merity S, Xiong C, Bradbury J, Socher R (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843
- 41.
Mikolov T, Karafiát M, Burget L, Černockỳ J Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
- 42.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
- 43.
Molina-González MD, Plaza-del Arco FM, Martín-Valdivia M, Ureña López L (2019) Ensemble learning to detect aggressiveness in mexican spanish tweets. In: Proceedings of the first workshop for Iberian languages evaluation forum (IberLEF 2019), CEUR WS proceedings
- 44.
Pitsilis GK, Ramampiaro H, Langseth H (2018) Detecting offensive language in tweets using deep learning. arXiv preprint arXiv:1801.04433
- 45.
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding with unsupervised learning. Technical report, OpenAI
- 46.
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533
- 47.
Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813
- 48.
Silva L, Mondal M, Correa D, Benevenuto F, Weber I (2016) Analyzing the targets of hate in online social media. In: Tenth International AAAI conference on web and social media
- 49.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
- 50.
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852
- 51.
Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. In: Thirteenth annual conference of the international speech communication association
- 52.
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
- 53.
Suzor N, Van Geelen T, Myers West S (2018) Evaluating the legitimacy of platform governance: a review of research and a shared research agenda. Int Commun Gaz 80(4):385–400
- 54.
The Online Hate Index: Innovation Brief (2018) Technical report, the anti-defamation League’s center for technology and society. https://www.adl.org/media/10894/download
- 55.
Turian J, Ratinov L, Bengio Y (2010), Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 384–394
- 56.
Wang B, Wang A, Chen F, Wang Y, Kuo C-CJ (2019) Evaluating word embedding models: methods and experimental results. APSIPA Trans Signal Inf Process 8:e19
- 57.
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers), vol 2, pp 707–712
- 58.
Wang W, Chen L, Thirunarayan K, Sheth AP (2014) Cursing in english on twitter. In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing. ACM, pp 415–425
- 59.
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(Feb):207–244
- 60.
Xiang G, Fan B, Wang L, Hong J, Rose C (2012) Detecting offensive tweets via topical feature discovery over a large scale twitter corpus. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 1980–1984
- 61.
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833
- 62.
Zhang KW, Bowman SR (2018) Language modeling teaches you more syntax than translation does: lessons learned through auxiliary task analysis. arXiv preprint arXiv:1809.10040
- 63.
Zhang Z, Luo L (2019) Hate speech detection: a solved problem? The challenging case of long tail on twitter. Semant Web 10(5):925–945
Acknowledgements
his research was partially supported by the QUT IFE Catapult fund. Suzor is the recipient of an Australian Research Council DECRA Fellowship (project number DE160101542).
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Description of evaluation measures
Appendix A: Description of evaluation measures
-
True Positive (TP) True positives are instances classified as positive by the model that actually are positive.
-
True Negative (TN): True negatives are instances the model classifies as negative that actually are negative.
-
False Positive (FP): False positives are instances identified by model as positive that actually are negative.
-
False Negative (FN): False negatives are instances the model classifies as negative that actually are positive.
-
Accuracy (Ac): It is the percentage of correctly classified instances, and it is calculated as \(\frac{\hbox {TP} + \hbox {TN}}{\hbox {TP} + \hbox {TN} + \hbox {FP} + \hbox {FN}}\).
-
Precision (Pr): It calculates a model’s ability to return only relevant instances. It is calculated as \(\frac{\hbox {TP}}{\hbox {TP} + \hbox {FP}}\).
-
Recall (Re): It calculates a model’s ability to identify all relevant instances. It is calculated as \(\frac{\hbox {TP}}{\hbox {TP} + \hbox {FN}}\).
-
\(F_1\) Score (\(F_1\)): A single metric that combines recall and precision using the harmonic mean. \(F_1\) Score is calculated as \(2 \times \frac{\hbox {precision}}{\hbox {precision} + \hbox {recall}}\).
-
Cohen Kappa (CK): Cohen’s kappa score is used to measure inter-rater and intra-rater reliability for categorical items [37]. It is calculated as \(\frac{\hbox {OA}-\hbox {AC}}{1-\hbox {AC}}\), where OA is the relative observed agreement between predicted labels and actual labels and AC is the probability of agreement by chance.
-
Area Under Curve (AUC): Area under the receiver operating characteristic (ROC) curve is called area under the curve (AUC). ROC plots the true positive rate versus the false positive rate as a function of the model’s threshold for classifying a positive. AUC calculates the overall performance of a classification model.
Rights and permissions
About this article
Cite this article
Bashar, M.A., Nayak, R. & Suzor, N. Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set. Knowl Inf Syst 62, 4029–4054 (2020). https://doi.org/10.1007/s10115-020-01481-0
Received:
Accepted:
Published:
Issue Date:
Keywords
- Misogynistic tweet
- Transfer learning
- LSTM
- Small dataset
- Overfitting