Abstract
While social media has taken a fixed place in our daily life, its steadily growing prominence also exacerbates the problem of hostile contents and hate-speech. These destructive phenomena call for automatic hate-speech detection, which, however, is facing two major challenges, namely i) the dynamic nature of online content causing significant data-drift over time, and ii) a high class-skew, as hate-speech represents a relatively small fraction of the overall online content. The first challenge naturally calls for a batch mode active learning solution, which updates the detection system by querying human domain-experts to annotate meticulously selected batches of data instances. However, little prior work exists on batch mode active learning with high class-skew, and in particular for the problem of hate-speech detection. In this work, we propose a novel partition-based batch mode active learning framework to address this problem. Our framework falls into the so-called screening approach, which pre-selects a subset of most uncertain data items and then selects a representative set from this uncertainty space. To tackle the class-skew problem, we use a data-driven skew-specialized cluster representation, with a higher potential to “cherry pick” minority classes. In extensive experiments we demonstrate substantial improvements in terms of G-Means, and F1 measure, over several baseline approaches and multiple datasets, for highly imbalanced class ratios.
Keywords
- Batch-mode active learning
- Imbalance data
- Hate-speech recognition
This is a preview of subscription content, access via your institution.
Buying options




Notes
- 1.
- 2.
Annotators had to label “hate-speech” (\(y=1\)) vs. “no hate-speech” (\(y=0\)). Hate-speech was defined as a statement expressing hate or extreme bias towards a particular group, in particular defined via religion, race, gender, or sexual orientation. Offensive or hateful expressions directed towards individuals, without reference to a group-defining property, did not count as hate-speech.
- 3.
References
Attenberg, J., Provost, F.J.: Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance. In: KDD (2010)
Cardoso, T.N.C., Silva, R.M., Canuto, S.D., Moro, M.M., Gonçalves, M.A.: Ranked batch-mode active learning. Inf. Sci. 379, 313–337 (2017)
Chakraborty, S., Balasubramanian, V.N., Panchanathan, S.: Dynamic batch mode active learning. In: CVPR 2011, pp. 2649–2656 (2011)
Chakraborty, S., Balasubramanian, V.N., Panchanathan, S.: Adaptive batch mode active learning. IEEE TNNLS 26, 1747–1760 (2015)
Davidson, T., Warmsley, D., Macy, M.W., Weber, I.: Automated hate speech detection and the problem of offensive language. In: ICWSM (2017)
Demir, B., Persello, C., Bruzzone, L.: Batch-mode active-learning methods for the interactive classification of remote sensing images. IEEE Trans. Geosci. Remote Sens. 49, 1014–1031 (2011)
Elahi, M., Braunhofer, M., Ricci, F., Tkalcic, M.: Personality-based active learning for collaborative filtering recommender systems. In: Baldoni, M., Baroglio, C., Boella, G., Micalizio, R. (eds.) AI*IA 2013. LNCS (LNAI), vol. 8249, pp. 360–371. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03524-6_31
Founta, A.M., et al.: Large scale crowdsourcing and characterization of twitter abusive behavior. In: ICWSM (2018)
Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: ICML (2017)
Gao, L., Huang, R.: Detecting online hate speech using context aware models. In: RANLP (2017)
Golbeck, J., et al.: A large labeled corpus for online harassment research. In: WebSci (2017)
Guo, Y., Schuurmans, D.: Discriminative batch mode active learning. In: NIPS (2007)
Haußmann, M., Hamprecht, F.A., Kandemir, M.: Deep active learning with adaptive acquisition. In: IJCAI (2019)
Hoi, S.C.H., Jin, R., Lyu, M.R.: Batch mode active learning with applications to text categorization and image retrieval. IEEE TKDE 21, 1233–1248 (2009)
Konyushkova, K., Sznitman, R., Fua, P.: Learning active learning from data. In: NIPS (2017)
Lin, C.H., Mausam, M., Weld, D.S.: Active learning with unbalanced classes and example-generation queries. In: HCOMP (2018)
Lourentzou, I., Gruhl, D., Welch, S.: Exploring the efficiency of batch active learning for human-in-the-loop relation extraction. In: WWW (2018)
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
McCallum, A., Nigam, K.: Employing EM and pool-based active learning for text classification. In: ICML (1998)
Patra, S., Bruzzone, L.: A cluster-assumption based batch mode active learning technique. Pattern Recognit. Lett. 33(9), 1042–1048 (2012)
Schohn, G., Cohn, D.: Less is more: active learning with support vector machines. In: ICML (2000)
Settles, B.: Active learning literature survey. Computer Sciences Technical report 1648, University of Wisconsin-Madison (2009)
Singla, A., Patra, S.: A fast partition-based batch-mode active learning technique using SVM classifier. Soft Comput. 22(14), 4627–4637 (2018)
Wang, H., Zhou, R., Shen, Y.D.: Bounding uncertainty for active batch selection. In: AAAI (2019)
Wang, Z., Yan, S., Zhang, C.: Active learning with adaptive regularization. Pattern Recognit. 44, 2375–2383 (2011)
Waseem, Z.: Are you a racist or am i seeing things? Annotator influence on hate speech detection on twitter. In: Proceedings of the First Workshop on NLP and Computational Social Science, pp. 138–142 (2016)
Xia, X., Protopapas, P., Doshi-Velez, F.: Cost-sensitive batch mode active learning: designing astronomical observation by optimizing telescope time and telescope choice. In: Proceedings of SIAM 2016, pp. 477–485 (2016)
Yu, H., Yang, X., Zheng, S., Sun, C.: Active learning from imbalanced data: a solution of online weighted extreme learning machine. IEEE Trans. Neural Netw. Learn. Syst. 30, 1088–1103 (2019)
Zhang, X., Yang, T., Srinivasan, P.: Online asymmetric active learning with imbalanced data. In: KDD (2016)
Acknowledgement
This work was supported by grants from Indonesia Endowment Fund for Education (LPDP) and Ministry of Research, Technology and Higher Education of the Republic of Indonesia (BUDI-LN Scholarship). The authors also would like to thank the research programme Commit2Data, specifically the RATE-Analytics project NWO628 003 001 (partly) financed by the Dutch Research Council.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Fajri, R.M., Khoshrou, S., Peharz, R., Pechenizkiy, M. (2021). PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data. In: Dong, Y., Ifrim, G., Mladenić, D., Saunders, C., Van Hoecke, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12461. Springer, Cham. https://doi.org/10.1007/978-3-030-67670-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-67670-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67669-8
Online ISBN: 978-3-030-67670-4
eBook Packages: Computer ScienceComputer Science (R0)
-
Published in cooperation with
http://www.ecmlpkdd.org/