Skip to main content

PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data

Part of the Lecture Notes in Computer Science book series (LNAI,volume 12461)

Abstract

While social media has taken a fixed place in our daily life, its steadily growing prominence also exacerbates the problem of hostile contents and hate-speech. These destructive phenomena call for automatic hate-speech detection, which, however, is facing two major challenges, namely i) the dynamic nature of online content causing significant data-drift over time, and ii) a high class-skew, as hate-speech represents a relatively small fraction of the overall online content. The first challenge naturally calls for a batch mode active learning solution, which updates the detection system by querying human domain-experts to annotate meticulously selected batches of data instances. However, little prior work exists on batch mode active learning with high class-skew, and in particular for the problem of hate-speech detection. In this work, we propose a novel partition-based batch mode active learning framework to address this problem. Our framework falls into the so-called screening approach, which pre-selects a subset of most uncertain data items and then selects a representative set from this uncertainty space. To tackle the class-skew problem, we use a data-driven skew-specialized cluster representation, with a higher potential to “cherry pick” minority classes. In extensive experiments we demonstrate substantial improvements in terms of G-Means, and F1 measure, over several baseline approaches and multiple datasets, for highly imbalanced class ratios.

Keywords

  • Batch-mode active learning
  • Imbalance data
  • Hate-speech recognition

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-67670-4_5
  • Chapter length: 17 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-67670-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

Notes

  1. 1.

    http://theconversation.com/why-ai-cant-solve-everything-97022.

  2. 2.

    Annotators had to label “hate-speech” (\(y=1\)) vs. “no hate-speech” (\(y=0\)). Hate-speech was defined as a statement expressing hate or extreme bias towards a particular group, in particular defined via religion, race, gender, or sexual orientation. Offensive or hateful expressions directed towards individuals, without reference to a group-defining property, did not count as hate-speech.

  3. 3.

    nltk.org.

References

  1. Attenberg, J., Provost, F.J.: Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance. In: KDD (2010)

    Google Scholar 

  2. Cardoso, T.N.C., Silva, R.M., Canuto, S.D., Moro, M.M., Gonçalves, M.A.: Ranked batch-mode active learning. Inf. Sci. 379, 313–337 (2017)

    CrossRef  Google Scholar 

  3. Chakraborty, S., Balasubramanian, V.N., Panchanathan, S.: Dynamic batch mode active learning. In: CVPR 2011, pp. 2649–2656 (2011)

    Google Scholar 

  4. Chakraborty, S., Balasubramanian, V.N., Panchanathan, S.: Adaptive batch mode active learning. IEEE TNNLS 26, 1747–1760 (2015)

    MathSciNet  Google Scholar 

  5. Davidson, T., Warmsley, D., Macy, M.W., Weber, I.: Automated hate speech detection and the problem of offensive language. In: ICWSM (2017)

    Google Scholar 

  6. Demir, B., Persello, C., Bruzzone, L.: Batch-mode active-learning methods for the interactive classification of remote sensing images. IEEE Trans. Geosci. Remote Sens. 49, 1014–1031 (2011)

    CrossRef  Google Scholar 

  7. Elahi, M., Braunhofer, M., Ricci, F., Tkalcic, M.: Personality-based active learning for collaborative filtering recommender systems. In: Baldoni, M., Baroglio, C., Boella, G., Micalizio, R. (eds.) AI*IA 2013. LNCS (LNAI), vol. 8249, pp. 360–371. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03524-6_31

    CrossRef  Google Scholar 

  8. Founta, A.M., et al.: Large scale crowdsourcing and characterization of twitter abusive behavior. In: ICWSM (2018)

    Google Scholar 

  9. Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: ICML (2017)

    Google Scholar 

  10. Gao, L., Huang, R.: Detecting online hate speech using context aware models. In: RANLP (2017)

    Google Scholar 

  11. Golbeck, J., et al.: A large labeled corpus for online harassment research. In: WebSci (2017)

    Google Scholar 

  12. Guo, Y., Schuurmans, D.: Discriminative batch mode active learning. In: NIPS (2007)

    Google Scholar 

  13. Haußmann, M., Hamprecht, F.A., Kandemir, M.: Deep active learning with adaptive acquisition. In: IJCAI (2019)

    Google Scholar 

  14. Hoi, S.C.H., Jin, R., Lyu, M.R.: Batch mode active learning with applications to text categorization and image retrieval. IEEE TKDE 21, 1233–1248 (2009)

    Google Scholar 

  15. Konyushkova, K., Sznitman, R., Fua, P.: Learning active learning from data. In: NIPS (2017)

    Google Scholar 

  16. Lin, C.H., Mausam, M., Weld, D.S.: Active learning with unbalanced classes and example-generation queries. In: HCOMP (2018)

    Google Scholar 

  17. Lourentzou, I., Gruhl, D., Welch, S.: Exploring the efficiency of batch active learning for human-in-the-loop relation extraction. In: WWW (2018)

    Google Scholar 

  18. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    Google Scholar 

  19. McCallum, A., Nigam, K.: Employing EM and pool-based active learning for text classification. In: ICML (1998)

    Google Scholar 

  20. Patra, S., Bruzzone, L.: A cluster-assumption based batch mode active learning technique. Pattern Recognit. Lett. 33(9), 1042–1048 (2012)

    CrossRef  Google Scholar 

  21. Schohn, G., Cohn, D.: Less is more: active learning with support vector machines. In: ICML (2000)

    Google Scholar 

  22. Settles, B.: Active learning literature survey. Computer Sciences Technical report 1648, University of Wisconsin-Madison (2009)

    Google Scholar 

  23. Singla, A., Patra, S.: A fast partition-based batch-mode active learning technique using SVM classifier. Soft Comput. 22(14), 4627–4637 (2018)

    CrossRef  Google Scholar 

  24. Wang, H., Zhou, R., Shen, Y.D.: Bounding uncertainty for active batch selection. In: AAAI (2019)

    Google Scholar 

  25. Wang, Z., Yan, S., Zhang, C.: Active learning with adaptive regularization. Pattern Recognit. 44, 2375–2383 (2011)

    CrossRef  Google Scholar 

  26. Waseem, Z.: Are you a racist or am i seeing things? Annotator influence on hate speech detection on twitter. In: Proceedings of the First Workshop on NLP and Computational Social Science, pp. 138–142 (2016)

    Google Scholar 

  27. Xia, X., Protopapas, P., Doshi-Velez, F.: Cost-sensitive batch mode active learning: designing astronomical observation by optimizing telescope time and telescope choice. In: Proceedings of SIAM 2016, pp. 477–485 (2016)

    Google Scholar 

  28. Yu, H., Yang, X., Zheng, S., Sun, C.: Active learning from imbalanced data: a solution of online weighted extreme learning machine. IEEE Trans. Neural Netw. Learn. Syst. 30, 1088–1103 (2019)

    CrossRef  Google Scholar 

  29. Zhang, X., Yang, T., Srinivasan, P.: Online asymmetric active learning with imbalanced data. In: KDD (2016)

    Google Scholar 

Download references

Acknowledgement

This work was supported by grants from Indonesia Endowment Fund for Education (LPDP) and Ministry of Research, Technology and Higher Education of the Republic of Indonesia (BUDI-LN Scholarship). The authors also would like to thank the research programme Commit2Data, specifically the RATE-Analytics project NWO628 003 001 (partly) financed by the Dutch Research Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ricky Maulana Fajri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Fajri, R.M., Khoshrou, S., Peharz, R., Pechenizkiy, M. (2021). PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data. In: Dong, Y., Ifrim, G., Mladenić, D., Saunders, C., Van Hoecke, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12461. Springer, Cham. https://doi.org/10.1007/978-3-030-67670-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67670-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67669-8

  • Online ISBN: 978-3-030-67670-4

  • eBook Packages: Computer ScienceComputer Science (R0)