Skip to main content
Log in

Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The massive increase in classroom video data enables the possibility of utilizing artificial intelligence technology to automatically recognize, detect and caption students’ behaviors. This is beneficial for related research, e.g., pedagogy and educational psychology. However, the lack of a dataset specifically designed for students’ classroom behaviors may block these potential studies. This paper presents a comprehensive dataset that can be employed for recognizing, detecting, and captioning students’ behaviors in a classroom. We collected videos of 128 classes in different disciplines and in 11 classrooms. Specifically, the constructed dataset consists of a detection part, recognition part, and captioning part. The detection part includes a temporal detection data module with 4542 samples and an action detection data module with 3343 samples, whereas the recognition part contains 4276 samples and the captioning part contains 4296 samples. Moreover, the students’ behaviors are spontaneous in real classes, rendering the dataset representative and realistic. We analyze the special characteristics of the classroom scene and the technical difficulties for each module (task), which are verified by experiments. Due to the particularity of classrooms, our datasets proposes increasing the requirements of existing methods. Moreover, we provide a baseline for each task module in the dataset and make a comparison with the current mainstream datasets. The results show that our dataset is viable and reliable. Additionally, we present a thorough performance analysis of each baseline model to provide a comprehensive comparison for models using our presented dataset. The dataset and code are available to download online: https://github.com/BNU-Wu/Student-Class-Behavior-Dataset/tree/master.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314

  2. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803

  3. Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1529–1538

  4. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37

  5. Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid r-cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7363–7372

  6. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  7. Shaoqing Ren, Kaiming He,Ross Girshick,Jian Sun (2016) Faster R-CNN: towards real-time object detection with region proposal networks. In: IEEE transactions on pattern analysis and machine intelligence. arXiv:1506.01497

  8. Berclaz J, Fleuret F, Fua P (2006) Robust people tracking with global trajectory optimization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 744–750

  9. Breitenstein MD, Reichlin F, Leibe B, Koller-Meier E, Gool L (2009) Robust tracking-by-detection using a detector confidence particle filter. In: Proceedings of the IEEE international conference on computer vision, pp 1515–1522

  10. Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems, pp 3844–3852

  11. Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631

  12. Wang J, Liu W, Kumar S, Chang S (2016) Learning to hash for indexing big data: a survey. Proc IEEE 104(1):34–57

    Article  Google Scholar 

  13. Liu W, Zhang T (2016) Multimedia hashing and networking. IEEE Multimed 23:75–79

    Article  Google Scholar 

  14. Song J, Gao L, Liu L, Zhu X, Sebe N (2018) Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn 75:175–187

    Article  Google Scholar 

  15. Wang J, Zhang T, Song J, Sebe N, Shen H (2018) A Survey on Learning to Hash. IEEE Trans Pattern Anal Mach Intell 40(4):769–790

    Article  Google Scholar 

  16. Haijun Z, Yuzhu J, Wang H, Linlin L (2019) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x

    Article  Google Scholar 

  17. Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. Assoc Adv Artif Intel 3:16

    Google Scholar 

  18. Soomro K, Zamir AR, Shah MJCe (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  19. Kuehne, H, Jhuang, H, Garrote, E, Poggio, T.A., Serre, T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563

  20. Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200

  21. Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6047–6056

  22. Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019) VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE international conference on computer vision, pp 4580–4590

  23. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: International conference on pattern recognition, pp 32–36

  24. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intel 29(12):2247–2253

    Article  Google Scholar 

  25. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2929–2936

  26. Over P, Fiscus J, Sanders G, Joy D, Quénot G (2013) TRECVID 2013: an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID 2013 workshop participants notebook papers. http://www-nlpir.nist.gov/projects/tvpubs/tv13.papers/tv13overview.pdf. Accessed 29 Dec 2020

  27. Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675

  28. Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I, Memisevic R (2017) The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision, pp 5843–5851

  29. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  30. Zhao H, Torralba A, Torresani L, Yan Z (2017) SLAC: a sparsely labeled dataset for action classification and localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. arXiv:1712.09374

  31. Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan T, Brown L, Fan Q, Gutfruend D, Vondrick C (2018) Moments in time dataset: one million videos for event understanding. In: IEEE transactions on pattern analysis and machine intelligence. arXiv:1801.03150

  32. Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) The kinetics human action video dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. arXiv:170506950

  33. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970

  34. Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The THUMOS challenge on action recognition for videos “in the wild”. Comput Vis Image Underst 155:1–23

    Article  Google Scholar 

  35. Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2018) Every moment counts: dense detailed labeling of actions in complex videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–389

  36. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta (2016) A hollywood in homes: crowd sourcing data collection for activity understanding. In: European conference on computer vision. Springer, pp 510–526

  37. Ke Y, Sukthankar R, Hebert M (2005) Efficient visual event detection using volumetric features. In: Proceedings of the IEEE international conference on computer vision, pp 166–173

  38. Yuan J, Liu Z, Wu Y (2009) Discriminative subvolume search for efficient action detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2442–2449

  39. Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8

  40. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3192–3199

  41. Weinzaepfel P, Martin X, Schmid C (2016) Towards weakly-supervised action localization. arXiv:1065.05197

  42. Mettes P, Van Gemert JC, Snoek CG (2016) Spot on: action localization from pointly-supervised proposals. In: European conference on computer vision. Springer, pp 437–453

  43. Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1194–1201

  44. Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2634–2641

  45. Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, Springer, pp 144–157

  46. Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. In: German conference on pattern recognition. Springer, pp 184–195

  47. Zhou L, Xu C, Corso J (2018) Towards automatic learning of procedures from web instructional videos. In: Association for the advancement of artificial intelligence, pp 7590–7598

  48. Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3202–3212

  49. Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070

  50. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  51. Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715

  52. Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6578–6587

  53. Gella S, Lewis M, Rohrbach M (2018) A dataset for telling the stories of social media videos. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 968–974

  54. Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv 52(6):1–37

    Article  Google Scholar 

  55. Zeng K-H, Chen T-H, Niebles JC, Sun M (2016) Generation for user generated videos. In: European conference on computer vision. Springer, pp 609–625

  56. Wei Q, Sun B, He J, Yu LJ (2017) BNU-LSVED 2.0: Spontaneous multimodal student affect database with multi-dimensional labels. Sig Process Image Commun 59:168–181

    Article  Google Scholar 

  57. Wang Z, Pan X, Miller KF, Cortina KSJC, Education (2014) Automatic classification of activities in classroom discourse. Comput Educ 78:115–123

    Article  Google Scholar 

  58. Sun B, Wei Q, He J, Yu L, Zhu X (2016) BNU-LSVED: a multimodal spontaneous expression database in educational environment. In: Optics and photonics for information processing X, international society for optics and photonics, p 997016

  59. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL(2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  60. Simonyan K, Zisserman (2014) A Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  61. Ulutan O, Rallapalli S, Srivatsa M, Torres C, Manjunath B (2020) Actor conditioned attention maps for video action detection. In: The IEEE winter conference on applications of computer vision, pp 527–536

  62. Li Y, Wang Z, Wang L, Wu G (2020) Actions as moving points. In: Proceedings of the European conference on computer vision. arXiv:2001.04608

  63. Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer aggregation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2403–2412

  64. Gkioxari G, Malik J (2015) Finding action tubes. In: Proceedings of the IEEE international conference on computer vision, pp 759–768

  65. Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European conference on computer vision, pp 3–19

  66. Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang F, Ji R (2020) Fast learning of temporal action proposal via dense boundary generator. In: Association for the advancement of artificial intelligence. https://doi.org/10.22648/ETRI.2020.J.350303

  67. Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7622–7631

  68. Wang X, Wang YF, Wang WY (2018) Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, pp 795–801

  69. Denkowski M, Lavie (2014) A Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380

  70. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  71. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  72. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun He.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest with regard to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, B., Wu, Y., Zhao, K. et al. Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Comput & Applic 33, 8335–8354 (2021). https://doi.org/10.1007/s00521-020-05587-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05587-y

Keywords

Navigation