Skip to main content
Log in

AHMN: A multi-modal network for long MOOC videos chapter segmentation

  • 1233: Robust Enhancement, Understanding and Assessment of Low-quality Multimedia Data
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper proposes a task named MOOC Videos Chapter Segmentation (MVCS) which is a significant problem in the field of video understanding. To solve this problem, we first introduce a dataset called MOOC Videos Understanding (MVU) which consists of approximately 10k annotated chapters organized by 120k snippets from 400 MOOC videos where chapters and snippets are two levels of video unit proposed in this paper for hierarchical level expression of videos. Then, we design the Attention-based Hierarchical bi-LSTM Multi-modal Network (AHMN) based on three core ideas: (1) we take advantage of the features of multi-modal semantic elements, including video, audio, and text, along with an attention-based multi-modal fusion module to extract video information in a comprehensive way. (2) we focus on chapters boundaries rather than the content recognition of chapters themselves, so we develop Boundary Predict Network (BPN) to label boundaries between chapters. (3) we exploit the semantic consistency between snippets and develop Consistency Modeling as an auxiliary task to improve the performance of BPN. Our experiments demonstrate that the proposed AHMN can solve the MVCS precisely, outperforming previous methods on all evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. http://videolectures.net

  2. https://www.readbeyond.it/aeneas

  3. https://www.selenium.dev

  4. http://videolectures.net

References

  1. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  2. Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1430–1439

  3. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1227–1236

  4. Xiao Y, Yuan Q, Jiang K, Jin X, He J, Zhang L, Lin C-w (2023) Local-global temporal difference learning for satellite video super-resolution. IEEE Trans Circ Syst Vid Technol 1–16. https://doi.org/10.1109/TCSVT.2023.3312321

  5. Xiao Y, Su X, Yuan Q, Liu D, Shen H, Zhang L (2022) Satellite video super-resolution via multiscale deformable convolution alignment and temporal grouping projection. IEEE Trans Geosci Remote Sens 60:1–19. https://doi.org/10.1109/TGRS.2021.3107352

    Article  Google Scholar 

  6. Mukherjee A, Tiwari S, Chowdhury T, Chakraborty T (2019) Automatic curation of content tables for educational videos. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 1329–1332

  7. Bendraou Y (2017) Video shot boundary detection and key-frame extraction using mathematical models. Université du littoral côte d’opale

  8. Lu Z-M, Shi Y (2013) Fast video shot boundary detection based on svd and pattern matching. IEEE Trans Image Process 22(12):5136–5145

    Article  MathSciNet  Google Scholar 

  9. Shao H, Qu Y, Cui W (2015) Shot boundary detection algorithm based on hsv histogram and hog feature. In: 5th International conference on advanced engineering materials and technology, pp 951–957

  10. Shou Z, Wang D, Chang S-F (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1049–1058

  11. Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3889–3898

  12. Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 988–996

  13. Tapaswi M, Bauml M, Stiefelhagen R (2014) Storygraphs: visualizing character interactions as a timeline. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 827–834

  14. Rao A, Xu L, Xiong Y, Xu G, Huang Q, Zhou B, Lin D (2020) A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10146–10155

  15. Ma D, Zhang X, Ouyang X, Agam G (2017) Lecture vdeo indexing using boosted margin maximizing neural networks. In: 2017 16th IEEE international conference on machine learning and applications (ICMLA), pp 221–227. IEEE

  16. Basu S, Yu Y, Singh VK, Zimmermann R (2016) Videopedia: Lecture video recommendation for educational blogs using topic modeling. In: International conference on multimedia modeling, pp 238–250. Springer

  17. Lin M, Chau M, Cao J, Nunamaker JF Jr (2005) Automated video segmentation for lecture videos: A linguistics-based approach. Int. J. Technol. Hum. Interact (IJTHI) 1(2):27–45

    Article  Google Scholar 

  18. Shah RR, Yu Y, Shaikh AD, Tang S, Zimmermann R (2014) Atlas: automatic temporal segmentation and annotation of lecture videos based on modelling transition time. In: Proceedings of the 22nd ACM international conference on multimedia, pp 209–212

  19. Che X, Yang H, Meinel C (2013) Lecture video segmentation by automatically analyzing the synchronized slides. In: Proceedings of the 21st ACM international conference on multimedia, pp 345–348

  20. Soares ER, Barrére E (2019) An optimization model for temporal video lecture segmentation using word2vec and acoustic features. In: Proceedings of the 25th brazillian symposium on multimedia and the web, pp 513–520

  21. Zhao B, Lin S, Luo X, Xu S, Wang R (2017) A novel system for visual navigation of educational videos using multimodal cues. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 1680–1688

  22. Gupta R, Roy A, Christensen C, Kim S, Gerard S, Cincebeaux M, Divakaran A, Grindal T, Shah M (2023) Class prototypes based contrastive learning for classifying multi-label and fine-grained educational videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19923–19933

  23. Croitoru I, Bogolin S-V, Albanie S, Liu Y, Wang Z, Yoon S, Dernoncourt F, Jin H, Bui T (2023) Moment detection in long tutorial videos. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2594–2604

  24. Lee DW, Ahuja C, Liang PP, Natu S, Morency L-P (2022) Multimodal lecture presentations dataset: Understanding multimodality in educational slides. arXiv:2208.08080

  25. Ghauri JA, Hakimov S, Ewerth R (2020) Classification of important segments in educational videos using multimodal features. arXiv:2010.13626

  26. Zhong Y, Ji W, Xiao J, Li Y, Deng W, Chua T-S (2022) Video question answering: datasets, algorithms and challenges. arXiv:2203.01225

  27. Ge Y, Ge Y, Liu X, Li D, Shan Y, Qie X, Luo P (2022) Bridging video-text retrieval with multiple choice questions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16167–16176

  28. Yang A, Miech A, Sivic J, Laptev I, Schmid C (2022) Tubedetr: Spatio-temporal video grounding with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16442–16453

  29. Awad G, Butt AA, Fiscus J, Joy D, Delgado A, Mcclinton W, Michel M, Smeaton AF, Graham Y, Kraaij W (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning, and hyperlinking. In: TREC video retrieval evaluation (TRECVID)

  30. Singer U, Polyak A, Hayes T, Yin X, An J, Zhang S, Hu Q, Yang H, Ashual O, Gafni O, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792

  31. Liu Z, Zhao F, Zhang M (2022) Multi-modal transformer for video retrieval using improved sentence embeddings. In: Fourteenth international conference on digital image processing (ICDIP 2022), vol. 12342, pp 601–607. SPIE

  32. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimed 19(9):2045–2055

    Article  Google Scholar 

  33. Le TM, Le V, Venkatesh S, Tran T (2020) Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9972–9981

  34. Zhang S, Peng H, Fu J, Lu Y, Luo J (2021) Multi-scale 2d temporal adjacency networks for moment localization with natural language. IEEE Trans Pattern Anal Mach Intell 44(12):9073–9087

    Article  Google Scholar 

  35. Miech A, Alayrac J-B, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9879–9889

  36. Lei J, Li L, Zhou L, Gan Z, Berg TL, Bansal M, Liu J (2021) Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7331–7341

  37. Mei X, Liu X, Sun J, Plumbley MD, Wang W (2022) On metric learning for audio-text cross-modal retrieval. arXiv:2203.15537

  38. Liu X, Mei X, Huang Q, Sun J, Zhao J, Liu H, Plumbley MD, Kilic V, Wang W (2022) Leveraging pre-trained bert for audio captioning. In: 2022 30th European Signal Processing Conference (EUSIPCO), pp 1145–1149. IEEE

  39. Aldausari N, Sowmya A, Marcus N, Mohammadi G (2022) Cascaded siamese self-supervised audio to video gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4691–4700

  40. Iashin V, Rahtu E (2021) Taming visually guided sound generation. arXiv:2110.08791

  41. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  42. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  43. Koshorek O, Cohen A, Mor N, Rotman M, Berant J (2018) Text segmentation as a supervised learning task. arXiv:1803.09337

  44. Berlage O, Lux K-M, Graus D (2020) Improving automated segmentation of radio shows with audio embeddings. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 751–755. IEEE

  45. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 131–135. IEEE

  46. Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 776–780. IEEE

  47. Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo

  48. Munro J, Damen D (2020) Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 122–132

  49. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  50. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  51. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078

  52. Katharopoulos A, Vyas A, Pappas N, Fleuret F (2020) Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning, pp 5156–5165. PMLR

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China (Nos. 2021ZD0113202, 2022YFE0116700), in part by the National Natural Science Foundation of China under Grants 62171125, 61876037, and in part by the Industry-University-Research cooperation project of Jiangsu Province under grant BY2022564.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Sun.

Ethics declarations

Conflicts of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A    Appendices

1.1 A.1   APPENDIX: Dataset Collection

The MOOC Videos Understanding Dataset (MVU) is introduced in Section 3 of the paper. Here we introduce additional details about the data source. In order to build a reliable and easily scalable MOOC videos dataset, we use seleniumFootnote 3 to collect a total of 400 videos from 20 categories from the videolecturesFootnote 4. In the videolectures, usually a webpage contains the following elements: a video player, a slide player, a timeline where you can click to select any slide page and jump to its corresponding video clip. The videolectures regularly uploads the latest MOOC videos, so the data source is reliable and the fully automated collection pipline also facilitates the easy expansion of the MVU.

Fig. 6
figure 6

Overview of our text network. Our text network consists of two parts, the LSTM-based network on the left is snippet’s word2vec sentence vector \(\partial _{i}^{w2v}\), the right side is a BERT-based sentence vector \(\partial _{i}^{BERT}\)

1.2 A.2   APPENDIX: Text Network

The text network is introduced in Section 4.1.1 in this paper. We propose a two-stage feature extraction framework for text information as shown in Fig. 6. Here we introduce more implementation details on this framework. Inspired by [43], we develop a Bi-directional Long Short-Term Memory (bi-LSTM) network to encode the words. For each snippet, the input to the network is the sequence of words \([w_{1}^{i},w_{2}^{i},\cdot \cdot \cdot ,w_{k}^{i}]\) and the output is the sentence representation \(\partial _{i}^{w2v}\) which is from bi-LSTM output layer after maxpooling, with a dimension of 512. Then to further enhance the feature representation of text modality, we use the BERT [41] pre-trained model, for each text sentence, inserting [CLS] at the beginning and [SEP] at the end, and feeding the sentence with the marking symbols into the BERT. A 768-dimensional vector of [CLS] for each sentence in the results, projected into a 512-dimensional vector by a linear layer, represents the BERT feature \(\partial _{i}^{BERT}\) of the snippet.

Additional qualiative results

Fig. 7
figure 7

The role of multi-modal features. The cosine similarity values of each semantic element is represented by the corresponding bar length

The effectiveness of multi-modal semantic information is shown in Section 5.3 in this paper, where we select typical MOOC video examples to illustrate how different modalities help the AHMN group several semantically similar snippets into one chapter and predict the chapter boundaries. To help further understand the role of multi-modal features, we present more visualization results of the AHMN on the MVCS task which are shown in Fig. 7. As is shown in Fig. 7(a), the speaker is chatting with his audience on this slide, so the text from both the speaker and the audience are not similar enough among these three snippets where the highly similar video features help the AHMN to confirm the boundary. Figure 7(b) shows a typical lecture scene, the camera moves a bit, resulting in the video features change a lot among these snippets, so the AHMN determines this chapter boundary based on similar text and audio features. Similar to Fig. 7(b), snippets in Fig. 7(c) do not have highly similar video features, but due to the noisy audience discussion, the AHMN could only determine the boundaries based on highly similar text features. Figure 7(d) shows a large-scale public class scene, where the camera moves a lot between the speakers and the screen, so the AHMN uses text information and audio information from the speaker. The speaker in Fig. 7(e) is playing a video clip with music in which there is not plenty of similar text information, so the AHMN uses the audio features of this video clip to recognize these snippets belong to the same chapter.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Sun, Y., Kong, Y. et al. AHMN: A multi-modal network for long MOOC videos chapter segmentation. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17654-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-023-17654-2

Keywords

Navigation