Abstract
This paper proposes a task named MOOC Videos Chapter Segmentation (MVCS) which is a significant problem in the field of video understanding. To solve this problem, we first introduce a dataset called MOOC Videos Understanding (MVU) which consists of approximately 10k annotated chapters organized by 120k snippets from 400 MOOC videos where chapters and snippets are two levels of video unit proposed in this paper for hierarchical level expression of videos. Then, we design the Attention-based Hierarchical bi-LSTM Multi-modal Network (AHMN) based on three core ideas: (1) we take advantage of the features of multi-modal semantic elements, including video, audio, and text, along with an attention-based multi-modal fusion module to extract video information in a comprehensive way. (2) we focus on chapters boundaries rather than the content recognition of chapters themselves, so we develop Boundary Predict Network (BPN) to label boundaries between chapters. (3) we exploit the semantic consistency between snippets and develop Consistency Modeling as an auxiliary task to improve the performance of BPN. Our experiments demonstrate that the proposed AHMN can solve the MVCS precisely, outperforming previous methods on all evaluation metrics.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1430–1439
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1227–1236
Xiao Y, Yuan Q, Jiang K, Jin X, He J, Zhang L, Lin C-w (2023) Local-global temporal difference learning for satellite video super-resolution. IEEE Trans Circ Syst Vid Technol 1–16. https://doi.org/10.1109/TCSVT.2023.3312321
Xiao Y, Su X, Yuan Q, Liu D, Shen H, Zhang L (2022) Satellite video super-resolution via multiscale deformable convolution alignment and temporal grouping projection. IEEE Trans Geosci Remote Sens 60:1–19. https://doi.org/10.1109/TGRS.2021.3107352
Mukherjee A, Tiwari S, Chowdhury T, Chakraborty T (2019) Automatic curation of content tables for educational videos. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 1329–1332
Bendraou Y (2017) Video shot boundary detection and key-frame extraction using mathematical models. Université du littoral côte d’opale
Lu Z-M, Shi Y (2013) Fast video shot boundary detection based on svd and pattern matching. IEEE Trans Image Process 22(12):5136–5145
Shao H, Qu Y, Cui W (2015) Shot boundary detection algorithm based on hsv histogram and hog feature. In: 5th International conference on advanced engineering materials and technology, pp 951–957
Shou Z, Wang D, Chang S-F (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1049–1058
Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3889–3898
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 988–996
Tapaswi M, Bauml M, Stiefelhagen R (2014) Storygraphs: visualizing character interactions as a timeline. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 827–834
Rao A, Xu L, Xiong Y, Xu G, Huang Q, Zhou B, Lin D (2020) A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10146–10155
Ma D, Zhang X, Ouyang X, Agam G (2017) Lecture vdeo indexing using boosted margin maximizing neural networks. In: 2017 16th IEEE international conference on machine learning and applications (ICMLA), pp 221–227. IEEE
Basu S, Yu Y, Singh VK, Zimmermann R (2016) Videopedia: Lecture video recommendation for educational blogs using topic modeling. In: International conference on multimedia modeling, pp 238–250. Springer
Lin M, Chau M, Cao J, Nunamaker JF Jr (2005) Automated video segmentation for lecture videos: A linguistics-based approach. Int. J. Technol. Hum. Interact (IJTHI) 1(2):27–45
Shah RR, Yu Y, Shaikh AD, Tang S, Zimmermann R (2014) Atlas: automatic temporal segmentation and annotation of lecture videos based on modelling transition time. In: Proceedings of the 22nd ACM international conference on multimedia, pp 209–212
Che X, Yang H, Meinel C (2013) Lecture video segmentation by automatically analyzing the synchronized slides. In: Proceedings of the 21st ACM international conference on multimedia, pp 345–348
Soares ER, Barrére E (2019) An optimization model for temporal video lecture segmentation using word2vec and acoustic features. In: Proceedings of the 25th brazillian symposium on multimedia and the web, pp 513–520
Zhao B, Lin S, Luo X, Xu S, Wang R (2017) A novel system for visual navigation of educational videos using multimodal cues. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 1680–1688
Gupta R, Roy A, Christensen C, Kim S, Gerard S, Cincebeaux M, Divakaran A, Grindal T, Shah M (2023) Class prototypes based contrastive learning for classifying multi-label and fine-grained educational videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19923–19933
Croitoru I, Bogolin S-V, Albanie S, Liu Y, Wang Z, Yoon S, Dernoncourt F, Jin H, Bui T (2023) Moment detection in long tutorial videos. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2594–2604
Lee DW, Ahuja C, Liang PP, Natu S, Morency L-P (2022) Multimodal lecture presentations dataset: Understanding multimodality in educational slides. arXiv:2208.08080
Ghauri JA, Hakimov S, Ewerth R (2020) Classification of important segments in educational videos using multimodal features. arXiv:2010.13626
Zhong Y, Ji W, Xiao J, Li Y, Deng W, Chua T-S (2022) Video question answering: datasets, algorithms and challenges. arXiv:2203.01225
Ge Y, Ge Y, Liu X, Li D, Shan Y, Qie X, Luo P (2022) Bridging video-text retrieval with multiple choice questions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16167–16176
Yang A, Miech A, Sivic J, Laptev I, Schmid C (2022) Tubedetr: Spatio-temporal video grounding with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16442–16453
Awad G, Butt AA, Fiscus J, Joy D, Delgado A, Mcclinton W, Michel M, Smeaton AF, Graham Y, Kraaij W (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning, and hyperlinking. In: TREC video retrieval evaluation (TRECVID)
Singer U, Polyak A, Hayes T, Yin X, An J, Zhang S, Hu Q, Yang H, Ashual O, Gafni O, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792
Liu Z, Zhao F, Zhang M (2022) Multi-modal transformer for video retrieval using improved sentence embeddings. In: Fourteenth international conference on digital image processing (ICDIP 2022), vol. 12342, pp 601–607. SPIE
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimed 19(9):2045–2055
Le TM, Le V, Venkatesh S, Tran T (2020) Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9972–9981
Zhang S, Peng H, Fu J, Lu Y, Luo J (2021) Multi-scale 2d temporal adjacency networks for moment localization with natural language. IEEE Trans Pattern Anal Mach Intell 44(12):9073–9087
Miech A, Alayrac J-B, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9879–9889
Lei J, Li L, Zhou L, Gan Z, Berg TL, Bansal M, Liu J (2021) Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7331–7341
Mei X, Liu X, Sun J, Plumbley MD, Wang W (2022) On metric learning for audio-text cross-modal retrieval. arXiv:2203.15537
Liu X, Mei X, Huang Q, Sun J, Zhao J, Liu H, Plumbley MD, Kilic V, Wang W (2022) Leveraging pre-trained bert for audio captioning. In: 2022 30th European Signal Processing Conference (EUSIPCO), pp 1145–1149. IEEE
Aldausari N, Sowmya A, Marcus N, Mohammadi G (2022) Cascaded siamese self-supervised audio to video gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4691–4700
Iashin V, Rahtu E (2021) Taming visually guided sound generation. arXiv:2110.08791
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Koshorek O, Cohen A, Mor N, Rotman M, Berant J (2018) Text segmentation as a supervised learning task. arXiv:1803.09337
Berlage O, Lux K-M, Graus D (2020) Improving automated segmentation of radio shows with audio embeddings. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 751–755. IEEE
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 131–135. IEEE
Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 776–780. IEEE
Chen Y (2015) Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo
Munro J, Damen D (2020) Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 122–132
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Katharopoulos A, Vyas A, Pappas N, Fleuret F (2020) Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning, pp 5156–5165. PMLR
Acknowledgements
This work was supported in part by the National Key Research and Development Program of China (Nos. 2021ZD0113202, 2022YFE0116700), in part by the National Natural Science Foundation of China under Grants 62171125, 61876037, and in part by the Industry-University-Research cooperation project of Jiangsu Province under grant BY2022564.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Appendices
1.1 A.1 APPENDIX: Dataset Collection
The MOOC Videos Understanding Dataset (MVU) is introduced in Section 3 of the paper. Here we introduce additional details about the data source. In order to build a reliable and easily scalable MOOC videos dataset, we use seleniumFootnote 3 to collect a total of 400 videos from 20 categories from the videolecturesFootnote 4. In the videolectures, usually a webpage contains the following elements: a video player, a slide player, a timeline where you can click to select any slide page and jump to its corresponding video clip. The videolectures regularly uploads the latest MOOC videos, so the data source is reliable and the fully automated collection pipline also facilitates the easy expansion of the MVU.
1.2 A.2 APPENDIX: Text Network
The text network is introduced in Section 4.1.1 in this paper. We propose a two-stage feature extraction framework for text information as shown in Fig. 6. Here we introduce more implementation details on this framework. Inspired by [43], we develop a Bi-directional Long Short-Term Memory (bi-LSTM) network to encode the words. For each snippet, the input to the network is the sequence of words \([w_{1}^{i},w_{2}^{i},\cdot \cdot \cdot ,w_{k}^{i}]\) and the output is the sentence representation \(\partial _{i}^{w2v}\) which is from bi-LSTM output layer after maxpooling, with a dimension of 512. Then to further enhance the feature representation of text modality, we use the BERT [41] pre-trained model, for each text sentence, inserting [CLS] at the beginning and [SEP] at the end, and feeding the sentence with the marking symbols into the BERT. A 768-dimensional vector of [CLS] for each sentence in the results, projected into a 512-dimensional vector by a linear layer, represents the BERT feature \(\partial _{i}^{BERT}\) of the snippet.
Additional qualiative results
The effectiveness of multi-modal semantic information is shown in Section 5.3 in this paper, where we select typical MOOC video examples to illustrate how different modalities help the AHMN group several semantically similar snippets into one chapter and predict the chapter boundaries. To help further understand the role of multi-modal features, we present more visualization results of the AHMN on the MVCS task which are shown in Fig. 7. As is shown in Fig. 7(a), the speaker is chatting with his audience on this slide, so the text from both the speaker and the audience are not similar enough among these three snippets where the highly similar video features help the AHMN to confirm the boundary. Figure 7(b) shows a typical lecture scene, the camera moves a bit, resulting in the video features change a lot among these snippets, so the AHMN determines this chapter boundary based on similar text and audio features. Similar to Fig. 7(b), snippets in Fig. 7(c) do not have highly similar video features, but due to the noisy audience discussion, the AHMN could only determine the boundaries based on highly similar text features. Figure 7(d) shows a large-scale public class scene, where the camera moves a lot between the speakers and the screen, so the AHMN uses text information and audio information from the speaker. The speaker in Fig. 7(e) is playing a video clip with music in which there is not plenty of similar text information, so the AHMN uses the audio features of this video clip to recognize these snippets belong to the same chapter.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, J., Sun, Y., Kong, Y. et al. AHMN: A multi-modal network for long MOOC videos chapter segmentation. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17654-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-17654-2