Generalized framework for summarization of fixed-camera lecture videos by detecting and binarizing handwritten content

Abstract

We propose a framework to extract and binarize handwritten content in lecture videos. The extracted content could potentially be used to index video collections powering content-based search and navigation within lecture videos helping students and educators across the world. A deep learning pipeline is used to detect handwritten text, formulae and sketches and then binarize the extracted content. We exploit the spatio-temporal structure of our binarized detections to compute associativity information of content across all video frames. This information is later used to segment the video. Experiments are conducted to compare the performance of key components of our framework in isolation, as well as the impact on overall performance, with respect to existing methods. We evaluate our framework on the publicly available AccessMath lecture video dataset obtaining an f-measure of \(94.32\%\) for binary connected components. Code for the framework (including trained weights) and summarization will be released.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    https://github.com/bhargavaurala/accessmath-ijdar

References

  1. 1.

    Banerjee, P., Bhattacharya, U., Chaudhuri, B.B.: Automatic detection of handwritten texts from video frames of lectures. In: 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 627–632. IEEE (2014)

  2. 2.

    Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. J. Image Video Process. 2008, 1 (2008)

    Article  Google Scholar 

  3. 3.

    Castellanos, K.D.: Symbolic and Visual Retrieval of Mathematical Notation Using Formula Graph Symbol Pair Matching and Structural Alignment. Rochester Institute of Technology, Rochester (2017)

    Google Scholar 

  4. 4.

    Choudary, C., Liu, T.: Summarization of visual content in instructional videos. IEEE Trans. Multimed. 9(7), 1443–1455 (2007)

    Article  Google Scholar 

  5. 5.

    Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)

    Article  Google Scholar 

  6. 6.

    Davila, K., Agarwal, A., Gaborski, R., Zanibbi, R., Ludi, S.: Accessmath: indexing and retrieving video segments containing math expressions based on visual similarity. In: Image processing workshop (WNYIPW), 2013 IEEE Western New York, pp. 14–17. IEEE (2013)

  7. 7.

    Davila, K., Zanibbi, R.: Whiteboard video summarization via spatio-temporal conflict minimization. In: International Conference on Document Analysis and Recognition (ICDAR) (2017)

  8. 8.

    Davila, K., Zanibbi, R.: Visual search engine for handwritten and typeset math in lecture videos and latex notes. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE (2018)

  9. 9.

    Dickson, P.E., Adrion, W.R., Hanson, A.R.: Whiteboard content extraction and analysis for the classroom environment. In: 10th IEEE International Symposium on Multimedia, 2008. ISM 2008, pp. 702–707. IEEE (2008)

  10. 10.

    Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2963–2970. IEEE (2010)

  11. 11.

    Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015)

    Article  Google Scholar 

  12. 12.

    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)

  13. 13.

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  14. 14.

    Huang, L., Yang, Y., Deng, Y., Yu, Y.: Densebox: unifying landmark localization with end to end object detection (2015). arXiv preprint arXiv:1509.04874

  15. 15.

    Jia, W., Sun, L., Zhong, Z., Huo, Q.: A CNN-based approach to detecting text from images of whiteboards and handwritten notes. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE (2018)

  16. 16.

    Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)

  17. 17.

    Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705 (2013)

  18. 18.

    Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4225–4232 (2014)

  19. 19.

    Kim, K.H., Hong, S., Roh, B., Cheon, Y., Park, M.: PVANet: deep but lightweight neural networks for real-time object detection (2016). arXiv preprint arXiv:1608.08021

  20. 20.

    Kota, B.U., Davila, K., Stone, A., Setlur, S., Govindaraju, V.: Automated detection of handwritten whiteboard content in lecture videos for summarization. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 19–24. IEEE (2018)

  21. 21.

    Lee, G.C., Yeh, F.H., Chen, Y.J., Chang, T.K.: Robust handwriting extraction and lecture video summarization. Multimed. Tools Appl. 76(5), 7067–7085 (2017)

    Article  Google Scholar 

  22. 22.

    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1346–1353. IEEE (2012)

  23. 23.

    Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network. In: AAAI, pp. 4161–4167 (2017)

  24. 24.

    Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: CVPR, vol. 1, p. 4 (2017)

  25. 25.

    Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)

  26. 26.

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  27. 27.

    Meng, G., Yuan, K., Wu, Y., Xiang, S., Pan, C.: Deep networks for degraded document image binarization through pyramid reconstruction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 727–732. IEEE (2017)

  28. 28.

    Meng, J., Wang, H., Yuan, J., Tan, Y.P.: From keyframes to key objects: video summarization by representative object proposal selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1039–1048 (2016)

  29. 29.

    Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Asian Conference on Computer Vision, pp. 770–783. Springer (2010)

  30. 30.

    Ntirogiannis, K., Gatos, B., Pratikakis, I.: Performance evaluation methodology for historical document image binarization. IEEE Trans. Image Process. 22(2), 595–609 (2013)

    MathSciNet  Article  MATH  Google Scholar 

  31. 31.

    Onishi, M., Izumi, M., Fukunaga, K.: Blackboard segmentation using video image of lecture and its applications. In: Proceedings of 15th International Conference on Pattern Recognition, 2000, vol. 4, pp. 615–618. IEEE (2000)

  32. 32.

    Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  Google Scholar 

  33. 33.

    Pratikakis, I., Zagoris, K., Barlas, G., Gatos, B.: ICFHR2016 handwritten document image binarization contest (H-DIBCO 2016). In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 619–623. IEEE (2016)

  34. 34.

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  35. 35.

    Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

  36. 36.

    Shah, R.R., Yu, Y., Shaikh, A.D., Tang, S., Zimmermann, R.: Atlas: automatic temporal segmentation and annotation of lecture videos based on modelling transition time. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 209–212. ACM (2014)

  37. 37.

    Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J.: Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 240–248. Springer (2017)

  38. 38.

    Tang, L., Kender, J.R.: A unified text extraction method for instructional videos. In: IEEE International Conference on Image Processing, 2005. ICIP 2005, vol. 3, pp. III–1216. IEEE (2005)

  39. 39.

    Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: European Conference on Computer Vision, pp. 56–72. Springer (2016)

  40. 40.

    Vajda, S., Rothacker, L., Fink, G.A.: A method for camera-based interactive whiteboard reading. In: International Workshop on Camera-Based Document Analysis and Recognition, pp. 112–125. Springer (2011)

  41. 41.

    Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images (2016). arXiv preprint arXiv:1601.07140

  42. 42.

    Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015)

    Article  Google Scholar 

  43. 43.

    Yin, X.C., Zuo, Z.Y., Tian, S., Liu, C.L.: Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans. Image Process. 25(6), 2752–2773 (2016)

    MathSciNet  Article  MATH  Google Scholar 

  44. 44.

    Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an efficient and accurate scene text detector. In: Proceedings of CVPR, pp. 2642–2651 (2017)

  45. 45.

    Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: recent advances and future trends. Front. Comput. Sci. 10(1), 19–36 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This material was partially supported by the National Science Foundation under Grant No. 1640867 (OAC/DMR).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bhargava Urala Kota.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Urala Kota, B., Davila, K., Stone, A. et al. Generalized framework for summarization of fixed-camera lecture videos by detecting and binarizing handwritten content. IJDAR 22, 221–233 (2019). https://doi.org/10.1007/s10032-019-00327-y

Download citation

Keywords

  • Lecture video summarization
  • Handwritten text detection
  • Binarization
  • Deep learning