Multimedia Tools and Applications

, Volume 76, Issue 21, pp 22569–22597 | Cite as

Content-based unsupervised segmentation of recurrent TV programs using grammatical inference

  • Bingqing Qu
  • Félicien Vallet
  • Jean Carrive
  • Guillaume Gravier


TV program segmentation raised as a major topic in the last decade for the task of high quality indexing of multimedia content. Earlier studies of TV program segmentation are either highly supervised (e.g., event detection) or too specific to a certain type of program (e.g., cluster-based methods), which is not practically usable for indexing tasks because of the lack of generality of programs types. In this paper, we address the problem of unsupervised TV program segmentation by leveraging grammatical inference, i.e., discovering a common structural model shared by a collection of episodes of a recurrent TV program by finding an optimal alignment of structural elements across episodes. Structural elements referring to a video segment with a particular syntactic meaning with respect to the video structure. The use of symbolic representation of structural elements makes grammatical inference feasible to be applied on TV program modeling, and makes TV program segmentation possible to rely on only minimal domain knowledge. The proposed approach is operated in two phases. The first phase aims at obtaining a symbolic representation of each episode, where the elements relevant to the structure are discovered based on recurrence mining. The second phase is that of grammatical inference from the symbolic representation of episodes. We investigate two inference techniques, one based on multiple sequence alignment and one relying on uniform resampling, to infer structural grammars for TV programs. A model of the structure is derived from the structural grammars and used to predict the structure of new episodes. Comparative evaluation on two grammar inference approaches demonstrates that the models obtained can reflect the structure of the program and predict the structure of unseen episodes, which is the main application of the proposed approach in industry, i.e., to assist librarians for segmentation tasks.


Multimedia mining Video segmentation Grammatical inference Multiple sequence alignment Uniform resampling Hierarchical clustering Experimental evaluations Practical applications 


  1. 1.
    Abduraman AE, Berrani SA, Merialdo B (2011) An unsupervised approach for recurrent TV program structuring. In: EuroITV’11, pp 123–126Google Scholar
  2. 2.
    Abduraman AE, Berrani SA, Merialdo B (2013) Audio/visual recurrences and decision trees for unsupervised TV program structuring. In: VISAPP’13, pp 701–708Google Scholar
  3. 3.
    Alfred V (2014) Algorithms for finding patterns in strings. Algorithms Complex 1:255zbMATHGoogle Scholar
  4. 4.
    Ancona N, Cicirelli C, Branca A, Distante A (2001) Goal detection in football by using support vector machines for classification. In: IJCNN’01, vol 1, pp 611–616Google Scholar
  5. 5.
    Ben M, Gravier G (2011) Unsupervised mining of audiovisually consistent segments in videos with application to structure analysis. In: ICME’11, pp 1–6Google Scholar
  6. 6.
    Botev Z, Grotowski J, Kroese D (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Chang YF, Lin P, Cheng SH, Chan KH, Zeng YC, Liao CW, Chang WT, Wang YC, Tsao Y (2014) Robust anchorperson detection based on audio streams using a hybrid i-vector and DNN system. In: APSIPA ASC’14. IEEE, pp 1–4Google Scholar
  8. 8.
    Chua TS, Chang SF, Chaisorn L, Hsu W (2004) Story boundary detection in large broadcast news video archives: techniques, experience and trends. In: MM’04, pp 656–659Google Scholar
  9. 9.
    Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) Weblogo: a sequence logo generator. Genome Res 14(6):1188–1190CrossRefGoogle Scholar
  10. 10.
    Dumont E, Quénot G (2012) Automatic story segmentation for tv news video using multiple modalities. Int J Digit Multimed BroadcastGoogle Scholar
  11. 11.
    Gupta V, Kenny P, Ouellet P, Stafylakis T (2014) I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription. In: 2014 IEEE international conference on Acoustics, speech and signal processing (ICASSP). IEEE, pp 6334–6338CrossRefGoogle Scholar
  12. 12.
    Hopcroft JE (1979) Introduction to automata theory, languages, and computation. Pearson Education, IndiazbMATHGoogle Scholar
  13. 13.
    Jacobs A (2006) Using self-similarity matrices for structure mining on news video. In: Lecture Notes in Artificial Intelligence. Springer, pp 87–94Google Scholar
  14. 14.
    Jayagopi DB, Ba S, Odobez JM, Gatica-Perez D (2008) Predicting two facets of social verticality in meetings from five-minute time slices and nonverbal cues. In: ICMI’08, pp 45–52Google Scholar
  15. 15.
    Ji P, Cao L, Zhang X, Zhang L, Wu W (2014) News videos anchor person detection by shot clustering. Neurocomputing 123:86–99CrossRefGoogle Scholar
  16. 16.
    Kijak E, Gravier G, Oisel L, Gros P (2006) Audiovisual integration for tennis broadcast structuring. Multimed Tools Appl 30(3):289–311CrossRefGoogle Scholar
  17. 17.
    Lee H, Yu J, Im Y, Gil JM, Park D (2011) A unified scheme of shot boundary detection and anchor shot detection in news video story parsing. Multimed Tools Appl 51(3):1127–1145CrossRefGoogle Scholar
  18. 18.
    Letessier P, Buisson O, Joly A (2012) Scalable mining of small visual objects. In: MM’12, pp 599– 608Google Scholar
  19. 19.
    Li H, Tang J, Wu S, Zhang Y, Lin S (2010) Automatic detection and analysis of player action in moving background sports video sequences. IEEE Trans Circ Syst Video Technol 20(3):351–364CrossRefGoogle Scholar
  20. 20.
    Mocanu B, Tapu R, Zaharia T (2016) Automatic segmentation of tv news into stories using visual and temporal information International conference on advanced concepts for intelligent vision systems. Springer, pp 648–660CrossRefGoogle Scholar
  21. 21.
    Qu B, Vallet F, Carrive J, Gravier G (2014) Content-based inference of hierarchical structural grammar for recurrent TV programs using multiple sequence alignment. In: IEEE International conference on multimedia and expo (ICME), pp 1–6Google Scholar
  22. 22.
    Qu B, Vallet F, Carrive J, Gravier G (2015) Content-based discovery of multiple structures from episodes of recurrent TV programs based on grammatical inference. In: International conference on multimedia modelling, pp 140–154Google Scholar
  23. 23.
    Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Bugalho M, Trancoso I (2011) Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans Circ Syst Video Technol 21(8):1163–1177CrossRefGoogle Scholar
  24. 24.
    Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: 2011 IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5688–5691CrossRefGoogle Scholar
  25. 25.
    Thompson JD, Higgins DG, Gibson TJ (1994) Clustal W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22 (22):4673–4680CrossRefGoogle Scholar
  26. 26.
    Thompson K (1968) Programming techniques: Regular expression search algorithm. Commun ACM 11(6):419–422CrossRefzbMATHGoogle Scholar
  27. 27.
    Xie L, Xu P, Chang SF, Divakaran A, Sun H (2004) Structure analysis of soccer video with domain knowledge and hidden Markov models. Pattern Recogn Lett 25(7):767–775CrossRefGoogle Scholar
  28. 28.
    Yang XF, Tian Q, Xue P (2007) Efficient short video repeat identification with application to news video structure analysis. IEEE Trans Multimed 9(3):600–609CrossRefGoogle Scholar
  29. 29.
    Zhang DQ, Lin CY, Chang SF, Smith JR (2004) Semantic video clustering across sources using bipartite spectral clustering. In: ICME’04, vol 1, pp 117–120Google Scholar
  30. 30.
    Zhang J, Qiu J, Wang X, Wu L (2013) Representation of the player action in sport videos. In: APSIPA ASC’13. IEEE, pp 1–4Google Scholar
  31. 31.
    Zhu S, Liu Y (2009) Video scene segmentation and semantic representation using a novel scheme. Multimed Tools Appl 42(2):183–205CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Bingqing Qu
    • 1
  • Félicien Vallet
    • 2
  • Jean Carrive
    • 3
  • Guillaume Gravier
    • 4
  1. 1.Université de Rennes 1RennesFrance
  2. 2.Commission Nationale de l’Informatique et des Libertés (CNIL)ParisFrance
  3. 3.Institut National de l’audiovisuel (INA)ParisFrance
  4. 4.Centre National de la Recherche Scientifique (CNRS)ParisFrance

Personalised recommendations