Multimedia Tools and Applications

, Volume 72, Issue 3, pp 2787–2832 | Cite as

Multimodal concept detection in broadcast media: KavTan

  • Medeni Soysal
  • K. Berker Loğoğlu
  • Mashar Tekin
  • Ersin Esen
  • Ahmet Saracoğlu
  • Banu Oskay Acar
  • Ezgi Can Ozan
  • Tuğrul K. Ateş
  • Hakan Sevimli
  • Müge Sevinç
  • İlkay Atıl
  • Savaş Özkan
  • Mehmet Ali Arabacı
  • Seda Tankız
  • Talha Karadeniz
  • Duygu Önür
  • Sezin Selçuk
  • A. Aydın Alatan
  • Tolga Çiloğlu


Concept detection stands as an important problem for efficient indexing and retrieval in large video archives. In this work, the KavTan System, which performs high-level semantic classification in one of the largest TV archives of Turkey, is presented. In this system, concept detection is performed using generalized visual and audio concept detection modules that are supported by video text detection, audio keyword spotting and specialized audio-visual semantic detection components. The performance of the presented framework was assessed objectively over a wide range of semantic concepts (5 high-level, 14 visual, 9 audio, 2 supplementary) by using a significant amount of precisely labeled ground truth data. KavTan System achieves successful high-level concept detection performance in unconstrained TV broadcast by efficiently utilizing multimodal information that is systematically extracted from both spatial and temporal extent of multimedia data.


Intelligent multimedia systems Concept detection Broadcast video indexing Multimodal semantic indexing 


  1. 1.
    Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proc ECML, pp 39–50Google Scholar
  2. 2.
    Ates TK, Ozkan S, Soysal M, Alatan AA (2011) Relevance feedback for semantic classification: a comparative study. In: 2011 IEEE 19th conference on signal processing and communications applications (SIU), pp 1004–1007Google Scholar
  3. 3.
    Barrington L, Chan A, Turnbull D, Lanckriet G (2007) Audio information retrieval using semantic similarity. In: Proc. ICASSP, IEEE, vol 2, pp II–725Google Scholar
  4. 4.
    Bay H, Ess a, Tuytelaars T, Van Gool L (2008) Speeded-up Robust Features (SURF). Comp Vision Image Underst 110(3):346–359CrossRefGoogle Scholar
  5. 5.
    Biatov K, Hesseler W, Koehler J (2008) Audio data retrieval and recognition using model selection criterion. In: Proc. ICSPCS, IEEE, pp 1–5Google Scholar
  6. 6.
    Chang S, He J, Jiang Y, Khoury E, Ngo C, Yanagawa A, Zavesky E (2008) Columbia university at trecvid2008: high-level feature extraction and interactive video search. In: Proc. TRECVIDGoogle Scholar
  7. 7.
    Chang YC, Chen SM (2006) A new query reweighting method for document retrieval based on genetic algorithms. IEEE Trans Evol Comput 10(5):617–622CrossRefGoogle Scholar
  8. 8.
    Changkaew P, Kongkachandra R (2010) Automatic movie rating using visual and linguistic information. In: Proc. ICIIC, IEEE, pp 12–16Google Scholar
  9. 9.
    Cheng J, Drue S, Hartmann G, Thiem J (2000) Efficient detection and extraction of color objects from complex scenes. In: Proc. ICPR, IEEE, vol 1, pp 668–671Google Scholar
  10. 10.
    Chu S, Narayanan S, Kuo C (2009) Environmental sound recognition with time-frequency audio features. IEEE Trans Audio Speech Lang Process 17(6):1142–1158CrossRefGoogle Scholar
  11. 11.
    Clarin C, Dionisio J, Echavez M, Naval P (2006) Dove: detection of movie violence using motion intensity analysis on skin and blood. In: Proc. PCSC, Citeseer, vol 6, pp 150–156Google Scholar
  12. 12.
    Clavel C, Ehrette T, Richard G (2005) Events detection for an audio-based surveillance system. In: Proc. ICME, IEEE, pp 1306–1309Google Scholar
  13. 13.
    Crandall D, Luo J (2004) Robust color object detection using spatial-color joint probability functions. In: Proc. CVPR, IEEE, vol 1, pp I–379Google Scholar
  14. 14.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proc. CVPR, IEEE, vol 1, pp 886–893Google Scholar
  15. 15.
    Deselaers T, Pimenidis L, Ney H (2008) Bag-of-visual-words models for adult image classification and filtering. In: Proc. ICPR, IEEE, pp 1–4Google Scholar
  16. 16.
    Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: Proc. CVPR, IEEE, vol 2, pp II–264Google Scholar
  17. 17.
    Ghimire D, Lee J (2010) Color image enhancement in hsv space using nonlinear transfer function and neighborhood dependent approach with preserving details. In: Proc. PSIVT, IEEE, pp 422–426Google Scholar
  18. 18.
    Gotlieb CC, Kreyszig HE (1990) Texture descriptors based on co-occurrence matrices. Comput Vis Graph Image Process 51(1):70–86CrossRefGoogle Scholar
  19. 19.
    Huang J, Kumar S, Mitra M, Zhu WJ, Zabih R (1997) Image indexing using color correlograms. In: Proc. CVPR, pp 762–768Google Scholar
  20. 20.
    Huang R, Hansen J (2006) Advances in unsupervised audio classification and segmentation for the broadcast news and ngsw corpora. IEEE Trans Audio Speech Lang Process 14(3):907–919CrossRefGoogle Scholar
  21. 21.
    Huttenlocher D, Klanderman G, Rucklidge W (1993) Comparing images using the hausdorff distance. IEEE Trans Patt Anal Mac Intel 15(9):850–863CrossRefGoogle Scholar
  22. 22.
    Jansohn C, Ulges A, Breuel T (2009) Detecting pornographic video content by combining image features with motion information. In: Proc. MM, ACM, pp 601–604Google Scholar
  23. 23.
    Jia W, Zhang H, He X, Wu Q (2006) Image matching using colour edge cooccurrence histograms. In: Proc. SMC, IEEE, vol 3, pp 2413–2419Google Scholar
  24. 24.
    Jiang Y, Ngo C, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proc. CIVR, ACM, pp 494–501Google Scholar
  25. 25.
    Jones M, Rehg J (1999) Statistical color models with application to skin detection. In: Proc. CVPR, IEEE, vol 1Google Scholar
  26. 26.
    Jones M, Rehg J (2002) Statistical color models with application to skin detection. Int J Comput Vis 46(1):81–96zbMATHCrossRefGoogle Scholar
  27. 27.
    Jones M, Viola P, Jones M, Snow D (2003) Detecting pedestrians using patterns of motion and appearance. In: Proc. ICCV, CiteseerGoogle Scholar
  28. 28.
    Lin C, Chen S, Truong T, Chang Y (2005) Audio classification and categorization based on wavelets and support vector machine. IEEE Trans Audio Speech Lang Process 13(5):644–651CrossRefGoogle Scholar
  29. 29.
    Liu Y, Xie H (2009) Constructing surf visual-words for pornographic images detection. In: Proc. ICCIT, IEEE, pp 404–407Google Scholar
  30. 30.
    Lopes A, de Avila S, Peixoto A, Oliveira R, Araújo A (2009a) A bag-of-features approach based on hue-sift descriptor for nude detection. In: Proc. ESPC, CiteseerGoogle Scholar
  31. 31.
    Lopes A, de Avila S, Peixoto A, Oliveira R, de M Coelho M, de A Araujo A (2009b) Nude detection in video using bag-of-visual-features. In: Proc. SIBGRAPI, IEEE, pp 224–231Google Scholar
  32. 32.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  33. 33.
    Mamou J, Ramabhadran B, Siohan O (2007) Vocabulary independent spoken term detection. In: Proc. SIGIR, ACM, pp 615–622Google Scholar
  34. 34.
    Manjunath B, Salembier P, Sikora T (2002) Introduction to MPEG-7: multimedia content description interface, vol 1. WileyGoogle Scholar
  35. 35.
    Mesaros A, Heittola T, Eronen A, Virtanen T (2010) Acoustic event detection in real life recordings. In: Proc. ESPC, pp 1267–1271Google Scholar
  36. 36.
    Mikolajczyk K, Schmid C, Zisserman A (2004) Human detection based on a probabilistic assembly of robust part detectors. In: Proc. ECCV, pp 69–82Google Scholar
  37. 37.
    MPEG (2001) Mpeg-7 multimedia content description interface. ISO/IEC 15938Google Scholar
  38. 38.
    Muller H, Muller W, Marchand-Maillet S, Pun T, Squire DM (2000) Strategies for positive and negative relevance feedback in image retrieval. In: Proc. ICPR, vol 1, pp 1043–1046Google Scholar
  39. 39.
    Nam J, Alghoniemy M, Tewfik A (1998) Audio-visual content-based violent scene characterization. In: Proc. ICIP, IEEE, vol 1, pp 353–357Google Scholar
  40. 40.
    Over P, Awad G, Fiscus J, Antonishek B, Qu G (2011) TRECVID 2011 - an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proc. TRECVIDGoogle Scholar
  41. 41.
    Ozan E, Tankiz S, Acar B, Ciloglu T (2011) Content based event retrieval on TV broadcast audio. In: Proc. SIU, IEEE, pp 391–394Google Scholar
  42. 42.
    Peng Y, Yang Z, Yi J, Cao L, Li H, Yao J (2008) Peking university at trecvid 2008: high level feature extraction. In: Proc. TRECVID, vol 3Google Scholar
  43. 43.
    Petridis S, Giannakopoulos T, Perantonis S (2010) A multi-class method for detecting audio events in news broadcasts. In: Artificial intelligence: theories, models and applications, pp 399–404Google Scholar
  44. 44.
    Phan R, Androutsos D (2010) Content-based retrieval of logo and trademarks in unconstrained color image databases using color edge gradient co-occurrence histograms. Comp Vision Image Underst 114(1):66–84CrossRefGoogle Scholar
  45. 45.
    Phan R, Chia J, Androutsos D (2008) Colour logo and trademark detection in unconstrained images using colour edge gradient co-occurrence histograms. In: Proc. CCECE 2008, IEEE, pp 000,531–000,534Google Scholar
  46. 46.
    Phillips P, Moon H, Rizvi S, Rauss P (2000) The feret evaluation methodology for face-recognition algorithms. IEEE Trans Patt Anal Mac Intel 22(10):1090–1104CrossRefGoogle Scholar
  47. 47.
    Portelo J, Bugalho M, Trancoso I, Neto J, Abad A, Serralheiro A (2009) Non-speech audio event detection. In: Proc. ICASSP, IEEE, pp 1973–1976Google Scholar
  48. 48.
    Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The SMART retrieval system: experiments in automatic document processing, chap 14. Prentice-Hall series in automatic computation, Prentice-Hall, Englewood Cliffs NJ, pp 313–323Google Scholar
  49. 49.
    van de Sande KEA, Gevers T, Snoek C (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Patt Anal Mac Intel 32(9):1582–1596CrossRefGoogle Scholar
  50. 50.
    Saracoglu A, Alatan A (2006) Automatic video text localization and recognition. In: Proc. SIU, IEEE, pp 1–4Google Scholar
  51. 51.
    Saracoğlu A, Tekin M, Esen E, Soysal M, Loğoğlu K, Ateş T, Sevinç A, Sevimli H, Acar B, Zubari U et al (2010) Generalized visual concept detection. In: Proc. SIU, IEEE, pp 621–624Google Scholar
  52. 52.
    Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5):1207–1245CrossRefGoogle Scholar
  53. 53.
    Smeaton AF, Over P, Kraaij W (2009) High-level feature detection from video in trecvid: a 5-year retrospective of achievements. In: Divakaran A (ed) Multimedia content analysis, theory and applications. Springer Verlag, Berlin, pp 151–174Google Scholar
  54. 54.
    Snoek C, Worring M, Koelma D, Smeulders A (2007) A learned lexicon-driven paradigm for interactive video retrieval. IEEE Trans Multimed 9(2):280–292CrossRefGoogle Scholar
  55. 55.
    Snoek CGM, van de Sande KEA, de Rooij O, Huurnink B, Gavves E, Odijk D, de Rijke M, Gevers T, Worring M, Koelma DC, Smeulders AWM (2010) The mediamill trecvid 2010 semantic video search engine. In: Proc. TRECVIDGoogle Scholar
  56. 56.
    Snoek C et al (2006) The semantic pathfinder: using an authoring metaphor for generic multimedia indexing. IEEE Trans Patt Anal Mac Intel 28(10):1678–1689CrossRefGoogle Scholar
  57. 57.
    Stricker MA, Orengo M (1995) Similarity of color images. In: Proc. SPIE, pp 381–392Google Scholar
  58. 58.
    Sundaram S, Narayanan S (2008) Audio retrieval by latent perceptual indexing. In: ICASSP, IEEE, pp 49–52Google Scholar
  59. 59.
    Tao L, Asari V (2004) An integrated neighborhood dependent approach for nonlinear enhancement of color images. In: Proc. ITCC, IEEE, vol 2, pp 138–139Google Scholar
  60. 60.
    Viola M, Jones M, Viola P (2003) Fast multi-view face detection. In: Proc. CVPR, CiteseerGoogle Scholar
  61. 61.
    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proc. CVPR, IEEE, vol 1, pp I–511Google Scholar
  62. 62.
    Wang Y, Liu Z, Huang JC (2000) Multimedia content analysis-using both audio and visual clues. IEEE Signal Proc Mag 17(6):12–36CrossRefGoogle Scholar
  63. 63.
    Wu P, Manjunanth B, Newsam S, Shin H (1999) A texture descriptor for image retrieval and browsing. In: Proc. CBAIVL, pp 3–7Google Scholar
  64. 64.
    Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proc. CIKM, ACM, pp 102–111Google Scholar
  65. 65.
    Yoon J, Jayant N (2001) Relevance feedback for semantics based image retrieval. In: Proc. ICIP, vol 1, pp 42–45.Google Scholar
  66. 66.
    You J, Liu G, Perkis A (2010) A semantic framework for video genre classification and event analysis. Signal Process Imag Commun 25(4):287–302CrossRefGoogle Scholar
  67. 67.
    Zhou XS, Huang TS (2003) Relevance feedback in image retrieval: a comprehensive review. Multimedia Systems 8(6):536–544CrossRefGoogle Scholar
  68. 68.
    Zubari Ü, Ozan E, Acar B, Ciloglu T, Esen E, Ateş T, Önür D (2010) Speech detection on broadcast audio. In: EUSIPCOGoogle Scholar
  69. 69.
    Zuo H, Wu O, Hu W, Xu B (2008) Recognition of blue movies by fusion of audio and video. In: Proc. ICME, IEEE, pp 37–40Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Medeni Soysal
    • 1
  • K. Berker Loğoğlu
    • 1
  • Mashar Tekin
    • 1
  • Ersin Esen
    • 1
  • Ahmet Saracoğlu
    • 1
  • Banu Oskay Acar
    • 1
  • Ezgi Can Ozan
    • 1
  • Tuğrul K. Ateş
    • 1
  • Hakan Sevimli
    • 1
  • Müge Sevinç
    • 1
  • İlkay Atıl
    • 1
  • Savaş Özkan
    • 1
  • Mehmet Ali Arabacı
    • 1
  • Seda Tankız
    • 1
  • Talha Karadeniz
    • 1
  • Duygu Önür
    • 1
  • Sezin Selçuk
    • 1
  • A. Aydın Alatan
    • 1
  • Tolga Çiloğlu
    • 1
  1. 1.TUBITAK - UZAYAnkaraTurkey

Personalised recommendations