Multimedia Tools and Applications

, Volume 30, Issue 3, pp 313–330 | Cite as

Audio indexing: primary components retrieval

Robust classification in audio documents
  • Julien PinquierEmail author
  • Régine André-Obrecht


This work addresses the soundtrack indexing of multimedia documents. Our purpose is to detect and locate sound unity to structure the audio dataflow in program broadcasts (reports). We present two audio classification tools that we have developed. The first one, a speech music classification tool, is based on three original features: entropy modulation, stationary segment duration (with a Forward–Backward Divergence algorithm) and number of segments. They are merged with the classical 4 Hz modulation energy. It is divided into two classifications (speech/non-speech and music/non-music) and provides more than 90% of accuracy for speech detection and 89% for music detection. The other system, a jingle identification tool, uses an Euclidean distance in the spectral domain to index the audio data flow. Results show that is efficient: among 132 jingles to recognize, we have detected 130. Systems are tested on TV and radio corpora (more than 10 h). They are simple, robust and can be improved on every corpus without training or adaptation.


Classification Indexing Audio documents Jingle Segmentation Duration Entropy Energy Spectral feature 



Gaussian Mixture Models


power density function


Forward–Backward Divergence


Fast Fourier Transform


Finite Impulse Response


  1. 1.
    Aigrain P, Joly P, Longueville V (1997) Medium knowledge-based macro-segmentation of video into sequences. In: Intelligent multimedia information retrieval, pp 159–173Google Scholar
  2. 2.
    Amaral R, Langlois T, Meinedo H, Neto J, Souto N, Trancoso I (2001) The development of a Portuguese version of a media watch system. block In: European Conference on Speech Communication and Technology. Aalborg, DenmarkGoogle Scholar
  3. 3.
    André-Obrecht R (1988) A new statistical approach for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Signal Processing 36(1)Google Scholar
  4. 4.
    André-Obrecht R (1993) Segmentation et parole?. Master's thesis, IRISAGoogle Scholar
  5. 5.
    André-Obrecht R, Jacob B (1997) Direct identification vs. correlated models to process acoustic and articulatory informations in automatic speech recognition. In: International conference on audio, speech and signal processing. IEEE, Munich, Germany, pp 989–992Google Scholar
  6. 6.
    Atal B (1983) Efficient coding of LPC parameters by temporal decomposition. In: International Conference on Audio, Speech and Signal Processing. Boston, USA, pp 81–84Google Scholar
  7. 7.
    Bimbot F, Cholet G, Deleglise P, Montacie C (1988) Temporal decomposition and acoustic–phonetic decoding of speech. In: International conference on audio, speech and signal processing. Singapore, pp 425–428Google Scholar
  8. 8.
    Caelen J (1979) Un modèle d'oreille; analyse de la parole continue; reconnaissance phonémique. Ph.D. thesis, UPS ToulouseGoogle Scholar
  9. 9.
    Calliope (1989) La parole et son traitement automatique. Masson, Paris, FranceGoogle Scholar
  10. 10.
    Campione E, Véronis J (1998) A multilingual prosodic database. In: International conference on spoken language processing. Sydney, Australia, pp 3163–3166Google Scholar
  11. 11.
    Carey MJ, Parris EJ, Lloyd-Thomas H (1999) A comparison of features for speech, music discrimination. In: International Conference on Audio, Speech and Signal Processing. IEEE, Phoenix, USA, pp 149–152Google Scholar
  12. 12.
    Carrive J, Pachet F, Ronfard R (2000) CLAViS—a temporal reasoning system for classification of audiovisual sequences. In: Proceedings of Content-Based Multimedia Information Access (RIAO) Conference. College de France, Paris, FranceGoogle Scholar
  13. 13.
    Foote J (2000) Automatic audio segmentation using a measure of audio novelty. In: IEEE international conference on multimedia and expo. IEEE, New-York, USA, pp 452–455Google Scholar
  14. 14.
    Franz M, Scott McCarley J, Ward T, Zhu W (2001) Topics styles in IR and TDT: Effect on System Behavior. In: European Conference on Speech Communication and Technology. Aalborg, Denmark, pp 287–290Google Scholar
  15. 15.
    Gauvain JL, Lamel L, Adda G (1999) Systèmes de processus légers: concepts et exemples. In: International workshop on content-based multimedia indexing. Toulouse, France, pp 67–73 GDR-PRC ISISGoogle Scholar
  16. 16.
    Houtgast T, Steeneken JM (1985) A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. J Acoust Soc Am 77(3):1069–1077CrossRefGoogle Scholar
  17. 17.
    Johnson NL, Kotz S (1970) Continuous univariate distributions. Willey, New-York, USAzbMATHGoogle Scholar
  18. 18.
    Moddemeijer R (1989) On estimation of entropy and mutual information of continuous distributions. Signal Process 16(3):233–246MathSciNetCrossRefGoogle Scholar
  19. 19.
    Pinquier J, Rouas J-L, André-Obrecht R (2002a) Robust speech / music classification in audio documents. In: International Conference on Spoken Language Processing, Vol. 3. Denver, USA, pp 2005–2008Google Scholar
  20. 20.
    Pinquier J, Sénac C, André-Obrecht R (2002b) Indexation de la bande sonore : recherche des composantes Parole et Musique. In: Congrès de Reconnaissance des Formes et Intelligence Artificielle. Angers, France, pp 163–170Google Scholar
  21. 21.
    Rossignol S, Rodet X, Soumagne J, Collette JL, Depalle P (1999) Automatic characterization of musical signals: feature extraction and temporal segmentation. J New Music Res 28(4):281–295CrossRefGoogle Scholar
  22. 22.
    Saunders J (1996) Real-time discrimination of broadcast speech/music. In: International Conference on Audio, Speech and Signal Processing. IEEE, Atlanta, USA, pp 993–996Google Scholar
  23. 23.
    Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: International conference on audio, speech and signal processing. IEEE, Munich, Germany, pp 1331–1334Google Scholar
  24. 24.
    Suaudeau N (1994) Un modèle probabiliste pour intégrer la dimension temporelle dans un système de reconnaisance automatique de parole. Ph.D. thesis, IRISAGoogle Scholar
  25. 25.
    Zhang T, Kuo C, CJ (1998) Hierarchical system for content-based audio classification and retrieval. In: Conference on multimedia storage and archiving systems III, Vol. 3527. pp 398–409, SPIEGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Institut de Recherche en Informatique de ToulouseUMR 5505 CNRS INP UPSToulouseFrance

Personalised recommendations