Challenges in Audio Processing of Terrorist-Related Data

  • Jodie Gauvain
  • Lori Lamel
  • Viet Bac Le
  • Julien Despres
  • Jean-Luc Gauvain
  • Abdel Messaoudi
  • Bianca Vieru
  • Waad Ben Kheder
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)


Much information in multimedia data related to terrorist activity can be extracted from the audio content. Our work in ongoing projects aims to provide a complete description of the audio portion of multimedia documents. The information that can be extracted can be derived from diarization, classification of acoustic events, language and speaker segmentation and clustering, as well as automatic transcription of the speech portions. An important consideration is ensuring that the audio processing technologies are well suited to the types of data of interest to the law enforcement agencies. While language identification and speech recognition may be considered as ’mature technologies’, our experience is that even state-of-the-art systems require customisation and enhancements to address the challenges of terrorist-related audio documents.


Automatic speech recognition Acoustic event detection Language identification Code switching 


  1. 1.
    Vu, N.T. et al.: A first speech recognition system for Mandarin-English code-switch conversational speech. In: IEEE ICASSP (2012)Google Scholar
  2. 2.
    Gauvain, J.L., Lamel, L., Adda, G.: Audio partitioningt and transcription for broadcast data indexation. Multimed. Tools Appl. 14, 187–200 (2001)CrossRefGoogle Scholar
  3. 3.
    House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. JASA 62(3), 708–713 (1977)CrossRefGoogle Scholar
  4. 4.
    Gauvain, J.L., Lamel, L.: Identification of non-linguistic speech features. In: Human Language Technology (HLT 1993), pp. 96–101. ACL (1993)Google Scholar
  5. 5.
    Lamel, L., Gauvain, J.L.: A phone-based approach to non-linguistic speech feature identification. Comput. Speech Lang. 9(1), 87–103 (1995). Scholar
  6. 6.
    Zissman, M.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio 4, 31–44 (1996)CrossRefGoogle Scholar
  7. 7.
    Benzeghiba, M. Gauvain, J.L., Lamel, L.: Improved n-gram phonotactic models for language recognition. In: Interspeech (2010)Google Scholar
  8. 8.
    Kadambe, S., Hieronymus, J.: Language identification with phonological and lexical models. In: IEEE ICASSP (1995)Google Scholar
  9. 9.
    Gauvain, J.L., Messaoudi, A., Schwenk, H.: Language recognition using phone lattices. In: ICSLP, pp. 1283–1286, Jeju Island (2004)Google Scholar
  10. 10.
    Dehak, N. et al.: Language recognition via i-vectors and dimensionality reduction. In: Interspeech, pp. 857–860, Florence (2011)Google Scholar
  11. 11.
    Martinez, D. et al.: Language recognition in iVectors space. In: Interspeech (2011)Google Scholar
  12. 12.
    Hinton, G., et al.: Deep neural networks foracoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  13. 13.
    Weinreich, U.: Languages in Contact. Mouton, The Hague (1953)Google Scholar
  14. 14.
    Demby, G.: How code-switching explains the world (2013)Google Scholar
  15. 15.
    Amazouz, D., Adda-Decker, M, Lamel, L.: Addressing code-switching in French/Algerian Arabic speech. In: Proceedings of Interspeech 2017, pp. 62–66 (2017)Google Scholar
  16. 16.
    Jelinek, F.: Continuous speech recognition by statistical methods. Proc. IEEE 64, 532–556 (1976)CrossRefGoogle Scholar
  17. 17.
    Schwartz, R. et al.: Improved hidden Markov modeling of phonemes for continuous speech recognition. In: IEEE ICASSP, vol. 3, pp. 35.6.1–35.6.4 (1984)Google Scholar
  18. 18.
    Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech (2015)Google Scholar
  19. 19.
    Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modelling. In: IEEE ICASSP, pp. 5619–5623 (2014)Google Scholar
  20. 20.
    Ragni, A., et al.: Data augmentation for low resource languages. In: Interspeech, pp. 810–814, Singapore (2014)Google Scholar
  21. 21.
    Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: IEEE ICASSP, pp. 776–780 (2017)Google Scholar
  22. 22.
    Hershey, S. et al.: CNN architectures for large-scale audio classification. In: IEEE ICASSP, pp. 131–135 (2017)Google Scholar
  23. 23.
    Takahashi, N. et al.: Deep convolutional neural networks and data augmentation for acoustic event detection, arXiv preprint arXiv:1604.07160 (2016)
  24. 24.
    Snyder, D., Chen, G., Povey, D.: MUSAN: a music, speech, and noise corpus, CoRR abs/1510.08484 (2015).
  25. 25.
    Martin, A. Garofolo, J.: NIST speech processing evaluations: LVCSR, speaker recognition, language recognition. In: IEEE Workshop on Signal Processing Applications for Public Security and Forensics, pp. 1–7 (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jodie Gauvain
    • 1
  • Lori Lamel
    • 2
  • Viet Bac Le
    • 1
  • Julien Despres
    • 1
  • Jean-Luc Gauvain
    • 2
  • Abdel Messaoudi
    • 1
  • Bianca Vieru
    • 1
  • Waad Ben Kheder
    • 2
  1. 1.Vocapia ResearchOrsayFrance
  2. 2.CNRS-LIMSI, TLPOrsayFrance

Personalised recommendations