Journal on Multimodal User Interfaces

, Volume 9, Issue 3, pp 223–229 | Cite as

Synchronizing multimodal recordings using audio-to-audio alignment

An application of acoustic fingerprinting to facilitate music interaction research
  • Joren Six
  • Marc Leman
Original Paper


Research on the interaction between movement and music often involves analysis of multi-track audio, video streams and sensor data. To facilitate such research a framework is presented here that allows synchronization of multimodal data. A low cost approach is proposed to synchronize streams by embedding ambient audio into each data-stream. This effectively reduces the synchronization problem to audio-to-audio alignment. As a part of the framework a robust, computationally efficient audio-to-audio alignment algorithm is presented for reliable synchronization of embedded audio streams of varying quality. The algorithm uses audio fingerprinting techniques to measure offsets. It also identifies drift and dropped samples, which makes it possible to find a synchronization solution under such circumstances as well. The framework is evaluated with synthetic signals and a case study, showing millisecond accurate synchronization.


Multimodal data synchronization Audio fingerprinting  Audio-to-audio-alignment Music performance research Digital signal processing 

Supplementary material

12193_2015_196_MOESM1_ESM.7z (74.9 mb)
Supplementary material 1 (7z 76651 KB)


  1. 1.
    Bannach D, Amft O, Lukowicz P (2009) Automatic event-based synchronization of multimodal data streams from wearable and ambient sensors. In: EuroSSC 2009: proceedings of the European conference on smart sensing and context, lecture note in computer science, vol 5741, pp 135–148. SpringerGoogle Scholar
  2. 2.
    Camurri A, Coletta P, Massari A, Mazzarino B, Peri M, Ricchetti M, Ricci A, Volpe G (2004) Toward real-time multimodal processing: EyesWeb 4.0. In: AISB 2004 convention: motion, emotion and cognitionGoogle Scholar
  3. 3.
    Cannam C, Landone C, Sandler M, Bello J (2006) The Sonic Visualiser: a visualisation platform for semantic descriptors from musical signals. In: Proceedings of the 7th international symposium on music information retrieval (ISMIR 2006). Victoria, CanadaGoogle Scholar
  4. 4.
    Cotton CV, Ellis DPW (2010) Audio fingerprinting to identify multiple videos of an event. In: IEEE international conference on acoustics speech and signal processing (ICASSP), pp 2386–2389. IEEEGoogle Scholar
  5. 5.
    Godøy RI, Leman M (2010) Musical gestures: sound, movement, and meaning. Routledge, New YorkGoogle Scholar
  6. 6.
    Gowing M, Kelly P, O’Connor NE, Concolato C, Essid S, Feuvre JL, Tournemenne R, Izquierdo E, Kitanovski V, Lin X, Zhang Q (2011) Enhanced visualisation of dance performance from automatically synchronised multimodal recordings. In: Candan KS, Panchanathan S, Prabhakaran B, Sundaram H, Chi Feng W, Sebe N (eds) ACM Multimedia, pp 667–670. ACMGoogle Scholar
  7. 7.
    Hochenbaum J, Kapur A (2012) Nuance: a software tool for capturing synchronous data streams from multimodal musical systems. In: International computer music conference, pp 1 – 6. ICMCGoogle Scholar
  8. 8.
    Jaimovich J, Knapp B (2010) Synchronization of multimodal recordings for musical performance research. In: Beilharz K, Bongers B, Johnston A, Ferguson S (eds) Proceedings of the international conference on new interfaces for musical expression (NIME), Australia, Sydney, pp 372–374Google Scholar
  9. 9.
    Mayor O, Llimona Q, Marchini M, Papiotis P, Maestre E (2013) RepoVIZZ: a framework for remote storage, browsing, annotation, and exchange of multi-modal data. In: Proceedings of the 21st ACM international conference on multimedia, pp 415–416. ACMGoogle Scholar
  10. 10.
    Ogle J, Ellis DPW (2007) Fingerprinting to identify repeated sound events in long-duration personal audio recordings. In: IEEE international conference on acoustics speech and signal processing (ICASSP), pp 1–233. HawaïGoogle Scholar
  11. 11.
    Shrestha P, Barbieri M, Weda H (2007) Synchronization of multi-camera video recordings based on audio. In: Proceedings of the 15th international conference on multimedia, MULTIMEDIA ’07ACM, New York, NY, USA, pp 545–548Google Scholar
  12. 12.
    Six J, Cornelis O, Leman M (2013) Tarsos, a modular platform for precise pitch analysis of Western and non-Western music. J N Music Res 42(2):113–129CrossRefGoogle Scholar
  13. 13.
    Six J, Cornelis O, Leman M (2014) TarsosDSP, a real-time audio processing framework in Java. In: Proceedings of the 53rd AES conference (AES 53rd). The Audio Engineering SocietyGoogle Scholar
  14. 14.
    Six J, Leman M (2014) Panako—a scalable acoustic fingerprinting system handling time-scale and pitch modification. In: Proceedings of the 15th ISMIR conference (ISMIR 2014), pp 1–6Google Scholar
  15. 15.
    Wang ALC (2003) An industrial-strength audio search algorithm. In: Proceedings of the 4th international symposium on music information retrieval (ISMIR 2003), pp 7–13Google Scholar
  16. 16.
    Wang ALC, Culbert D (2002) Robust and invariant audio pattern matching. US Patent US7627477Google Scholar
  17. 17.
    Wittenburg P, Brugman H, Russel A, Klassmann A, Sloetjes H (2006) ELAN: a professional framework for multimodality research. In: Proceedings of language resources and evaluation conference (LREC)Google Scholar

Copyright information

© OpenInterface Association 2015

Authors and Affiliations

  1. 1.Department of Musicology, Institute for Psychoacoustics and Electronic Music (IPEM)Ghent UniversityGhentBelgium

Personalised recommendations