Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation

Abstract

Subtitles are a key element to make any media content accessible for people who suffer from hearing impairment and for elderly people, but also useful when watching TV in a noisy environment or learning new languages. Most of the time, subtitles are generated manually in advance, building a verbatim and synchronised transcription of the audio. However, in TV live broadcasts, captions are created in real time by a re-speaker with the help of a voice recognition software, which inevitability leads to delays and lack of synchronisation. In this paper, we present Deep-Sync, a tool for the alignment of subtitles with the audio-visual content. The architecture integrates a deep language representation model and a real-time voice recognition software to build a semantic-aware alignment tool that successfully aligns most of the subtitles even when there is no direct correspondence between the re-speaker and the audio content. In order to avoid any kind of censorship, Deep-Sync can be deployed directly on users’ TVs causing a small delay to perform the alignment, but avoiding to delay the signal at the broadcaster station. Deep-Sync was compared with other subtitles alignment tool, showing that our proposal is able to improve the synchronisation in all tested cases.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    The source code is available at: https://github.com/alexMyG/deep-sync.

  2. 2.

    In case Deep-Sync is applied to a different language, this value should be tuned properly.

  3. 3.

    https://cloud.google.com/speech-to-text.

  4. 4.

    https://blog.google/products/search/search-language-understanding-bert/.

References

  1. 1.

    Ando A, Imai T, Kobayashi A, Homma S, Goto J, Seiyama N, Mishima T, Kobayakawa T, Sato S, Onoe K et al (2003) Simultaneous subtitling system for broadcast news programs with a speech recognizer. IEICE Trans Inf Syst 86(1):15–25

    Google Scholar 

  2. 2.

    Avegliano PB, Real LCV, Guimaraes RL, Gallo DS (2017) Automatic synchronization of subtitles based on audio fingerprinting. US Patent 9,609,397

  3. 3.

    Baskar MK, Burget L, Watanabe S, Karafiát M, Hori T, Černockỳ JH (2019) Promising accurate prefix boosting for sequence-to-sequence asr. ICASSP 2019–2019 IEEE international conference on acoustics. Speech and signal processing (ICASSP), IEEE, pp 5646–5650

  4. 4.

    Brito JO, Santos CA, Guimarães RL, Borges TFC (2019) Toward understanding the quality of subtitle synchronization to improve the viewer experience. In: Proceedings of the 25th Brazillian symposium on multimedia and the web, pp 209–216

  5. 5.

    Cañete J, Chaperon G, Fuentes R, Pérez J (2020) Spanish pre-trained bert model and evaluation data. In: to appear in PML4DC at ICLR 2020

  6. 6.

    Cuzco-Calle I, Ingavélez-Guerra P, Robles-Bykbaev V, Calle-López D (2018) An interactive system to automatically generate video summaries and perform subtitles synchronization for persons with hearing loss. 2018 IEEE XXV international conference on electronics. Electrical engineering and computing (INTERCON), IEEE, pp 1–4

  7. 7.

    van Deventer MO, Stokking H, Hammond M, Le Feuvre J, Cesar P (2016) Standards for multi-stream and multi-device media synchronization. IEEE Commun Mag 54(3):16–21

    Article  Google Scholar 

  8. 8.

    Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805

  9. 9.

    Dhumal M, Kushwaha HK, Gupta V, Pawara SR (2019) Instant bi-lingual captions. In: 2019 IEEE 5th international conference for convergence in technology (I2CT), IEEE, pp 1–6

  10. 10.

    Gales MJ (2001) Adaptive training for robust ASR. In: 2001 IEEE workshop on automatic speech recognition and understanding, ASRU 2001 - conference proceedings, IEEE, pp 15–20, https://doi.org/10.1109/ASRU.2001.1034578

  11. 11.

    Gambier Y (2003) Introduction: Screen transadaptation: Perception and reception. The Translator 9(2):171–189. https://doi.org/10.1080/13556509.2003.10799152

    Article  Google Scholar 

  12. 12.

    Gao J, Zhao Q, Li T, Yan Y (2009) In: International symposium on neural networks. Simultaneous synchronization of text and speech for broadcast news subtitling. Springer, pp 576–585

  13. 13.

    Garcia JE, Ortega A, Lleida E, Lozano T, Bernues E, Sanchez D (2009) Audio and text synchronization for tv news subtitling based on automatic speech recognition. In: 2009 IEEE international symposium on broadband multimedia systems and broadcasting, IEEE, pp 1–6

  14. 14.

    González-Carrasco I, Puente L, Ruiz-Mezcua B, López-Cuadrado J (2019) Sub-sync: Automatic synchronization of subtitles in the broadcasting of true live programs in spanish. IEEE Access 7:60968–60983

    Article  Google Scholar 

  15. 15.

    Guimarães RL, Brito JO, Santos CA (2018) Investigating the influence of subtitles synchronization in the viewer’s quality of experience. In: Proceedings of the 17th Brazilian symposium on human factors in computing systems, pp 1–10

  16. 16.

    Howard J, Gugger S (2020) Fastai: A layered api for deep learning. Information 11(2):108

    Article  Google Scholar 

  17. 17.

    Kedačić D, Herceg M, Peković V, Mihić V (2018) Application for testing of video and subtitle synchronization. In: 2018 International conference on smart systems and technologies (SST), IEEE, pp 23–27

  18. 18.

    Krishnamoorthy M, Paulik M (2019) Automatic speech recognition based on user feedback. US Patent 10,446,141

  19. 19.

    Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv:190911942

  20. 20.

    Lertwongkhanakool N, Punyabukkana P, Suchato A, (2013) Real-time synchronization of live speech with its transcription. In: 10th international conference on electrical engineering/electronics, computer, telecommunications and information technology, IEEE, pp 1–5

  21. 21.

    Li J, Deng L, Haeb-Umbach R, Gong Y (2016) Fundamentals of speech recognition. Robust Automatic Speech Recognition pp 9–40, https://doi.org/10.1016/b978-0-12-802398-3.00002-7, 1001.2267

  22. 22.

    Likic V (2008) The needleman-wunsch algorithm for sequence alignment. In: Lecture given at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotechnology Institute, University of Melbourne pp 1–46

  23. 23.

    Maas AL, Le QV, O’Neil TM, Vinyals O, Nguyen P, Ng AY (2012) Recurrent neural networks for noise reduction in robust ASR. In: 13th Annual conference of the international speech communication association 2012, INTERSPEECH 2012, vol 1, pp 22–25

  24. 24.

    Manuel Jerez JAd (2005) La incorporación de la realidad profesional a la formación de intérpretes de conferencias mediante las nuevas tecnologías y la investigación-acción. http://hdl.handle.net/10481/871

  25. 25.

    Montagud M, Boronat F, González J, Pastor J (2017) Web-based platform for subtitles customization and synchronization in multi-screen scenarios. In: Adjunct publication of the 2017 ACM international conference on interactive experiences for TV and online video, pp 81–82

  26. 26.

    Nguyen TS, Niehues J, Cho E, Ha TL, Kilgour K, Muller M, Sperber M, Stueker S, Waibel A (2020) Low latency asr for simultaneous speech translation. arXiv:200309891

  27. 27.

    Ofcom (2005) Subtitling–an issue of speed?

  28. 28.

    Olofsson O (2019) Detecting unsynchronized audio and subtitles using machine learning

  29. 29.

    Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019) Specaugment: A simple data augmentation method for automatic speech recognition. arXiv:190408779

  30. 30.

    Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual bert? arXiv preprint arXiv:190601502

  31. 31.

    Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9

    Google Scholar 

  32. 32.

    Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing, association for computational linguistics, http://arxiv.org/abs/1908.10084

  33. 33.

    Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv:200409813 http://arxiv.org/abs/2004.09813

  34. 34.

    Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 1910.01108

  35. 35.

    Souto-Rico M, González-Carrasco I, López-Cuadrado JL, Ruíz-Mezcua B (2020) A new system for automatic analysis and quality adjustment in audiovisual subtitled-based contents by means of genetic algorithms. Expert Syst. https://doi.org/10.1111/exsy.12512

    Article  Google Scholar 

  36. 36.

    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

Download references

Acknowledgements

This work has been supported by the Spanish Ministry of Science and Education under TIN2017-85727-C4-3-P grant (DeepBio) and Comunidad Autónoma de Madrid under S2018/TCS-4566 grant (CYNAMON). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Alejandro Martín.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Code availability

The code of Deep-Sync and the instructions to be executed are available at: https://github.com/alexMyG/deep-sync.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Martín, A., González-Carrasco, I., Rodriguez-Fernandez, V. et al. Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-05751-y

Download citation

Keywords

  • TV Broadcasting
  • Synchronisation
  • Language model
  • Deep neural networks
  • Machine learning