Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation

Martín, Alejandro; González-Carrasco, Israel; Rodriguez-Fernandez, Victor; Souto-Rico, Mónica; Camacho, David; Ruiz-Mezcua, Belén

doi:10.1007/s00521-021-05751-y

Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation

S.I. : Data Fusion in the era of Data Science
Published: 08 February 2021

(2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Alejandro Martín ORCID: orcid.org/0000-0002-0800-7632¹,
Israel González-Carrasco²,
Victor Rodriguez-Fernandez¹,
Mónica Souto-Rico³,
David Camacho¹ &
…
Belén Ruiz-Mezcua²

618 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

Subtitles are a key element to make any media content accessible for people who suffer from hearing impairment and for elderly people, but also useful when watching TV in a noisy environment or learning new languages. Most of the time, subtitles are generated manually in advance, building a verbatim and synchronised transcription of the audio. However, in TV live broadcasts, captions are created in real time by a re-speaker with the help of a voice recognition software, which inevitability leads to delays and lack of synchronisation. In this paper, we present Deep-Sync, a tool for the alignment of subtitles with the audio-visual content. The architecture integrates a deep language representation model and a real-time voice recognition software to build a semantic-aware alignment tool that successfully aligns most of the subtitles even when there is no direct correspondence between the re-speaker and the audio content. In order to avoid any kind of censorship, Deep-Sync can be deployed directly on users’ TVs causing a small delay to perform the alignment, but avoiding to delay the signal at the broadcaster station. Deep-Sync was compared with other subtitles alignment tool, showing that our proposal is able to improve the synchronisation in all tested cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automating live and batch subtitling of multimedia contents for several European languages

Article 11 July 2015

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

Finnish parliament ASR corpus

Article Open access 27 March 2023

Notes

The source code is available at: https://github.com/alexMyG/deep-sync.
In case Deep-Sync is applied to a different language, this value should be tuned properly.
https://cloud.google.com/speech-to-text.
https://blog.google/products/search/search-language-understanding-bert/.

References

Ando A, Imai T, Kobayashi A, Homma S, Goto J, Seiyama N, Mishima T, Kobayakawa T, Sato S, Onoe K et al (2003) Simultaneous subtitling system for broadcast news programs with a speech recognizer. IEICE Trans Inf Syst 86(1):15–25
Google Scholar
Avegliano PB, Real LCV, Guimaraes RL, Gallo DS (2017) Automatic synchronization of subtitles based on audio fingerprinting. US Patent 9,609,397
Baskar MK, Burget L, Watanabe S, Karafiát M, Hori T, Černockỳ JH (2019) Promising accurate prefix boosting for sequence-to-sequence asr. ICASSP 2019–2019 IEEE international conference on acoustics. Speech and signal processing (ICASSP), IEEE, pp 5646–5650
Brito JO, Santos CA, Guimarães RL, Borges TFC (2019) Toward understanding the quality of subtitle synchronization to improve the viewer experience. In: Proceedings of the 25th Brazillian symposium on multimedia and the web, pp 209–216
Cañete J, Chaperon G, Fuentes R, Pérez J (2020) Spanish pre-trained bert model and evaluation data. In: to appear in PML4DC at ICLR 2020
Cuzco-Calle I, Ingavélez-Guerra P, Robles-Bykbaev V, Calle-López D (2018) An interactive system to automatically generate video summaries and perform subtitles synchronization for persons with hearing loss. 2018 IEEE XXV international conference on electronics. Electrical engineering and computing (INTERCON), IEEE, pp 1–4
van Deventer MO, Stokking H, Hammond M, Le Feuvre J, Cesar P (2016) Standards for multi-stream and multi-device media synchronization. IEEE Commun Mag 54(3):16–21
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805
Dhumal M, Kushwaha HK, Gupta V, Pawara SR (2019) Instant bi-lingual captions. In: 2019 IEEE 5th international conference for convergence in technology (I2CT), IEEE, pp 1–6
Gales MJ (2001) Adaptive training for robust ASR. In: 2001 IEEE workshop on automatic speech recognition and understanding, ASRU 2001 - conference proceedings, IEEE, pp 15–20, https://doi.org/10.1109/ASRU.2001.1034578
Gambier Y (2003) Introduction: Screen transadaptation: Perception and reception. The Translator 9(2):171–189. https://doi.org/10.1080/13556509.2003.10799152
Article Google Scholar
Gao J, Zhao Q, Li T, Yan Y (2009) In: International symposium on neural networks. Simultaneous synchronization of text and speech for broadcast news subtitling. Springer, pp 576–585
Garcia JE, Ortega A, Lleida E, Lozano T, Bernues E, Sanchez D (2009) Audio and text synchronization for tv news subtitling based on automatic speech recognition. In: 2009 IEEE international symposium on broadband multimedia systems and broadcasting, IEEE, pp 1–6
González-Carrasco I, Puente L, Ruiz-Mezcua B, López-Cuadrado J (2019) Sub-sync: Automatic synchronization of subtitles in the broadcasting of true live programs in spanish. IEEE Access 7:60968–60983
Article Google Scholar
Guimarães RL, Brito JO, Santos CA (2018) Investigating the influence of subtitles synchronization in the viewer’s quality of experience. In: Proceedings of the 17th Brazilian symposium on human factors in computing systems, pp 1–10
Howard J, Gugger S (2020) Fastai: A layered api for deep learning. Information 11(2):108
Article Google Scholar
Kedačić D, Herceg M, Peković V, Mihić V (2018) Application for testing of video and subtitle synchronization. In: 2018 International conference on smart systems and technologies (SST), IEEE, pp 23–27
Krishnamoorthy M, Paulik M (2019) Automatic speech recognition based on user feedback. US Patent 10,446,141
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv:190911942
Lertwongkhanakool N, Punyabukkana P, Suchato A, (2013) Real-time synchronization of live speech with its transcription. In: 10th international conference on electrical engineering/electronics, computer, telecommunications and information technology, IEEE, pp 1–5
Li J, Deng L, Haeb-Umbach R, Gong Y (2016) Fundamentals of speech recognition. Robust Automatic Speech Recognition pp 9–40, https://doi.org/10.1016/b978-0-12-802398-3.00002-7, 1001.2267
Likic V (2008) The needleman-wunsch algorithm for sequence alignment. In: Lecture given at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotechnology Institute, University of Melbourne pp 1–46
Maas AL, Le QV, O’Neil TM, Vinyals O, Nguyen P, Ng AY (2012) Recurrent neural networks for noise reduction in robust ASR. In: 13th Annual conference of the international speech communication association 2012, INTERSPEECH 2012, vol 1, pp 22–25
Manuel Jerez JAd (2005) La incorporación de la realidad profesional a la formación de intérpretes de conferencias mediante las nuevas tecnologías y la investigación-acción. http://hdl.handle.net/10481/871
Montagud M, Boronat F, González J, Pastor J (2017) Web-based platform for subtitles customization and synchronization in multi-screen scenarios. In: Adjunct publication of the 2017 ACM international conference on interactive experiences for TV and online video, pp 81–82
Nguyen TS, Niehues J, Cho E, Ha TL, Kilgour K, Muller M, Sperber M, Stueker S, Waibel A (2020) Low latency asr for simultaneous speech translation. arXiv:200309891
Ofcom (2005) Subtitling–an issue of speed?
Olofsson O (2019) Detecting unsynchronized audio and subtitles using machine learning
Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019) Specaugment: A simple data augmentation method for automatic speech recognition. arXiv:190408779
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual bert? arXiv preprint arXiv:190601502
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Google Scholar
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing, association for computational linguistics, http://arxiv.org/abs/1908.10084
Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv:200409813 http://arxiv.org/abs/2004.09813
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 1910.01108
Souto-Rico M, González-Carrasco I, López-Cuadrado JL, Ruíz-Mezcua B (2020) A new system for automatic analysis and quality adjustment in audiovisual subtitled-based contents by means of genetic algorithms. Expert Syst. https://doi.org/10.1111/exsy.12512
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

Download references

Acknowledgements

This work has been supported by the Spanish Ministry of Science and Education under TIN2017-85727-C4-3-P grant (DeepBio) and Comunidad Autónoma de Madrid under S2018/TCS-4566 grant (CYNAMON). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.

Author information

Authors and Affiliations

Departamento de Sistemas Informáticos, Universidad Politécnica de Madrid, Madrid, Spain
Alejandro Martín, Victor Rodriguez-Fernandez & David Camacho
Computer Science Department, Universidad Carlos III de Madrid, Av. Universidad, 20, 28915, Leganés, Madrid, Spain
Israel González-Carrasco & Belén Ruiz-Mezcua
Culture Area, Spanish Centre for Subtitling and Audio Description (CESyA), Spanish, Spain
Mónica Souto-Rico

Authors

Alejandro Martín
View author publications
You can also search for this author in PubMed Google Scholar
Israel González-Carrasco
View author publications
You can also search for this author in PubMed Google Scholar
Victor Rodriguez-Fernandez
View author publications
You can also search for this author in PubMed Google Scholar
Mónica Souto-Rico
View author publications
You can also search for this author in PubMed Google Scholar
David Camacho
View author publications
You can also search for this author in PubMed Google Scholar
Belén Ruiz-Mezcua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alejandro Martín.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Code availability

The code of Deep-Sync and the instructions to be executed are available at: https://github.com/alexMyG/deep-sync.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Martín, A., González-Carrasco, I., Rodriguez-Fernandez, V. et al. Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-05751-y

Download citation

Received: 02 July 2020
Accepted: 16 January 2021
Published: 08 February 2021
DOI: https://doi.org/10.1007/s00521-021-05751-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation

Abstract

Access this article

Similar content being viewed by others

Automating live and batch subtitling of multimedia contents for several European languages

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

Finnish parliament ASR corpus

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation

Abstract

Access this article

Similar content being viewed by others

Automating live and batch subtitling of multimedia contents for several European languages

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

Finnish parliament ASR corpus

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation