Subtitles are a key element to make any media content accessible for people who suffer from hearing impairment and for elderly people, but also useful when watching TV in a noisy environment or learning new languages. Most of the time, subtitles are generated manually in advance, building a verbatim and synchronised transcription of the audio. However, in TV live broadcasts, captions are created in real time by a re-speaker with the help of a voice recognition software, which inevitability leads to delays and lack of synchronisation. In this paper, we present Deep-Sync, a tool for the alignment of subtitles with the audio-visual content. The architecture integrates a deep language representation model and a real-time voice recognition software to build a semantic-aware alignment tool that successfully aligns most of the subtitles even when there is no direct correspondence between the re-speaker and the audio content. In order to avoid any kind of censorship, Deep-Sync can be deployed directly on users’ TVs causing a small delay to perform the alignment, but avoiding to delay the signal at the broadcaster station. Deep-Sync was compared with other subtitles alignment tool, showing that our proposal is able to improve the synchronisation in all tested cases.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
The source code is available at: https://github.com/alexMyG/deep-sync.
In case Deep-Sync is applied to a different language, this value should be tuned properly.
Ando A, Imai T, Kobayashi A, Homma S, Goto J, Seiyama N, Mishima T, Kobayakawa T, Sato S, Onoe K et al (2003) Simultaneous subtitling system for broadcast news programs with a speech recognizer. IEICE Trans Inf Syst 86(1):15–25
Avegliano PB, Real LCV, Guimaraes RL, Gallo DS (2017) Automatic synchronization of subtitles based on audio fingerprinting. US Patent 9,609,397
Baskar MK, Burget L, Watanabe S, Karafiát M, Hori T, Černockỳ JH (2019) Promising accurate prefix boosting for sequence-to-sequence asr. ICASSP 2019–2019 IEEE international conference on acoustics. Speech and signal processing (ICASSP), IEEE, pp 5646–5650
Brito JO, Santos CA, Guimarães RL, Borges TFC (2019) Toward understanding the quality of subtitle synchronization to improve the viewer experience. In: Proceedings of the 25th Brazillian symposium on multimedia and the web, pp 209–216
Cañete J, Chaperon G, Fuentes R, Pérez J (2020) Spanish pre-trained bert model and evaluation data. In: to appear in PML4DC at ICLR 2020
Cuzco-Calle I, Ingavélez-Guerra P, Robles-Bykbaev V, Calle-López D (2018) An interactive system to automatically generate video summaries and perform subtitles synchronization for persons with hearing loss. 2018 IEEE XXV international conference on electronics. Electrical engineering and computing (INTERCON), IEEE, pp 1–4
van Deventer MO, Stokking H, Hammond M, Le Feuvre J, Cesar P (2016) Standards for multi-stream and multi-device media synchronization. IEEE Commun Mag 54(3):16–21
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805
Dhumal M, Kushwaha HK, Gupta V, Pawara SR (2019) Instant bi-lingual captions. In: 2019 IEEE 5th international conference for convergence in technology (I2CT), IEEE, pp 1–6
Gales MJ (2001) Adaptive training for robust ASR. In: 2001 IEEE workshop on automatic speech recognition and understanding, ASRU 2001 - conference proceedings, IEEE, pp 15–20, https://doi.org/10.1109/ASRU.2001.1034578
Gambier Y (2003) Introduction: Screen transadaptation: Perception and reception. The Translator 9(2):171–189. https://doi.org/10.1080/13556509.2003.10799152
Gao J, Zhao Q, Li T, Yan Y (2009) In: International symposium on neural networks. Simultaneous synchronization of text and speech for broadcast news subtitling. Springer, pp 576–585
Garcia JE, Ortega A, Lleida E, Lozano T, Bernues E, Sanchez D (2009) Audio and text synchronization for tv news subtitling based on automatic speech recognition. In: 2009 IEEE international symposium on broadband multimedia systems and broadcasting, IEEE, pp 1–6
González-Carrasco I, Puente L, Ruiz-Mezcua B, López-Cuadrado J (2019) Sub-sync: Automatic synchronization of subtitles in the broadcasting of true live programs in spanish. IEEE Access 7:60968–60983
Guimarães RL, Brito JO, Santos CA (2018) Investigating the influence of subtitles synchronization in the viewer’s quality of experience. In: Proceedings of the 17th Brazilian symposium on human factors in computing systems, pp 1–10
Howard J, Gugger S (2020) Fastai: A layered api for deep learning. Information 11(2):108
Kedačić D, Herceg M, Peković V, Mihić V (2018) Application for testing of video and subtitle synchronization. In: 2018 International conference on smart systems and technologies (SST), IEEE, pp 23–27
Krishnamoorthy M, Paulik M (2019) Automatic speech recognition based on user feedback. US Patent 10,446,141
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv:190911942
Lertwongkhanakool N, Punyabukkana P, Suchato A, (2013) Real-time synchronization of live speech with its transcription. In: 10th international conference on electrical engineering/electronics, computer, telecommunications and information technology, IEEE, pp 1–5
Li J, Deng L, Haeb-Umbach R, Gong Y (2016) Fundamentals of speech recognition. Robust Automatic Speech Recognition pp 9–40, https://doi.org/10.1016/b978-0-12-802398-3.00002-7, 1001.2267
Likic V (2008) The needleman-wunsch algorithm for sequence alignment. In: Lecture given at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotechnology Institute, University of Melbourne pp 1–46
Maas AL, Le QV, O’Neil TM, Vinyals O, Nguyen P, Ng AY (2012) Recurrent neural networks for noise reduction in robust ASR. In: 13th Annual conference of the international speech communication association 2012, INTERSPEECH 2012, vol 1, pp 22–25
Manuel Jerez JAd (2005) La incorporación de la realidad profesional a la formación de intérpretes de conferencias mediante las nuevas tecnologías y la investigación-acción. http://hdl.handle.net/10481/871
Montagud M, Boronat F, González J, Pastor J (2017) Web-based platform for subtitles customization and synchronization in multi-screen scenarios. In: Adjunct publication of the 2017 ACM international conference on interactive experiences for TV and online video, pp 81–82
Nguyen TS, Niehues J, Cho E, Ha TL, Kilgour K, Muller M, Sperber M, Stueker S, Waibel A (2020) Low latency asr for simultaneous speech translation. arXiv:200309891
Ofcom (2005) Subtitling–an issue of speed?
Olofsson O (2019) Detecting unsynchronized audio and subtitles using machine learning
Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019) Specaugment: A simple data augmentation method for automatic speech recognition. arXiv:190408779
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual bert? arXiv preprint arXiv:190601502
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing, association for computational linguistics, http://arxiv.org/abs/1908.10084
Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv:200409813 http://arxiv.org/abs/2004.09813
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 1910.01108
Souto-Rico M, González-Carrasco I, López-Cuadrado JL, Ruíz-Mezcua B (2020) A new system for automatic analysis and quality adjustment in audiovisual subtitled-based contents by means of genetic algorithms. Expert Syst. https://doi.org/10.1111/exsy.12512
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
This work has been supported by the Spanish Ministry of Science and Education under TIN2017-85727-C4-3-P grant (DeepBio) and Comunidad Autónoma de Madrid under S2018/TCS-4566 grant (CYNAMON). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.
Conflict of interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
The code of Deep-Sync and the instructions to be executed are available at: https://github.com/alexMyG/deep-sync.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Martín, A., González-Carrasco, I., Rodriguez-Fernandez, V. et al. Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-05751-y
- TV Broadcasting
- Language model
- Deep neural networks
- Machine learning