1 Introduction

Prosody is an important aspect in human spoken communication (Fujisaki, 1997), as it acts as an essential element from pointing out relevance on certain aspects to expressing emotions or to help structure discourse. However, despite its crucial role in speech, prosody is still a less-focused phenomena in systems that process spoken language. This is mainly due to its additional complexity. Prosodic features have continuous representations, span over multiple segments in speech—therefore known as suprasegmental features, and their interpretation depends on the context (Crystal, 2003; Fujisaki, 1997). For this reason, there is a current tendency in these systems to carry on with linguistic modeling once spoken input is converted to its written form. Although conversion to discrete linguistic units facilitates computational processing, modeling only lexical information in speech leaves out any important aspect of it carried through prosody.

In this paper, we deal with the compilation of speech corpora annotated with prosody heuristic, adapted for machine-learning based applications. We propose a minimal spoken language data structure that facilitates joint processing of prosodic and other linguistic features in speech. We furthermore present a set of tools for harvesting, compiling and visualizing this type of data. Our strategy for speech data collection is based on exploiting readily available recorded material—usually referred to as found data. The use of found data requires some data pre-processing in order to convert raw speech material into sampled and annotated speech. Our methodologies built on our previous work (Öktem, Farrús, & Bonafonte, 2018) facilitate the collection of both monolingual and parallel spontaneous speech data in a scalable manner.

The proposed corpus compilation methodology is put into use in the collection of two datasets: an English conference speech corpus and an English–Spanish dubbed movie speech corpus. All the developed methodologies and corpora are made publicly available as open source software librariesFootnote 1 and through the Pompeu Fabra University (UPF) Digital RepositoryFootnote 2, respectively.

The contributions of this paper, and their corresponding sections are listed in Table 1. Section 5 presents a brief evaluation of both corpora as source data for speech recognition and translation tasks, and finally Sect. 6 concludes the paper.

Table 1 Paper contributions

2 Related work

Computational applications that deal with prosody necessitate a standard for representing the data structure; i.e. the structure of speech with its orthography and prosody together. One of the most popular of these conventions is the TextGrid format, which is used by Praat (Boersma & Weenink, 2019). A TextGrid file stores any number of tiers that can be used to label acoustic and linguistic information time-aligned with speech. Although very useful for visualization in Praat, this format is not designed to be functional for viewing and manipulating computationally by itself. Every tier defines which event occurs at what time on its own and it is difficult to associate events that occur in parallel in different tiers. Also, for storage of raw acoustic features, Praat uses different file formats. Due to this design, a complete prosodic–acoustic representation of a short utterance ends up being represented with a clutter of files. Other tools such as (Huang, Chen, & Harper, 2006; Xu, 2013) are also based on Praat and are only runnable through its interface. And some others such as AuToBI (Rosenberg, 2010) or ANALOR (Avanzi, Lacheret-Dujour, & Victorri, 2008) provide automatic or semi-automatic annotation of English and French prosodic structures, respectively, while ELAN (Sloetjes & Wittenburg, 2008) serves as an annotation tool for audio and video, compatible with Praat TextGrid format.

Related to securing spoken parallel corpora, several attempts have been carried out, as for instance: the European Parliament Interpretation Corpus (EPIC) corpus (Bendazzoli & Sandrelli, 2005), the EMIME (Effective Multilingual Interaction in Mobile Environments) Bilingual prompted speech (Wester, 2010), the Microsoft Speech Language Translation (MSLT) corpus (Federmann & Lewis, 2016), or the Multi Dialect Arabic (MDA) parallel prompted speech corpora (Almeman, Lee, & Almiman, 2013). The 300 Languages Project, part of the Rosetta Project,Footnote 3 aims at collecting parallel audio and texts in the world’s 300 most widely-spoken languages. More recently, the CPJD corpus (Takamichi & Saruwatari, 2018) provides crowd-sourced parallel speech data of Japanese dialects. In addition, MaSS corpus is a multilingual parallel speech dataset based on recorded readings of the Bible with speech-to-text and speech-to-speech alignments (Zanon Boito et al., 2020).

The need for monolingual spoken data is growing steadily to achieve linguistic coverage in automatic speech recognition and text-to-speech research and development. Some examples to these are: the Switchboard corpus (Godfrey & Holliman, 1993), for English telephone conversational speech, the CALLHOME speech corpora, consisting of telephone conversations in several languages (Canavan, Graff, & Zipperlen, 1997), English Boston University Radio Speech Corpus (Ostendorf, Price, & Shattuck-Hufnagel, 1996), Rhapsodie (Lacheret et al., 2014), a French speech corpus with prosodic, syntactic and orthographic annotations, DEMoS (Parada-Cabaleiro et al., 2019) an Italian emotional speech corpus, RSC (Georgescu et al., 2020), a Romanian read speech corpus for automatic speech recognition, TV3Parla (Külebi & Öktem, 2018) and ParlamentParla (Külebi et al., 2020), parliamentary and television speech corpora for Catalan.

The vast amount of natural language data residing in the web has been an invaluable source for both linguistic research and language technology development. OPUS collection (Tiedemann, 2012), for instance, is a publicly available corpus of parallel text from the web. Among others, it includes the OpenSubtitles collection (Lison & Tiedemann, 2016; Lison et al., 2018) of translated movie subtitles. Fortunately, although not created for research purposes and although being monolingual, TED talksFootnote 4 have become an inestimable large and free resource for research in spoken language. The number or works that have already used these resources is relevant (Cettolo, Girardi, & Federico, 2012; Pappas & Popescu-Belis, 2013; Farrús, Lai, & Moore, 2016). However, since TED talks were not thought to fit computational linguistics research objectives, they require a significant work on pre-processing the data. Some attempts for such processing was made for TED-LIUM corpus (Rousseau, Deléglise, & Estève, 2012), an English speech recognition training corpus from TED talks, and also for text-based tasks, such as document classification (Hermann & Blunsom, 2014) and machine translation (Cettolo et al., 2012). One of the aims of this paper is to present a prepared dataset based on TED talks for prosody modeling in speech applications.

3 Toolkit for speech corpus creation

This section describes the main methodology and toolkit employed in the creation of our corpora:

  1. (i)

    Proscript for joint lexical–prosodic data handling,

  2. (ii)

    Prosograph for visualization of large speech corpora, and

  3. (iii)

    movie2parallelDB for obtaining parallel speech corpora from dubbed media.

These tools were useful for the development and creation of our corpora to adapt existing formats, align text and speech content, and visualize prosodic phenomena.

3.1 Linguistically-aligned prosody extraction with Proscript

Prosody is conveyed through three different elements: intonation, stress, and rhythm. These elements are perceived by listeners as changes in fundamental frequency (F0), intensity and sound duration over time, respectively (Adami et al., 2003). At the same time, F0, intensity and duration can be extracted in terms of their acoustic form by means of speech processing tools. For the current work, we propose a minimal spoken language data structure for facilitating: (1) the creation and storage of spoken language data represented by these measurable features of prosody—from now one referred to as annotated data, and (2) their processing with machine learning applications.

The idea behind our lexical–prosodic information structuring is based on representing relevant prosodic phenomena aligned with their respective linguistic context. Figure 1 illustrates this idea. In this example, the relevant features are pausing, intonation, intensity and inter-word prosodic changes. Instead of focusing on the complete movement of F0 or energy levels, average values are calculated within the period of utterance of a word. The same framework could be extended to represent contours as a vector. The aim is to represent the utterance in time-linearity that recurrent type neural networks are designed to process. Each segment is represented as a vector of linguistic and prosodic features. An example of a word-level segment where only mean values are calculated would be:

$$ \begin{aligned} segment_{i}&=\langle word_{i}, mean_f0_{i}, range\_f0_{i} ,\\&mean\_{i}nt_{i}, range\_{i}nt_{i}, pause\_after_{i}\rangle . \end{aligned}$$
(1)
Fig. 1
figure 1

Word-level prosodic feature labelling demonstrated on Praat

To address our general idea of representing linguistically aligned prosodic information, we have developed the Proscript framework. This framework provides a spoken language data representation format and a library for the creation, manipulation, reading and writing of this sort of data.

Proscript represents speech as features occurring in parallel at discrete bounded intervals, referred to as segments. Segments can be defined within, for instance, a word, a prosodic phrase, a sentence or a group of sentences. Any type of linguistic, prosodic or morphosyntactic features can be stored within these boundaries. Table 2 shows an example of parallel features stored in a Proscript file. Here, the linguistic units are words, and the set of features is determined by the application.

Table 2 An example list of features used in a Proscript format file with word-level segmentation

Proscript Python libraryFootnote 5 was developed to facilitate the creation, manipulation and annotation of Proscript files. It can be imported as a Python library to batch process transcribed speech files, annotate them with the desired features and output as files. Both word alignment and prosodic–acoustic tagging software (explained in following subsections) are accessible through the library. For word alignment, we used the open-source Montreal Forced Aligner (McAuliffe et al. 2013) for its availability of English and Spanish models. The forced alignment process is built on an automatic speech recognition system and requires its own acoustic models and a pronunciation dictionary. A Spanish pronunciation dictionary was created for the purpose of aligning Spanish language speech data using an open-source vocabularyFootnote 6 and TransDic phonetic transcription software (Garrido, Codina, & Fodge, 2018).Footnote 7 As for prosodic–acoustic feature annotation we used ProsodyTagger (Domínguez, Farrús, & Wanner, 2016a; Domínguez et al., 2016b), which is a Python wrapper around the feature extraction scripts from Praat.

In our deep learning-based applications, we opted for a log-scale representation for contour-style features with respect to the mean of the utterances. We computed the pitch range on a word to take into account the most common textual linguistic unit. Although speech is continuous, silences can abruptly change intonation range, and are still added between words, which reinforces the use of words as segments. Units that did not give any measurement were simply labeled as 0.0 to represent average of the speaker. The sampling frequency of our sounds is 16 kHz and the corresponding acoustic features were extracted at a time step of 10 ms.

3.2 Aiding study of large speech corpora with Prosograph

Being able to clearly visualize the different elements involved in prosody—intonation, rhythm, and stress—is often needed in computational prosody research. Several speech analysis tools (e.g. Praat), together with derived scripts and tools (Boersma & Weenink, 2019; Xu, 2013; Mertens, 2004; Domínguez et al., 2016b) partially cover these needs by helping to visualize quantifiable speech features like fundamental frequency (F0) and intensity contours, word stress marking, or prosodic labeling. These tools work well when showing detailed analyses on data and visualizing one single utterance at a time as in Fig. 1, but fail in visualizing generalized word-averaged speech features of many utterances, e.g., a discourse or a collection of speech samples, at once.

We developed Prosograph for visualizing acoustic and prosodic information of long speech segments together with their transcript. One of its principal motivations is to enable observing the relationship between prosodic features and punctuation in text. Moreover, its interactive interface makes it easy to listen to any portion of the displayed speech to accommodate auditory analysis (Öktem, Farrús, & Wanner, 2017b).

An overview can be seen in Fig. 2. Similar to sheet music representation, the prosodic feature values are plotted in the vertical axis over a temporal horizontal axis. Words are put in order together with pauses and punctuation, and the prosodic features are drawn under each corresponding word.

Fig. 2
figure 2

An example of a visualization frame of segments from a conference talk with Prosograph. Two utterances are represented in the same view with readable transcription and traceable F0 contour

Prosograph can be customized easily as it is written in the highly visual and simple programming language Processing.Footnote 8 To demonstrate, we created a bilingual mode of Prosograph for analyzing parallel speech corpora. As illustrated in Fig. 3, the bilingual mode makes it possible to visualize aligned parallel corpora. Aligned samples are displayed side by side to accommodate e.g. prosodic comparison. Both originalFootnote 9 and bilingual modeFootnote 10 of Prosograph are made openly available as open-source software under the GNU General Public License.Footnote 11

Fig. 3
figure 3

Visualization of parallel samples from an episode of Heroes corpus with bilingual mode of Prosograph

3.3 Automated speech corpus creation with Movie2ParallelDB

This section explains movie2parallelDB, our methodology to make parallel speech corpora from dubbed movies (Öktem et al., 2018). It makes only use of raw data: original and dubbed audio track of a multimedia such as a movie or television series together with their subtitles. It does not require any training as in the case of previous works (Tsiartas et al., 2011). We can list the main advantages of our methodology as follows:

  1. (1)

    It is easily expandable to obtain monolingual or parallel corpora,

  2. (2)

    It is language independent with sole dependence on availability of forced alignment models,

  3. (3)

    It can handle any domain and speech style,

  4. (4)

    It delivers a spoken language corpus with prosodic feature annotations, and

  5. (5)

    It does not violate the fair use principles that go with copyrighted material.

We publish movie2parallelDB as an open-source software together with instructions to use in http://github.com/alpoktem/movie2parallelDB.

3.3.1 Methodology

Figure 4 illustrates the overall process for mining speech samples from a dubbed multimedia. As raw data, we need audio tracks and subtitles in the original and dubbed languages, and a script that contains speaker information. In the first stage, we mine the speech segments at the sentence level using the audio and subtitles pair for each language. Later, we extract timing information at the word-level and obtain speaker labels for each segment from the script. Monolingual extraction stage is completed once we annotate the prosodic features using the Proscript library. Finally, we align the segments extracted in each language to obtain the parallel corpus segments.

Fig. 4
figure 4

Overall corpus extraction pipeline. Monolingual stage that involves segmentation, word alignment, speaker labelling and prosodic feature annotation is performed separately for both audio channels (L1 and L2). Alignment stage aligns segments of each language and outputs them as parallel segments

For a detailed demonstration of the process of obtaining parallel segments from a portion of a movie, refer to Fig. 5. Subtitles are the source for obtaining both (1) audio transcriptions, and (2) timing information related to the utterances. These information are contained in a standard srt formatFootnote 12 subtitles, entry by entry, where each subtitle entry is represented by an index, time cues and the script being spoken at that time in the movie. A subtitle entry can consist of a single, multiple (#3), or incomplete sentences (#1). They can contain speech from a single (#1, 2) or multiple speakers (#3). Thus, using only these time cues does not suffice for extracting audio segments with complete sentences of a single speaker. To achieve this, word boundary information is combined with punctuation mark positions to split and merge segments as needed. Two entries are merged if the first one does not end with a sentence-ending punctuation mark and the second one starts with a lowercase letter. Multi-speaker segments are split from the words following speech-dashes [–]. This process is marked with the label “1” on Fig. 5.

Fig. 5
figure 5

Processes 1, 2 and 3 of the methodology illustrated on a portion of a movie: (1) Segment extraction using subtitle cues, (2) Speaker annotation from movie script, (3) Parallel segment extraction where aligning segments have matching colors. (Color figure online)

Movie scripts, which contain dialogue and scene information, are valuable pieces of information for determining the segment speaker labels. A heuristic approach based on word-matching is followed to obtain subtitle and script alignment. This process is marked with the label “2” on Fig. 5.

Afterwards, each word in the extracted segments is automatically annotated with the acoustic features needed for the purpose. For example in our case, we were especially interested in obtaining pauses between words for their use in machine translation (see Sect. 5.2).

In a later phase, the speech segments are aligned to create the parallel segment pairs. Since the subtitles and the number of extracted segments can differ between languages, the segment alignments can be one-to-one, one-to-many, many-to-one or many-to-many depending on the sentencing structure in the subtitles. To solve it, a metric is defined that measures the time correlation percentage between two sets of ordered segments and then maps them using a heuristic approach. This process is marked with the label “3” on Fig. 5.

3.3.2 Fair-use principles of copyrighted media

Movie and TV shows are protected with copyright laws and limit the amount of its usage. This is governed by the principles of fair use, which lets the use of copyrighted material for transformative and non-commercial purpose. The boundaries of what counts as transformative is not defined in a rigid way, but governed with guidelines and court decisions. The term “fair use” is originally defined by the United States lawFootnote 13 and is influenced in other countries. United Kingdom, for example, allows non-commercial research on any material as long as it is within lawful access.Footnote 14

Our methodology complies with these principals since the small audio portions obtained with it cannot be reconstructed back to the original form of the movie. The copyright on the original source of the segments has to be stated in both any publication explaining the work and during its access.

4 Building prosody corpora

This section deals with the description of methodologies used to build two different corpora for prosody-related applications. First, the processing of TED Talks to build a prosody-specific corpus; and second, the compilation of a movie-domain parallel corpus.

4.1 Prosodically annotated TED talks (PANTED) corpus

TED (Technology, Entertainment, Design) talks are a set of conference talks lasting 15 min each in average, and held worldwide in more than 100 languages. TED talks, include a large variety of topics, from technology and design to science, culture and academia. The corresponding transcripts, as well as audio and video files, are available openly on TED’s website.Footnote 15 For its public availability, TED talks have been the source of many corpora for linguistic analysis and machine learning-based applications. One example of this is the corpus-based study of paragraph-based prosodic cues by Farrús, Lai and Moore (2016). For their work, Farrús et al. compiled a corpus of 1365 talks together with their punctuated transcriptions and extracted various F0 and intensity based features at word and sentence levels.

Prosodically annotated TED talks (PANTED) corpus is the result of reprocessing of the corpus used in Farrús et al. (2016) to serve for joint processing of prosodic and other linguistic features in speech. Table 3 shows the main statistics of the corpus. PANTED corpus is made publicly available through the UPF e-repositoryFootnote 16 with Attribution 4.0 International (CC BY 4.0) license.Footnote 17 Source code used during the corpus reprocessing is also accessible online.Footnote 18

Table 3 PANTED corpus statistics

4.2 The Heroes corpus

The movie2parallelDB methodology presented in the previous section has been put into practice by compiling a corpus from 2000s popular science fiction TV series Heroes.Footnote 19 Originating from United States, Heroes ran in TV channels worldwide between the years 2006 and 2010. The whole series consists of 4 seasons and 77 episodes and is dubbed into many languages including Spanish, Portuguese, French and Catalan. Each episode runs for a length of 42 min in average.

Twenty-one episodes from seasons 2 and 3 were processed using our methodology to obtain 7000 parallel audio segments together with their transcriptions and Proscript annotations. The total duration audio content is about 9.5 h. See Table 4 for further details. Counts of several linguistic units (words, tokens, sentences) in the final parallel corpus are presented in Table 5. A summary of how much of the content in one episode ended up in the dataset in average is presented in Table 6.

Table 4 Heroes corpus duration information
Table 5 Word, token, sentence counts and average word count for parallel English and Spanish segments
Table 6 Average numbers for each episode

The dataset is published online as Heroes CorpusFootnote 20 and is accessible through Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.Footnote 21 In what follows we further explain the additional processes that were followed during the collection of the dataset.

4.2.1 Raw data acquisition

The DVD’s of the series were obtained from the Pompeu Fabra University Library. Videos of the episodes were extracted using the Handbrake video transcoder softwareFootnote 22 and were saved as Matroska format (mkv) files. To run movie2parallelDB scripts, we needed to extract audio and subtitle pairs for both languages residing in them. We used MKVToolNixFootnote 23 for extracting the audio. As subtitles were embedded as bitmap images in the DVD, we used an optical character recognition (OCR) softwareFootnote 24 to convert them to srt format subtitles. In total, 21 episodes were processed to obtain 25 h English and Spanish audio with their corresponding subtitles. The episode scripts were obtained from a fan web page.Footnote 25

4.2.2 Manual subtitle correction work

We sourced the transcriptions of audio segments from the subtitles. Although subtitles are highly reliable sources for obtaining proper transcriptions in the original language of the movie, this is not always the case in the dubbed languages. This is due to the fact that dubbing transcript needs to satisfy visual alignment such as lip movements, whereas subtitles do not. Also, subtitles are often done in a more concise way to facilitate easy reading. In our case, we observed that the Spanish subtitles were matching with the Spanish audio in approximately 80% of the cases. As we needed exact correspondence between audio and transcriptions, we had to follow a manual correction process. Both subtitle transcripts and time-stamps had to be corrected to match exactly what is being spoken on the dubbing audio and when.

A manual passing over the segments also gave the opportunity to filter out any noisy audio portions that would otherwise end up in the corpus. Segments that contained noise and music, overlapping or unintelligible speech and speech in other languages were marked. The spell checking and timestamps and script correction of 21 episodes was done by the two native Spanish speaking annotators and took 60 h in total.

5 Corpora evaluation

In this section we demonstrate the use of prosodic–lexical joint processing on two practical tasks using the corpora we have presented: automatic speech transcription and spoken language translation (Öktem, 2019). In the former case, we report experiments utilizing our data modeling methodology for the restoration of punctuation marks in raw automatic speech recognition output. In the latter case, we deal with enhancing spoken language translation through the incorporation of pause features.

5.1 Punctuation restoration using prosodic and lexical cues

The introduction of punctuation marks into the output of automatic speech recognition (ASR) is an important issue in applications such as automatic transcription/subtitling, speech-to-speech translation and language analysis. Punctuation is essential for grammaticality, understandability, and functionality of several downstream tasks (Öktem et al., 2017a). For example, correct sentence segmentation and punctuation of recognized speech improves the quality of machine translation (Matusov, Mauser, & Ney, 2006; Peitz et al., 2011; Cho, Niehues, & Waibel, 2017; Lu & Ng, 2010), and missing periods and commas in machine generated text degrades performance of information extraction from speech (Favre et al., 2008; Hillard et al., 2006). Also, most of the data-driven parsing models require segmentation of recognized text into sentence-like units and use punctuation as features (Jones, 1994; Spitkovsky, Alshawi, & Jurafsky, 2011; Ma, Zhang, & Zhu, 2014).

During the manual transcription of an audio recording, both modalities, syntax and prosody, are used in determining the phrasing structure and punctuation. Analogously, an automatic system for restoring punctuation in automatic speech recognition output could take account of both of these modalities. Works that deal with the automatic punctuation restoration in ASR output demonstrated that prosodic features are highly indicative of phrase boundaries as well as of punctuation placement (Batista et al., 2012; Tilk & Alumäe, 2016; Xu, Xie, & Xiao, 2017).

In Öktem, Farrús and Wanner (2017a), we have proposed a framework that combines the processing of lexical and prosodic information for restoring punctuation in raw speech transcripts. The deep learning-based framework processes textual and prosodic information in a parallel fashion, reflecting the data structure we have in our datasets.Footnote 26

The recurrent neural network (RNN)-based architecture treats the problem as a sequence prediction problem where a punctuation class is predicted at each interval of a sequence of words, as illustrated in Fig. 6. The proposed model makes it possible to integrate any desired feature (be it lexical, syntactic or prosodic) at each interval and allows us to test which feature influence punctuation placement accuracy to what extent. Our feature set includes word embeddings and part-of-speech (POS) tags as syntactic features, and fundamental frequency (F0), intensity, pauses and speech rate as prosodic features.

Fig. 6
figure 6

modeling punctuation as a classification problem at each word interval (quote by Lao Tze)

PANTED corpus was used as the training, development and testing dataset. Data was sampled into sequences of 50 words reflecting average number of words in the paragraph-segmentation structure in the corpus. Each sample starts with a new sentence. A total of 51,311 samples were extracted, consisting of 2.6 sentences in average. The train, validation and testing sets were split into percentages of 70–15–15 from the complete set.

Table 7 shows the results in generating full stops, commas and question marks with different feature settings with the TED test set. Features are denoted with symbols p (pause duration), f0 (intonation), i (intensity) and sr (speech rate). All reported results except the last row use prosodic features as continuous values in order to ensure comparability with the baseline architecture, replicated from Tilk and Alumäe (2016) and using only lexical information (w). For the final setup, we determined discrete levels for the f0 and i features and used them instead of their real values. This resulted in an improvement in terms of \(F_{1}\) scores achieving 70.3% accuracy averaged over all three punctuation marks.

Table 7 Punctuation generation results for each punctuation mark as \(F_{1}\) score in percentage (%). Bold indicates highest results

We can observe that all combination sets outperform the results obtained with the baseline architecture. The use of words, POS, pause duration and mean F0 features especially serves for detecting commas and all three punctuation marks overall. The same combination of features plus mean intensity values improve especially in the full stop detection. Only words, POS and pauses give the best F1-score when marking sentence boundaries that signal a question.

5.2 Enhancing spoken language translation with prosody

In this section, we explore the introduction of prosodic features directly on a deep learning-based machine translation system. We particularly focus on the inclusion of inter-lexical silent pauses as an additional feature on the encoder side of the sequence-to-sequence MT architecture (Sutskever, Vinyals, & Le, 2014; Öktem, 2019).

The enhanced encoder architecture is illustrated in Fig. 7. The text encoder (inner box) encodes the words by passing them as embedding vectors to a bidirectional recurrent layer. Meanwhile, a separate encoding sequence is followed for the pause features where they are gradually converted to be fit as input to the same recurrent layer concatenated with the word vector. Output vectors at each timestep are then passed on through an attentional decoder (Bahdanau, Cho, & Bengio, 2014), which only outputs words.Footnote 27

Fig. 7
figure 7

Sequence-to-sequence translation encoder with prosodic information

Training is performed in two stages with two types of data: (1) a parallel text corpus and (2) a prosodically annotated parallel spoken audio corpus. In the first stage, training is performed updating only the parameters belonging to the text encoder and decoder. On a second stage, training is performed with the joint text and prosody encoder components. For base parallel text data, we used 5 million English–Spanish sentence pairs from the OpenSubtitles corpus (Lison & Tiedemann, 2016; Lison et al., 2018). As for the second stage training, we used a version of Heroes corpus consisting of 7225 parallel segments. This set was split into training–test–validation sets of size 6141, 542, 541 segments, respectively.

Table 8 shows the difference in translation quality metrics between the baseline (text) model and the model enhanced with pause information (text + pauses) on the two versions of test sets from Heroes corpus, one including the original punctuation and another including automatically restored punctuation simulating an automatic transcription scenario. In order to compare the different systems, we used BLEU (Papineni et al., 2002), which is a widely-accepted automatic MT evaluation metric that is known to correlate with human judgements. With manually annotated punctuation on the input sentences, there is an improvement of 1.31% in terms of BLEU scoring. With punctuation recovery preprocessing on the raw transcripts, translation quality still increases by 1.07%. These improvements indicate a possibility of increase in neural machine translation quality by means of joint lexical–prosodic encoding.

Table 8 BLEU scores (%) on the Heroes corpus testing set with and without pause encoding

6 Conclusion

The motivation to enhance speech processing applications with prosodic heuristic comes with a cost and that mostly resides in the labour of data harvesting. In this paper, we addressed the need for a tool set to obtain speech data with prosody heuristic and also presented the two corpora resulting from this motivation. We introduced a complete pipeline for collecting, handling, storage and visualization of this type of data. Proscript library served for handling of linguistically-aligned prosodic data. Prosograph enabled manual examination of this type of data by visualizing speech-related characteristics through a programmable interface. It can be used in many areas of research such as language learning and acquisition, comparative studies in different languages, tone languages, and audiovisual prosody, among others. Finally, the movie2parallelDB framework was built to create structured parallel speech data from dubbed movies.

The two data resources prepared and packaged using these tools are also presented, and include: (1) PANTED, consisting of automatically generated lexical–prosodic annotations of TED conference talks, and (2) the Heroes corpus, consisting of prosodically annotated parallel TV–movie domain speech segments. Using the latter one, we were able to demonstrate the increase in the accuracy of automatic punctuation restoration through the use of prosody heuristic. The former resource, to our knowledge, is the first example of a parallel movie speech corpus paving the way for research in dubbing translation (Öktem et al., 2019; Nayak et al., 2020). We furthermore outlined a deep learning-based machine translation framework where such corpora would be useful in integrating prosodic features into the neural machine translation logic.

All the developed resources, corpora, as well as evaluation frameworks are published openly. We hope that both the toolkit and corpora serves the speech technology and prosody research community.