1 Introduction

Search on speech (SoS) has become an interesting research area due to the huge amount of information stored in audio and video repositories. SoS focuses on retrieving speech content from audio repositories that matches user queries, for which the development of efficient methods is highly necessary [1]. Significant research has been carried out in SoS for spoken document retrieval (SDR) [27], keyword spotting (KWS) [813], spoken term detection (STD) [1419], and query-by-example (QbE) STD and SDR [2025]. STD is important due to the following factors: (1) It offers the possibility of retrieving any speech file that contains any term (a sequence of one or more words) from its textual representation, allowing search of any term in a large index efficiently. (2) This technology can be accessed using any device with text input capabilities. (3) It is suitable for building open-vocabulary SoS systems.

1.1 Spoken term detection overview

STD has been receiving much interest for years from outstanding companies/research institutes such as IBM [14, 2630], BBN [3133], SRI & OGI [3436], BUT [17, 37, 38], Microsoft [39], QUT [40, 41], JHU [16, 4244], Fraunhofer IAIS/NTNU/TUD [15], NTU [45, 46], and IDIAP [47], among others. STD systems are composed of two main stages: (1) indexing, which is usually done with an automatic speech recognition (ASR) subsystem and (2) search by a detection subsystem, as depicted in Fig. 1. The ASR subsystem generates word/subword lattices from the input speech signal and stores them as an index. The detection subsystem integrates a term detector and a decision maker. The term detector searches for putative detections of the terms in the index, and the decision maker decides whether each putative detection is a hit or a false alarm (FA) based on certain confidence measures.

Fig. 1
figure 1

Architecture and evaluation of a standard STD system

For the ASR stage, word-based speech recognition has been widely used [35, 4854], since this typically yields better performance than subword-based ASR [5562] due to the lexical and language model (LM) information employed by the word-based ASR. However, one of the main drawbacks of word-based ASR is that it can only detect in-vocabulary (INV) terms. On the other hand, the subword-based approach has the unique advantage that it can detect terms that consist of words that are not in the vocabulary of the recognizer, i.e., out-of-vocabulary (OOV) terms. The combination of these two approaches has been proposed in order to exploit the relative advantages of word and subword-based strategies [17, 32, 33, 36, 44, 6370].

Recently, end-to-end ASR-free approaches for STD have also been proposed, which aim to solve the issue of collecting and handling large amounts of data for building word and phone-based STD systems [2830].

The availability of ASR tools, e.g., Hidden Markov Model Toolkit (HTK) [71], Sphinx [72], and Kaldi [44, 73], among others, facilitates the development of STD systems, since these mitigate the issue of constructing an ASR system from scratch. Among these, Kaldi is specially suitable for building STD systems since it integrates an ASR subsystem, a term detector, and a decision maker [7375]. The Kaldi STD system employs a word-based approach for term detection, and a method based on proxy words (i.e., replace each OOV word by the most similar in-vocabulary word or word sequence) to detect OOV terms [76].

1.2 Methods

Research carried out in a certain area (speech recognition, speaker recognition, speaker diarization, to cite some examples) may be difficult to compare in the absence of a common evaluation framework. In STD, research also suffers from this issue since the published systems typically employ different acoustic databases and different lists of terms that make system comparison impossible. In this context, international evaluations provide a unique framework to measure the progress of any technology, as STD in this case.

ALBAYZIN evaluation campaigns comprise an internationally open set of evaluations supported by the Spanish Thematic Network on Speech Technologies (RTTHFootnote 1) and the ISCA Special Interest Group on Iberian Languages (SIG-ILFootnote 2), which have been held biennially since 2006. These evaluation campaigns provide an objective mechanism to compare different systems and are a powerful way to promote research on different speech technologies [7786].

Spanish is a major language in the world and significant research has been conducted on it for ASR, KWS, and STD tasks [8793]. The increasing interest in SoS around the world and the lack of SoS evaluations dealing with the Spanish language encouraged us to organize a series of STD evaluations starting in 2012 and held biennially until 2018 aiming to evaluate the progress in this technology for Spanish. Each evaluation has been extended by incorporating new challenges. The main novelty of the fourth ALBAYZIN STD evaluation is the addition of a new data domain, namely broadcast news, with programs from the Spanish public television Radio Televisión Española (RTVE). In addition, a novel conversational speech database has also been used to assess the validity of the submitted systems in an unseen domain. Moreover, the terms used in one of the databases in the ALBAYZIN 2016 STD evaluation were kept to enable a straightforward comparison of the systems submitted to both evaluations.

The main objectives of this evaluation can be summarized as follows:

  • Organize the first Spanish STD multi-domain evaluation whose systems are ranked according to different databases and different domains

  • Provide evaluation and benchmark with increasing complexity in the search terms compared to the previous ALBAYZIN STD evaluations

This evaluation is suitable for research groups/companies that work in speech recognition.

This paper is organized as follows: First, the Section 2 presents the evaluation (including databases, evaluation metrics, and participants) and a comparison with other STD evaluations. Then, in the Section 3, the different systems submitted to the evaluation are presented. Evaluation results along with a discussion are presented in the Section 4 which includes the corresponding paired t tests [94] as statistical significance measure for system comparison. The Section 5 presents a post-evaluation analysis based on some search term properties and the fusion of the primary systems submitted to the evaluation. The last section outlines the main conclusions of the paper.

2 Spoken term detection evaluation

2.1 STD evaluation overview

This evaluation involves searching a list of terms (given in written form) within speech data, and indicating the audio files and timestamps of each detected occurrence.

The evaluation consists in searching different term lists within different sets of speech data. Speech data comprise different domains (workshop talks, broadcast news, and 2-people conversations), for which individual datasets are given. Each domain contains training/development/test data, except the 2-people conversation dataset that only contains test data. The evaluation result ranking is based on the average system performance on the three datasets in the test experiment. Participants can use the training data for system training and the development data for system tuning, but any additional data can also be employed both for training and development.

Two different types of terms are defined in this evaluation, namely in-vocabulary terms and out-of-vocabulary terms. The OOV term set was defined to simulate the out-of-vocabulary words of a large vocabulary continuous speech recognition (LVCSR) system. In case participants employ an LVCSR system for processing the audio, these OOV terms must be previously removed from the system dictionary and hence, other methods have to be used for searching OOV terms. On the other hand, the INV terms could appear in the LVCSR system dictionary in case participants consider it.

Participants could submit a primary system and up to 4 contrastive systems. No manual intervention was allowed for each developed system to generate the final output file, and hence, all the systems had to be fully automatic. Listening to the test data, or any other human interaction with the test data, was forbidden before the evaluation results had been sent back to the participants. The output file with the term detections followed the standard Extensible Markup Language (XML)-based format accepted by the National Institute of Standards and Technology (NIST) evaluation tool [95]. Ground-truth labels corresponding to the test data were given to participants once the organizers sent back the evaluation results.

2.2 Evaluation metric

In STD, a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit, otherwise it is called a false alarm. If an actual occurrence is not detected, this is called a miss. The Actual Term-Weighted Value (ATWV) metric proposed by NIST [95] has been used as the main metric for the evaluation. This metric integrates the hit rate and false alarm rate of each term into a single metric and then averages over all the terms:

$$ \text{ATWV}=\frac{1}{|\Delta|}\sum_{K \in \Delta}{\left(\frac{N^{K}_{\text{hit}}}{N^{K}_{\text{true}}} - \beta \frac{N^{K}_{\text{FA}}}{T-N^{K}_{\text{true}}}\right)}, $$
(1)

where Δ denotes the set of terms and |Δ| is the number of terms in this set. \(N^{K}_{\text {hit}}\) and \(N^{K}_{\text {FA}}\) represent the numbers of hits and false alarms of term K, respectively, and \(N^{K}_{\text {true}}\) is the number of actual occurrences of K in the audio. T denotes the audio length in seconds, and β is a weight factor set to 999.9, as in the ATWV proposed by NIST [31]. This weight factor causes an emphasis placed on recall compared to precision with a ratio 10:1.

ATWV represents the Term-Weighted Value (TWV) for a threshold given by the STD system (usually tuned on development data). An additional metric, called Maximum Term-Weighted Value (MTWV) [95] can also be used to evaluate the performance of an STD system. MTWV is the maximum TWV obtained by the STD system for all possible thresholds, and hence does not depend on the tuned threshold. Therefore, MTWV represents an upper-bound of the performance obtained by the STD system. Results based on this metric are also presented to evaluate system performance regardless the decision threshold.

p(Miss) and p(FA) values, which represent the probability of miss and FA of the STD system, respectively, are also reported. They are defined as follows:

$$ \mathrm{p(Miss)}=1-\frac{N_{\text{hit}}}{N_{\text{true}}} $$
(2)
$$ \mathrm{p(FA)}=\frac{N_{\text{FA}}}{T-N_{\text{true}}}, $$
(3)

where Nhit represents the number of hits of the STD system, Ntrue is the number of occurrences of the terms in the audio, NFA represents the number of FAs of the STD system, and T denotes the audio length in seconds. These values provide a quantitative way to measure the STD system performance in terms of misses (or equivalently, hits) and false alarms.

In addition to ATWV, MTWV, p(Miss), and p(FA), NIST also proposed a detection error tradeoff (DET) curve [96] to evaluate the performance of an STD system working at various miss/FA ratios. Although DET curves were not used for the evaluation itself, they are also presented in this paper for system comparison.

In this work, the NIST STD evaluation tool [97] was employed to compute MTWV, ATWV, p(Miss), p(FA), and DET curves.

2.3 Databases

Three different databases that comprise different acoustic conditions and domains have been employed for the evaluation. (1) For comparison purposes, the same MAVIR database employed in the previous ALBAYZIN STD evaluations in 2012, 2014, and 2016 has been used. (2) A database named RTVE that consists of different programs recorded from the Spanish public television (Radio Televisión Española) and involves different broadcast news domains. (3) The COREMAH database, which contains conversational speech with two speakers per recording. For the MAVIR and RTVE databases, three separate datasets (i.e., for training, development, and test) were provided to participants. For the COREMAH database, only test data were provided. This allowed measuring the generalization capability of the systems in an unseen domain. Tables 1, 2, and 3 include some database features such as the division into training, development, and test; the number of word occurrences; duration; and average mean opinion score (MOS) [98] as a way to get an idea of the quality of each speech file in the different databases.

Table 1 Characteristics of the MAVIR database: number of word occurrences (#occ.), duration (dur.) in minutes (min), number of speakers (#spk.), and average MOS (Ave. MOS)
Table 2 Characteristics of the RTVE database: number of word occurrences (#occ.), duration (dur.) in minutes (min.), number of speakers (#spk.), and average MOS (Ave. MOS)
Table 3 Characteristics of the COREMAH database: number of word occurrences (#occ.), duration (dur.) in minutes (min.), number of speakers (#spk.), and average MOS (Ave. MOS)

2.3.1 MAVIR

The MAVIR database consists of a set of Spanish talks extracted from the MAVIR workshopsFootnote 3 held in 2006, 2007, and 2008 that contain speakers from Spain and Latin America.

The MAVIR Spanish data consist of spontaneous speech files from different speakers, which amount to about 7 h of speech. These data are then divided for the purpose of this evaluation into training, development, and test sets. The data were also manually annotated in an orthographic form, but timestamps were only set for phrase boundaries. To prepare the data for the evaluation, organizers manually added the timestamps for the roughly 3000 occurrences of the spoken terms used in the development and test evaluation sets. The training data were made available to the participants including the orthographic transcription and the timestamps for phrase boundariesFootnote 4.

The speech data were originally recorded in several audio formats (pulse-code modulation (PCM) mono and stereo, MP3, 22.05 kHz, and 48 kHz, among others). The recordings were converted to PCM, 16 kHz, single channel, and 16 bits per sample using the SoX toolFootnote 5. All the recordings (except one) were made with the same equipment, a Digital TASCAM DAT model DA-P1. Different microphones were used, which mainly consisted of tabletop or floor standing microphones, but in one case a lavalier microphone was used. The distance from the mouth of the speaker to the microphone varies and was not controlled at all, but in most cases the distance was smaller than 50 cm. The recordings were made in large conference rooms with capacity for over a hundred people and a large amount of people in the conference room. This poses additional challenges including background noise (particularly babble noise) and reverberation. The realistic settings and the variety of phenomena in the spontaneous speech in this database make it appealing and challenging enough for the evaluation.

2.3.2 RTVE

The RTVE database belongs to the broadcast news domain and contains speech from different television (TV) programs recorded from 2015 to 2018 (e.g., Millenium, La tarde en 24H, Comando actualidad, España en comunidad, to name a few). These comprise about 570 h in total, which were further divided into training, development, and test sets for the purpose of this evaluation. To prepare the data for the evaluation, organizers manually added the timestamps for the roughly 2700 occurrences of the spoken terms used in the development and test evaluation sets. The training data were available to participants with the corresponding subtitles of the speech data (though these could contain non-accurate word transcriptions), and the development data were further divided into two different development sets, as follows: The dev1 dataset consists of about 60 h of speech material with human-revised word transcriptions without time alignment. The dev2 dataset, which was employed as real development data for STD evaluation, consists of 15 h of speech data. The recordings were provided in Advanced Audio Coding (AAC) format, stereo, 44.1 kHz, and variable bit rate. As far as we know, this database represents the largest speech database employed in any Spanish SoS evaluation. More information about the RTVE database can be found in [99].

2.3.3 COREMAH

The COREMAH database contains conversations about different topics such as rejection, compliment, and apology, which were recorded in 2014 and 2015 in a university environmentFootnote 6. This database contains Spanish recordings from speakers with different levels of Spanish (native, advanced C1, and intermediate B1). Since the main purpose of this database is to evaluate the submitted systems to an unseen domain, only the Spanish native speaker recordings are employed in the evaluation to recreate the same conditions of the other databases. The speech data amount to about 45 min. To prepare the data for the evaluation, organizers manually added the timestamps for the roughly 1000 occurrences of the spoken terms used in the test evaluation set.

The original recordings are videos in the Moving Picture Experts Group (MPEG) format. The audio of these videos was extracted and converted to PCM, 16 kHz, single channel, and 16 bits per sample using the ffmpegFootnote 7 tool. It is worth mentioning that this database contains a high degree of overlapped speech, which makes it quite challenging.

2.3.4 Term list selection

The selection of terms for the development and test sets aimed to build a realistic scenario for STD, by including high occurrence terms, low occurrence terms, in-language (INL) (i.e., Spanish) terms, out-of-language (OOL) (i.e., foreign) terms, single-word and multi-word terms, in-vocabulary and out-of-vocabulary terms, and terms of different length. A term may not have any occurrence or appear one or more times in the speech data. Table 4 includes some features of the development and test term lists such as the number of INL and OOL terms, the number of single-word and multi-word terms, and the number of INV and OOV terms, along with the number of occurrences of each set in the corresponding speech database. It must be noted that a multi-word term is considered OOV in case any of the words that form the term is OOV.

Table 4 Development and test term list characteristics for MAVIR, RTVE, and COREMAH databases

2.4 Comparison to other STD international evaluations

Spoken Term Detection evaluations have been organized from more than a decade. In 2006, the NIST launched the first NIST STD evaluation [95], with English, Mandarin Chinese, and Modern Standard and Levantine Arabic as target languages. The speech included conversational telephone speech (CTS), broadcast news (BNews) speech, and speech recorded in roundtable meeting rooms (RTMeet) with distantly placed microphones (this last type was used for English only). NIST publicly released the results of this evaluation, and they are summarized in Table 5.

Table 5 Best performance (in terms of Actual Term Weighted-Value, ATWV) obtained in the NIST STD 2006 evaluation for the different conditions: “CTS” stands for Conversational Telephone Speech, “BNews” for Broadcast News, and “RTMeet” for speech recorded in roundtable meeting rooms

A significant amount of STD research has been carried out in the framework of the IARPA BABEL program and NIST Open Keyword Search (OpenKWS) evaluation series [19, 28, 30, 32, 33, 44, 52, 56, 58, 63, 6567, 70, 101107]. The BABEL program was born in 2011 aiming to develop fully automatic and noise-robust speech recognition systems in a limited time (e.g., one week) and with a limited amount of transcribed training data. This program supports research in low-resource languages such as Cantonese, Pashto, Tagalog, Turkish, Vietnamese, Swahili, and Tamil, among others. From 2013 to 2016, NIST organized an annual STD evaluation called OpenKWS, which is included within the BABEL program, but open to other research groups besides BABEL participants [108111]. This evaluation was quite similar to the former NIST STD 2006 evaluation and included CTS and microphone speech data on a surprise language that was announced only a few (4 or less) weeks before the evaluation. The main results of these OpenKWS evaluations are shown in Table 6. In 2017, NIST also launched the biennial Open Speech Analytics Technologies (OpenSAT) evaluation series, which includes keyword search among its tasks. This series goal is “to provide broad support for the advancement of speech analytic technologies by including multiple speech analytic tasks and multiple data domains.” This evaluation focused on low-resources languages, as the previous OpenKWS, and speech data comprised conversational telephone speech.

Table 6 Best performance (in terms of Actual Term-Weighted Value, ATWV) obtained in the different editions (2013, 2014, 2015, and 2016) of the OpenKWS evaluations under the full language pack condition

In the ALBAYZIN 2018 STD evaluation, the audio comprises diverse recording conditions: (1) real talks in real workshops held in large conference rooms with public, (2) conversational speech, and (3) broadcast news speech. In the recordings of the workshops, microphones, conference rooms, and even recording conditions change from one recording to another, and tabletop and ground standing microphones were typically employed. In addition, our evaluation explicitly defines different in-vocabulary and out-of-vocabulary term sets. These differences in the evaluation conditions make our evaluation pose different challenges and make it difficult to compare the results obtained in our evaluation to those of the previous NIST STD/OpenKWS/OpenSAT evaluations.

STD evaluations have also been held in the framework of the NTCIR conferences from 2011 to 2016 [112115]. Data used in these evaluations are spontaneous speech in Japanese, provided by the National Institute for Japanese language, and spontaneous speech recorded during seven editions of the Spoken Document Processing Workshop. In these evaluations, the organizers provided the participants with manual transcriptions of the speech data and the output of an LVCSR system. Table 7 presents the best result obtained in each evaluation, where the F-measure was used as the evaluation metric. Although MAVIR data employed in our evaluation could be similar in terms of speech nature to these NTCIR STD evaluations (speech recorded in real workshops), our evaluation makes use of a different language, employs a larger list of terms along with three different databases (each covering a different domain), and defines disjoint development and test term lists to measure the generalization capability of the systems. Besides, the evaluation metric used in these evaluations is different. All these differences make system comparison very difficult.

Table 7 Best performance (in terms of F-measure) obtained in the different editions of the NTCIR STD evaluations

2.5 Comparison to previous ALBAYZIN search on speech evaluations

From 2012, ALBAYZIN STD evaluation has been integrated within the framework of ALBAYZIN SoS evaluation. This SoS evaluation includes two different tasks, named STD and QbE STD. In 2012, participants focused on the QbE STD task, whereas in 2014, systems were mainly submitted to the STD task. In 2016 and 2018, both tasks received the same attention from participants. Specifically, from 2014, the ALBAYZIN STD evaluation has evolved in different aspects:

  • Evaluation domains. In the evaluation held in 2014, a single domain (spontaneous speech from workshop talks) was chosen. In 2016, a novel Spanish database (Spanish European parliament sessions) was added to that domain, which allowed measuring the system performance on an additional dataset, for which neither training nor development data were provided. The system ranking was based on the performance obtained on the workshop talk domain. On the other hand, the evaluation held in 2018 employs three different domains: spontaneous speech from workshop talks, broadcast news (which contains the largest database used in any Spanish STD evaluation), and spontaneous speech from 2-people conversations (for which neither training nor development data were provided). This makes the evaluation more attractive for participants, since they can evaluate the submitted systems in different domains and conditions. Moreover, since the evaluation ranking is based on the average performance from the different datasets, participants were encouraged to build multi-domain STD systems. This represents the most important difference compared to the evaluation held in 2016.

  • Number and complexity of search terms. The number of terms used in the evaluations increases from one evaluation to another. In 2014, there were 548 terms for searching, 780 terms in 2016, and 1560 terms in the 2018 evaluation. In addition, the complexity of the search terms has also increased from one evaluation to another, with more out-of-language, multi-word, and OOV terms.

  • Evaluation sets. In 2014, the evaluation organizers provided two different datasets: training/development and test datasets. Aiming to solve the system bias to the training data when the submitted systems were evaluated on development data, the 2016 and 2018 evaluations provided three different datasets: training, development, and test.

2.6 Participants

Three different teams submitted 11 different systems to the ALBAYZIN 2018 Spoken Term Detection evaluation, as listed in Table 8. About 3 months were given to the participants for system development and, therefore, the STD evaluation focuses on building STD systems in a limited period of time. The training, development, and test data were released to the participants at different times. Training and development data were released on June 30, 2018. The test data were released on September 24, 2018. The final system submission was due on October 21, 2018. Final results were discussed at IberSPEECH 2018 conference on November 21, 2018.

Table 8 Participants in the ALBAYZIN 2018 STD evaluation along with the systems submitted

3 Systems

In this section, the systems submitted to the evaluation are described (see Appendix). These systems can be divided into three different categories, as presented in Table 8: (1) LVCSR-based approaches, (2) subword approaches based on phone units, and (3) generating a spoken query from the written form of the term using speech synthesis and employing dynamic time warping (DTW)-based search in a QbE-STD framework.

3.1 Combined Kaldi-based STD system (Combined Kaldi)

This system combines a word-based STD system and a phone-based STD system, as depicted in Fig. 2. Both systems are described next.

Fig. 2
figure 2

Architecture of the Combined Kaldi STD system

3.1.1 Word-based STD system

The ASR subsystem is based on the Kaldi open-source toolkit [73] and employs deep neural network (DNN)-based acoustic models. Specifically, a DNN-based context-dependent speech recognizer is trained following the DNN training approach presented in [116]. Forty-dimensional Mel-frequency cepstral coefficients (MFCCs) augmented with three pitch- and voicing-related features [117] and appended with their delta and acceleration coefficients are first extracted for each speech frame. The DNN has 6 hidden layers with 2048 neurons each. Each speech frame is spliced across ± 5 frames to produce 1419-dimensional vectors which are the input to the first layer, whereas the output layer is a soft-max layer representing the log-posteriors of the context-dependent hidden Markov model (HMM) states. The Kaldi LVCSR decoder generates word lattices [118] using these DNN-based acoustic models.

The data used to train the acoustic models of this Kaldi-based LVCSR system are extracted from the Spanish training material of the 2006 TC-STAR automatic speech recognition evaluation campaignFootnote 8, which amounts to about 99 h of speech, and the Galician broadcast news database Transcrigal [119], which amounts to about 26 h of speech. It must be noted that all the non-speech parts as well as the speech parts corresponding to transcriptions with pronunciation errors, incomplete sentences, and short speech utterances are discarded, so in the end the acoustic training material consists of approximately 104.5 h.

The language model employed in the LVCSR system is constructed using a text database of 150 million word occurrences composed of material from several sources (transcriptions of European and Spanish Parliaments from the TC-STAR database, subtitles, books, newspapers, on-line courses, and the transcriptions of the MAVIR sessions included in the development set provided by the evaluation organizersFootnote 9 [120]). Specifically, the LM is obtained from static interpolation of two 4-gram-based language models which are trained using these different text databases. Both LMs are built using the Kneser-Ney discounting strategy employing the SRILM toolkit [121], and the final LM is obtained using the SRILM static n-gram interpolation functionality. One of the 4-gram LMs is trained from the subtitles provided by the evaluation organizers within the RTVE training data, and the other LM is built from the rest of the text corpora. Both LMs contain 15 million 4-grams, 10 million 3-grams, 780K 2-grams, and 300K 1-grams. The LM vocabulary size is limited to the most frequent 300K words and, for each evaluation data set, the OOV terms are removed from the language model. Grapheme-to-phoneme conversion is carried out with the Cotovia software [122].

The STD subsystem integrates the Kaldi term detector [7375], which searches for the input terms within the word lattices obtained in the previous step. To do so, these lattices are processed using the lattice indexing technique described in [123] so that the lattices of all the utterances in the search collection are converted from individual weighted finite state transducers (WFSTs) to a single generalized factor transducer structure in which the start-time, end-time, and lattice posterior probability of each word token are stored as 3-dimensional costs. This factor transducer is actually an inverted index of all word sequences seen in the lattices. Thus, given a list of terms, a simple finite state machine is created such that it accepts each term and composes it with the factor transducer to obtain all occurrences of the terms in the search collection. The Kaldi decision-maker conducts a YES/NO decision for each detection based on the term specific threshold (TST) approach presented in [49]. To do so, the score for each detection is computed as follows:

$$ p > \frac{N_{\text{conf}}}{\frac{T}{\beta}+\frac{\beta-1}{\beta}N_{\text{conf}}}, $$
(4)

where p is the confidence score of the detection, Nconf is the sum of the confidence score of all the detections of the given term, β is set to 999.9 (as in Eq. 1), and T is the length of the audio in seconds.

The proxy words strategy in the Kaldi open-source toolkit [76] is employed for OOV term detection. This strategy consists in substituting each OOV word of the search term with acoustically similar INV proxy words so that the search of OOV terms can be carried out using the obtained INV term or terms.

3.1.2 Phone-based STD system

The phone-based STD system is applied for INV and OOV term detection and follows a probabilistic retrieval model for information retrieval. This model consists of the following stages:

  • Indexing. The lattice-to-phone-lattice tool in the Kaldi [73] toolkit is employed to produce phone lattices from the word lattices output by the LVCSR system described above. Then, 40 n-best lists are created from the phone lattices and indexed in terms of phone n-grams of different size [124, 125]. The minimum and maximum sizes of the n-grams are set to 1 and 5, respectively, according to [125]. According to the probabilistic retrieval model used in this system, each spoken document is represented by means of a language model [126]. In this case, given that the phone transcriptions have errors, several hypotheses for each transcription are used to improve the quality of the language model. The start time and duration of each phone are also stored in the index.

  • Search. A phonetic transcription of the term is first obtained using the grapheme-to-phoneme model for Spanish included in the Cotovia software [122]. Then, the term is searched within the different indices, and a score for each spoken document is computed following the query likelihood retrieval model [127]. It must be noted that this model sorts the spoken documents according to how likely it is that they contain the term, but the start and end times of the match are required in this task. To obtain these times, the phone transcription of the term T is aligned to that of the spoken document D by computing its minimum edit distance (MED) MED(T,D). This allows the recovery of the start and end times, since they are stored in the index. In addition, the MED is used to penalize the score returned by the query likelihood retrieval model (Lopez-Otero et al.: Probabilistic information retrieval models for query-by-example spoken document retrieval, submitted) (i.e., scoreLM(T,D)), as follows:

    $$ \mathrm{score(T,D)} = \mathrm{score_{LM}}(T,D)\cdot \mathrm{score_{MED}}(T,D), $$
    (5)

    where scoreMED(T,D) is a score between 0 and 1 derived from MED(T,D) and computed as

    $$ \mathrm{score_{MED}}(T,D) = \frac{n_{\mathrm{T}}-\text{MED}(T,D)}{K}, $$
    (6)

    where nT is the number of phonemes of the term, and K is the length of the best alignment path.

Indexing and search are performed using Lucene.Footnote 10

3.1.3 Fusion

Discriminative calibration and fusion [128] are applied in order to combine the outputs of the word and phone-based STD systems described above. The global minimum score produced by the system for all the terms is used to hypothesize the missing scores. After normalization, calibration and fusion parameters are estimated by logistic regression on the development dataset to obtain improved discriminative and well-calibrated scores [129]. Calibration and fusion training are performed using the Bosaris toolkit [130].

The decision threshold, weight of the LM in the wordbased system, and number of n-best lists in the phonebased system for MAVIR and RTVE development data are tuned for each dataset from the individual development dataset. However, for all the test data (i.e., MAVIR, RTVE, and COREMAH), these parameters are tuned from the combined ground-truth labels of the MAVIR and RTVE development data, aiming to avoid overfitting issues. The rest of the parameters are set based on preliminary experiments.

3.2 Kaldi+proxy words-based STD system (Proxy Kaldi)

This system is the word-based STD system described in the Section 3.1.

3.3 Phone-based sTD system (Phone-based)

This system is the phone-based STD system explained in the Section 3.1.

3.4 Kaldi-based DNN system (Kaldi-DNN)

This system, whose architecture is presented in Fig. 3, is based on an LVCSR system constructed with the open-source Kaldi toolkit [73]. Specifically, the design of the system relies on the use of the s5 Wall Street Journal (WSJ) recipe in KaldiFootnote 11. The acoustic features used are 13 MFCCs with cepstral mean and variance normalization (CMVN) to reduce the effects of the channel. Linear discriminant analysis (LDA), maximum likelihood linear transform (MLLT), and feature-space maximum likelihood linear regression (fMLLR) were also applied to obtain more robust features. The training of the acoustic models begins with a flat initialization of context-independent phone HMMs. Then, several re-training and alignment of acoustic models are performed to obtain context-dependent phone HMMs, following the standard procedures of the Kaldi s5 WSJ recipeFootnote 12.

Fig. 3
figure 3

Architecture of the Kaldi-DNN STD system

These phone models consist of three HMM states, each in a tied-pdf cross-word tri-phone context with Gaussian mixture models (GMMs). Then, the GMM-HMM model is speaker-adapted by means of subspace Gaussian mixture model (SGMM), as described in [131], using fMLLR features and sharing the same Gaussian model. The GMM-HMM also produces the alignments for training the DNN-based acoustic model (DNN-HMM). The DNNs contain 2 hidden layers with 300 nodes each. The number of spliced frames is 9 to produce 360 dimensional vectors as input to the first layer. The output layer is a soft-max layer representing the log-posteriors of the context-dependent states.

The data used to train the acoustic models comprise the TC-STAR data recorded from 2005 to 2007, which contain more than 26 h of speech; a subset of the dev1 set of the RTVE data, which amounts to about 14 h of speech; and the MAVIR training data, which amount to more than 4 h of speech. In total, there are 45 h of speech material in the three datasets. Overlapped speech is removed from the dev1 set of RTVE, so eventually 44 h of speech are used for acoustic model training.

The data used for language model training include the text transcriptions of the data used for acoustic model training, which contain 425K word occurrences. Specifically, these text transcriptions are given to the SRILM toolkit [121] to create a trigram-based LM, which consists of 38K trigrams, 155K bigrams, and 23K unigrams. The system vocabulary consists of the different words corresponding to the training data which, after removing the OOV words, amounts to 23K words. The multilingual G2P transcriberFootnote 13 is employed to obtain the phone transcription of each word.

The Kaldi decoder generates word lattices using the DNN-HMM based acoustic models. The STD subsystem, which takes the word lattices as input, includes the Kaldi term detector and Kaldi decision maker-explained in the Combined Kaldi system.

The proxy words strategy in the Kaldi open-source toolkit [76] is employed for OOV term detection.

All the system parameters are selected based on preliminary experiments, and no additional tuning from development data is carried out.

3.5 Kaldi-based SGMM system (Kaldi-SGMM)

This system is the same as the Kaldi-DNN system but SGMMs are employed for acoustic modeling in the Kaldi-based LVCSR system.

3.6 Kaldi-based GMM system (Kaldi-GMM)

This system is the same as the Kaldi-DNN system but GMMs are employed for acoustic modeling in the Kaldi-based LVCSR system.

3.7 Combined synthetic-Speech DTW system (Combined synt-DTW)

This system, whose architecture is shown in Fig. 4, aims to completely overcome the OOV word issue of text-based approaches. To do so, the written form of the term is synthesized to generate a spoken query that is then given to a QbE-STD system to hypothesize detections.

Fig. 4
figure 4

Architecture of the Combined Synt-DTW STD system

3.7.1 Generation of multiple spoken queries

Two different text-to-speech (TTS) tools are used for spoken query generation: the Google TTS (gTTS) Python library and command-line interface (CLI) tool [132], which provides two different female voices (es-ES and es-US); and the Cocoa TTS interface in MacOS [133], which has five different voices (three male, two female) including both European and American Spanish. In this way, for each textual form of the term, seven spoken queries q1,q2,…,q7 are synthesized.

3.7.2 Voice activity detection (VAD)

The synthesized spoken queries and the audio documents are given to a VAD system. Specifically, the Python interface for the VAD module developed by Google for the WebRTC project [134] is employed. This VAD strategy is based on Gaussian distributions of speech and non-speech features.

3.7.3 Feature extraction

The feature extraction consists in stacked bottleneck feature (SBNF) computation following the BUT/Phonexia approach [135], both for the synthesized spoken queries and the audio documents. To do so, three different neural networks are applied, each trained to classify a different set of acoustic units and later optimized for language recognition tasks. The first network is trained on telephone speech from the English Fisher corpus [136] with 120 monophone state targets, which will be referred as FisherMono. The second one is also trained on the Fisher corpus but with 2423 triphone tied-state targets and will be referred as FisherTri. The third network is trained on telephone speech from 17 languages included in the IARPA Babel program [137], with 3096 stacked monophone state targets (BabelMulti for short). Given that the SBNF extractors are trained using 8 kHz speech signals, the documents and the synthesized spoken queries are downsampled to 8 kHz.

The architecture of the SBNF networks consists of two stages. The first stage is a standard bottleneck network fed with low-level acoustic features, which span 10 frames (100 ms), producing a bottleneck feature vector of 80 dimensions. The second stage employs five equally-spaced bottleneck feature vectors from the first stage as input and is trained on the same targets as the first stage, producing bottleneck features of the same size (80). The bottleneck features extracted from the second stage are known as stacked bottleneck features, and comprise the output of the feature extraction module. Alternatively, the extractor can output target posteriors, instead of SBNFs.

The operation of BUT/Phonexia SBNF extractors requires an external VAD module (as WebRTC VAD in our case) providing speech/non-speech information. If no external VAD is provided, a simple energy-based VAD is computed internally. This system employs the WebRTC VAD module.

The first aim for the feature extraction stage was to employ the BUT/Phonexia posteriors, but the huge size of FisherTri (2423) and BabelMulti (3096) targets requires some kind of selection, clustering or dimensionality reduction approach. Therefore, given that—at least theoretically—the same information is conveyed by sBNFs with a suitably low dimensionality (as 80 in this case), sBNFs are employed. However, this may require to pay a high price. Posteriors have a clear meaning, they can be linearly combined and their values suitably fall within the range [0,1], which makes the − log cos(α) distance also range in [0,1], where α is the angle between two vectors of posteriors. On the other hand, bottleneck layer activations have no clear meaning, it is not actually known if they can be linearly combined (e.g., for computing an average query from multiple query instances), and their values are unbounded, so the − log cos(α) distance no longer applies.

3.7.4 Average query construction

From the synthesized spoken queries, the longest query is taken as reference and then, optimally aligned to the other queries by means of a standard DTW procedure. Let ql be a sequence of ml VAD-filtered sBNF vectors for the reference query, and let qi be the sequence of mi vectors corresponding to another synthesized query. The alignment starts at [1,1], ends at [ml,mi], and involves L alignments, such that each feature vector of ql is aligned to a sequence of vectors of qi. This is repeated for all the synthesized queries, and a set of feature vectors namely Sj is obtained from the alignment with each feature vector ql[j],j=1,2,…,ml. Then, each ql[j] is averaged with the feature vectors in Sj to get a single average query, as follows:

$$ q_{\text{avg}}[j] = \frac{1}{1+|S_{j}|} \left(q_{l}[j] + \sum_{v \in S_{j}} v\right) \quad j=1, 2, \ldots,m_{l}. $$
(7)

Finally, the average query qavg[j] is used to search for occurrences in the audio documents using the DTW-based approach explained next.

3.7.5 Dynamic time warping-based search

To perform the search of spoken queries in audio documents, the system follows the DTW-based approach presented in [138]. Given two sequences of sBNFs corresponding to a spoken query and an audio document, a VAD system is used to discard non-speech frames, but keeping the timestamp of each frame. To avoid memory issues, audio documents are split into chunks of 5 min with 5-s overlap and processed independently. This chunking process is key to the speed and feasibility of the search procedure. Let q=(q[1],q[2],…,q[m]) be the VAD-filtered sequences corresponding to a query of length m and x=(x[1],x[2],…,x[n]) be those of an audio document of length n. Since sBNFs (theoretically) range from − to +, the distance between any pair of vectors, q[i] and x[j], is defined as follows:

$$ d(q[i],x[j]) = -\log \left(1 + \frac{q[i] \cdot x[j]}{|q[i]| \cdot |x[j]|} \right) + \log 2. $$
(8)

Note that d(v,w)≥0, with d(v,w)=0 if and only if v and w are aligned and pointing in the same direction, and d(v,w)=+ if and only if v and w are aligned and pointing in opposite directions.

The distance matrix computed according to Eq. 8 is normalized with respect to the audio document x, as follows:

$$ d_{\text{norm}}\left(q[i],x[j]\right)=\frac{d(q[i],x[j])-d_{\text{min}}(i)}{d_{\text{max}}(i)-d_{\text{min}}(i)}, $$
(9)

where:

$$\begin{array}{@{}rcl@{}} d_{\text{min}}(i)=\min\limits_{j=1,\ldots,n} d(q[i],x[j]) \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} d_{\text{max}}(i)=\max\limits_{j=1,\ldots,n} d(q[i],x[j]). \end{array} $$
(11)

In this way, matrix values are in the range [0,1] and a perfect match would produce a quasi-diagonal sequence of zeroes. This can be seen as test normalization since, given a query q, distance matrices take values in the same range (and with the same relative meaning), no matter the acoustic conditions, the speaker or other factors of the audio document x.

It must be noted that the chunking process described above makes the normalization procedure differ from that applied in [138], since dmin(i) and dmax(i) are not computed for the whole audio document but for each chunk independently. On the other hand, considering chunks of 5 min might be beneficial, since normalization is performed in a more local fashion, that is, more suited to the speaker(s) and acoustic conditions of each particular chunk.

The best match of a query q of length m in an audio document x of length n is defined as that which minimizes the average distance in a crossing path of the matrix dnorm. A crossing path starts at any given frame of x, k1∈[1,n], then traverses a region of x which is optimally aligned to q (involving L vector alignments), and ends at frame k2∈[k1,n]. The average distance in this crossing path is as follows:

$$ d_{\text{avg}}(q,x) = \frac{1}{L} \sum_{l=1}^{L} d_{\text{norm}}(q[i_{l}],x[j_{l}]), $$
(12)

where il and jl are the indices of the vectors of q and x in the alignment l, for l=1,2,…,L. Note that i1=1,iL=m,j1=k1, and jL=k2. The optimization procedure is O(n·m·d) in time (d size of feature vectors) and O(n·m) in space. Readers are referred to [138] for more details.

The detection score is computed as 1−davg(q,x), thus ranging from 0 to 1, being 1 only for a perfect match. The starting time and the duration of each detection are obtained by retrieving the time offsets corresponding to frames k1 and k2 in the VAD-filtered audio document.

This procedure is iteratively applied to find not only the best match, but also less likely matches in the same audio document. To that end, a queue of search intervals is defined and initialized with [1,n]. Given an interval [a,b], and assuming that the best match is found at [a,b], the intervals [a,a−1] and [b+1,b] are added to the queue (for further processing) only if the following conditions are satisfied: (1) The score of the current match is greater than a given threshold T (T=0.85); (2) The interval is long enough (half the query length, m/2); (3) The number of matches (those already found + those waiting in the queue) is limited to less than a given threshold M (M=7). An example is shown in Fig. 5. Finally, the list of matches for each query is ranked according to the scores and truncated to the N highest scores (N=1000, though it effectively applied only in a few cases).

Fig. 5
figure 5

Example of the iterative DTW procedure: (1) The best match of q in x[1,n] is located in x[k1,k2]. (2) Since the score is greater than the established threshold T, the search continues in the surrounding segments x[1,k1−1] and x[k2+1,n]. (3) x[k2+1,n] is not searched, because it is too short. (4) The best match of q in x[1,k1−1] is located in x[k3,k4]. (5) But its score is lower than T, so the surrounding segments x[1,k3−1] and x[k4+1,k1−1] are not searched. The search procedure outputs the segments x[k1,k2] and x[k3,k4]

Four different DTW-based searches are carried out. Three of them employ the three sBNF sets computed in the feature extraction module (FisherMono, FisherTri, BabelMulti). The other DTW search employs the concatenation of all the three sBNF sets (which leads to 240-dimensional sBNFs). Each DTW search produces different term detections that are next fused in the fusion stage.

3.7.6 Calibration and fusion

The scores produced by the different searches are transformed according to a discriminative calibration/fusion approach commonly applied in speaker and language recognition [139].

First, the so-called q-norm (query normalization) is applied, so that zero-mean and unit-variance scores are obtained per query. Then, if n different systems are fused, detections are aligned so that only those supported by k or more systems (1≤kn) are retained for further processing (k=2). To build the full set of trials (potential detections), a rate of 1 trial per second is chosen (which is consistent with the evaluation script provided by the organizers). Given one of those detections of a query q supported by at least k systems and a system A that did not provide a score for it, there could be different ways to fill up this hole. The minimum score that A has output for query q in other trials is selected. In fact, the minimum score for the query q is hypothesized for all target and non-target trials of query q for which system A has not output a detection score. When a single system is considered (n=1), the majority voting scheme and the filling up of missing scores are skipped. In this way, a complete set of scores is prepared, which besides the ground truth (target/non-target labels) for a development set of queries, can be used to discriminatively estimate a linear transformation that will hopefully produce well-calibrated scores.

The calibration/fusion model is learned on the development set and then applied to both the development and test sets, using the Bosaris toolkit [130]. Under this approach, and given the effective prior (in this evaluation, \(\hat {P}_{\text {target}} = C_{\text {miss}} P_{\text {target}} / (C_{\text {miss}} P_{\text {target}} + C_{\text {fa}} (1-P_{\text {target}})) = 0.001\)), the Bayes optimal threshold is applied and—at least theoretically—no further tuning would be necessary. In practice, however, if a system yields a small amount of detections, the system will be using hypothesized scores for most of the trials. As a result, the calibration/fusion model would be poorly learned and the Bayes optimal threshold would not produce good results.

The calibration/fusion parameters and optimal decision threshold are obtained from the corresponding development set for each database (MAVIR and dev2 for RTVE). For the COREMAH database, the optimal calibration/fusion parameters tuned on MAVIR data are employed, since evaluation organizers did not provide any development data for that database, and the optimal decision threshold is chosen so that 15% of the detections with the highest scores are assigned YES decision. The parameters involved in the feature extraction and search procedures are set based on preliminary experiments.

3.8 Super-bottleneck feature-based synthetic-speech DTW system (Super-BNF synt-DTW)

This system is the same as the Combined Synt-DTW system, except that a single DTW-based search with the concatenation of the three SBNF as features is used to hypothesize term detections.

3.9 Multilingual bottleneck feature-based synthetic-speech DTW system (Multilingual-BNF synt-DTW)

This system is the same as the Super-BNF Synt-DTW system, except that DTW-based search on the BabelMulti sBNF set is used for term detection.

3.10 Monophone bottleneck feature-based synthetic-speech DTW system (Monoph.-BNF synt-DTW)

This system is the same as the Super-BNF Synt-DTW system, except that DTW-based search on the FisherMono sBNF set is used for term detection.

3.11 Triphone bottleneck feature-based synthetic-speech DTW system (Triph.-BNF synt-DTW)

This system is the same as the Super-BNF Synt-DTW system, except that DTW-based search on the FisherTri sBNF set is used for term detection.

4 Evaluation results and discussion

4.1 Overall results

The overall evaluation results are presented in Figs. 6 and 7 for development and test data, respectively. These show that the best performance for MTWV and ATWV metrics corresponds to the Combined Kaldi system. Development and test data largely present the same ranking results. However, as explained next, this does not mean that the best system on development data corresponds to the best system on test data for all the databases. Different calibration threshold issues have caused this in the overall results.

Fig. 6
figure 6

Overall system results of the ALBAYZIN 2018 STD evaluation on development data

Fig. 7
figure 7

Overall system results of the ALBAYZIN 2018 STD evaluation on test data

4.2 Development data

4.2.1 MAVIR

System results for the MAVIR development data are presented in Fig. 8. The best performance is obtained with the Kaldi-DNN system, for which the small performance gap between MTWV and ATWV suggests that the threshold has been well-calibrated. This best performance is statistically significant for a paired t test (p<0.01) with respect to the Phone-based system and the systems that employ the QbE-STD approach (i.e., Combined Synt-DTW, Super-BNF Synt-DTW, Multilingual-BNF Synt-DTW, Monoph.-BNF Synt-DTW, and Triph.-BNF Synt-DTW), and weakly significant (p<0.03) with respect to the Combined Kaldi system. On the one hand, by inspecting the systems that employ a text-based STD approach, the Phone-based system degrades the STD performance compared with the other text-based STD systems. Although this system is based on word ASR to produce word lattices, these are then converted to phone n-grams for search, so that the word information is lost. This can be critical for highly-spontaneous and low-quality speech in MAVIR data. Nevertheless, phone-based systems typically convey fast search and indexing, and the possibility of detecting OOV terms with no additional system development. All the text-based STD systems that employ word ASR and word lattices for search do not present statistically significant differences for a paired t test, and hence they should be considered equivalent from an STD perspective. This indicates that the small difference in OOV rate in the development data according to Table 9 (5.5%−0.3%=5.2%) is not statistically significant. The systems that employ a QbE-STD approach for STD obtained a remarkably low performance. This may be due to these factors: (1) An acoustic mismatch between the synthesized queries and the test audios might lead to low scores and block the iterative DTW detection procedure. (2) The use of bottleneck layer activations as frame-level acoustic representation might be incompatible with the query averaging procedure (which worked fine with phone posteriors). (3) The absence of lexical information since no ASR system is employed.

Fig. 8
figure 8

System results of the ALBAYZIN 2018 STD evaluation on MAVIR development data

Table 9 Percentage of MAVIR INV terms that do not appear in the LVCSR system vocabulary (only for word-based STD systems)

4.2.2 RTVE

System results for the RTVE development data are presented in Fig. 9. The best performance is obtained with the Combined Kaldi system, for which the very small performance gap between MTWV and ATWV suggests that the threshold has been optimally calibrated. This best performance is statistically significant for a paired t test (p<0.01) compared with the rest of the systems, except with the Phone-based system, for which the improvement is weakly significant (p<0.02), and the Proxy Kaldi system, for which the performance gap is insignificant. This different ranking, compared with the MAVIR development data, is due to the large difference in the OOV rate (52.6%−5.1%=47.5%) between the Kaldi-DNN and Combined Kaldi systems, as presented in Table 10. It is worth mentioning that there is no significant difference for a paired t test between the Proxy Kaldi and the Phone-based systems. This suggests that phone-based systems are able to perform similarly to word-based systems for high-quality and well-pronounced speech such as that of the RTVE data. The systems that employ the QbE-STD approach obtain the worst results, probably due to the same causes mentioned in the previous section.

Fig. 9
figure 9

System results of the ALBAYZIN 2018 STD evaluation on RTVE development data

Table 10 Percentage of RTVE INV terms that do not appear in the LVCSR system vocabulary (only for word-based STD systems)

4.3 Test data

4.3.1 MAVIR

System results for the MAVIR test data are presented in Fig. 10. The best performance is obtained with the Proxy Kaldi system, for which the performance gap between MTWV and ATWV metrics suggests that the threshold calibration works well. This best performance is statistically significant for a paired t test (p<0.01) compared with all the systems except the Kaldi-DNN and Kaldi-SGMM systems. On the one hand, the low performance of the Combined Kaldi system indicates some calibration issues in the fusion stage. This is confirmed by the low performance obtained in the Phone-based system, which indicates that the parameter tuning on MAVIR development data does not generalize well in unseen data. On the other hand, the Proxy Kaldi system incorporates INV and OOV term detection in a common framework, and hence it is more robust against calibration issues. The differences in OOV rate shown in Table 9 between the Proxy Kaldi, Kaldi-DNN, and Kaldi-SGMM systems (20.3%−5.2%=15.1%) do not produce a statistically significant reduction in the ATWV performance, which suggests that robust acoustic models along with an effective OOV term detection can mitigate the OOV issue in low-quality and highly-spontaneous speech domains. Again, the systems based on a QbE-STD approach obtained a much lower performance.

Fig. 10
figure 10

System results of the ALBAYZIN 2018 STD evaluation on MAVIR test data

4.3.2 RTVE

System results for the RTVE test data are presented in Fig. 11. The best performance corresponds to the Proxy Kaldi system, for which MTWV and ATWV are very close, indicating an almost perfect threshold calibration. This best performance is statistically significant for a paired t test (p<0.01) with respect to the rest of the systems, except the Combined Kaldi and the Phone-based systems. Similar findings to those of the RTVE development data arise: (1) The large difference in OOV rate shown in Table 10 for test data (66.8%−6.6%=62.2%) produces large differences in terms of ATWV for the word-based STD systems. (2) The results of the Phone-based system can be considered statistically equivalent to those obtained with the Proxy Kaldi system, which highlights the performance of the n-grams when facing term detection in an open-vocabulary STD system.

Fig. 11
figure 11

System results of the ALBAYZIN 2018 STD evaluation on RTVE test data

4.3.3 COREMAH

System results for the COREMAH test data are presented in Fig. 12. The best performance is for the Monoph.-BNF Synt-DTW system, although all of the systems obtained very low MTWV/ATWV results. This best performance is statistically significant for a paired t test (p<0.01) compared with the Combined Kaldi, Kaldi-SGMM, and Kaldi-GMM systems, and weakly significant compared with the Proxy Kaldi (p<0.04), Phone-based (p<0.02), and Kaldi-DNN (p<0.03) systems. The low performance obtained in these data may be due to the following factors: (1) These data contain overlapped speech, which significantly reduces ASR performance; (2) the absence of training/development data belonging to this domain, which prevents the systems from being properly tuned to these data. This is especially critical in systems based on word speech recognition, which typically need a larger dataset for system construction than systems based on QbE-STD since these just rely on template-matching of features. For systems based on word ASR, the threshold calibration issue is more important, as can be seen from the performance gap between MTWV and ATWV. Table 11 shows the OOV rate of the word-based systems. In this case, the OOV rate is not as critical to STD performance as the change in the data domain.

Fig. 12
figure 12

System results of the ALBAYZIN 2018 STD evaluation on COREMAH test data

Table 11 Percentage of COREMAH INV test terms that do not appear in the LVCSR system vocabulary (only for word-based STD systems)

4.4 Development and test data DET curves

DET curves of the systems submitted to the evaluation are presented in Figs. 13 and 14 for MAVIR and RTVE development data, respectively, and in Figs. 15, 16, and 17 for MAVIR, RTVE, and COREMAH test data, respectively.

Fig. 13
figure 13

The DET curves of the STD systems for MAVIR development data

Fig. 14
figure 14

The DET curves of the STD systems for RTVE development data

Fig. 15
figure 15

The DET curves of the STD systems for MAVIR test data

Fig. 16
figure 16

The DET curves of the STD systems for RTVE test data

Fig. 17
figure 17

The DET curves of the STD systems for COREMAH test data

On MAVIR development data, the Proxy Kaldi system performs the best for low and high FA rates, and the Kaldi-DNN and Kaldi-SGMM systems perform the best for moderate FA rates. On RTVE development data, the Proxy Kaldi system performs the best for low and high FA rates, and the Combined Kaldi system performs the best for moderate FA rates. According to the MTWV/ATWV results (see Figs. 8 and 9), this means that the best operating point is placed in moderate FA rates for both datasets.

On MAVIR test data, the Proxy Kaldi system performs the best for all the operation points, as expected from the MTWV/ATWV results (see Fig. 10). On RTVE test data, the Proxy Kaldi system performs the best for low FA rates, and the Combined Kaldi system performs the best for low miss rates. According to the MTWV/ATWV results (see Fig. 11), this means that the best operating point resides in low FA rates. On COREMAH test data, the Combined Kaldi system performs the best for low FA rates, and the Proxy Kaldi system performs the best for low miss rates. According to the MTWV/ATWV results (see Fig. 12), this differs from the best ATWV (which is obtained with the Monoph.-BNF Synt-DTW system). However, this Monoph.-BNF Synt-DTW system only outputs one detection as hit (the detection with the highest score) and no FAs. This causes that any other systems working at different miss/FA ratios have a better DET curve in case there are FAs with better scores than those of the hits.

5 Post-evaluation analysis

After the evaluation period, an analysis based on some term properties and fusion of the primary systems submitted from the different participants has been carried out. This section presents the results of this analysis.

5.1 Performance analysis of STD systems for in-language and out-of-language terms

An analysis of the STD performance has been carried out for INL and OOL terms and results are presented in Fig. 18 for MAVIR, RTVE, and COREMAH test data. As expected, a large STD performance degradation is found from INL to OOL terms for all the databases in systems that employ a text-based STD approach. Some discrepancies appear on the COREMAH test data (i.e., Kaldi-DNN, Kaldi-SGMM, and Kaldi-GMM systems) although the STD performance is so low that any conclusion is meaningless. However, for the systems that employ the QbE-STD approach, the differences between INL and OOL terms are not so clear, and specially on MAVIR data, systems perform, in general, better for OOL term detection than for INL term detection. This may be due to the fact that Spanish language was not employed in the feature extraction, but English and other IARPA Babel program languages were, along with the fact that OOL terms in this database are mainly English terms.

Fig. 18
figure 18

System results of the ALBAYZIN 2018 STD evaluation on test data for in-language (INL) and out-of-language (foreign) (OOL) terms

5.2 Performance analysis of STD systems for single and multi-word terms

An analysis of the STD performance has been carried out for single and multi-word terms and results are presented in Fig. 19 for MAVIR, RTVE, and COREMAH test data. They show some differences depending on the database.

Fig. 19
figure 19

System results of the ALBAYZIN 2018 STD evaluation on test data for single-word (Single) and multi-word (Multi) terms

On MAVIR test data, Kaldi-DNN, Kaldi-SGMM, Kaldi-GMM, and Proxy Kaldi systems perform better for single-word term detection than for multi-word term detection. This probably happens because multi-word term detection is intrinsically more difficult for word-based ASR since more words must be detected. However, on RTVE test data, Kaldi-DNN, Kaldi-SGMM, and Kaldi-GMM systems perform better for multi-word term detection than for single-word term detection. This might be caused by the fact that there are much more OOV single-word terms (202) than OOV multi-word terms (9), which leads to a dramatical degradation in the final performance for single-word terms.

On the other hand, Combined Kaldi and Phone-based systems perform better for multi-word term detection on MAVIR and RTVE test data. The Phone-based system relies on a subword unit approach and multi-word terms are typically longer than single-word terms. Short terms tend to produce many FAs in phone-based systems, and the opposite stands for longer terms. Therefore, phonebased systems may obtain better performance for multiword term detection. The Combined Kaldi system performance for multi-word terms seems to be highly influenced by the Phone-based system.

The systems that employ a QbE-STD approach obtain better performance for multi-word term detection than for single-word term detection on MAVIR and RTVE test data due to the fact that multi-word terms are typically longer than single-word terms so that less false alarms are produced with the DTW search. In addition, these systems typically perform better for multi-word term detection than the Kaldi-DNN, Kaldi-SGMM, and Kaldi-GMM systems. This indicates that QbE-STD approaches can be effectively employed for long term detection in the absence of robust word-based LVCSR systems.

On COREMAH test data, the systems obtained better performance for multi-word term detection than for single-word term detection. However, the ATWVs are, in general, so low that any conclusion is hardly reliable.

5.3 Performance analysis of STD systems for INV and OOV terms

Figure 20 shows a similar analysis for in-vocabulary and out-of-vocabulary terms. The text-based STD systems perform, in general, better for INV term detection than for OOV term detection. The Kaldi-DNN, Kaldi-SGMM, and Kaldi-GMM systems on MAVIR test data are the only exceptions. We consider this could be due to the moderate OOV term rate (20.3%) in this dataset, along with the amount of training data used to train the INV language model. However, when the OOV term rate increases (66.8% for the RTVE test data), the proxy words strategy of Kaldi for OOV term detection is less powerful. System performance is so low on COREMAH test data that no reliable conclusion can be derived.

Fig. 20
figure 20

System results of the ALBAYZIN 2018 STD evaluation on test data for in-vocabulary (INV) and out-of-vocabulary (OOV) terms

5.4 System fusion

After the evaluation, we have combined all the primary systems developed by the participants by fusing the scores they produced. System fusion consists of two different stages: (1) pre-processing and (2) calibration and fusion. These are explained next.

5.4.1 Pre-processing

First, scores for each query and system are normalized to mean 0 and variance 1.

All the detections given by the fused systems are taken into account to generate the output of the fusion system. Given a certain detection output by a certain system A, in case some other fused system B does not detect it (and hence there is no score for it), the score generated for that detection is the minimum global score for all the terms generated by system B.

5.4.2 Calibration and fusion

Calibration and fusion are carried out with the Bosaris toolkit [130]. To do so, a linear model based on logistic regression trained on the development detection scores is employed. MAVIR and RTVE fusion parameters are optimized independently based on their corresponding development sets and then, are applied to their corresponding test sets. For COREMAH data, the model trained for MAVIR data is employed.

Fusion is employed to combine the three primary systems of the participants in the evaluation (i.e., Combined Kaldi, Combined Synt-DTW, and Kaldi-DNN systems).

5.4.3 Fusion results

Results are presented in Figs. 21 and 22 for MAVIR and RTVE development data, respectively, and in Figs. 23, 24, and 25 for MAVIR, RTVE, and COREMAH test data, respectively. They show that system fusion plays an important role on RTVE data, for which the fusion improves the best individual system for both development and test data. A paired t test shows that the Fusion system obtains a statistically significant difference (p<0.01) for both sets of RTVE data. However, on MAVIR and COREMAH data, the fusion does not outperform the best individual system. RTVE data contain higher-quality/better-pronounced speech than MAVIR data, and there were much more data available for RTVE. Fusion gets more benefit on these conditions. On COREMAH data, for which there are no available data for a fine tuning, fusion gets also worse results.

Fig. 21
figure 21

Fusion and primary system results of the ALBAYZIN 2018 STD evaluation on MAVIR development data

Fig. 22
figure 22

Fusion and primary system results of the ALBAYZIN 2018 STD evaluation on RTVE development data

Fig. 23
figure 23

Fusion and primary system results of the ALBAYZIN 2018 STD evaluation on MAVIR test data

Fig. 24
figure 24

Fusion and primary system results of the ALBAYZIN 2018 STD evaluation on RTVE test data

Fig. 25
figure 25

Fusion and primary system results of the ALBAYZIN 2018 STD evaluation on COREMAH test data

DET curves of the primary systems and the fusion systems are presented in Figs. 26 and 27 for MAVIR and RTVE development data, respectively, and in Figs. 28, 29, and 30 for MAVIR, RTVE, and COREMAH test data, respectively. On MAVIR development data, the Combined Kaldi system performs the best for low FA rates, the Kaldi-DNN system performs the best for moderate FA rates, and the Fusion system performs the best for low miss rates. This means that the fusion may be suitable for scenarios in which misses are more important than false alarms. On MAVIR test data, the Kaldi-DNN system performs the best for low FA rates and the Fusion system performs the best for low miss rates. This confirms that system fusion is suitable for low miss rates on MAVIR data. On RTVE development and test data, the Fusion system performs the best for almost all the operating points, which is consistent with the MTWV/ATWV results (see Figs. 22 and 24). On COREMAH test data, the Combined Kaldi system performs the best for almost all the operating points. In these data, the Fusion system may play an important role for low miss rates regardless the high FA rates.

Fig. 26
figure 26

The DET curves of the fusion and primary STD systems for MAVIR development data

Fig. 27
figure 27

The DET curves of the fusion and primary STD systems for RTVE development data

Fig. 28
figure 28

The DET curves of the fusion and primary STD systems for MAVIR test data

Fig. 29
figure 29

The DET curves of the fusion and primary STD systems for RTVE test data

Fig. 30
figure 30

The DET curves of the fusion and primary STD systems for COREMAH test data

5.5 Comparison to the ALBAYZIN 2016 STD evaluation

Given that MAVIR test data are the same for ALBAYZIN 2016 and 2018 STD evaluations, these data can be used to establish a comparison between the performance of the systems presented to both evaluations (see Fig. 31). The highest performance in 2018 (ATWV=0.4699) is lower than that obtained in 2016 (ATWV=0.5724). In the 2018 evaluation, the decision threshold was tuned on MAVIR and RTVE data simultaneously, which produced the performance degradation. However, in 2016, the decision threshold was only tuned on MAVIR data, which produced a better threshold calibration, and hence, better performance.

Fig. 31
figure 31

Best system performances obtained in the ALBAYZIN STD 2016 and 2018 evaluations on MAVIR test data

Therefore, it can be said that building multi-domain STD systems still represents a research challenge since it can lead to reduced performance on some specific domains. However, this presents a great advantage, since a single system is able to search on speech in different domains.

6 Conclusions

This paper has presented a multi-domain spoken term detection international evaluation for search on speech in Spanish. The amount of systems submitted to the evaluation has made it possible to compare the progress of this technology under a common framework. Three different research groups have taken part in the evaluation and eleven different systems were submitted in total. Most of the systems are largely based on the standard text-based STD approach (with state-of-the-art DNN-based ASR systems) for hypothesizing detections from word ASR. On the other hand, other systems are based on a QbE-STD framework for hypothesizing detections. Among those systems, Combined Kaldi and Phone-based systems, which include a probabilistic retrieval model for information retrieval and a query likelihood retrieval model, and Combined Synt-DTW, Super-BNF Synt-DTW, Multilingual-BNF Synt-DTW, Monoph.-BNF Synt-DTW, and Triph.-BNF Synt-DTW, which employ speech synthesis for query generation from the term list and a QbE-STD approach, can be considered novel from an STD perspective.

The most important conclusion from this evaluation is that multi-domain STD is still a challenge in STD research, since results have shown much variability with regard to domain mismatch. On the one hand, the submitted systems have obtained the best performance on RTVE data, for which more data are available for system construction and include high-quality and well-pronounced speech. On the other hand, the systems have obtained the worst performance on COREMAH data, for which only test data were provided, and speech is very spontaneous and with high degree of overlapping. This indicates that domain change is quite challenging in STD tasks. Finally, systems on MAVIR data, which present highly-spontaneous speech, obtained performances between those obtained on RTVE and COREMAH data.

We have also shown that OOL term detection still remains an important challenge in STD, since systems have obtained low performance on those terms (ATWV=0.1392 on MAVIR data and ATWV=0.3496 on RTVE data). On OOV term detection, which is crucial for open-vocabulary STD, systems have obtained best performances of ATWV=0.5284 on MAVIR data and ATWV=0.3600 on RTVE data (i.e., for domains in which training/development data have been provided). Regarding multi-word term detection, systems have obtained best performances of ATWV=0.4595 on MAVIR data and ATWV=0.5967 on RTVE data.

Given the best overall result obtained in the evaluation (ATWV=0.2250), which comes from the average of the three domains, there is still ample room for improvement. Specifically, the performance of STD systems degrades dramatically when applied to unseen data. This encourages us to maintain the STD evaluation in the next years, focusing on multi-domain STD, and the applicability of this technology to unseen challenging domains. Specifically, in the next months we will be launching the ALBAYZIN 2020 STD evaluation to be held in November 2020 within the IberSPEECH conference. This new evaluation edition aims to provide new domains and more challenging data (i.e., more difficult search terms) and evaluation conditions (i.e., rank the submitted systems from weighting the system performance according to the most challenging domain).

7 Appendix

This appendix shows the full result tables for the systems submitted to the ALBAYZIN 2018 STD evaluation for development and test data.

Table 12 Overall system results of the ALBAYZIN 2018 STD evaluation on development and test data
Table 13 System results of the ALBAYZIN 2018 STD evaluation on MAVIR development data
Table 14 System results of the ALBAYZIN 2018 STD evaluation on RTVE development data
Table 15 System results of the ALBAYZIN 2018 STD evaluation on MAVIR test data
Table 16 System results of the ALBAYZIN 2018 STD evaluation on RTVE test data
Table 17 System results of the ALBAYZIN 2018 STD evaluation on COREMAH test data
Table 18 Fusion and primary system results of the ALBAYZIN 2018 STD evaluation on development data
Table 19 Fusion and primary system results of the ALBAYZIN 2018 STD evaluation on test data