Language Resources and Evaluation

, Volume 44, Issue 4, pp 371–386

Remote-based text-to-speech modules’ evaluation framework: the RES framework


    • Faculty of Electrical Engineering and Computer ScienceUniversity of Maribor
  • Harald Höge
    • IC 5Siemens AG, Corporate Technology
  • Zdravko Kačič
    • Faculty of Electrical Engineering and Computer ScienceUniversity of Maribor

DOI: 10.1007/s10579-009-9110-3

Cite this article as:
Rojc, M., Höge, H. & Kačič, Z. Lang Resources & Evaluation (2010) 44: 371. doi:10.1007/s10579-009-9110-3


The ECESS consortium (European Center of Excellence in Speech Synthesis) aims to speed up progress in speech synthesis technology, by providing an appropriate evaluation framework. The key element of the evaluation framework is based on the partition of a text-to-speech synthesis system into distributed TTS modules. A text processing, prosody generation, and an acoustic synthesis module have been specified currently. A split into various modules has the advantage that the developers of an institution active in ECESS, can concentrate its efforts on a single module, and test its performance in a complete system using missing modules from the developers of other institutions. In this way, complete TTS systems can be built using high performance modules from different institutions. In order to evaluate the modules and to connect modules efficiently, a remote evaluation platform—the Remote Evaluation System (RES) based on the existing internet infrastructure—has been developed within ECESS. The RES is based on client–server architecture. It consists of RES module servers, which encapsulate the modules of the developers, a RES client, which sends data to and receives data from the RES module servers, and a RES server, which connects the RES module servers, and organizes the flow of information. RES can be used by developers for selecting RES module from the internet, which contains a missing TTS module needed to test and improve the performances of their own modules. Finally, the RES allows for the evaluation of TTS modules running at different institutions worldwide. When using the RES client, the institution performing the evaluation is able to set-up and performs various evaluation tasks by sending test data via the RES client and receiving results from the RES module servers. Currently ELDA is setting-up an evaluation using the RES client, which will then be extended to an evaluation client specializing in the envisaged evaluation tasks.


Remote text-to-speech synthesis evaluationText-to-speech synthesis modulesECESS consortium

1 Introduction

As demonstrated over many EU-funded and DARPA projects, constant evaluation as a constituent part of research activities has proven to be a successful approach for enhancing progress in almost all areas of speech technology, such as speech recognition, speech synthesis, or speech translation, especially if organized in the form of evaluation campaigns (e.g. TC-STAR,1 Blizzard2 etc.).

An evaluation framework has been established for speech synthesis technology within the ECESS3 consortium (European Center of Excellence in Speech Synthesis) in cooperation with the EU-funded project TC-STAR. The ECESS consortium is an open, non funded consortium for institutions active in speech synthesis and related topics. The key element of the evaluation framework is the specification of several modules: e.g. for text processing, prosody generation, and acoustic synthesis modules, by building a complete text-to-speech system. The functionality and interfaces of the TTS modules are described in (Perez et al. 2006). This split into modules has the advantage that the developers of an institution can concentrate its efforts on a single module, and test its performance in a complete system, using missing modules from the developers of other institutions. In this way high-performance systems can be built from the high-performance modules of different institutions. A common evaluation methodology has been developed to assess the performances of the modules (Bonafonte et al. 2006). The methodology is based on the common use of those module-specific evaluation criteria and module-specific language resources needed for training and testing the modules.

Two evaluation campaigns were performed within the TC-STAR project, in order to evaluate the TTS modules and complete TTS systems. These evaluations were done in a ‘traditional’ way. The institution responsible for evaluation (in this case ELDA) sent out an evaluation kit (test data, evaluation scripts) and those institutions whose modules or systems were evaluated, sent back the evaluation results. This type of traditional evaluation has highlighted two main drawbacks:
  • The evaluation is not performed ‘on-line’, because the transport of test data and results has to be treated manually. Furthermore, the test data are not ‘secret’.

  • The connecting of different developers’ modules can not be handled without an exchange of software to be integrated locally.

The web-based distributed system—RES (Remote Evaluation System)—has been developed in order to avoid these drawbacks. The RES is designed not only to evaluate TTS modules but also to support the developers of TTS modules. Thus, RES is dedicated to two user groups:
  • Developers/researchers, who use RES in a test/development modus in order to improve the performances of their TTS module(s). In the following this user group is called ‘developers’.

  • Evaluators, who use RES in an evaluation modus to measure the performance of selected TTS modules.

The architecture of RES consists of three RES components: the RES module server, which encapsulates the TTS modules, the RES client, which sends data to and receives data from the RES module servers, and a RES server (managing unit MU), which connects the RES clients and RES module servers, and organizes the flow of information. Using the RES framework, each developer places his TTS module (or several modules) embedded in a RES module server on the internet. The developers of any ECESS research group and evaluators can access these TTS modules via a locally installed RES client. Based on this architecture, evaluation can be done remotely without the need for installing these modules locally, and without the need for manual intervention regarding the transport of test data and results. Furthermore, each developer can combine his TTS module with other available TTS modules, in order to test the performance of his/her module within a complete TTS system. Developers, focusing on the research of speech synthesis, do not want to spend too much time integrating their module into a RES module server. A Unforma RES tool has been developed in order to ease the embedding of a module into the RES module server, which allows for an easier way of constructing those data format parsers able to convert proprietary data formats into the RES system data format. An additional ProtocolGen tool enables the generation of numerous RES system’s task configurations needed for evaluations or testing of various TTS modules and systems from different institutions. Depending on the experiences of developers active in ECESS, the RES will be further extended and modified. The architecture of RES also allows for the evaluation of arbitrary software components. A testbed of this idea will be the evaluation of those tools needed to support a generation of TTS systems (e.g.pitch-marking, VAD etc.). The remainder of this paper is organized as follows. Section 2 describes how the remote evaluation system RES is used, and exposes the main functionalities of the system. Section 3 then describes the integration of new TTS modules into the RES. Implementation of new evaluation or testing tasks for the RES system is described in Sect. 4. The paper ends with a presentation of current ECESS evaluation/testing platforms, based on the RES system, that are used in evaluation campaigns, and the last section draws some conclusions.

2 Functionalities, use and installation of RES

2.1 Functionalities of the RES components

The functional architecture of RES is shown in Fig. 1. As can be seen, RES consists of several RES clients, the RES server (managing unit—MU) and RES module servers encapsulating the TTS modules. The core of RES is the RES server responsible for interconnecting the RES clients, and the RES module servers. The RES server is administrated by the administrator. All RES components are connected via the internet by TCP/IP and UDP connections. In this way all TTS modules are accessible via the TCP/IP network. Developers have to install their TTS module locally, embedded in a RES module server. Users of RES (developers and evaluators) have to install RES clients locally. The RES client is equipped optionally with an RTP player for testing audio signals. The RES server can interact with an arbitrary number of RES module servers, and RES clients. The RES server has to be installed by the ‘administrating’ institutions. The RES server communicates simultaneously with several RES clients, thus allowing the RES module servers to communicate with several RES clients at the same time. When performing evaluation, or testing developers and evaluators simply select the desired TTS modules via RES clients and give corresponding input for the selected task. The given input is then automatically transferred within the RES to the selected ECESS TTS modules, and their generated output is returned to the RES client.
Fig. 1

Functional architecture of the RES

RES users are able to perform numerous evaluation and testing tasks using the RES. These tasks can be performed by different RES components’ architectures and their behavioural specifications. Each task to be performed using the RES system, starts by selecting those configurations of developers’ TTS modules suitable for the desired task (input/output data exchange must make sense), followed by the executions of certain tasks to be done by a TTS module. The RES systems’ architectures and the RES components’ behaviour are described in XML format.

All the communication protocols used (as specified by the ECESS consortium) are additionally drawn in Fig. 2. It can be seen that RES clients open RTSP (Real Time Streaming Protocol) sessions with the RES server, which are then closed after the requested task performed by the selected developer’s RES module server is finished. The RTSP protocol is based on TCP/IP, a secure, connection-oriented protocol (Burke 2007). Therefore, there is no need for the RES client or the RES server to implement any additional error-correction mechanisms. The RTSP protocol is used as a support for the MRCP protocol (Media Resource Control Protocol) (Burke 2007). Within the RES system, RTSP defines packets content and packet exchange sequences between RES clients, and the RES server. These packets also contain the MRCP content that has to be exchanged between RES clients and RES server according to MRCP protocol. It provides the means for a client device requiring audio streams to control stream processing resources within the network. MRCP is used to control speech synthesisers and recognizers in order to provide speech recognition, and to stream audio from a common location to a user. It is a rapidly growing standard, gaining wide support in the speech and telecommunication markets of today. As can be seen, the RTSP/MRCP protocols are only used between RES clients and RES server. After connection between the RES server and RES client is established, the RES server dedicates a special thread and establishes connection(s) with selected RES module server(s). By using an efficient thread mechanism, the RES server is able to handle many users simultaneously, even when performing different tasks. The RES server’s connections to selected RES module servers remains active until the requested task performed by the user (using RES client) is finished, and the results obtained. Results from developers’ TTS modules are always sent back to the users (to the RES clients), and the RES server has the role of mediator in any data exchange between the RES module server(s) and RES clients. Furthermore, the ECESS XML-based protocol is used between RES server and RES module server(s) for exchanging input/output data with developers’ TTS modules. TTS modules for text processing and prosody generation only exchange text data, whereas acoustic processing modules exchange text and audio data. In this case RTP protocol is used for transmitting audio data (Burke 2007). Audio data are transferred from the selected RES module server via the RES server to the RES client, where the RTP player is used at the end of the transmission. All text data exchanged in the RES system are written in ECESS data format that is compatible with the TC-STAR data format (Bonafonte et al. 2006).
Fig. 2

RES system and related protocols

2.2 Use and installation of the RES components

Different configurations for the RES system are shown in Fig. 3. As can be seen, currently three configurations of the RES system enable to perform evaluation or test/development tasks. The configuration marked as “Partner I” is dedicated to evaluators needing only RES clients locally. This configuration is suitable for evaluating those various TTS modules available on the internet via a RES module server. Configuration marked as “Partner II” includes a RES client and a RES module server. This configuration is available to all developers who would like to test and improve their TTS modules by comparing their results with those results obtained by the TTS modules of other developers. Another benefit of this configuration is also that developers are able to use the TTS module of another developer in order to test their own TTS module (e.g. using the text processing module of another developer for testing his/her own prosody processing module). Configuration “Partner III” has only a RES module server. Such a configuration is intended for developers who want to participate in evaluation campaigns, and who want to make their TTS modules available to other developers, but have no intention of doing any testing or to run other developers’ TTS modules.
Fig. 3

Configurations to use the RES system

An RES server as central managing unit (MU) is installed by only one institution, which is also responsible for administrating the RES system. This administration institution also maintains the list of all RES module servers made available by developers (IP/port access). This list is automatically sent to RES clients in order that RES users are able to select between different available TTS modules. RES users can install a RES client and their RES module servers on any platform, since all components of RES are of pure Java application (Linux or Windows). Each RES client additionally contains sets of XML protocol scenario files, and an XML configuration file. RES client access to all other RES components, regarding TCP and UDP traffic, is set-up in XML configuration file. Sets of XML protocol scenario files are designed for performing different tasks using the RES system, as is explained in more details under Sect. 4. In this way, by using a RES client, RES users are able to select any RES module server within the RES, running a specific TTS module. When they want to run RES module servers running an acoustic processing module from another developer, they also have to configure IP/UDP port for RTP protocol in the RES client XML configuration file. Installation of the RES client is simple, since RES users just have to copy the software package into some selected directory. After running the RES client, they have to select the desired RES task from the ‘task list’ (depending on the evaluation task). The given RES ‘task list’ actually identifies those sets of XML files describing the needed behaviour of RES modules. Finally, they have to enter input data as specified within the evaluation campaign. The input given by the RES client is transferred via the RES server to the specified RES module server, where it is stored in a predefined file. Next, the RES module server runs the TTS module or the script specified in the XML configuration file. Finally, the TTS module or script stores the output results in a predefined file (also specified in the XML configuration file) and then the RES module server takes care of transferring its content via the RES server back to the RES client. Some developers only want to make their TTS modules available via the internet. Encapsulation of developers’ TTS modules into the RES can be accomplished by simply using the RES module server. Namely, for each TTS module, developers need a dedicated RES module server, or more if they would like to encapsulate more TTS modules within the RES. Encapsulation of TTS modules into the RES via a RES module server has to be done by the developers themselves. Developers just have to specify the name of the corresponding TTS module or the script to be executed and run by the RES module server in the XML configuration file. Additionally, they have to register their TTS module within the RES system. In order to do so, the developers have to meditate IP/port configuration information to the administrating institution that maintains the list of all available TTS modules within the RES.

3 Embedding a TTS modules into a RES module server

TTS modules developed by different developers generate and use different input/output data formats. Usually, these formats will differ from the ECESS data format specified in the RES system. The adaptation of these formats into the ECESS data format can involve great effort and can be quite time consuming. Therefore, a solution has to be found in order to speed-up the evaluation process and other actions inside the ECESS consortium. Otherwise, it could happen that many developers would be unable to provide resources or TTS modules, in order to accomplish a specific evaluation task or would not be motivated enough to do some extra work in order to participate in the evaluation campaigns. The idea proposed is that data conversion from proprietary data formats into the ECESS data format should be done automatically by the RES. Namely, for each developer’s TTS module to be executed by the RES module server, two Java parsers have to be written by the administrator, one for conversion of developers’ proprietary data format into the ECESS data format, and the other for conversion of ECESS data format into the developers’ proprietary data format. Java framework JavaCC is used for the development of these Java parsers (Copeland 2007). After the development of these parsers (generated as Java classes), they could easily be included in the RES module server’s directory structure and specified in the XML configuration file as I/O data format conversion classes. Only the last two tasks (RES module server configuration step) should be carried-out by the developers themselves. When running the RES module server, the Java parsers would be automatically loaded into the system and used for the run-time I/O data conversion process. When developers’ TTS modules already support ECESS data format, the proposed procedure would, of course, not be needed. The Unforma tool was developed in order to make the development of Java parsers using JavaCC as easily as possible. The functional architecture of this tool and its usage is illustrated in Fig. 4. The Unforma tool is composed of more compilers. The developer first specifies the parser’s name, which is usually identified by the name composed for those data formats’ names involved in the conversion. The administrator then writes a parser script (*.jj). This is actually a “description” of how the data format conversion process should be performed. The JavaCC compiler is used in order to generate the corresponding java classes (*.java). For conversion in the opposite way, a new parser script has to be written and additional java classes have to be generated. After running the JavaCC compiler (checking if the parsers are written without errors), the administrator has to compile the generated Java classes by using a general Java compiler (javac)—in order to compile the generated Java classes into binary class files (*.class). Java enables the loading and running binary files even within already running applications. Therefore, testing of the generated Java parsers can be performed immediately after the compilation process finishes without errors, and the administrator can check if the generated parsers perform the conversion in the correct way. If there are problems, the administrator has to correct the parser scripts, repeat compilation by JavaCC and Java compilers, and test the parsers again, until I/O data conversion corresponds to given specifications. The generated parser can then be included into the developers’ RES module server. This step does not demand any re-compilation of the RES module server. Only additional entries are needed in the XML configuration file. In this way, it is unnecessary to deploy new version of the RES module server, only the generated parser classes should be provided to the developers.
Fig. 4

Unforma tool—functional architecture

4 Implementation of new tasks by the RES system

An important RES implementation issue is also that the RES components cover many different task scenarios using different RES system architectures, since ECESS activities and evaluation campaigns are and will be very colourful. Therefore, many different scenarios are possible for RES clients, the RES server and RES module servers. Hard-coded implementations of RES modules’ scenarios (behaviour) in this context seem to be rigid and inefficient solutions, and can quickly be turned into a ‘nightmare’ for the developer of the RES components. Additionally, deployment of numerous new versions can lead to many confusions and problems at the developers’ sites. In order to avoid such situations, all RES components have been implemented as finite-state engines using the UniMod framework (Weyns et al. 2007; Shalyto 2001).

Each RES module performs specific actions and also in general, specific sequences of these actions. Any specific sequence of these actions is determined by the used protocols in the RES, by task-specific RES architecture, and by the tasks themselves. All this can be flexibly described in the form of a finite-state machine graph as presented in Fig. 5. As can be seen, each task can be described by a set of states. Transitions between states are triggered by events. Each transition specifies actions that have to be performed. Additional ‘guard’ functions can be used for control if all conditions are met before specific transition can be performed. When some conditions are not met (e.g. connection is closed etc.) or some error occurs, finite-state machine goes to ‘error’ state and returns back to the ‘start’ state (s0). Such graphs can be drawn off-line, then re-written in the XML language in proprietary data format, and added as new XML protocol scenario files to the RES modules. The FSM (finite-state machine) engine’s graph traversal in RES modules is triggered by a series of randomly generated events, such as by received packets, transmitted packets etc. Using such an approach can ensure flexible and fast configuration of all RES modules and even remote behaviour specifications, for many different tasks. In this way, no task and RES modules’ behaviour is hard-coded. Instead, they are described by human-readable XML protocol scenario files. The development of new XML protocol scenario files for RES modules is expected to be performed by the RES administrating institution. From the RES users’ point of view, there would be almost no noticeable difference. They would just see additional available list items in the RES client GUI for identifying a new set of XML protocol scenario files used for running new task by the RES system. In order to write such an XML protocol scenario files for different tasks (for all RES components) as easily and as quickly as possible, a flexible and efficient tool has been developed called ProtocolGen tool. The functional architecture of this tool and its usage is shown in Fig. 6. The first step is to draw a graphical representation of the new evaluation task, by considering those used protocols and architecture of the RES system that can be composed of RES clients, the RES server, and RES module servers. These graphs are actually finite-state machines composed of states, transitions, and events regarding transitions that trigger graph traversal during RES module execution. Graphic representations must then be rewritten into the XML descriptions and stored as XML protocol scenario files. The UniMod framework itself already supports XML data format, but its finite-state machine description is quite difficult to read and to generate manually/directly from graphical representations. Therefore, a proprietary XML format has been defined. This data format, used for describing a finite-state machine, can be easily generated manually from graphical representations of the desired evaluation task. Within the same tool, the proprietary XML format is then automatically converted into the UniMOD XML format. For this step, the corresponding Java parser needs to be generated using the JavaCC and JavaC compilers. This is done in the same way as already described for the Unforma tool. When the UniMod XML format is generated, the developer can already test the generated XML protocol scenario file for a given evaluation task. When no errors are found, the developed XML protocol scenario files can be included in the RES client. Again, from the developers’ or evaluators’ points of view, nothing has changed. After they add additional XML protocol scenario files into the corresponding RES client protocol directory, they can use the RES client and the RES module server(s) in the same way as before. The RES framework enables in general the construction of any desired configuration of RES modules, using any number of RES clients and RES module servers. New modules for different purposes can be added with only one line in the configuration file at the partner’s side. On the RES server side, only one line (IP/port) information for new modules has to be added. All logic around RES system tasks is specified by XML descriptions. Java language additionally makes use of the RES system by different partners easier.
Fig. 5

Task description for RES module in the form of finite-state machine graph
Fig. 6

ProtocolGen tool for generation of new XML protocol scenario files

5 ECESS evaluation/testing platforms

The RES components are implemented on a standard Java-based platform. Therefore, developers and evaluators can use it under Windows or Linux platforms without any problems. This aspect is very important, since developers develop their TTS modules on different platforms. The first evaluation platform, based on the RES system, is shown in Fig. 7. This platform is used for remote ECESS evaluation campaigns dealing with evaluation of the text processing modules for the tasks defined within the TC-STAR: normalization of non-standard-words (NSWs), end-of-sentence detection, POS tagging, and grapheme-to-phoneme conversion (G2P). The novelty of such ECESS evaluation is usage of the RES framework presented in this paper. In these experiments, ELDA is responsible for setting up and running evaluations using the RES client. By using it, the evaluator is able to send the input text corpus (test set) to the RES server (MU), which disseminates the text to the text processing modules running by RES module servers on the developers’ site. Once it gets the modules’ output back, the RES server (MU) sends them back to the RES evaluation client, who performs the evaluation tasks mentioned above. These evaluation campaigns are currently performed for TTS modules handling UK English as the target language. In order to evaluate the algorithmic performance of TTS modules of different developers, the developers’ TTS modules are trained with the same language resources. The training LR package for UK English consists of UK English phonetic lexicon (specifications according to LC-STAR, 50,000 common words), UK English text of about 90,000 running words containing POS tags specified according to LC-STAR format, and annotated recordings of a female speaker (native speaker of UK English) of about 10 h. In addition to the recordings, a phonetic lexicon is delivered containing all those pronunciation variants realised by the corresponding speaker. As was done for the TC-STAR evaluation campaigns, ELDA packages a TTS evaluation suite for the ECESS campaigns. At the time of writing the paper, the campaign’s focus is on evaluation of the grapheme-to-phoneme conversion (G2P) module for UK English.
Fig. 7

RES system configuration for remote ECESS evaluation campaigns

As can be seen from Fig. 7, three institutions are involved in these evaluation campaigns (Siemens AG, IPS of University of Munich and University of Maribor). The RES client runs at ELDA and the RES server runs at the University of Maribor. Both RES components run on Windows platform. Since developers’ text processing modules run on Linux, all three RES module servers are running on Linux platforms. The advantage of this evaluation platform is also that all three developers run their modules locally. The same is true for used language resources. Furthermore, developers don’t need to be involved in the evaluation procedure, preparation, and running of the evaluation package. And they don’t know anything about the evaluation data. On the other hand evaluator doesn’t need to contact developers who would like to evaluate their text processing tools, doesn’t need to prepare evaluation packages for them, and doesn’t need to offer them any support. Whenever the evaluator decides to perform evaluation, he/she just runs the RES client. The RES client provides him/her information about available text processing modules and the corresponding developers who would like to be involved in an on-going evaluation campaign. Namely, the RES client obtains this information from the RES server. Then the evaluator just sequentially selects available ECESS text processing modules, enters the test data, and evaluates the returned results. The second evaluation/testing platform is based on RES architecture, as shown in Fig. 8. This platform is used for the remote ECESS evaluation/testing of a complete TTS system. In this configuration, three TTS modules of the TTS system are involved: text processing, prosody processing, and acoustic processing. When evaluating a complete TTS system, the evaluator has to select one (any of the available ones) text processing, one prosody processing, and one acoustic processing RES module server. The evaluator should only take care that all the selected modules are compatible regarding their input/output parameters. As far as the format of the input/output data is concerned all the input/output data formats of the modules integrated in RES are compatible with the ECESS data format. The evaluator has to run the RES client and to select the architecture for the evaluation platform, as shown in Fig. 8. After selecting all three modules for the TTS system and sending test data to the RES server, the RES server automatically sends data to the selected text processing RES module server first. It further sends the obtained results from the text processing RES module server to the selected prosody processing RES module server. The results obtained from the RES module server are then finally sent to the acoustic processing RES module server. Generated audio data are transferred at the end, via the RES server back to the RES client, using RTP protocol. By using this RES system configuration, developers are also able to test their module or algorithm, by selecting via the RES client, their module encapsulated in the RES module server and use other required modules to compose the complete TTS system from other developers. In this way they can compare their results with those obtained by other developers’ modules, and are able to further improve their module and algorithms.
Fig. 8

RES system configuration for remote ECESS evaluation/testing of a complete TTS system

Partners’ ‘user experience’ of using RES system is very positive. Although there were some initial difficulties with proper configuration of the necessary communication infrastructure at partners’ sites, all modules within the RES worked fine and without problems during test period and evaluation campaigns. According to partners’ experiences, RES system enables fast and easy integration of developer’s modules, regardless operating systems used, and possibility of automatic input/output data format conversion proved to be very helpful. Some partners had problems configuring firewalls, where some extra knowledge regarding setting proper IP addresses and incoming/outgoing ports was needed. Currently, manual configuration of the IP settings for different RES modules is still needed, and in order to make this easier for users, a GUI to set all the necessary configuration data should be developed in the future.

Additionally, checking of selected IP settings should be performed, before they are being saved and used by the RES system in order that developers will be sure that their module works within RES without problems and is accessible to all interested partners that use the RES framework.

If one or even more RES module servers would encounter problems running some users’ modules integrated in the RES, the RES system would not fail. In worst case, the specific RES module server would just be disconnected after a specified timeout, and no output would be returned to the RES client (the user would be notified about this with an appropriate error message). At the same time, other users could use the RES system and perform different tasks simultaneously using other available modules.

The main purpose of the presented RES system is to open new possibilities for the collaborative work of different partners and to enable development, testing, and evaluation of different TTS modules and systems developed by different partners. In the paper only several possible examples for the use of the RES system are given. The RES system is capable to integrate any desired or needed granular processing task that can be defined within TTS systems and even other processing tasks within the speech technology field. Therefore, there is no problem to integrate the whole TTS system or further sub-divide e.g. text processing module into e.g. tokenizer, POS tagger, text normalization etc. For example, the RES system is currently running modules for tokenizing several European languages, POS tagging, etc. The described test/evaluation scenarios/platforms are therefore only few, which have been developed within ECESS, out of many possible that can be developed for various tasks.

6 Conclusion

The paper presents a web-based distributed framework for the evaluation and development of TTS modules. It is a client/server architecture composed of several components, running as finite-state engines. The proposed architecture is flexible, reliable, easily re-configurable and maintainable. It can be used for numerous evaluation and testing tasks regarding TTS modules. Using the RES system is very easy from the user’s point of view. The RES is based on those protocol standards that have gained wide support in the speech and telecommunication areas today. All RES components use the same flexible architecture (finite-state engines), all input/output formats are standardized and compatible (TC-STAR compatible), and the structure is modular. By using the proposed remote evaluation RES framework, researchers are able to concentrate their efforts on the development of a single TTS module or algorithm, and test its performance. The RES can be easily configured to particular evaluation campaign. In this way, the performance of single modules (as e.g. text processing modules) from different developers or also whole TTS systems composed of modules of different developers can be evaluated.


EU project TC-STAR (Technology and Corpora for Speech to Speech Translation)


The Blizzard challenge:

3 The ECESS consortium is from its beginning an open, non funded consortium for institutions active in speech synthesis and related topics.


Copyright information

© Springer Science+Business Media B.V. 2009