Remote-based text-to-speech modules’ evaluation framework: the RES framework
- First Online:
- Cite this article as:
- Rojc, M., Höge, H. & Kačič, Z. Lang Resources & Evaluation (2010) 44: 371. doi:10.1007/s10579-009-9110-3
- 50 Views
The ECESS consortium (European Center of Excellence in Speech Synthesis) aims to speed up progress in speech synthesis technology, by providing an appropriate evaluation framework. The key element of the evaluation framework is based on the partition of a text-to-speech synthesis system into distributed TTS modules. A text processing, prosody generation, and an acoustic synthesis module have been specified currently. A split into various modules has the advantage that the developers of an institution active in ECESS, can concentrate its efforts on a single module, and test its performance in a complete system using missing modules from the developers of other institutions. In this way, complete TTS systems can be built using high performance modules from different institutions. In order to evaluate the modules and to connect modules efficiently, a remote evaluation platform—the Remote Evaluation System (RES) based on the existing internet infrastructure—has been developed within ECESS. The RES is based on client–server architecture. It consists of RES module servers, which encapsulate the modules of the developers, a RES client, which sends data to and receives data from the RES module servers, and a RES server, which connects the RES module servers, and organizes the flow of information. RES can be used by developers for selecting RES module from the internet, which contains a missing TTS module needed to test and improve the performances of their own modules. Finally, the RES allows for the evaluation of TTS modules running at different institutions worldwide. When using the RES client, the institution performing the evaluation is able to set-up and performs various evaluation tasks by sending test data via the RES client and receiving results from the RES module servers. Currently ELDA www.elda.org is setting-up an evaluation using the RES client, which will then be extended to an evaluation client specializing in the envisaged evaluation tasks.
KeywordsRemote text-to-speech synthesis evaluationText-to-speech synthesis modulesECESS consortium
As demonstrated over many EU-funded and DARPA projects, constant evaluation as a constituent part of research activities has proven to be a successful approach for enhancing progress in almost all areas of speech technology, such as speech recognition, speech synthesis, or speech translation, especially if organized in the form of evaluation campaigns (e.g. TC-STAR,1 Blizzard2 etc.).
An evaluation framework has been established for speech synthesis technology within the ECESS3 consortium (European Center of Excellence in Speech Synthesis) in cooperation with the EU-funded project TC-STAR. The ECESS consortium is an open, non funded consortium for institutions active in speech synthesis and related topics. The key element of the evaluation framework is the specification of several modules: e.g. for text processing, prosody generation, and acoustic synthesis modules, by building a complete text-to-speech system. The functionality and interfaces of the TTS modules are described in (Perez et al. 2006). This split into modules has the advantage that the developers of an institution can concentrate its efforts on a single module, and test its performance in a complete system, using missing modules from the developers of other institutions. In this way high-performance systems can be built from the high-performance modules of different institutions. A common evaluation methodology has been developed to assess the performances of the modules (Bonafonte et al. 2006). The methodology is based on the common use of those module-specific evaluation criteria and module-specific language resources needed for training and testing the modules.
The evaluation is not performed ‘on-line’, because the transport of test data and results has to be treated manually. Furthermore, the test data are not ‘secret’.
The connecting of different developers’ modules can not be handled without an exchange of software to be integrated locally.
Developers/researchers, who use RES in a test/development modus in order to improve the performances of their TTS module(s). In the following this user group is called ‘developers’.
Evaluators, who use RES in an evaluation modus to measure the performance of selected TTS modules.
The architecture of RES consists of three RES components: the RES module server, which encapsulates the TTS modules, the RES client, which sends data to and receives data from the RES module servers, and a RES server (managing unit MU), which connects the RES clients and RES module servers, and organizes the flow of information. Using the RES framework, each developer places his TTS module (or several modules) embedded in a RES module server on the internet. The developers of any ECESS research group and evaluators can access these TTS modules via a locally installed RES client. Based on this architecture, evaluation can be done remotely without the need for installing these modules locally, and without the need for manual intervention regarding the transport of test data and results. Furthermore, each developer can combine his TTS module with other available TTS modules, in order to test the performance of his/her module within a complete TTS system. Developers, focusing on the research of speech synthesis, do not want to spend too much time integrating their module into a RES module server. A Unforma RES tool has been developed in order to ease the embedding of a module into the RES module server, which allows for an easier way of constructing those data format parsers able to convert proprietary data formats into the RES system data format. An additional ProtocolGen tool enables the generation of numerous RES system’s task configurations needed for evaluations or testing of various TTS modules and systems from different institutions. Depending on the experiences of developers active in ECESS, the RES will be further extended and modified. The architecture of RES also allows for the evaluation of arbitrary software components. A testbed of this idea will be the evaluation of those tools needed to support a generation of TTS systems (e.g.pitch-marking, VAD etc.). The remainder of this paper is organized as follows. Section 2 describes how the remote evaluation system RES is used, and exposes the main functionalities of the system. Section 3 then describes the integration of new TTS modules into the RES. Implementation of new evaluation or testing tasks for the RES system is described in Sect. 4. The paper ends with a presentation of current ECESS evaluation/testing platforms, based on the RES system, that are used in evaluation campaigns, and the last section draws some conclusions.
2 Functionalities, use and installation of RES
2.1 Functionalities of the RES components
RES users are able to perform numerous evaluation and testing tasks using the RES. These tasks can be performed by different RES components’ architectures and their behavioural specifications. Each task to be performed using the RES system, starts by selecting those configurations of developers’ TTS modules suitable for the desired task (input/output data exchange must make sense), followed by the executions of certain tasks to be done by a TTS module. The RES systems’ architectures and the RES components’ behaviour are described in XML format.
2.2 Use and installation of the RES components
An RES server as central managing unit (MU) is installed by only one institution, which is also responsible for administrating the RES system. This administration institution also maintains the list of all RES module servers made available by developers (IP/port access). This list is automatically sent to RES clients in order that RES users are able to select between different available TTS modules. RES users can install a RES client and their RES module servers on any platform, since all components of RES are of pure Java application (Linux or Windows). Each RES client additionally contains sets of XML protocol scenario files, and an XML configuration file. RES client access to all other RES components, regarding TCP and UDP traffic, is set-up in XML configuration file. Sets of XML protocol scenario files are designed for performing different tasks using the RES system, as is explained in more details under Sect. 4. In this way, by using a RES client, RES users are able to select any RES module server within the RES, running a specific TTS module. When they want to run RES module servers running an acoustic processing module from another developer, they also have to configure IP/UDP port for RTP protocol in the RES client XML configuration file. Installation of the RES client is simple, since RES users just have to copy the software package into some selected directory. After running the RES client, they have to select the desired RES task from the ‘task list’ (depending on the evaluation task). The given RES ‘task list’ actually identifies those sets of XML files describing the needed behaviour of RES modules. Finally, they have to enter input data as specified within the evaluation campaign. The input given by the RES client is transferred via the RES server to the specified RES module server, where it is stored in a predefined file. Next, the RES module server runs the TTS module or the script specified in the XML configuration file. Finally, the TTS module or script stores the output results in a predefined file (also specified in the XML configuration file) and then the RES module server takes care of transferring its content via the RES server back to the RES client. Some developers only want to make their TTS modules available via the internet. Encapsulation of developers’ TTS modules into the RES can be accomplished by simply using the RES module server. Namely, for each TTS module, developers need a dedicated RES module server, or more if they would like to encapsulate more TTS modules within the RES. Encapsulation of TTS modules into the RES via a RES module server has to be done by the developers themselves. Developers just have to specify the name of the corresponding TTS module or the script to be executed and run by the RES module server in the XML configuration file. Additionally, they have to register their TTS module within the RES system. In order to do so, the developers have to meditate IP/port configuration information to the administrating institution that maintains the list of all available TTS modules within the RES.
3 Embedding a TTS modules into a RES module server
4 Implementation of new tasks by the RES system
An important RES implementation issue is also that the RES components cover many different task scenarios using different RES system architectures, since ECESS activities and evaluation campaigns are and will be very colourful. Therefore, many different scenarios are possible for RES clients, the RES server and RES module servers. Hard-coded implementations of RES modules’ scenarios (behaviour) in this context seem to be rigid and inefficient solutions, and can quickly be turned into a ‘nightmare’ for the developer of the RES components. Additionally, deployment of numerous new versions can lead to many confusions and problems at the developers’ sites. In order to avoid such situations, all RES components have been implemented as finite-state engines using the UniMod framework (Weyns et al. 2007; Shalyto 2001).
5 ECESS evaluation/testing platforms
Partners’ ‘user experience’ of using RES system is very positive. Although there were some initial difficulties with proper configuration of the necessary communication infrastructure at partners’ sites, all modules within the RES worked fine and without problems during test period and evaluation campaigns. According to partners’ experiences, RES system enables fast and easy integration of developer’s modules, regardless operating systems used, and possibility of automatic input/output data format conversion proved to be very helpful. Some partners had problems configuring firewalls, where some extra knowledge regarding setting proper IP addresses and incoming/outgoing ports was needed. Currently, manual configuration of the IP settings for different RES modules is still needed, and in order to make this easier for users, a GUI to set all the necessary configuration data should be developed in the future.
Additionally, checking of selected IP settings should be performed, before they are being saved and used by the RES system in order that developers will be sure that their module works within RES without problems and is accessible to all interested partners that use the RES framework.
If one or even more RES module servers would encounter problems running some users’ modules integrated in the RES, the RES system would not fail. In worst case, the specific RES module server would just be disconnected after a specified timeout, and no output would be returned to the RES client (the user would be notified about this with an appropriate error message). At the same time, other users could use the RES system and perform different tasks simultaneously using other available modules.
The main purpose of the presented RES system is to open new possibilities for the collaborative work of different partners and to enable development, testing, and evaluation of different TTS modules and systems developed by different partners. In the paper only several possible examples for the use of the RES system are given. The RES system is capable to integrate any desired or needed granular processing task that can be defined within TTS systems and even other processing tasks within the speech technology field. Therefore, there is no problem to integrate the whole TTS system or further sub-divide e.g. text processing module into e.g. tokenizer, POS tagger, text normalization etc. For example, the RES system is currently running modules for tokenizing several European languages, POS tagging, etc. The described test/evaluation scenarios/platforms are therefore only few, which have been developed within ECESS, out of many possible that can be developed for various tasks.
The paper presents a web-based distributed framework for the evaluation and development of TTS modules. It is a client/server architecture composed of several components, running as finite-state engines. The proposed architecture is flexible, reliable, easily re-configurable and maintainable. It can be used for numerous evaluation and testing tasks regarding TTS modules. Using the RES system is very easy from the user’s point of view. The RES is based on those protocol standards that have gained wide support in the speech and telecommunication areas today. All RES components use the same flexible architecture (finite-state engines), all input/output formats are standardized and compatible (TC-STAR compatible), and the structure is modular. By using the proposed remote evaluation RES framework, researchers are able to concentrate their efforts on the development of a single TTS module or algorithm, and test its performance. The RES can be easily configured to particular evaluation campaign. In this way, the performance of single modules (as e.g. text processing modules) from different developers or also whole TTS systems composed of modules of different developers can be evaluated.