The Universal Speech Translation Advanced Research Consortium (U-STAR) is an international research collaboration entity formed to develop a network-based speech-to-speech translation (S2ST) with the aim of breaking language barriers around the world and to implement vocal communication between different languages.In 2010, as an initiative of U-STAR, international communication protocols were standardized and approved by the ITU-T; Recommendations F.745 [4], and H.625 [5], enabling speech-to-speech translation (S2ST) modules to be connected across the globe over networks. Below are the Excerpts from these recommendations:
Network-based speech-to-speech translation services
This Recommendation specifies the service description and the requirements for speech-to-speech translation (S2ST) accomplished by connecting distributed S2ST modules all over the world through a network. This service provides S2ST that recognizes the speech in one language, translates the recognized speech into another language, and then synthesizes the translation into speech. People who speak different languages can communicate using this service.
The applications and services using network-based S2ST technologies are characterized by the following components:
S2ST client
S2ST servers
-
Speech recognition: speech is recognized and transcribed;
-
Machine translation: text in source language is translated into text in target language;
-
Speech synthesis: speech signal is created from text.
Communication protocol
In order to extend the network-based S2ST to other modalities (e.g., sign language), a communication protocol is incorporated for modality conversion (MC), which converts single/multiple modality information to different single/multiple modality information. The communication protocol for MC needs to have an expandable structure.
Modality conversion markup language (MCML)
The leveraging of S2ST technologies in a pragmatic manner, which has long been one of mankind’s dreams, may have a significant impact on tourism, social services, safety, and security. To construct S2ST systems, automatic speech recognition (ASR), machine translation (MT) and text-to-speech synthesis (TTS) must be built for source and target languages by collecting speech and language data, such as audio data, its manual transcriptions, pronunciation lexica for each word, parallel corpora for translation and so on.
It is very difficult for individual organizations to build S2ST systems covering all topics and languages. However, by interconnecting ASR, MT and TTS modules developed by separate organizations and distributed globally through a network, one can create S2ST systems that break the world’s language barriers.
The consortium is working collaboratively to collect language corpora, create common speech recognition and translation dictionaries, develop Web service speech translation modules for the various Asian languages, and standardize interfaces and data formats that facilitate the international interaction among the different speech translation modules from different countries.
The system is being designed to translate common spoken utterances of travel conversations from a certain source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different languages. Each consortium member contributes one or more of the following spoken language technologies: automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) through Web servers. Currently, the system is covering 23 languages aiming 23*22/2 S2S systems.
Figure 1 illustrates an example where a spoken Source Language utterance is recognized and converted into Source Language text; this is then translated into Target Language text, which is synthesized into Target Language speech [1].
In a loosely coupled system, these three components are connected in a sequential style: ASR converts user’s speech to text in source language and then MT translates the source text into the target language. Finally TTS creates the synthesized speech from the target text.