Advertisement

Speech Mashups

  • Giuseppe Di Fabbrizio
  • Thomas Okken
  • Jay Wilpon

Abstract

Speech mashups are an emerging type of mashup that expose cloud-based speech and language processing technologies as web services. They allow researchers, practitioners, and developers to access commercial-grade speech recognition and text-to-speech systems without the need to install, configure, or manage speech processing software or equipment. This approach significantly lowers the barrier to build speech applications by having all the necessary components and tools available in the network. Compared to traditional mashups, they introduce a number of new concepts such as audio capturing, audio play-back, streaming media across the network, and resource configuration management.

Keywords

Speech Recognition Language Model Automatic Speech Recognition Mobile Client Stream Control Transmission Protocol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

We would like to thank Linda Crane and Amanda Stent for their contributions and continuous unconditional support.

References

  1. 1.
    Allauzen C, Mohri M, Roark B (2005) The design principles and algorithms of a weighted grammar library. Int J Found Comput Sci 16:403–421 MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Alonso G, Casati F, Kunoand H, Machiraju V (2003) Web services. Springer, Berlin Google Scholar
  3. 3.
    Axelsson J, Cross C, Lie HW, McCobb G, Raman TV, Wilson L (2001) XHTML+voice profile 1.0. http://www.w3.org/TR/2001/NOTE-xhtml+voice-20011221
  4. 4.
    Beckham JL, Di Fabbrizio G, Klarlund N (2001) Towards SMIL as a foundation for multimodal, multimedia applications. In: Eurospeech 2001, European conference on speech communication and technology Google Scholar
  5. 5.
    Berners-Lee T, Fielding RT, Frystyk Nielsen H (1996) Hypertext transfer protocol—HTTP/1.0. http://www.w3.org/Protocols/HTTP/1.0/spec.html
  6. 6.
    Beutnagel M, Conkie A, Schroeter J, Stylianou Y, Syrdal A (1999) The AT&T next-gen TTS system. In: Joint meeting of ASA, EAA, and DAGA Google Scholar
  7. 7.
    Black AW (2002) Perfect synthesis for all of the people all of the time. In: IEEE 2002 workshop on speech synthesis Google Scholar
  8. 8.
    Blechschmitt E, Strödecke C (2002) An architecture to provide adaptive, synchronized and multimodal human computer interaction. In: Multimedia’02: proceedings of the tenth ACM international conference on multimedia, pp 287–290 CrossRefGoogle Scholar
  9. 9.
    Burnett DC, Walker MR, Hunt A (2004) Speech synthesis markup language (SSML) version 1.0. http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
  10. 10.
    Caceres R, Friday A (2012) Ubicomp systems at 20: progress, opportunities, and challenges. IEEE Pervasive Comput 11:14–21 CrossRefGoogle Scholar
  11. 11.
    Chudnovskyy O, Weinhold F, Gebhardt H, Gaedke M (2012) Integration of telco services into enterprise mashup applications. In: Current trends in web engineering. Lecture notes in computer science, vol 7059, pp 37–48 CrossRefGoogle Scholar
  12. 12.
    Conkie A, Okken T, Kim Y-J, Di Fabbrizio G (2012) Building text-to-speech voices in the cloud. In: Eight international conference on language resources and evaluation (LREC’12) Google Scholar
  13. 13.
    Crockford D (2006) The application/json media type for JavaScript object notation (JSON). RFC 4627. http://www.json.org/
  14. 14.
    Di Fabbrizio G, Okken T, Wilpon JG (2009) A speech mashup framework for multimodal mobile services. In: ICMI-MLMI 2009, pp 71–78 CrossRefGoogle Scholar
  15. 15.
    Eberman B, Carter J, Meyer D, Goddeau D (2002) Building VoiceXML browsers with OpenVXI. In: WWW 2002: proceedings of the 11th international conference on world wide web, pp 713–717 Google Scholar
  16. 16.
    Feng J, Bangalore S, Gilbert M (2009) Role of natural language understanding in voice local search. In: 10th annual conference of the international speech communication association (Interspeech 2009) Google Scholar
  17. 17.
    Fielding RT (2000) REST: architectural styles and the design of network-based software architectures. PhD thesis, Univ. of California, Irvine. http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
  18. 18.
    Gebhardt H, Gaedke M, Daniel F, Soi S, Casati F, Iglesias CA, Wilson S (2012) From mashups to telco mashups: a survey. IEEE Internet Comput 16(3):70–76 CrossRefGoogle Scholar
  19. 19.
    Goffin V, Allauzen C, Bocchieri E, Hakkani-Tür D, Ljolje A, Parthasarathy S, Rahim M, Riccardi G, Saraclar M (2005) The AT&T WATSON speech recognizer. In: IEEE international conference on acoustics, speech and signal processing (ICASSP) Google Scholar
  20. 20.
    Gruenstein A, McGraw I, Badr I (2008) The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces. In: ICMI 2008: proceedings of the 10th international conference on multimodal interfaces, pp 141–148 Google Scholar
  21. 21.
    Johnston M (2009) EMMA: extensible multimodal annotation markup language. http://www.w3.org/TR/emma/
  22. 22.
    Johnston M, Di Fabbrizio G, Urbanek S (2011) mTalk—a multimodal browser for mobile services. In: Interspeech 2011, 12th annual conference of the International Speech Communication Association, pp 3261–3264 Google Scholar
  23. 23.
    Lee K-F, Hon H-W, Reddy R (1990) An overview of the SPHINX speech recognition system. IEEE Trans Acoust Speech Signal Process 38(1):35–45 CrossRefGoogle Scholar
  24. 24.
    McGlashan S, Hunt A (2004) Speech recognition grammar specification version 1.0. http://www.w3.org/TR/2004/REC-speech-grammar-20040316/
  25. 25.
    Newcomer E, Lomow G (2004) Understanding SOA with web services. Addison-Wesley Professional, Reading Google Scholar
  26. 26.
    Niklfeld G, Finan R, Pucher M (2001) Architecture for adaptive multimodal dialog systems based on VoiceXML. In: EuroSpeech 2001 Google Scholar
  27. 27.
    Niklfeld G, Pucher M, Finan R, Eckhart E (2002) Mobile multi-modal data services for GPRS phones and beyond. In: Fourth IEEE international conference on multimodal interfaces, pp 337–342 CrossRefGoogle Scholar
  28. 28.
    Porter B, McGlashan S, Burke D, Burnett DC, Candell E, Auburn RJ, Baggia P, Carter J, Rehor K, Oshry M, Bodell M, Lee A (2007) Voice extensible markup language (VoiceXML) 2.1. http://www.w3.org/TR/2007/REC-voicexml21-20070619/
  29. 29.
    Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding Google Scholar
  30. 30.
    Shanmugham S, Burnett D (2009) Media resource control protocol version 2 (MRCPv2). tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-18
  31. 31.
    Wang K (2002) SALT: an XML application for web-based multimodal dialog management. In: NLPXML 2002: 2nd workshop on NLP and XML, pp 1–8 Google Scholar
  32. 32.
    Yee R (2008) Pro Web 2.0 mashups: remixing data and web services. APress, New York Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Giuseppe Di Fabbrizio
    • 1
  • Thomas Okken
    • 1
  • Jay Wilpon
    • 1
  1. 1.ResearchAT&T LabsFlorham ParkUSA

Personalised recommendations