Survay of Speech Research around the World 
by Pisit P, November 1999, PsTNLP Laboratory 

I. America

1. Bell Lab Text-to-Speech system

    The Bell Labs Text-to-Speech system (TTS) has various applications including reading electronic mail messages, generating spoken prompts in voice response systems, and as an interface to an order-verification system for salespeople in the field.
    TTS is implemented entirely in software and only standard audio capability is required. At present, it contains several components, each of which handles a different task. For example, the text analysis capabilities of the system detect the ends of sentences, perform some rudimentary syntactic
analysis, expand digit sequences into words, and disambiguate and expand abbreviations into normally spelled words which can then be analyzed by the dictionary-based pronunciation module. The following sentences illustrate the system's text-analysis capabilities.

 Lumber will cost $3.95 for 7 ft. on Sat.
 That fossil from NM is 165,256,011 yr. old.
 Dr. Smith lives on Oak Dr., but St. John lives on 71st St.

    The pronunciation module provides pronunciations for most ordinary words, and morphological derivatives thereof, as well as proper names; default strategies exist for pronouncing words not recognized by the dictionary-based methods. Other components handle prosodic phrasing,
word accentuation, sentence intonation, and the actual speech synthesis. We believe that the word pronunciation and intelligibility of our American English TTS system are the best available. However, we are continuously working to improve its naturalness. We are also expanding the set of
languages that TTS can support, such as German, Chinese (Mandarin and Taiwanese), Russian, French, Romanian, Italian, Spanish (Latin American and Castilian), Japanese, and Navajo.

2. DECtalk™ Text-to-SpeechSynthesis Software

    Converts standard text into highly intelligible and human-like speech DECtalk Software Features:
Converts standard ASCII text into natural, highly intelligible speech Speech output through any audio device is supported by Microsoft Video for Windows or Multimedia Services for Digital UNIX
API gives developers direct access to text-to-speech functions Choose from nine voice personalities -- four female, four male, one child  Provides punctuation and tonal control Unlimited vocabulary
Supports customized pronunciation of trade jargon and acronyms Bundled applets and sample code enable you to get started quickly and productively Provides human voice quality and advanced phonetic and linguistic pronunciation Common programming interface works with both Alpha and Intel platforms
Enables users to proofread text, such as spreadsheets, far more easily against written source material
Mobile professionals can access electronic mail and other documents, significantly increasing on-the-road productivity Great for form filling and word processing, too!

Description

Digital, an industry leader in text-to-speech synthesis technology, proudly offers DECtalk, advanced desktop software that lets you create and employ applications that electronically speak to users. With DECtalk, your applications can convert standard text into highly intelligible, human-like speech through a sound card on your system. Use DECtalk to provide important information in situations where reading text would be inconvenient or impossible, such as mobile computing, telephone-based data access
services for consumers, and even electronic mail. DECtalk is versatile, too -- you can incorporate it into any application running on Windows NT or Digital UNIX. And it offers you a choice of nine rich voice personalities, intonation and speed control, and has built-in phonetic, linguistic, and pronunciation rules.

3. Speech research at Carnegie Mellon University

     On-line resources for speech technology research, development, and deployment. It include groups, projects, research centers Papers, dictionary, survey of state of the art of human language technology, links to speech labs and other resources

4. Speech at Apple computer

    You don’t have to be a Star Trek fan to know that the computer of the future will talk, listen and
understand. That computer of the future is the Apple Macintosh of today. Apple’s Speech
Recognition and Speech Synthesis Technologies now give speech-savvy applications the power to
carry out your voice commands, and even speak back to you in plain English, and now, Spanish.
It is included text-to-speech and speech recognition.

5. Center for Spoken Language Understanding at Johns Hopkins University

    The research center conduct research in the area of Automatic Speech Recognition, Natural Language Processing, Speech and VLSI, Cognitive Science, etc.

6. Spoken Language System Laboratory at MIT

    In order to make it possible for humans to speak to computers a conversational interface is needed. A conversational interface enables humans to converse with machines (in much the same
way we communicate with one another) in order to create, access, and manage information and to solve problems. It is what Hollywood and every "vision of the future" tells us
what we must have. Since 1989, getting computers to communicate the way people do -- by speaking and
listening -- has been the objective of the Spoken Language Systems (SLS) Group at MIT's Laboratory of
Computer Science.

7. Machine Listening Group at MIT media lab

    The Machine Listening Group is working towards bridging the gap between the current generation
of audio technologies and those that will be needed for future interactive media applications.
Our research includes new description-based representations for audio that enable controllable,
compact and computationally-efficient sound and music rendering and presentation.
We are also concerned with the application of research from psycho-acoustics and auditory
perception and cognition to engineering solutions for new audio applications. These include 3D
spatialization, virtual acoustics and computational analysis of complex real-world auditory scenes
such as multi-instrument music recordings and so-called "cocktail party" problems.

8. Microsoft Speech Research Group

    The Speech Technology Group engages in research and development of spoken language technologies. We are interested not only in creating state-of-the-art spoken language components, but also in how these disparate components can come together with other modes of human-computer interaction to form a
unified, consistent computing environment. We are pursuing several projects to help us reach our vision of a fully speech-enabled computer. Projects at the group include (Whisper) Speech Recognition
(Whistler) Speech Synthesis
    (Text-to-Speech)
    (Dr. Who) Multimodal Conversational
    User Interface
    Speech Application Programming
    Interface Development Toolkit
 

II. Asia

1. Advanced Telecommunications Research Institute International (ATR) at Japan

    The research institute in Japan conduct researchs include speech information processing. The area of research includes speech database, dialogue database, hearing school, chatr, intelegent sound monitoring system etc.

2. PsTNLP laboratory

    Talking computer is an innovative technology in today's wired world. Computer is able to talk with various languages like English, France, Spanish and Chinese Mandarin for example. Computer is learning to talk Thai as well. Computer talks not quite well enough as human does. Naturalness and intelligibility are key issues for the talking computer enhancement.
    Thai and some other Asian languages have no explicit word boundary in the written sentences. Consider the example, 'i fyoucanreadthis'. Thus, the processing for these type of languages has the prerequisite in word segmentation. Thai word segmentation is not obvious because of ambiguity. There are many possibilities in segmenting words from one sentence. For example, note this contiguous sentence
'Importantproductsofregion' may be incorrectly segment as 'Im-port-ant-product-so-fregion' or correctly
segment as 'Important-products-of-region'.

3. Thai Text-to-Speech Project at NECTEC

    The aim of our software is to create synthesized Thai speech with natural sounding accent and rhythm. Speech unit based on demi-syllable approach has been adopted in the first phase, totally 1,399
speech units have been recorded. The system is designed to be able to analyze Thai text into appropriate words and phrases by using an intelligent word segmentation algorithm and a shallow parser. The speech
output is based on the concatenation of these speech units. Our team aims to investigate the following main areas of research: acoustic-phonetics, corpus-base synthesis, discourse and speech synthesis, innovative speech signal processing techniques, Input annotations and text processing, prosody processing, and utterance generation. Once fully developed, this software will be combine Thai OCR software and Thai text-to-speech software to "read" book directly for the blinds.

III. Australia

IV. Europe

1. Laureate at British Telecom

    Laureate is a text-to-speech system developed at BT Laboratories. It was designed to generate high quality natural synthetic speech from typed text, while maintaining high intelligibility.
Laureate is different from other text-to-speech systems because it speaks in a voice that is far more natural than the robotic sounding synthesisers of the past. This makes it much easier for a listener to concentrate on the information in a synthesised message, not just the voice. Text-to-speech systems offer the ability to convert unpredictable or wide ranging information into synthetic speech, with a flexibility that recorded message systems cannot match. Speech is a natural information medium for telecommunications and hands-free environments, and is particularly useful for remote mobile access to centrally stored information.

2. The Festival Speech Synthesis System at the university of Edinburgh

    Festival is a general multi-lingual speech synthesis system developed at CSTR. It offers a full text to speech system with various APIs, as well an environment for development and research of speech synthesis techniques. It is written in C++ with a Scheme-based command interpreter for general control.

3. Spoken Language Working Group of the European Advisory Groups on Language Engineering Standards (EAGLES)

    A brief overview is provided of the EAGLES project ([1]) specifically with reference to progress achieved within the Spoken Language Working Group (SLWG). The goals, working structures, methods, and achievements are first briefly summarised ([2], [4]). We then outline the major achievement of the project, the handbook of Spoken Language working practices and guidelines, with some discussion of important liaisons developing with other projects and bodies. The paper concludes with an overview of current plans and prospects for further extension and development of these activities.  The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi- modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (including for example translation systems). The definition of de facto standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools.  In these areas the de facto standards are derived from the consensus within the spoken language community previously established in a number of European ([3]) and national projects, with reference to important initiatives in the US and Japan. Primary among these have been the SAM projects (centred on component technology assessment and corpus creation), SQALE (for large vocabulary systems assessment) and both SUNDIAL and SUNSTAR (for multi-modal systems.) Past and present projects with significant outputs in the domain of assessment and resources include ARS, RELATOR, ONOMASTICA and SPEECHDAT, as well as major national projects and programmes of research such as VERBMOBIL in Germany. This has led to an initial documentation of existing practice which is relatively comprehensive but in many respects heterogeneous and widely dispersed. The Spoken Language Working Group of the EAGLES project has addressed the task of collecting and unifying this existing body of material to provide an up-to-date baseline reference documentation to serve current and immediate future needs.

3. HMM-Based Trainable Speech Synthesis by Dr. Robert Donovan from Cambridge University's Engineering Department

    A speech synthesis system has been developed which uses a set of decision-tree state-clustered Hidden Markov Models to automatically select and segment a set of HMM-state sized sub-word units from a one hour single-speaker continuous-speech database for use in a concatenation synthesiser. Duration and energy parameters are also estimated automatically. The selected segments are concatenated using a TD-PSOLA synthesiser. The system can synthesise highly intelligible, fluent speech from a
word string of known phonetic pronunciation. It can be retrained on a new voice in less than 48 hours, and has been trained on 4 different voices. The segmental intelligibility of the speech has been measured using large scale modified rhyme tests, and an error rate of 5.0% obtained

4. The Speech Technology Group

    The Speech Technology Group (Grupo de Tecnología del Habla, GTH) is part of the Department of Electronic Engineering (Departamento de Ingeniería Electrónica, DIE), which belongs to the Technical University of Madrid (Universidad Politécnica de Madrid, UPM), at the High Technical School of Telecommunication Engineering (Escuela Técnica Superior de Ingenieros de Telecomunicación, ETSIT). We carry out research and development in areas related to Speech Technology: recognition,
synthesis, understanding, etc.
  1