Survay
of Speech Research around the World
by Pisit P, November 1999, PsTNLP Laboratory |
1.
Bell Lab Text-to-Speech system
The Bell Labs Text-to-Speech
system (TTS) has various applications including reading electronic mail
messages, generating spoken prompts in voice response systems, and as an
interface to an order-verification system for salespeople in the field.
TTS is implemented entirely
in software and only standard audio capability is required. At present,
it contains several components, each of which handles a different task.
For example, the text analysis capabilities of the system detect the ends
of sentences, perform some rudimentary syntactic
analysis, expand digit sequences into words,
and disambiguate and expand abbreviations into normally spelled words which
can then be analyzed by the dictionary-based pronunciation module. The
following sentences illustrate the system's text-analysis capabilities.
Lumber will cost $3.95 for 7 ft. on
Sat.
That fossil from NM is 165,256,011
yr. old.
Dr. Smith lives on Oak Dr., but St.
John lives on 71st St.
The pronunciation module
provides pronunciations for most ordinary words, and morphological derivatives
thereof, as well as proper names; default strategies exist for pronouncing
words not recognized by the dictionary-based methods. Other components
handle prosodic phrasing,
word accentuation, sentence intonation, and
the actual speech synthesis. We believe that the word pronunciation and
intelligibility of our American English TTS system are the best available.
However, we are continuously working to improve its naturalness. We are
also expanding the set of
languages that TTS can support, such as German,
Chinese (Mandarin and Taiwanese), Russian, French, Romanian, Italian, Spanish
(Latin American and Castilian), Japanese, and Navajo.
2.
DECtalk™ Text-to-SpeechSynthesis Software
Converts standard text
into highly intelligible and human-like speech DECtalk Software Features:
Converts standard ASCII text into natural,
highly intelligible speech Speech output through any audio device is supported
by Microsoft Video for Windows or Multimedia Services for Digital UNIX
API gives developers direct access to text-to-speech
functions Choose from nine voice personalities -- four female, four male,
one child Provides punctuation and tonal control Unlimited vocabulary
Supports customized pronunciation of trade
jargon and acronyms Bundled applets and sample code enable you to get started
quickly and productively Provides human voice quality and advanced phonetic
and linguistic pronunciation Common programming interface works with both
Alpha and Intel platforms
Enables users to proofread text, such as
spreadsheets, far more easily against written source material
Mobile professionals can access electronic
mail and other documents, significantly increasing on-the-road productivity
Great for form filling and word processing, too!
Description
Digital, an industry leader in text-to-speech
synthesis technology, proudly offers DECtalk, advanced desktop software
that lets you create and employ applications that electronically speak
to users. With DECtalk, your applications can convert standard text into
highly intelligible, human-like speech through a sound card on your system.
Use DECtalk to provide important information in situations where reading
text would be inconvenient or impossible, such as mobile computing, telephone-based
data access
services for consumers, and even electronic
mail. DECtalk is versatile, too -- you can incorporate it into any application
running on Windows NT or Digital UNIX. And it offers you a choice of nine
rich voice personalities, intonation and speed control, and has built-in
phonetic, linguistic, and pronunciation rules.
3.
Speech research at Carnegie Mellon University
On-line resources for speech technology research, development, and deployment. It include groups, projects, research centers Papers, dictionary, survey of state of the art of human language technology, links to speech labs and other resources
4.
Speech at Apple computer
You don’t have to be a
Star Trek fan to know that the computer of the future will talk, listen
and
understand. That computer of the future is
the Apple Macintosh of today. Apple’s Speech
Recognition and Speech Synthesis Technologies
now give speech-savvy applications the power to
carry out your voice commands, and even speak
back to you in plain English, and now, Spanish.
It is included text-to-speech and speech
recognition.
5.
Center for Spoken Language Understanding at Johns Hopkins University
The research center conduct
research in the area of Automatic Speech Recognition, Natural Language
Processing, Speech and VLSI, Cognitive Science, etc.
6.
Spoken Language System Laboratory at MIT
In order to make it possible
for humans to speak to computers a conversational interface is needed.
A conversational interface enables humans to converse with machines (in
much the same
way we communicate with one another) in order
to create, access, and manage information and to solve problems. It is
what Hollywood and every "vision of the future" tells us
what we must have. Since 1989, getting computers
to communicate the way people do -- by speaking and
listening -- has been the objective of the
Spoken Language Systems (SLS) Group at MIT's Laboratory of
Computer Science.
7.
Machine Listening Group at MIT media lab
The Machine Listening
Group is working towards bridging the gap between the current generation
of audio technologies and those that will
be needed for future interactive media applications.
Our research includes new description-based
representations for audio that enable controllable,
compact and computationally-efficient sound
and music rendering and presentation.
We are also concerned with the application
of research from psycho-acoustics and auditory
perception and cognition to engineering solutions
for new audio applications. These include 3D
spatialization, virtual acoustics and computational
analysis of complex real-world auditory scenes
such as multi-instrument music recordings
and so-called "cocktail party" problems.
8.
Microsoft Speech Research Group
The Speech Technology
Group engages in research and development of spoken language technologies.
We are interested not only in creating state-of-the-art spoken language
components, but also in how these disparate components can come together
with other modes of human-computer interaction to form a
unified, consistent computing environment.
We are pursuing several projects to help us reach our vision of a fully
speech-enabled computer. Projects at the group include (Whisper) Speech
Recognition
(Whistler) Speech Synthesis
(Text-to-Speech)
(Dr. Who) Multimodal Conversational
User Interface
Speech Application Programming
Interface Development
Toolkit
II. Asia
1. Advanced
Telecommunications Research Institute International (ATR) at Japan
The research institute
in Japan conduct researchs include speech information processing. The area
of research includes speech database, dialogue database, hearing school,
chatr, intelegent sound monitoring system etc.
2.
PsTNLP laboratory
Talking computer is an
innovative technology in today's wired world. Computer is able to talk
with various languages like English, France, Spanish and Chinese Mandarin
for example. Computer is learning to talk Thai as well. Computer talks
not quite well enough as human does. Naturalness and intelligibility are
key issues for the talking computer enhancement.
Thai and some other Asian
languages have no explicit word boundary in the written sentences. Consider
the example, 'i fyoucanreadthis'. Thus, the processing for these type of
languages has the prerequisite in word segmentation. Thai word segmentation
is not obvious because of ambiguity. There are many possibilities in segmenting
words from one sentence. For example, note this contiguous sentence
'Importantproductsofregion' may be incorrectly
segment as 'Im-port-ant-product-so-fregion' or correctly
segment as 'Important-products-of-region'.
3.
Thai Text-to-Speech Project at NECTEC
The aim of our software
is to create synthesized Thai speech with natural sounding accent and rhythm.
Speech unit based on demi-syllable approach has been adopted in the first
phase, totally 1,399
speech units have been recorded. The system
is designed to be able to analyze Thai text into appropriate words and
phrases by using an intelligent word segmentation algorithm and a shallow
parser. The speech
output is based on the concatenation of these
speech units. Our team aims to investigate the following main areas of
research: acoustic-phonetics, corpus-base synthesis, discourse and speech
synthesis, innovative speech signal processing techniques, Input annotations
and text processing, prosody processing, and utterance generation. Once
fully developed, this software will be combine Thai OCR software and Thai
text-to-speech software to "read" book directly for the blinds.
III. Australia
IV. Europe
1.
Laureate at British Telecom
Laureate is a text-to-speech
system developed at BT Laboratories. It was designed to generate high quality
natural synthetic speech from typed text, while maintaining high intelligibility.
Laureate is different from other text-to-speech
systems because it speaks in a voice that is far more natural than the
robotic sounding synthesisers of the past. This makes it much easier for
a listener to concentrate on the information in a synthesised message,
not just the voice. Text-to-speech systems offer the ability to convert
unpredictable or wide ranging information into synthetic speech, with a
flexibility that recorded message systems cannot match. Speech is a natural
information medium for telecommunications and hands-free environments,
and is particularly useful for remote mobile access to centrally stored
information.
2.
The Festival Speech Synthesis System at the university of Edinburgh
Festival is a general
multi-lingual speech synthesis system developed at CSTR. It offers a full
text to speech system with various APIs, as well an environment for development
and research of speech synthesis techniques. It is written in C++ with
a Scheme-based command interpreter for general control.
3.
Spoken Language Working Group of the European Advisory Groups on Language
Engineering Standards (EAGLES)
A brief overview is provided
of the EAGLES project ([1]) specifically with reference to progress achieved
within the Spoken Language Working Group (SLWG). The goals, working structures,
methods, and achievements are first briefly summarised ([2], [4]). We then
outline the major achievement of the project, the handbook of Spoken Language
working practices and guidelines, with some discussion of important liaisons
developing with other projects and bodies. The paper concludes with an
overview of current plans and prospects for further extension and development
of these activities. The domain of spoken language technologies ranges
from speech input and output systems to complex understanding and generation
systems, including multi- modal systems of widely differing complexity
(such as automatic dictation machines) and multilingual systems (including
for example translation systems). The definition of de facto standards
and evaluation methodologies for such systems involves the specification
and development of highly specific spoken language corpus and lexicon resources,
and measurement and evaluation tools. In these areas the de facto
standards are derived from the consensus within the spoken language community
previously established in a number of European ([3]) and national projects,
with reference to important initiatives in the US and Japan. Primary among
these have been the SAM projects (centred on component technology assessment
and corpus creation), SQALE (for large vocabulary systems assessment) and
both SUNDIAL and SUNSTAR (for multi-modal systems.) Past and present projects
with significant outputs in the domain of assessment and resources include
ARS, RELATOR, ONOMASTICA and SPEECHDAT, as well as major national projects
and programmes of research such as VERBMOBIL in Germany. This has led to
an initial documentation of existing practice which is relatively comprehensive
but in many respects heterogeneous and widely dispersed. The Spoken Language
Working Group of the EAGLES project has addressed the task of collecting
and unifying this existing body of material to provide an up-to-date baseline
reference documentation to serve current and immediate future needs.
3.
HMM-Based Trainable Speech Synthesis by Dr. Robert Donovan from Cambridge
University's Engineering Department
A speech synthesis system
has been developed which uses a set of decision-tree state-clustered Hidden
Markov Models to automatically select and segment a set of HMM-state sized
sub-word units from a one hour single-speaker continuous-speech database
for use in a concatenation synthesiser. Duration and energy parameters
are also estimated automatically. The selected segments are concatenated
using a TD-PSOLA synthesiser. The system can synthesise highly intelligible,
fluent speech from a
word string of known phonetic pronunciation.
It can be retrained on a new voice in less than 48 hours, and has been
trained on 4 different voices. The segmental intelligibility of the speech
has been measured using large scale modified rhyme tests, and an error
rate of 5.0% obtained
4.
The Speech Technology Group
The Speech Technology
Group (Grupo de Tecnología del Habla, GTH) is part of the Department
of Electronic Engineering (Departamento de Ingeniería Electrónica,
DIE), which belongs to the Technical University of Madrid (Universidad
Politécnica de Madrid, UPM), at the High Technical School of Telecommunication
Engineering (Escuela Técnica Superior de Ingenieros de Telecomunicación,
ETSIT). We carry out research and development in areas related to Speech
Technology: recognition,
synthesis, understanding, etc.