The word-level segmentation problem is to determine the location of word boundaries, given a continuous speech stream of words that have previously been learned in isolation. Since there are no standard inter-word silences in continuous speech, and since several syllables may be needed to disambiguate a segmentation (e.g., "myself", "my selfish" and "I sell fish"), the on-line segmentation problem is challenging. The Segmentation ART network, developed to address this problem, employs a hierarchy of classification modules to recognize increasingly large chunks of input (phonemes, syllables and words), starting with segmented phonemes. The network employs fast learning, top-down expectation, and a spatial representation of temporal order.
When a temporally ordered stream of inputs is represented as a spatial pattern, similar sequences are hard to distinguish in a noisy environment. More generally, small differences between similar inputs are often important for correct classification in a supervised learning environment. A new class of neural networks, the ARTMAP priming networks, are developed to address this problem, by focusing attention on differences between ambiguous categories, while ignoring common features. Several architectural variants within this class are compared, using different problems and noise profiles. Improvements over existing supervised neural network classifiers such as Fuzzy ARTMAP are demonstrated.
A final problem considered in my dissertation is to convert a speech signal into a sequence of phonemes that form the input to the word-level segmentation network. Noise and other sources of variablilty in the signal make this a difficult task. The approach to this problem commences with preprocessing that divides the digitized speech signal into a series of speech frames of fixed length and computes a vector based on the information in each frame. A publicly available off-the-shelf preprocessor developed by Dr. Tony Robinson is used. Neural network models that learn to predict the phoneme uttered during each frame, using the feature vector computed by the preprocessor as an input, are explored.