PhD Thesis from 2002


Remco Teunen
September 2002


Despite the considerable progress made in recent years, automatic speech recognition is far from being a solved problem. In particular, the accuracy of a speech recognizer degrades dramatically when there is a mismatch between the training and real usage conditions.

State-of-the-art speech recognizers use hidden Markov models (HMMs) and Gaussian mixture models (GMMs) with millions of parameters to model speech. The collection of all these models is called the acoustic model set of the speech recognizer. The parameters are trained with speech from thousands of different speakers to capture the variabilities of speech. However, the current acoustic model set overgeneralizes and is not able to capture certain constraints in speech that are relevant for recognition. For example, the acoustic model set does not take into account that the gender of a speaker cannot change within an utterance. Furthermore, experiments have shown that the acoustic model set is often not able to take advantage of the vastly increasing amount of training data that is now available with commercial applications.

In this work, a novel technique for deriving discriminative Gaussian networks (GNs) from training data is presented. The Gaussian networks can be viewed as HMM/GMM models that have complex HMM structures, and simple, single Gaussian GMMs. The models are iteratively grown in complexity by splitting HMM states into two states. For each iteration the algorithm splits the states that are expected to give the most significant error rate reduction. The model parameters are discriminatively trained as well, using an improved version of the maximum mutual information (MMI) training algorithm. Evaluations using the Aurora 2 industry standard benchmark, and a small vocabulary recognition task, show that GN acoustic models are both more accurate and more robust than comparable HMM/GMM acoustic models.