ACOUSTIC MODELING FOR AUTOMATIC SPEECH RECOGNITION:
DERIVING DISCRIMINATIVE GAUSSIAN NETWORKS
Remco Teunen September 2002 AbstractDespite the considerable progress made in recent years, automatic speech recognition is
far from being a solved problem. In particular, the accuracy of a speech recognizer
degrades dramatically when there is a mismatch between the training and real usage conditions.
State-of-the-art speech recognizers use hidden Markov models (HMMs) and Gaussian
mixture models (GMMs) with millions of parameters to model speech. The collection
of all these models is called the acoustic model set of the speech recognizer. The parameters
are trained with speech from thousands of different speakers to capture the variabilities
of speech. However, the current acoustic model set overgeneralizes and is not able to
capture certain constraints in speech that are relevant for recognition. For example, the
acoustic model set does not take into account that the gender of a speaker cannot change
within an utterance. Furthermore, experiments have shown that the acoustic model set is
often not able to take advantage of the vastly increasing amount of training data that is
now available with commercial applications.
In this work, a novel technique for deriving discriminative Gaussian networks (GNs)
from training data is presented. The Gaussian networks can be viewed as HMM/GMM
models that have complex HMM structures, and simple, single Gaussian GMMs. The
models are iteratively grown in complexity by splitting HMM states into two states. For
each iteration the algorithm splits the states that are expected to give the most significant
error rate reduction. The model parameters are discriminatively trained as well, using an
improved version of the maximum mutual information (MMI) training algorithm.
Evaluations using the Aurora 2 industry standard benchmark, and a small vocabulary
recognition task, show that GN acoustic models are both more accurate and more robust
than comparable HMM/GMM acoustic models.
|