Speech Recognition using Neural Networks



while temporal variability covers different speaking rates. These two dimensions are not
completely independent -- when a person speaks quickly, his acoustical patterns become
distorted as well -- but it's a useful simplification to treat them independently.

Of these two dimensions, temporal variability is easier to handle. An early approach to
temporal variability was to linearly stretch or shrink ("warp") an unknown utterance to the
duration of a known template. Linear warping proved inadequate, however, because utter-
ances can accelerate or decelerate at any time; instead, nonlinear warping was obviously
required. Soon an efficient algorithm known as Dynamic Time Warping was proposed as a
solution to this problem. This algorithm (in some form) is now used in virtually every
speech recognition system, and the problem of temporal variability is considered to be
largely solved

1
.
Acoustic variability is more difficult to model, partly because it is so heterogeneous in
nature. Consequently, research in speech recognition has largely focused on efforts to
model acoustic variability. Past approaches to speech recognition have fallen into three
main categories:

1.
Template-based approaches, in which unknown speech is compared against a set
of prerecorded words (templates), in order to find the best match. This has the
advantage of using perfectly accurate word models; but it also has the disadvan-
tage that the prerecorded templates are fixed, so variations in speech can only be
modeled by using many templates per word, which eventually becomes impracti-
cal.

2.
Knowledge-based approaches, in which "expert" knowledge about variations in
speech is hand-coded into a system. This has the advantage of explicitly modeling
variations in speech; but unfortunately such expert knowledge is difficult to obtain
and use successfully, so this approach was judged to be impractical, and automatic
learning procedures were sought instead.

3.
Statistical-based approaches, in which variations in speech are modeled statisti-
cally (e.g., by Hidden Markov Models, or HMMs), using automatic learning proce-
dures. This approach represents the current state of the art. The main disadvantage
of statistical models is that they must make a priori modeling assumptions, which
are liable to be inaccurate, handicapping the system's performance. We will see
that neural networks help to avoid this problem.

1.2. Neural Networks
Connectionism, or the study of artificial neural networks, was initially inspired by neuro-
biology, but it has since become a very interdisciplinary field, spanning computer science,
electrical engineering, mathematics, physics, psychology, and linguistics as well. Some
researchers are still studying the neurophysiology of the human brain, but much attention is