Understanding Hidden Markov Model for Speech Recognition

Top 10 voice recognition platforms 1 (i2tutorials)

Hidden Markov Model:

Hidden Markov Model is the set of finite states where it learns hidden or unobservable states and gives the probability of observable states. The current state always depends on the immediate previous state. In Hidden Markov Model, the state is not visible to the observer (Hidden states), whereas observation states which depends on the hidden states are visible.



Hidden Markov Model explains about the probability of the observable state or variable by learning the hidden or unobservable states. Speech Recognition mainly uses Acoustic Model which is HMM model. It is traditional method to recognize the speech and gives text as output by using Phonemes. In Speech Recognition, Hidden States are Phonemes, whereas the observed states are speech or audio signal.



Hidden Markov Models are widely used in fields where the hidden variables control the observable variables. Speech recognition, Image Recognition, Gesture Recognition, Handwriting Recognition, Parts of Speech Tagging, Time series analysis are some of the Hidden Markov Model applications.


1. Speaker Dependent

2. Speaker Independent

3. Single Word Recognizer

4. Continuous Word Recognizer



1. Feature Extraction

2. Feature Matching

3. Word-Phoneme Pairing


Feature Extraction:

Input is speech or audio signal which is in analog form where system cannot understand analog signal. It only works with digital format. Hence the audio signal needs to be converted into digital format. This process is known as Feature Extraction.

In Feature Extraction, primarily the signal is divided into small periods say 10ms. Each part of Speech signal is called Frame. Each Frame is converted from Time domain to Frequency domain by using Fourier Transforms. Output of Feature Extraction is Feature Vector. It gives the amplitude of Entire audio signal into vector format which is Digital.

Three methods are used for Feature Extraction. They are:

  1. Mel Frequency Cepstral Coefficients (MFCC)
  2. Linear Predictive Coding (LPC)
  3. Short Time Fourier Transform (STFT)


Mel Frequency Cepstral Coefficients (MFCC):

Fourier transform of the given signal are mapped to the Mel scale (nonlinear frequency scale). It takes logarithms at each Mel scale. Discrete cosine transform (DCT) is performed after taking logarithms. This represents amplitude of speech signal at each spectrum. The widely used method for feature extraction is MFCC.


Linear Predictive Coding (LPC):

This Process divides the entire Audio signal into seconds. Each Second is divided into 30 to 50 frames. And Extracts the Feature Vector From it. It actually calculates the Power of the Spectrum.


Short Time Fourier Transform (STFT):

Short time Fourier transform divides speech signal into smaller parts and compute Fourier transform of each which gives the Fourier spectrum of each segment of speech signal. Thus, it extracts the Feature vector from the given audio signal.


Feature Matching:

Feature Matching is the process of training the acoustic model with the Feature Vector Extracted. It gives the relation between Feature vector and the Phonemes. Phonemes are the distinct units of sound that can distinguish one word from another. Gaussian Mixture Model is used mostly for Feature Matching.


Gaussian Mixture Model:

It is used as a classifier to compare the features extracted from Feature vector with the stored templates. A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm.


Word-Phoneme Pairing:

Gaussian Mixture Model Classifies the Phonemes to the Feature Vector during Training the Model, whereas during Testing it follows reverse procedure. It matches the Feature vector to the previously trained Phonemes and recognizes the sequence of phonemes and gives output recognized word in text format.


Three basic problems for HMM:

Problem 1: Evaluation Problem

Computing the Probability that observed sequence was produced by model. We have to choose the one with maximum Probability will give better result. Viterbi Algorithm is used for this Evaluation problem but Forward Algorithm is used as well.


Problem 2: Hidden State Determination (Decoding)

Choosing corresponding state sequence is quite optimal. There is no correct solution for uncovering hidden part of the problem. Using optimal criterion is the best possible solution. Solution for the Decoding problem is also Viterbi algorithm.


Problem 3: Learning

Adjusting the model parameters to maximize Probability so as to describe how given observation sequence comes out. The observation sequence is used for training HMM. Training allows us adjust model parameter as to create best model for given training sequence. This problem can be rectified by using Forward- Backward algorithm.



Hidden Markov Model is an important statistical tool for modeling data with sequential correlations in neighboring samples, such as time series data. It is one of the most successful applications in natural language Processing (NLP).