Speech Recognition using KALDI
The people who are searching and new to the speech recognition models it is very great place to learn the open source tool KALDI. It is a open source tool kit and deals with the speech data. And the KALDI is mainly used for speech recognition, speaker diarisation and speaker recognition.
KALDI , it is mainly written in c/c++ and it is cover with the bash and python scripts. The wrapping spares are used to get into the deep source code.
Before this, we have to know the available open source speech recognition tools with their accuracy. They are
3. Deep Seach
These toolkits are ordered according to their efficiencies and accuracies.
In this article, we are going to understand the training process of the KALDI toolkit and some of the theoretical concepts of that process. And in this we won’t discuss any coding snippets for this we can go through this link http://kaldi-asr.org/doc/kaldi_for_dummies.html.
Preprocessing and Feature Extraction:
Now-a- days, most of the models are deals with the audio data that work with the pixel based representation of data. If you want to extract that representation we have to consider two things. They are
1. Identify the sounds of human speech
2. Discarding any unnecessary noise.
To make those features, MFCC (Mel-Frequency Cepstral Coefficients) is widely used in current industries.
The MFCC is based on the different frequencies that can be can be captured by the human ear. The block diagram of MFCC is
In kaldi we are using two more features,
CMVNs are used for the normalization of the MFCCs
I_Vectors are mainly used for the better understanding of the variances inside the domain. For example, creating a speaker dependent representation the I-vectors are very similar to the joint factor analysis. And most suitable for understanding the channel and speaker variances.
MFCC and CMVN are mainly used for representing the content of the each audio sentence. And i-vectors is used for representing the style of the each audio sentence or speaker.
The math behind the KALDI is based on the Linear algebra libraries like BLAS(Basic Linear algebra Subroutines) and LAPACK(Linear alzebra PACKegs ).
BLAS is a subroutine declarations that corresponds to the simple and low level matrix-vectors and the LAPACK is a set of routines and used for high-level matrix-vector operations such as matrix inversion and svd.
The Kaldi model is mainly divided into two components they are ,
The acoustic model uses the Gmm(Gaussian mixture models) in background . But now it is replaced by the deep neural networks. This neural networks are transcribe the audio features that can be converted into sequence of context dependent phenomenon’s called
In the Decoding graph, the phonemes are converted into lattices. The lattice is nothing but the representation of the alternative sentences that are likely to be an audio part. The lattice is represents in two types one is lattice type and another one is compact lattice type. Here the lattice type and compact lattice types represents the same information but in different manner. This is the output you get from the speech recognition system.
The decoding graph will be like this,
Worth noticing is a simplification way of the model works. And in training process you need to maintain your transcribed audio data in a sequential or specific order. After that you will need to represent the each word to the phenomenon’s and this representation is called as “Dictionary”. Using this dictionary we can get the output values for the acoustic model. For instance,
eight -> ey t
nine -> n ay n f ay v
four -> f ao r
five -> f ay v
By using both of those we can start your training process. In this training we are having the different training recipes are there. The most used recipe is wsj recipe.
For every recipe we have to align the phonemes into the audio sound with Gaussian mixture model. So the basic step I.e. alignment is helping us to what the sequence we want our Deep Neural Networks to spit out later.
After that alignment of audio data, we will create the Deep Neural Networks to form the Acoustic model and will train it to match it with the alignment output. After completion of creating the acoustic model we can train a WFST to transform the DNN output data into required lattices.