-
Notifications
You must be signed in to change notification settings - Fork 1
Feature Extraction
The first step of our ASR system is to preprocess the data and feed the processed data into a neural network. We used two kinds of acoustic feature representation: spectrogram and MFCC.
A spectrogram is a visual representation of the spectrum of frequencies of sound. We transform the raw audio into a graph where the x-axis is frequency, and the y-axis denotes time. A sample spectrogram created from an audio file is shown in figure 1(a).
MFCC stands for Mel-Frequency Cepstral Coefficients. MFCC is calculated from a type of cepstral representation of the audio clip. MFCC is lower dimensional than spectrogram, so one may avoid overfitting problem by using MFCC as the feature representation.
We also use Connectionist Temporal Classification (CTC) to calculate the training loss and validation loss. CTC uses a softmax layer to define a separate output distribution Pr(k|t) at every step t along the input sequence. K stands for K different possible phonemes. In our system, it stands for different possible characters in our vocabulary, which contains 26 English letters, punctuations, and blank. CTC takes together these decisions define the distribution Pr(k|t) over alignments between the input and target sequences. And then sum over all possible alignments and determine the normalized probability. (from Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on (pp. 6645-6649). IEEE., https://arxiv.org/abs/1303.5778)