-
Notifications
You must be signed in to change notification settings - Fork 4
Features
For the experiments in this project, the following features were considered, but the scope of features are not limited to them alone. Any new audio feature can be added as a method in the class audio_features.
-
Spectrogram:
This is based on STFT calculation. For a sample at 8kHz sampling rate, a windowing length of 30ms was considered in the original experiments but can be toyed with using WindowLength property of the class audio_features. This implies 240 samples per window frame. An overlap of 60 samples in adjacent windows was also considered to reduce the effect of windowing.
-
GFCC:
GFCC (Gammatone Frequency Cepstral Coefficients) are the gammatone-domain cepstrum based audio features. Being a cepstral property, it is a spectral representation of the spectrum of a time-domain audio signal which has been filtered using a gammatone filterbank. These features are calculated at subband levels for frames of length 30ms. For audio sampled at 8KHz rate, this implies 240 samples per frame. To reduce the effect of windowing, an overlap of 60 samples in adjacent frames is considered. In all 13 gammatone frequency cepstral coefficients, GFCC delta and GFCC delta-delta features were considered.
-
MFCC:
MFCC (Mel-Frequency Cepstral Coefficients) are the Mel-domain cepstrum based audio features. It is a spectral representation of the spectrum of a time-domain audio signal which has been filtered using a Mel-frequency filterbank. The MFCC is calculated by splitting the entire data into overlapping segments. These features are calculated at subband levels for frames of length 30ms. For audio sampled at 8KHz rate, this implies 240 samples per frame. To reduce the effect of windowing, an overlap of 60 samples in adjacent frames is considered. Using this methodology, the Mel frequency cepstral coefficients, log energy values, cepstral delta, and the cepstral delta-delta values for each segment is calculated. In all 13 Mel-frequency, cepstral coefficients were considered.
-
Pitch:
Pitch estimates fundamental frequency of audio signal. The pitch values are also estimated for audio at sub-band levels for frames of length 30ms. For audio sampled at 8KHz rate, this implies 240 samples per frame. To reduce the effect of windowing, an overlap of 60 samples in adjacent frames is considered.
-
Cochleagram:
Cochleagram is a representation of a time-domain audio waveform in a t-f domain but is different from the spectrogram. It is computed using an array of bandpass filters that each model the frequency selectivity and nerve response of a single hair cell. Keeping this in mind, a 64 channel Gammatone filter bank as an array of bandpass filters is used to estimate cochleagram. Cochleagram evaluation involves using ERB (Equivalent Rectangular Bandwidth) as a psychoacoustic measure for approximating the frequency-dependent bandwidth of the filters in human hearing. The bandwidth of the rectangular bandpass filter is chosen such that it has the same peak and passes the same amount of power for the input of white noise.
For the training phase, either an individual or a combination of these features can be used. The purpose with a deep learning machine is to use the features and a training-target as training data. The features make the input and the training-target make the corresponding output. The deep learning machine then tries to learn/model a function to be able to predict a t-f training-target during the prediction/testing phase.
IRM is calculated using spectrogram/cochleagram and is defined as:
where, denote speech energy and noise energy within a T-F unit, respectively. The tunable parameter β scales the mask and is commonly chosen to 0.5. With the square root, the IRM preserves the speech energy with each T-F unit, under the assumption that S(t,f) and N(t,f) are uncorrelated. Without the square root, the IRM in is similar to the classical Wiener filter, which is the optimal estimator of target speech in the power spectrum.
IRM based on a spectrogram:
IRM based on a cochleagram: