That is the third article on spoken language recognition primarily based on the Mozilla Common Voice dataset. In Part I, we mentioned information choice and information preprocessing and in Part II we analysed efficiency of a number of neural community classifiers.
The ultimate mannequin achieved 92% accuracy and 97% pairwise accuracy. Since this mannequin suffers from considerably excessive variance, the accuracy might probably be improved by including extra information. One quite common method to get further information is to synthesize it by performing varied transformations on the out there dataset.
On this article, we are going to take into account 5 well-liked transformations for audio information augmentation: including noise, altering pace, altering pitch, time masking, and lower & splice.
The tutorial pocket book might be discovered here.
For illustration functions, will use the pattern common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. That is the sentence The burning fireplace had been extinguished.
import librosa as lr
import IPythonsign, sr = lr.load('./remodeled/common_voice_en_100040.wav', res_type='kaiser_fast') #load sign
IPython.show.Audio(sign, fee=sr)
Including noise is the best audio augmentation. The quantity of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal sign amplitude and commonplace deviation of noise. We are going to generate a number of noise ranges, outlined with SNR, and see how they alter the sign.
SNRs = (5,10,100,1000) #Sign-to-noise ratio: max amplitude over noise stdnoisy_signal = {}
for snr in SNRs:
noise_std = max(abs(sign))/snr #get noise std
noise = noise_std*np.random.randn(len(sign),) #generate noise with given std
noisy_signal[snr] = sign+noise
IPython.show.show(IPython.show.Audio(noisy_signal[5], fee=sr))
IPython.show.show(IPython.show.Audio(noisy_signal[1000], fee=sr))
So, SNR=1000 sounds nearly just like the unperturbed audio, whereas at SNR=5 one can solely distinguish the strongest components of the sign. In observe, the SNR stage is hyperparameter that depends upon the dataset and the chosen classifier.
The only method to change the pace is simply to faux that the sign has a distinct pattern fee. Nonetheless, this may even change the pitch (how low/excessive in frequency the audio sounds). Growing the sampling fee will make the voice sound increased. As an instance this we will “enhance” the sampling fee for our instance by 1.5:
IPython.show.Audio(sign, fee=sr*1.5)
Altering the pace with out affecting the pitch is more difficult. One wants to make use of the Phase Vocoder(PV) algorithm. Briefly, the enter sign is first break up into overlapping frames. Then, the spectrum inside every body is computed by making use of Quick Fourier Transformation (FFT). The taking part in pace is then modifyed by resynthetizing frames at a distinct fee. Because the frequency content material of every body just isn’t affected, the pitch stays the identical. The PV interpolates between the frames and makes use of the part info to attain smoothness.
For our experiments, we are going to use the stretch_wo_loop time stretching perform from this PV implementation.
stretching_factor = 1.3signal_stretched = stretch_wo_loop(sign, stretching_factor)
IPython.show.Audio(signal_stretched, fee=sr)
So, the period of the sign decreased since we elevated the pace. Nonetheless, one can hear that the pitch has not modified. Be aware that when the stretching issue is substantial, the part interpolation between frames won’t work nicely. Consequently, echo artefacts might seem within the remodeled audio.
To change the pitch with out affecting the pace, we will use the identical PV time stretch however faux that the sign has a distinct sampling fee such that the entire period of the sign stays the identical:
IPython.show.Audio(signal_stretched, fee=sr/stretching_factor)
Why will we ever trouble with this PV whereas librosa already has time_stretch and pitch_shift capabilities? Nicely, these capabilities remodel the sign again to the time area. When that you must compute embeddings afterwards, you’ll lose time on redundant Fourier transforms. Then again, it’s simple to change the stretch_wo_loop perform such that it yields Fourier output with out taking the inverse remodel. One might in all probability additionally attempt to dig into librosa codes to attain related outcomes.
These two transformation had been initially proposed within the frequency area (Park et al. 2019). The concept was to avoid wasting time on FFT through the use of precomputed spectra for audio augmentations. For simplicity, we are going to exhibit how these transformations work within the time area. The listed operations might be simply transferred to the frequency area by changing the time axis with body indices.
Time masking
The concept of time masking is to cowl up a random area within the sign. The neural community has then much less probabilities to study signal-specific temporal variations that aren’t generalizable.
max_mask_length = 0.3 #most masks period, proportion of sign sizeL = len(sign)
mask_length = int(L*np.random.rand()*max_mask_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place
masked_signal = sign.copy()
masked_signal[mask_start:mask_start+mask_length] = 0
IPython.show.Audio(masked_signal, fee=sr)
Reduce & splice
The concept is to interchange a randomly chosen area of the sign with a random fragment from one other sign having the identical label. The implementation is sort of the identical as for time masking, besides {that a} piece of one other sign is positioned as a substitute of the masks.
other_signal, sr = lr.load('./common_voice_en_100038.wav', res_type='kaiser_fast') #load second signmax_fragment_length = 0.3 #most fragment period, proportion of sign size
L = min(len(sign), len(other_signal))
mask_length = int(L*np.random.rand()*max_fragment_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place
synth_signal = sign.copy()
synth_signal[mask_start:mask_start+mask_length] = other_signal[mask_start:mask_start+mask_length]
IPython.show.Audio(synth_signal, fee=sr)