At this step, it is crucial to analyse some random audio samples from the collected data. This procedure, will help us to better understand our data by visualizing the audio files with various techniques. Furthermore, this analysis aims to extract audio features that will be used later on in the building process of our model.
Audio feature extraction
For audio feature extraction we will implement some methods and functions from the torchaudio (opens in a new tab) python framework. We will try to explore the relationship between common audio features, such as the waveform, with advanced features the torchaudio's API offers For presentation purposes we will present two different examples from the collected data, not yet labeled with aprox 4 seconds length each.
Audio
Jan23_21-00-33-hip-hop-228614956.wav
Stats
- File size: 1008080 bytes
- AudioMetaData(sample_rate=44100, num_frames=168000, num_channels=2, bits_per_sample=24, encoding=PCM_S) Sample Rate: 44100 Shape: (2, 168000) Dtype: torch.float32
- Max: 0.330
- Min: -0.258
- Mean: -0.000
- Std Dev: 0.049
Waveform
def waveform(waveform, sample_rate, title=None, save=False):
waveform = waveform.numpy()
num_channels, num_frames = waveform.shape
time_axis = torch.arange(0, num_frames) / sample_rate
figure, axes = plt.subplots(num_channels, 1)
if num_channels == 1:
axes = [axes]
for c in range(num_channels):
axes[c].plot(time_axis, waveform[c], linewidth=1)
axes[c].grid(True)
if num_channels > 1:
axes[c].set_ylabel(f"Channel {c + 1}")
figure.suptitle("waveform")
plt.show(block=False)
if save:
figure.savefig(Util.create_plot_path(title=title))
import torchaudio
waveform, sample_rate = torchaudio.load(AUDIO_SAMPLE_FILE)
Visualization.waveform(waveform, sample_rate=sample_rate, title=AUDIO_FILE_TITLE, save=True)
Spectogram
Spectogram of 2 channels with 44100 sample rate
Torchaudio feature extraction
Mel Spectogram
MelSpectrogram for a raw audio signal
mel spectogram parameters
n_fft = 1024
win_length = None
hop_length = 512
n_mels = 128
MFCC
Mel-frequency cepstrum
MFCC spectogram parameters
n_fft = 2048
win_length = None
hop_length = 512
n_mels = 256
n_mfcc = 256
MFCC is the widely used technique for extracting the features from the audio signal, and is often used for speach recognition.
LFCC
LFCC spectogram parameters
n_fft = 2048
win_length = None
hop_length = 512
n_lfcc = 256
Pitch
Kaldi pitch with NFCC
Pitch feature [1] is a pitch detection mechanism tuned for automatic speech recognition (ASR) applications. This is a beta feature in torchaudio, and it is available as torchaudio.functional.compute_kaldi_pitch().