Research
Feature Extraction

At this step, it is crucial to analyse some random audio samples from the collected data. This procedure, will help us to better understand our data by visualizing the audio files with various techniques. Furthermore, this analysis aims to extract audio features that will be used later on in the building process of our model.

Audio feature extraction

For audio feature extraction we will implement some methods and functions from the torchaudio (opens in a new tab) python framework. We will try to explore the relationship between common audio features, such as the waveform, with advanced features the torchaudio's API offers For presentation purposes we will present two different examples from the collected data, not yet labeled with aprox 4 seconds length each.

Audio

Jan23_21-00-33-hip-hop-228614956.wav

Stats

  • File size: 1008080 bytes
  • AudioMetaData(sample_rate=44100, num_frames=168000, num_channels=2, bits_per_sample=24, encoding=PCM_S) Sample Rate: 44100 Shape: (2, 168000) Dtype: torch.float32
  • Max: 0.330
  • Min: -0.258
  • Mean: -0.000
  • Std Dev: 0.049

Waveform

Hello
    def waveform(waveform, sample_rate, title=None, save=False):
        waveform = waveform.numpy()
 
        num_channels, num_frames = waveform.shape
        time_axis = torch.arange(0, num_frames) / sample_rate
 
        figure, axes = plt.subplots(num_channels, 1)
        if num_channels == 1:
            axes = [axes]
        for c in range(num_channels):
            axes[c].plot(time_axis, waveform[c], linewidth=1)
            axes[c].grid(True)
            if num_channels > 1:
                axes[c].set_ylabel(f"Channel {c + 1}")
        figure.suptitle("waveform")
        plt.show(block=False)
        if save:
            figure.savefig(Util.create_plot_path(title=title))
import torchaudio
 
waveform, sample_rate = torchaudio.load(AUDIO_SAMPLE_FILE)
Visualization.waveform(waveform, sample_rate=sample_rate, title=AUDIO_FILE_TITLE, save=True)

Spectogram

Spectogram

Spectogram of 2 channels with 44100 sample rate

Torchaudio feature extraction

Mel Spectogram

MelSpectrogram for a raw audio signal

Spectogram Mel Spectogram

mel spectogram parameters

n_fft = 1024
win_length = None
hop_length = 512
n_mels = 128

MFCC

Mel-frequency cepstrum

MFCC Spectogram

MFCC spectogram parameters

n_fft = 2048
win_length = None
hop_length = 512
n_mels = 256
n_mfcc = 256

MFCC is the widely used technique for extracting the features from the audio signal, and is often used for speach recognition.

LFCC

MFCC Spectogram

LFCC spectogram parameters

n_fft = 2048
win_length = None
hop_length = 512
n_lfcc = 256

Pitch

Pitch waveform

Kaldi pitch with NFCC

Pitch waveform

Pitch feature [1] is a pitch detection mechanism tuned for automatic speech recognition (ASR) applications. This is a beta feature in torchaudio, and it is available as torchaudio.functional.compute_kaldi_pitch().

Sample #2

Pitch waveform