Notes about speech processing, 13.8.2021
6dB per octave
A classic observation is that the magnitude spectrum of speech decays with frequency, approximately 6dB per octave. The decibel scale is a logarithmic scale and the per-octave argument treats frequencies on a logarithmic scale. The observation thus translates to a claim that magnitude spectra are linear in the log-log domain.
Phenomena linear in log-log domain are described by the Pareto principle or perhaps more famously for discrete distributions as Zipf’s law. Equivalently, we can say that the distribution of such phenomena follow the power-law distribution.
Below is a brief implementation to demonstrate and experimentally validate the claim. It calculates the average magnitude spectrum over the LibriSpeech corpus (100 hours of speech sampled at 16 kHz). From the image, we can see that the 6dB-per-octave rule applies from approximately 500 or 1000 Hz upwards, so in this example includes a linear regression line fitted on 500 to 8000 Hz. Informally, we have observed that the result does not essentially change for window lengths in the range 20 to 128 ms.
We find that, indeed, the magnitude spectrum is linear in the log-log domain and the 6db-per-octave is remarkably accurate. The fit near 500 Hz is somewhat off in the regression line, though, indicating that a somewhat steeper line could also be considered. Recall on the other hand that a step of -6dB corresponds approximately to a reduction to half in signal amplitude. Half-amplitude in decibel is 20 log10(0.5)=-6.02, which is a steeper decline, thus conforming to our hypothesis of the steepness. The reason why energy is halved for each octave though remains an open question.
import torch
import torchaudio
import numpy as np
import matplotlib.pyplot as plt
dataset = torchaudio.datasets.LIBRISPEECH('/l/sounds/', download=True)
sample_rate = 16000
winlen_ms = 20
winlen = sample_rate*winlen_ms//1000
window = torch.from_numpy(np.sin(np.pi*np.arange(.5,winlen,1)/winlen))
mX = 0
for dataitem in range(len(dataset)):
(waveform, sample_rate, utterance, speaker_id, chapter_id, utterance_id) = dataset[dataitem]
X = torch.stft(waveform,winlen,winlen//2,window=window,return_complex=True).squeeze()
mX += X.abs().square().mean(dim=1)/len(dataset)input: /l/sounds/LibriSpeech/train-clean-100/4830/25898/4830-25898-0000.flac: FLAC__STREAM_DECODER_ERROR_STATUS_FRAME_CRC_MISMATCHfvector = torch.linspace(0, 8000, 1+winlen//2,dtype=torch.double)
ix = fvector.log10() > np.log10(500.)
X = torch.ones([sum(ix),2],dtype=torch.double)
X[:,1] = fvector[ix].log10()
c = torch.matmul(torch.linalg.pinv(X),10*mX[ix].log10()).numpy()
c6dB = [0, -6/(np.log10(2000)-np.log10(1000))]
c6dB[0] = (10*mX[ix].log10()).mean()-(c6dB[1]*fvector[ix].log10()).mean()
plt.plot(fvector.log10(),10*mX.log10().detach().numpy(),label='Energy')
plt.plot(fvector[ix].log10(),c[0]+c[1]*fvector[ix].log10(),'--',label='Regression')
plt.plot(fvector[ix].log10(),c6dB[0]+c6dB[1]*fvector[ix].log10(),label='6dB/octave')
plt.legend()
ax = [1.8, 4.0, -1+10*mX.log10().min().item(), 5+10*mX[ix].log10().max().item()]
plt.axis(ax)
xt = [100, 200, 300, 500, 1000, 2000, 3000, 5000, 8000]
plt.xticks(ticks=np.log10(xt),labels = [str(xt[i]) for i in range(len(xt))])
plt.xlabel('Frequency (Hz)')
plt.ylabel('Magnitude (dB)')
plt.show()