Notes about speech processing, 16.8.2021
Log-normal magnitudes
The magnitudes of time-frequency components of speech signals is often said to be normally distributed in the log-domain (such as decibel). Others claim that the components themselves follow the Laplace distribution. In my own previous work, I found that the distribution of the components are to a high degree dependent on the normalization used, and components normalized by the spectral envelope typically follows the generalized normal distribution which also covers the Laplace distribution. In this short note, I will discuss the motivations why the magnitudes should follow the log-normal distribution.
Gibrat’s law and connection to central limit theorem
According to the central limit theorem, if we take a collection of random variables of arbitrary distributions, then the distribution of their sum approaches the normal distribution when the number of random variables increases. Gibrat’s law is a consequence of this theorem; if we take a collection of positive random variables, then their product approaches the log-normal distribution. This relation is trivially true, since the logarithm of a product is a sum of logarithms, and hence we can apply the central limit theorem on the logarithms of components. It follows that the sum of logarithms of components approach the normal distribution, and thus the product follows the log-normal distribution.
Speech production
Speech is produced by oscillations in the vocal folds and turbulence in the vocal tract, both acoustically shaped by the vocal tract. In speech processing applications, the classical models for these components are filters (viz. convolutions). Each part of the speech production system can be approximated by a filter and the collection can be convolved into one large filter. In discrete representations we can always also proceed in the reverse direction and decompose the systems model into a large number of constitutent first-order filters.
When looking at the system in the frequency domain, filtering operations become multiplications. The whole speech production system is a relatively complex system, so the model can be decomposed into a multiplication of relatively many components, each of a random character. According to Gibrat’s law, the logarithm of the product will therefore approach a normal distribution. Conversely, we can expect the magnitudes of spectral components to follow the log-normal distribution.
A caveat here is that speech is not uniformly distributed over time. In particular, we can easily categorize speech signals into segments of speech and non-speech. We might therefore see multiple peaks in the distribution, corresponding to at least speech and silence. Further peaks corresponding to other discrete categories such as voiced and unvoiced phonemes, could perhaps also be recognized. Such different categories occur over time, such that when we analyze speech in windows, each window belongs to one of the categories or a transition between them. The different categories are therefore not in a multiplicative relation to each other, but visible in the distribution as additive components. In other words, the joint distribution is a mixture distribution of multiple different categories, where typically each category would follow a log-normal distribution.
Room responses
In addition to speech production itself, also room responses are convolutive in character and thus give a log-normal effect.
Verification
To experimentally verify whether the above arguments are accurate, I made the experiment below, where I calculate the histogram of one frequency-bin over the LibriSpeech corpus (100 hours of speech). This is a very large corpus and recording conditions will likely vary, so this could be an overtly easy task in the sense that if we take a very large amount of anything then it is likely to be normally distributed. In any case, in the figure below you see that the data indeed follows the normal distribution.
import torch
import torchaudio
import numpy as np
import configparser
import matplotlib.pyplot as plt
dataset = torchaudio.datasets.LIBRISPEECH('/l/sounds/', download=True)
(_, sample_rate, _, _, _, _) = dataset[0]
winlen_ms = 20
winlen = sample_rate*winlen_ms//1000
window = torch.from_numpy(np.sin(np.pi*np.arange(.5,winlen,1)/winlen))**2
target_frequency_Hz = 1000
spectrum_length = winlen//2 + 1
nyquist = sample_rate//2
target_frequency_index = target_frequency_Hz*spectrum_length//nyquist
histogram_range = [-120,40]
dB_index = torch.arange(histogram_range[0],histogram_range[1]+1,1)
histogram = torch.zeros(dB_index.shape[0])
for dataitem in range(len(dataset)):
(waveform, sample_rate, utterance, speaker_id, chapter_id, utterance_id) = dataset[dataitem]
X = torch.stft(waveform,winlen,winlen//2,window=window,return_complex=True).squeeze().abs().square()[target_frequency_index,:]
histogram += torch.histc(10*X.log10(),len(dB_index),dB_index[0],dB_index[-1])input: /l/sounds/LibriSpeech/train-clean-100/2764/36619/2764-36619-0032.flac: FLAC__STREAM_DECODER_ERROR_STATUS_FRAME_CRC_MISMATCH
input: /l/sounds/LibriSpeech/train-clean-100/4830/25898/4830-25898-0000.flac: FLAC__STREAM_DECODER_ERROR_STATUS_FRAME_CRC_MISMATCHp = histogram/histogram.sum()
mu_dB = (dB_index * p).sum()
var_dB = (((dB_index - mu_dB)**2) *p).sum()
std_dB = var_dB.sqrt()
print('Mean: ' + str(mu_dB.item()) + ' dB')
print('Standard deviation: ' + str(std_dB.item()) + ' dB')
f = (1/(std_dB*np.sqrt(2*np.pi))) * torch.exp(-0.5 * ((dB_index - mu_dB)/std_dB)**2 )
plt.plot(dB_index,(p*f.mean()/p.mean()).detach().numpy(),label='Histogram')
plt.plot(dB_index, f.detach().numpy(),label='Gaussian model')
plt.legend()
plt.show()Mean: -27.661785125732422 dB
Standard deviation: 17.668245315551758 dB
In the figure above we see the histogram of the log-magnitude of one frequency component as well as an Gaussian distribution fitted to that histogram. The match is reasonably good as per informal visual evaluation. The histogram is slightly skewed to the left and there is a small hump around -80 to -70 dB, but otherwise it really looks like a Gaussian.