Upscaling 8kHz voice audio to 16kHz with Deep Neural Networks
2Hz is committed to developing technologies which improve Voice Audio Quality in Real Time Communications.
One contributor to poor voice quality is the legacy infrastructure powered by 8kHz sampling based G.711 codec. While most of our phones can capture wideband audio (up to 48kHz) the codecs used by cellular networks downsample audio to 8kHz (lowband audio).
8kHz sampled audio can capture the frequency range the human ear is the most sensitive with however our voice still sounds like "coming from a tunnel" and is not pleasant enough. This is because of the absence of higher frequencies of our voice in the audio.
Artificial Bandwidth Expansion (we call it HD Voice Playback) refers to the idea of upsampling a lowband audio to wideband audio in a way that it improves voice quality. This technique has been around for many years. For example you can use ffmpeg open source tool to perform artificial expansion. ffmpeg up-samples the audio to 16kHz however it doesn't enrich it. The end result still sounds like coming from a tunnel.
In this article we describe a Deep Learning based HD Voice Playback. We call our designed DNN
There are two primary use cases for HD Voice Playback known to us.
Imagine a conference call with 3 participants. One of the callers is using their phone (via cellular) and the other 2 are using VoIP with their laptops. The ones with VoIP (assuming there is a good connectivity) will have a wideband audio (>=16khz). The one with phone will sound from tunnel (lowband).
If the Conferencing Service enriched lowband audio (8kHz) before sending to Laptop users they would hear higher quality audio (16kHz) instead of a voice coming from tunnel.
Conferencing Service makes lowband audio sound better
Imagine you are having a direct call with another person and the calling Mobile App you are using adjusts the sampling rate of audio to network conditions you have. So sometimes it has to switch to 8kHz audio for reliability reasons. Now if the receiver device had a way to upsample the audio before playing it - the end user wouldn't hear any voice quality degradation.
Audio is transmissioned with low bandwidth but then up sampled to HD on the device
Another reason to do this is to simply save network bandwidth by sending less bytes and doing more work on edge.
Let's listen to some samples before digging more into the techniques used by us.
original lowband - 8kHz lowband audio from a regular call
ffmpeg wideband - 8kHz lowband audio converted to 16kHz via
ffmpeg -i lowband.wav -ar 16000 wideband.wav
krispNet wideband - 8kHz lowband audio converted to 16kHz via krispNet (2Hz)
Chinese Speech in Noisy Street (note that background noise isn't extrapolated)
Spectogram of a sample audio.
ffpmeg processed audio is on left and krispNet processed audio is on the right side.
As we can see the higher frequencies (after 4000Hz) are expanded by krispNet.
The first two seconds are not human speech (noise) and therefore they are not expanded by krispNet.
We call our designed Deep Neural Network
As in any other machine learning problem the key to success is in finding the balance between the Training Data, DNN Architecture and Domain Specific techniques. We have performed many experiments with finding, converting, tailoring the data for our specific problem, trying out different DNN Architectures and extracting different voice features through Digital Signal Processing techniques until we have reached the final results.
We have trained krispNet with
520 hours of human speech audio.
Below is how entire training process works.
We take 8kHz wav file and calculate its power spectrum and phase of overlapped frames. Overlapping frames allow keeping correlation between neighboring frames. The minimal length of audio signal that human ear can differentiate lays between 20ms to 40ms. We take 32ms frame length which is close to the center of this range. We feed the network with the resulting vectors of Fourier coefficients.
krispNet is a dense network with several hidden layers.
We use Dropout for regularization, Batch Normalization and other best practices used in Machine Learning. We use ReLu as activation function between hidden layers. We use Adam Optimizer with learning rate of 10-e4 by using common technique called exponential decay. On output krispNet predicts the upper band of the signal.
The results are based on training on Nvidia 1080 Ti GPU for 30 epochs using Tensorflow.
We construct wideband audio signal in the following way.
Take narrowband signal which we fed to krispNet and upsample it using traditional methods. In parallel we predict upper band of the same signal with krispNet and then compute IDFT to bring signal to time domain. When doing IDFT we use phases of original narrowband signal to reconstruct phase of the wide band signal. Then we simply add them and we get 16kHz bandwidth expanded audio signal as a result.
Process of the real time wideband audio reconstruction
krispNet is able to process 20 concurrent 8kHz audio streams in real time per CPU on an average AWS instance.
krispNet is able to process 2000 concurrent 8kHz audio streams in real time on an average GPU-enabled AWS instance.
If you are a Conferencing Service provider we would love to integrate this technology into your infrastructure and improve the experience of your end users with enhanced audio.
krispNet can be integrated into Media Servers (e.g. Asterisk, Freeswitch) as audio plugin.
Melik Karapetyan, Ashot Baghdasaryan and Vazgen Mikayelyan are Software Engineers at 2Hz.