Fixing Voice Breakups with Deep Learning

Applying Deep Neural Networks to PLC in Voice Communications

At 2Hz we are continuously rethinking traditional approaches to known problems in Voice processing and disrupting them by applying deep learning.

Our last article has discussed the problem of bandwidth expansion in voice audio.

This time it's PLC's turn.

PLC (Packet Loss Concealment) is a well known problem in voice communications. It's also known to every telecommunication user in the world. Everyone, literally everyone who used VoIP Apps or Cellular Phone has experienced "chopped voice". When network conditions are bad our voice is cutting off and sounds annoying and funny. Early Skype users remember this very well. We sound like "he e ey how aaaaaare yoo ooo"?

In this article we demonstrate how our Deep Neural Network (DNN) powered PLC algorithm (krispNet-PLC) compares to existing state of the art PLC technologies.

Packet Loss Concealment (PLC)


To transmit audio over internet, first the audio signal is sampled and divided into frames. In turn these frames are encoded by a codec and grouped into network packets which are then sent over internet. On the receiving side, the opposite procedure is carried out and the resulting audio data is played out.

The network packets travel through different routes on the internet. On the receiving side, some of these packets arrive with significant delay or may even never arrive. These comprise the lost packets with which the receiver should ideally reconcile.

Each lost packet results in a number of lost audio frames.

The role of a PLC algorithm is to predict and fill in these frames so the audio quality does not degrade.

State-of-the-art PLC


The simplest algorithms are zero fill or repeat the last frame or their variations.

The audio generated by the zero fill algorithm sounds chopped, while the one generated by the frame repeating algorithm sounds robotic.

Below is an example of Zero Fill and Repeat recovered audio.

30% packet loss

Zero Fill PLC

Repeat PLC


Opus PLC


An emerging audio codec for general purpose voice and music over the internet is the Opus codec. Opus, standardized by the IETF, is unmatched for interactive speech and music transmission.

Opus incorporates technology from SILK codec of Skype and CELT codec of Xiph.Org. It also uses a PLC algorithm with good loss-robustness and concealment.

Opus-PLC algorithm uses LPC extrapolation from the previous frame or finds a periodicity in the decoded signal and repeats the windowed waveform using the pitch offset.

Below is an example of Opus-PLC recovered audio.

30% packet loss

Opus PLC

KrispNet - Deep Learning Based Algorithm


Deep Neural Network (DNN) methods have achieved tremendous success in tackling difficult, real world problems, such as image, audio and video processing. Key difficulties in using DNN methods are to correctly incorporate the network such that it plays the key role in the algorithm, choosing the right data, right features, preprocessing, the right network structure and training it properly.

krispNet-PLC DNN powered algorithm is based on extracting features from a missing frame's neighboring frames and generating enough predictive output about the missing frame to achieve a smooth passage between frames. The features are the log spectrum, including the phases, which after appropriate preprocessing are fed to the network.

Our data is 960 hours of voice, including both clean and noisy audio. This corresponds to 51.5 GByte of 8000 hz sampled 16-bit voice data.

Opus PLC vs krispNet-PLC


Let's hear samples and compare them before digging into diagrams.

In these samples we've used standard opus tools to simulate packet loss at different rate on a given audio and then reconstructed these using Opus-PLC and krispNet-PLC algorithms.

10% packet loss

Opus PLC

krispNet-PLC

Reference (no packet loss)


20% packet loss

Opus PLC

krispNet-PLC

Reference (no packet loss)


30% packet loss

Opus PLC

krispNet-PLC

Reference (no packet loss)


40% packet loss

Opus PLC

krispNet-PLC

Reference (no packet loss)


As you hear the superiority of krispNet-PLC is apparent. Our recovered voice audio is intelligible even at 40% loss rate. This is remarkable.

The following diagram depicts the signal waveforms for a sample audio and those of different algorithms.

The following figure shows the PESQ score comparison between the original audio, the audio generated by the Opus-PLC and krispNet-PLC for different values of error rates.

What's Next


The first step in deploying krispNet-PLC into real life is the integration into Krisp for Mac. By simply having Krisp App on both ends "Voice Chopping" will be significantly improved out of the box, independent on what Conferencing Calling App the users are using (Zoom, Skype, Hangouts or you name it).

The next step is integration into Mobile devices.

Authors


Karen Yeressian is a machine learning researcher at 2Hz. Arto Minasyan is the CTO of 2Hz.