Universal Speech Enhancement With Score-based Diffusion

This is the companion page of UNIVERSE, the universal speech enhancer described in the paper “Universal Speech Enhancement With Score-based Diffusion” by Joan Serrà, Santiago Pascual, Jordi Pons, R. Oguz Araz, and Davide Scaini. To access the paper, click here.

In this page you will find basic information about the paper, three sets of speech enhancement examples, and a link to some instances of our validation set.

Info

Abstract

Removing background noise from speech audio has been the subject of considerable research and effort, especially in recent years due to the rise of virtual communication and amateur sound recording. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

Core idea

The next video highlights the core idea of the project.

Reference

Universal Speech Enhancement With Score-based Diffusion
J. Serrà, S. Pascual, J. Pons, R. O. Araz, & D. Scaini.
Arxiv: 2206.03065. June 2022.

Examples

Here we provide a number of examples of enhancements performed by UNIVERSE. There are three sections: (1) examples enhancing the speech in real-world videos, (2) examples enhancing real-world speech recordings, and (3) examples from our validation set highlighting the removal of different and often simultaneous distortions.

Real-world speech from video

[Sources: Internet Archive and Youtube | License: Creative Commons (see videos)]

Documentaries — In old documentaries, the speech voice is usually band limited and dynamically compressed. In addition, recordings can also contain some background noise or codec artifacts due to suboptimal recording or digitization, respectively. Clipping or reverb may also be not in full control.

Talks — Talks given in class or conference rooms typically contain a large amount of reverberation. In addition, they feature inconsistent loudness levels and problematic equalization depending on the distance and angle between the speaker and the microphone. Some background noise can also be present.

Cooking — Cooking is a good example of a noisy environment where, sometimes, a distant microphone is placed to record the mixture with speech. In many cases, post-production background music is also added, and some amount of natural or synthetic reverberation can be present.

Other — Other example situations where speech recordings may contain some distortion are in room recordings with a distant microphone, exterior recordings featuring background noise, or recordings done with non-professional means. In addition, audio may be encoded with an aggressive compression factor, yielding a series of codec artifacts.

Real-world speech recordings

[Source: Freesound | License: Creative Commons CC-0-1.0]

Background noise — Here we remove background noise under different SNRs.

Input:
Enhanced:
Input:
Enhanced:

Reverb — Here we have examples of a room reverb that is removed.

Input:
Enhanced:
Input:
Enhanced:

Clipping/dynamics — This is an example of controling the dynamic range and removing some clipping.

Input:
Enhanced:

Expressivity — Here is an example to show that, despite being a generative model, UNIVERSE does not loose expressivity in the enhancement process.

Input:
Enhanced:

Deessing — This example shows that the model also controls strong sibilant sounds.

Input:
Enhanced:

Validation set utterances

[Sources: Publicly-available data sets (see paper) | License: Creative Commons CC-BY-4.0]

Silent gaps

Input:
Enhanced:

Codec + Dynamics + EQ + Noise + Spectral manipulation

Input:
Enhanced:

Noise + Reverb

Input:
Enhanced:

Noise + High-pass

Input:
Enhanced:

Codec + Clipping

Input:
Enhanced:

Low-pass

Input:
Enhanced:

Telephonic speech + Dynamics compressor

Input:
Enhanced:

Noise + Codec

Input:
Enhanced:

Noise + Silent gaps

Input:
Enhanced:

EQ + Reverb

Input:
Enhanced:

Validation subset download

To foster further subjective evaluation, we provide the first (random) 100 utterances of our validation data here. Input, target, and enhanced files are included, together with a small description of speech/noise sources and applied distortions.



[Last edit: June 2022]