Learning in audio-visual context: A review, analysis, and new perspective
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …
understanding. To mimic human perception ability, audio-visual learning, aimed at …
Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation
The dominant speech separation models are based on complex recurrent or convolution
neural network that model speech sequences indirectly conditioning on context, such as …
neural network that model speech sequences indirectly conditioning on context, such as …
Past review, current progress, and challenges ahead on the cocktail party problem
The cocktail party problem, ie, tracing and recognizing the speech of a specific speaker
when multiple speakers talk simultaneously, is one of the critical problems yet to be solved …
when multiple speakers talk simultaneously, is one of the critical problems yet to be solved …
Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique.
uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker …
uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker …
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
We propose a novel deep learning training criterion, named permutation invariant training
(PIT), for speaker independent multi-talker speech separation, commonly known as the …
(PIT), for speaker independent multi-talker speech separation, commonly known as the …
Deep clustering: Discriminative embeddings for segmentation and separation
We address the problem of" cocktail-party" source separation in a deep learning framework
called deep clustering. Previous deep network approaches to separation have shown …
called deep clustering. Previous deep network approaches to separation have shown …
Unsupervised sound separation using mixture invariant training
In recent years, rapid progress has been made on the problem of single-channel sound
separation using supervised training of deep neural networks. In such supervised …
separation using supervised training of deep neural networks. In such supervised …
Spex: Multi-scale time domain speaker extraction network
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target
speaker's voice from a multi-talker environment. It is common to perform the extraction in …
speaker's voice from a multi-talker environment. It is common to perform the extraction in …
Energy disaggregation via discriminative sparse coding
Energy disaggregation is the task of taking a whole-home energy signal and separating it
into its component appliances. Studies have shown that having device-level energy …
into its component appliances. Studies have shown that having device-level energy …
Paralinguistics in speech and language—state-of-the-art and the challenge
Paralinguistic analysis is increasingly turning into a mainstream topic in speech and
language processing. This article aims to provide a broad overview of the constantly …
language processing. This article aims to provide a broad overview of the constantly …