Layer-wise analysis of a self-supervised speech representation model
Recently proposed self-supervised learning approaches have been successful for pre-
training speech representation models. The utility of these learned representations has been …
training speech representation models. The utility of these learned representations has been …
Torchaudio: Building blocks for audio and speech processing
This document describes version 0.10 of TorchAudio: building blocks for machine learning
applications in the audio and speech processing domain. The objective of TorchAudio is to …
applications in the audio and speech processing domain. The objective of TorchAudio is to …
A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding
Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary
progress in Automatic Speech Recognition (ASR). However, they have not been totally …
progress in Automatic Speech Recognition (ASR). However, they have not been totally …
Wespeaker: A research and production oriented speaker embedding learning toolkit
Speaker modeling is essential for many related tasks, such as speaker recognition and
speaker diarization. The dominant modeling approach is fixed-dimensional vector …
speaker diarization. The dominant modeling approach is fixed-dimensional vector …
An efficient encoder-decoder architecture with top-down attention for speech separation
Deep neural networks have shown excellent prospects in speech separation tasks.
However, obtaining good results while keeping a low model complexity remains challenging …
However, obtaining good results while keeping a low model complexity remains challenging …
A first look into the carbon footprint of federated learning
Despite impressive results, deep learning-based technologies also raise severe privacy and
environmental concerns induced by the training procedure often conducted in data centers …
environmental concerns induced by the training procedure often conducted in data centers …
Adverb: Visually guided audio dereverberation
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues
in addition to the reverberant sound to estimate clean audio. Although audio-only …
in addition to the reverberant sound to estimate clean audio. Although audio-only …
Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation
in human-machine interaction applications. While the existing methods enable generating …
in human-machine interaction applications. While the existing methods enable generating …
{KENKU}: Towards Efficient and Stealthy Black-box Adversarial Attacks against {ASR} Systems
Prior researchers show that existing automatic speech recognition (ASR) systems are
vulnerable to adversarial examples. Most existing adversarial attacks against ASR systems …
vulnerable to adversarial examples. Most existing adversarial attacks against ASR systems …
Paddlespeech: An easy-to-use all-in-one speech toolkit
PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the
development and research of speech processing technologies by providing an easy-to-use …
development and research of speech processing technologies by providing an easy-to-use …