Layer-wise analysis of a self-supervised speech representation model

A Pasad, JC Chou, K Livescu - 2021 IEEE Automatic Speech …, 2021 - ieeexplore.ieee.org
Recently proposed self-supervised learning approaches have been successful for pre-
training speech representation models. The utility of these learned representations has been …

Torchaudio: Building blocks for audio and speech processing

YY Yang, M Hira, Z Ni, A Astafurov… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
This document describes version 0.10 of TorchAudio: building blocks for machine learning
applications in the audio and speech processing domain. The objective of TorchAudio is to …

A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding

Y Wang, A Boumadane, A Heba - arXiv preprint arXiv:2111.02735, 2021 - arxiv.org
Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary
progress in Automatic Speech Recognition (ASR). However, they have not been totally …

Wespeaker: A research and production oriented speaker embedding learning toolkit

H Wang, C Liang, S Wang, Z Chen… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Speaker modeling is essential for many related tasks, such as speaker recognition and
speaker diarization. The dominant modeling approach is fixed-dimensional vector …

An efficient encoder-decoder architecture with top-down attention for speech separation

K Li, R Yang, X Hu - arXiv preprint arXiv:2209.15200, 2022 - arxiv.org
Deep neural networks have shown excellent prospects in speech separation tasks.
However, obtaining good results while keeping a low model complexity remains challenging …

A first look into the carbon footprint of federated learning

X Qiu, T Parcollet, J Fernandez-Marques… - Journal of Machine …, 2023 - jmlr.org
Despite impressive results, deep learning-based technologies also raise severe privacy and
environmental concerns induced by the training procedure often conducted in data centers …

Adverb: Visually guided audio dereverberation

S Chowdhury, S Ghosh, S Dasgupta… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues
in addition to the reverberant sound to estimate clean audio. Although audio-only …

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

X Qi, J Pan, P Li, R Yuan, X Chi, M Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation
in human-machine interaction applications. While the existing methods enable generating …

{KENKU}: Towards Efficient and Stealthy Black-box Adversarial Attacks against {ASR} Systems

X Wu, S Ma, C Shen, C Lin, Q Wang, Q Li… - 32nd USENIX Security …, 2023 - usenix.org
Prior researchers show that existing automatic speech recognition (ASR) systems are
vulnerable to adversarial examples. Most existing adversarial attacks against ASR systems …

Paddlespeech: An easy-to-use all-in-one speech toolkit

H Zhang, T Yuan, J Chen, X Li, R Zheng… - arXiv preprint arXiv …, 2022 - arxiv.org
PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the
development and research of speech processing technologies by providing an easy-to-use …