Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification

Y Zhang, Z Lv, H Wu, S Zhang, P Hu, Z Wu… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an
easy-to-implement, simple but effective backbone for automatic speaker verification based …

Overview of speaker modeling and its applications: From the lens of deep speaker representation learning

S Wang, Z Chen, KA Lee, Y Qian… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Speaker individuality information is among the most critical elements within speech signals.
By thoroughly and accurately modeling this information, it can be utilized in various …

The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge

JM Martín-Doñas, A Álvarez - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
This paper describes our submitted systems to the 2022 ADD challenge withing the tracks 1
and 2. Our approach is based on the combination of a pre-trained wav2vec2 feature …

Freevc: Towards high-quality text-free one-shot voice conversion

J Li, W Tu, L Xiao - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
Voice conversion (VC) can be achieved by first extracting source content information and
target speaker information, and then reconstructing waveform with these information …

Computational language modeling and the promise of in silico experimentation

S Jain, VA Vo, L Wehbe, AG Huth - Neurobiology of Language, 2024 - direct.mit.edu
Abstract Language neuroscience currently relies on two major experimental paradigms:
controlled experiments using carefully hand-designed stimuli, and natural stimulus …

Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations

C Gong, X Wang, E Cooper, D Wells… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification

B Han, Z Chen, Y Qian - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
The automatic speaker verification task has achieved great success using deep learning
approaches with a large-scale, manually annotated dataset. However, collecting a …

Why does self-supervised learning for speech recognition benefit speaker recognition?

S Chen, Y Wu, C Wang, S Liu, Z Chen, P Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker
recognition, even if the pre-training objective is designed for speech recognition. In this …

Leveraging asr pretrained conformers for speaker verification through transfer learning and knowledge distillation

D Cai, M Li - IEEE/ACM Transactions on Audio, Speech, and …, 2024 - ieeexplore.ieee.org
This paper focuses on the application of Conformers in speaker verification. Conformers,
initially designed for Automatic Speech Recognition (ASR), excel at modeling both local and …