Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

T Park, I Medennikov, K Dhawan, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We propose Sortformer, a novel neural model for speaker diarization, trained with
unconventional objectives compared to existing end-to-end diarization models. The …

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

H Huang, T Park, K Dhawan, I Medennikov… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-supervised learning has been proved to benefit a wide range of speech processing
tasks, such as speech recognition/translation, speaker verification and diarization, etc …

Chain-of-Thought Prompting for Speech Translation

K Hu, Z Chen, CHH Yang, P Żelasko… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated remarkable advancements in language
understanding and generation. Building on the success of text-based LLMs, recent research …

EMMeTT: Efficient Multimodal Machine Translation Training

P Żelasko, Z Chen, M Wang, D Galvez… - arXiv preprint arXiv …, 2024 - arxiv.org
A rising interest in the modality extension of foundation language models warrants
discussion on the most effective, and efficient, multimodal training approach. This work …

ASR Benchmarking: Need for a More Representative Conversational Dataset

G Maheshwari, D Ivanov, T Johannet… - arXiv preprint arXiv …, 2024 - arxiv.org
Automatic Speech Recognition (ASR) systems have achieved remarkable performance on
widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do …