Uniaudio: An audio foundation model toward universal audio generation

D Yang, J Tian, X Tan, R Huang, S Liu, X Chang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language models (LLM) have demonstrated the capability to handle a variety of
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …

Recent Advances in Speech Language Models: A Survey

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Moshi: a speech-text foundation model for real-time dialogue

A Défossez, L Mazaré, M Orsini, A Royer… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue
framework. Current systems for spoken dialogue rely on pipelines of independent …

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

J Choi, SJ Park, M Kim, YM Ro - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …

Decoder-only architecture for streaming end-to-end speech recognition

E Tsunoo, H Futami, Y Kashiwagi, S Arora… - arXiv preprint arXiv …, 2024 - arxiv.org
Decoder-only language models (LMs) have been successfully adopted for speech-
processing tasks including automatic speech recognition (ASR). The LMs have ample …

Spirit-lm: Interleaved spoken and written language model

TA Nguyen, B Muller, B Yu, MR Costa-Jussa… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and
speech. Our model is based on a pretrained text language model that we extend to the …

dmel: Speech tokenization made simple

H Bai, T Likhomanenko, R Zhang, Z Gu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models have revolutionized natural language processing by leveraging self-
supervised pretraining on vast textual data. Inspired by this success, researchers have …

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

J Shi, X Ma, H Inaguma, A Sun, S Watanabe - arXiv preprint arXiv …, 2024 - arxiv.org
Speech discrete representation has proven effective in various downstream applications
due to its superior compression rate of the waveform, fast convergence during training, and …

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

X Chang, J Shi, J Tian, Y Wu, Y Tang, Y Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Representing speech and audio signals in discrete units has become a compelling
alternative to traditional high-dimensional feature vectors. Numerous studies have …

Multilingual visual speech recognition with a single model by learning with discrete visual speech units

M Kim, JH Yeo, J Choi, SJ Park, YM Ro - arXiv preprint arXiv:2401.09802, 2024 - arxiv.org
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …