Uniaudio: An audio foundation model toward universal audio generation
Large Language models (LLM) have demonstrated the capability to handle a variety of
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
Recent Advances in Speech Language Models: A Survey
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …
their capabilities in text-based interactions. However, natural human interaction often relies …
Moshi: a speech-text foundation model for real-time dialogue
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue
framework. Current systems for spoken dialogue rely on pipelines of independent …
framework. Current systems for spoken dialogue rely on pipelines of independent …
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Decoder-only architecture for streaming end-to-end speech recognition
Decoder-only language models (LMs) have been successfully adopted for speech-
processing tasks including automatic speech recognition (ASR). The LMs have ample …
processing tasks including automatic speech recognition (ASR). The LMs have ample …
Spirit-lm: Interleaved spoken and written language model
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and
speech. Our model is based on a pretrained text language model that we extend to the …
speech. Our model is based on a pretrained text language model that we extend to the …
dmel: Speech tokenization made simple
Large language models have revolutionized natural language processing by leveraging self-
supervised pretraining on vast textual data. Inspired by this success, researchers have …
supervised pretraining on vast textual data. Inspired by this success, researchers have …
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
Speech discrete representation has proven effective in various downstream applications
due to its superior compression rate of the waveform, fast convergence during training, and …
due to its superior compression rate of the waveform, fast convergence during training, and …
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Representing speech and audio signals in discrete units has become a compelling
alternative to traditional high-dimensional feature vectors. Numerous studies have …
alternative to traditional high-dimensional feature vectors. Numerous studies have …
Multilingual visual speech recognition with a single model by learning with discrete visual speech units
This paper explores sentence-level Multilingual Visual Speech Recognition with a single
model for the first time. As the massive multilingual modeling of visual data requires huge …
model for the first time. As the massive multilingual modeling of visual data requires huge …