Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

MK Baskar, A Rosenberg, B Ramabhadran… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we focus on addressing the constraints faced when applying LLMs to ASR.
Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs …

Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization

Y Ma, C Zhang, Q Chen, W Wang… - IEEE Signal Processing …, 2024 - ieeexplore.ieee.org
Large Language Models (LLMs) have proven successful across a spectrum of speech-
related tasks, such as speech recognition, text-to-speech, and spoken language …

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

G Yang, Z Ma, F Yu, Z Gao, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
As more and more information-rich data like video become available, utilizing multi-modal
auxiliary information to enhance audio tasks has sparked widespread research interest. The …

Pronunciation Assessment with Multi-modal Large Language Models

K Fu, L Peng, N Yang, S Zhou - arXiv preprint arXiv:2407.09209, 2024 - arxiv.org
Large language models (LLMs), renowned for their powerful conversational abilities, are
widely recognized as exceptional tools in the field of education, particularly in the context of …

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

X Geng, T Xu, K Wei, B Mu, H Xue, H Wang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models have demonstrated unparalleled effectiveness in various NLP
tasks, and integrating LLMs with automatic speech recognition is becoming a mainstream …

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

J Xu, Z Liang, Y Liu, Y Hu, J Li, Y Zheng, M Cai… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present MooER, a LLM-based large-scale automatic speech recognition
(ASR)/automatic speech translation (AST) model of Moore Threads. A 5000h pseudo …

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Z Wang, Y Chen, X Wang, L Xie, Y Wang - arXiv preprint arXiv:2408.02178, 2024 - arxiv.org
StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the
streaming domain. It uses a streamable language model (LM) with a context-aware …