Listen, think, and understand

Y Gong, H Luo, AH Liu, L Karlinsky, J Glass - arXiv preprint arXiv …, 2023 - arxiv.org
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …

Hyporadise: An open baseline for generative speech recognition with large language models

C Chen, Y Hu, CHH Yang… - Advances in …, 2024 - proceedings.neurips.cc
Advancements in deep neural networks have allowed automatic speech recognition (ASR)
systems to attain human parity on several publicly available clean speech datasets …

Towards audio language modeling-an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arXiv preprint arXiv …, 2024 - arxiv.org
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

Can chatgpt detect intent? evaluating large language models for spoken language understanding

M He, PN Garner - arXiv preprint arXiv:2305.13512, 2023 - arxiv.org
Recently, large pretrained language models have demonstrated strong language
understanding capabilities. This is particularly reflected in their zero-shot and in-context …

Whispering LLaMA: A cross-modal generative error correction framework for speech recognition

S Radhakrishnan, CHH Yang, SA Khan… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce a new cross-modal fusion technique designed for generative error correction
in automatic speech recognition (ASR). Our methodology leverages both acoustic …

Speechgen: Unlocking the generative power of speech language models with prompts

H Wu, KW Chang, YK Wu, H Lee - arXiv preprint arXiv:2306.02207, 2023 - arxiv.org
Large language models (LLMs) have gained considerable attention for Artificial Intelligence
Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct …

Joint audio and speech understanding

Y Gong, AH Liu, H Luo, L Karlinsky… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Humans are surrounded by audio signals that include both speech and non-speech sounds.
The recognition and understanding of speech and non-speech audio events, along with a …

Integrating pre-trained speech and language models for end-to-end speech recognition

Y Hono, K Mitsuda, T Zhao, K Mitsui… - Findings of the …, 2024 - aclanthology.org
Advances in machine learning have made it possible to perform various text and speech
processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) …

Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech

C Huang, KH Lu, SH Wang, CY Hsiao… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Text language models have shown remarkable zero-shot capability in generalizing to
unseen tasks when provided with well-formulated instructions. However, existing studies in …

Universlu: Universal spoken language understanding for diverse classification and sequence generation tasks with a single network

S Arora, H Futami, J Jung, Y Peng, R Sharma… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent studies have demonstrated promising outcomes by employing large language
models with multi-tasking capabilities. They utilize prompts to guide the model's behavior …