Listen, think, and understand
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …
crucial for many applications. Although significant progress has been made in this area …
Hyporadise: An open baseline for generative speech recognition with large language models
Advancements in deep neural networks have allowed automatic speech recognition (ASR)
systems to attain human parity on several publicly available clean speech datasets …
systems to attain human parity on several publicly available clean speech datasets …
Towards audio language modeling-an overview
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …
reduce transmission latency. Researchers recently discovered the potential of codecs as …
Can chatgpt detect intent? evaluating large language models for spoken language understanding
Recently, large pretrained language models have demonstrated strong language
understanding capabilities. This is particularly reflected in their zero-shot and in-context …
understanding capabilities. This is particularly reflected in their zero-shot and in-context …
Whispering LLaMA: A cross-modal generative error correction framework for speech recognition
We introduce a new cross-modal fusion technique designed for generative error correction
in automatic speech recognition (ASR). Our methodology leverages both acoustic …
in automatic speech recognition (ASR). Our methodology leverages both acoustic …
Speechgen: Unlocking the generative power of speech language models with prompts
Large language models (LLMs) have gained considerable attention for Artificial Intelligence
Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct …
Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct …
Joint audio and speech understanding
Humans are surrounded by audio signals that include both speech and non-speech sounds.
The recognition and understanding of speech and non-speech audio events, along with a …
The recognition and understanding of speech and non-speech audio events, along with a …
Integrating pre-trained speech and language models for end-to-end speech recognition
Advances in machine learning have made it possible to perform various text and speech
processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) …
processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) …
Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech
Text language models have shown remarkable zero-shot capability in generalizing to
unseen tasks when provided with well-formulated instructions. However, existing studies in …
unseen tasks when provided with well-formulated instructions. However, existing studies in …
Universlu: Universal spoken language understanding for diverse classification and sequence generation tasks with a single network
Recent studies have demonstrated promising outcomes by employing large language
models with multi-tasking capabilities. They utilize prompts to guide the model's behavior …
models with multi-tasking capabilities. They utilize prompts to guide the model's behavior …