Slm: Bridge the thin gap between speech and text foundation models

M Wang, W Han, I Shafran, Z Wu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-
modal model that takes advantage of pretrained foundational speech and language models …

Improving contextual biasing with text injection

TN Sainath, R Prabhavalkar, D Caseiro… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
In this work, we present a model-based approach to improving contextual biasing that
improves quality without drastically increasing model computation during inference …

[PDF][PDF] Dual-mode NAM: Effective top-k context injection for end-to-end asr

Z Wu, T Munkhdalai, P Rondon, G Pundak… - Proc …, 2023 - isca-archive.org
ASR systems in real applications must be adapted on the fly to correctly recognize task-
specific contextual terms, such as contacts, application names and media entities. However …

Contextual Spelling Correction with Large Language Models

G Song, Z Wu, G Pundak, A Chandorkar… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Contextual Spelling Correction (CSC) models are used to improve automatic speech
recognition (ASR) quality given userspecific context. Typically, context is modeled as a large …

Exploration of Adapter for Noise Robust Automatic Speech Recognition

H Shi, T Kawahara - arXiv preprint arXiv:2402.18275, 2024 - arxiv.org
Adapting a robust automatic speech recognition (ASR) system to tackle unseen noise
scenarios is crucial. Integrating adapters into neural networks has emerged as a potent …

Spellmapper: A non-autoregressive neural spellchecker for asr customization with candidate retrieval based on n-gram mappings

A Antonova, E Bakhturina, B Ginsburg - arXiv preprint arXiv:2306.02317, 2023 - arxiv.org
Contextual spelling correction models are an alternative to shallow fusion to improve
automatic speech recognition (ASR) quality given user vocabulary. To deal with large user …

Deferred NAM: Low-latency Top-K Context Injection via DeferredContext Encoding for Non-Streaming ASR

Z Wu, G Song, C Li, P Rondon, Z Meng, X Velez… - arXiv preprint arXiv …, 2024 - arxiv.org
Contextual biasing enables speech recognizers to transcribe important phrases in the
speaker's context, such as contact names, even if they are rare in, or absent from, the …

Contextual biasing with the Knuth-Morris-Pratt matching algorithm

W Wang, Z Wu, D Caseiro, T Munkhdalai… - arXiv preprint arXiv …, 2023 - arxiv.org
Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR)
systems towards rare entities that are relevant to the specific user or application scenarios …

Speech-enriched Memory for Inference-time Adaptation of ASR Models to Word Dictionaries

A Mittal, S Sarawagi, P Jyothi, G Saon… - Proceedings of the …, 2023 - aclanthology.org
Despite the impressive performance of ASR models on mainstream benchmarks, their
performance on rare words is unsatisfactory. In enterprise settings, often a focused list of …

[PDF][PDF] Optimizing Large-Scale Context Retrieval for End-to-End ASR

Z Huang, D Caseiro, K Joshi, C Li, P Rondon… - Proc. Interspeech …, 2024 - isca-archive.org
Abstract Contextual Automatic Speech Recognition (ASR) requires scalable and accurate
retrieval of content relevant to the user's context. This paper presents a comparative study of …