Ssvmr: Saliency-based self-training for video-music retrieval
With the rise of short videos, the demand for selecting appropriate background music (BGM)
for a video has increased significantly, video-music retrieval (VMR) task gradually draws …
for a video has increased significantly, video-music retrieval (VMR) task gradually draws …
Trimtail: Low-latency streaming asr with simple but effective spectrogram-level length penalty
In this paper, we present TrimTail, a simple but effective emission regularization method to
improve the latency of streaming ASR models. The core idea of TrimTail is to apply length …
improve the latency of streaming ASR models. The core idea of TrimTail is to apply length …
Peak-first CTC: reducing the peak latency of CTC models by applying peak-first regularization
The CTC model has been widely applied to many application scenarios because of its
simple structure, excellent performance, and fast inference speed. There are many peaks in …
simple structure, excellent performance, and fast inference speed. There are many peaks in …
Delay-penalized CTC implemented based on Finite State Transducer
Connectionist Temporal Classification (CTC) suffers from the latency problem when applied
to streaming models. We argue that in CTC lattice, the alignments that can access more …
to streaming models. We argue that in CTC lattice, the alignments that can access more …
Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Connectionist temporal classification (CTC) models are known to have peaky output
distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it …
distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it …
Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models
This paper introduces a novel training framework called Focused Discriminative Training
(FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech …
(FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech …
Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC
Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and
transcribing overlapping speech. To address these challenges, this paper investigates the …
transcribing overlapping speech. To address these challenges, this paper investigates the …
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction
Automatic speech recognition (ASR) based on transducers is widely used. In training, a
transducer maximizes the summed posteriors of all paths. The path with the highest …
transducer maximizes the summed posteriors of all paths. The path with the highest …
Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation
Transformer encoder with connectionist temporal classification (CTC) framework is widely
used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR …
used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR …
Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers
A Stooke, R Prabhavalkar, KC Sim… - The Thirty-eighth Annual … - openreview.net
Modern systems for automatic speech recognition, including the RNN-Transducer and
Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required …
Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required …