Ssvmr: Saliency-based self-training for video-music retrieval

X Cheng, Z Zhu, H Li, Y Li, Y Zou - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
With the rise of short videos, the demand for selecting appropriate background music (BGM)
for a video has increased significantly, video-music retrieval (VMR) task gradually draws …

Trimtail: Low-latency streaming asr with simple but effective spectrogram-level length penalty

X Song, D Wu, Z Wu, B Zhang, Y Zhang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
In this paper, we present TrimTail, a simple but effective emission regularization method to
improve the latency of streaming ASR models. The core idea of TrimTail is to apply length …

Peak-first CTC: reducing the peak latency of CTC models by applying peak-first regularization

Z Tian, H Xiang, M Li, F Lin, K Ding… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
The CTC model has been widely applied to many application scenarios because of its
simple structure, excellent performance, and fast inference speed. There are many peaks in …

Delay-penalized CTC implemented based on Finite State Transducer

Z Yao, W Kang, F Kuang, L Guo, X Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
Connectionist Temporal Classification (CTC) suffers from the latency problem when applied
to streaming models. We argue that in CTC lattice, the alignments that can access more …

Less Peaky and More Accurate CTC Forced Alignment by Label Priors

R Huang, X Zhang, Z Ni, L Sun, M Hira… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Connectionist temporal classification (CTC) models are known to have peaky output
distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it …

Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models

A Haider, X Na, E McDermott, T Ng, Z Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces a novel training framework called Focused Discriminative Training
(FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech …

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

J Kang, L Meng, M Cui, Y Wang, X Wu, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and
transcribing overlapping speech. To address these challenges, this paper investigates the …

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

J Tian, J Yu, H Chen, B Yan, C Weng, D Yu… - arXiv preprint arXiv …, 2023 - arxiv.org
Automatic speech recognition (ASR) based on transducers is widely used. In training, a
transducer maximizes the summed posteriors of all paths. The path with the highest …

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

E Kim, H Kim, K Lee - arXiv preprint arXiv:2406.07909, 2024 - arxiv.org
Transformer encoder with connectionist temporal classification (CTC) framework is widely
used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR …

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

A Stooke, R Prabhavalkar, KC Sim… - The Thirty-eighth Annual … - openreview.net
Modern systems for automatic speech recognition, including the RNN-Transducer and
Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required …