Asr is all you need: Cross-modal distillation for lip reading
The goal of this work is to train strong models for visual speech recognition without requiring
human annotated ground truth data. We achieve this by distilling from an Automatic Speech …
human annotated ground truth data. We achieve this by distilling from an Automatic Speech …
Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach
Text recognition methods are gaining rapid development. Some advanced techniques, eg,
powerful modules, language models, and un-and semi-supervised learning schemes …
powerful modules, language models, and un-and semi-supervised learning schemes …
Distilling the Knowledge of BERT for CTC-based ASR
Connectionist temporal classification (CTC)-based models are attractive because of their
fast inference in automatic speech recognition (ASR). Language model (LM) integration …
fast inference in automatic speech recognition (ASR). Language model (LM) integration …
Distilling attention weights for CTC-based ASR systems
We present a novel training approach for connectionist temporal classification (CTC)-based
automatic speech recognition (ASR) systems. CTC models are promising for building both a …
automatic speech recognition (ASR) systems. CTC models are promising for building both a …
Swing distillation: A privacy-preserving knowledge distillation framework
Knowledge distillation (KD) has been widely used for model compression and knowledge
transfer. Typically, a big teacher model trained on sufficient data transfers knowledge to a …
transfer. Typically, a big teacher model trained on sufficient data transfers knowledge to a …
Improving knowledge distillation of CTC-trained acoustic models with alignment-consistent ensemble and target delay
Knowledge distillation (KD) has been widely used to improve the performance of a simpler
student model by imitating the outputs or intermediate representations of a more complex …
student model by imitating the outputs or intermediate representations of a more complex …
Audio-visual deep learning
T Afouras - 2021 - ora.ox.ac.uk
Human perception and learning are inherently multimodal: we interface with the world
through multiple sensory streams, including vision, audition, touch, olfaction and taste. By …
through multiple sensory streams, including vision, audition, touch, olfaction and taste. By …