Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip
movements. VSR is regarded as a challenging task because of the insufficient information …
movements. VSR is regarded as a challenging task because of the insufficient information …
Visually-aware audio captioning with adaptive audio-visual attention
Audio captioning aims to generate text descriptions of audio clips. In the real world, many
objects produce similar sounds. How to accurately recognize ambiguous sounds is a major …
objects produce similar sounds. How to accurately recognize ambiguous sounds is a major …
Do VSR Models Generalize Beyond LRS3?
YAD Djilali, S Narayan, E LeBihan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of
intense research in visual speech recognition (VSR) during the last few years. As a result …
intense research in visual speech recognition (VSR) during the last few years. As a result …
Data-Driven Advancements in Lip Motion Analysis: A Review
This work reviews the dataset-driven advancements that have occurred in the area of lip
motion analysis, particularly visual lip-reading and visual lip motion authentication, in the …
motion analysis, particularly visual lip-reading and visual lip motion authentication, in the …
Public-private Attributes-based Variational Adversarial Network for Audio-Visual Cross-Modal Matching
A Zheng, F Yuan, H Zhang, J Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Existing audio-visual cross-modal matching methods focus on mitigating cross-modal
heterogeneity but ignore the impact of intra-class discrepancy of the same identity in …
heterogeneity but ignore the impact of intra-class discrepancy of the same identity in …
Contrastive Learning from Synthetic Audio Doppelgangers
Learning robust audio representations currently demands extensive datasets of real-world
sound recordings. By applying artificial transformations to these recordings, models can …
sound recordings. By applying artificial transformations to these recordings, models can …
BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition
Self-supervision has recently shown great promise for learning visual and auditory speech
representations from unlabelled data. In this work, we propose BRAVEn, an extension to the …
representations from unlabelled data. In this work, we propose BRAVEn, an extension to the …
Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition
D Gimeno-Gómez, CD Martínez-Hinarejos - arXiv preprint arXiv …, 2024 - arxiv.org
Thanks to the rise of deep learning and the availability of large-scale audio-visual
databases, recent advances have been achieved in Visual Speech Recognition (VSR) …
databases, recent advances have been achieved in Visual Speech Recognition (VSR) …
Exploring the Impact of Synthetic Data for Aerial-view Human Detection
Aerial-view human detection has a large demand for large-scale data to capture more
diverse human appearances compared to ground-view human detection. Therefore …
diverse human appearances compared to ground-view human detection. Therefore …
AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies
JM Acosta-Triana, D Gimeno-Gómez… - arXiv preprint arXiv …, 2024 - arxiv.org
More than 7,000 known languages are spoken around the world. However, due to the lack
of annotated resources, only a small fraction of them are currently covered by speech …
of annotated resources, only a small fraction of them are currently covered by speech …