Synthvsr: Scaling up visual speech recognition with synthetic supervision

Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model

JH Yeo, M Kim, J Choi, DH Kim… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip
movements. VSR is regarded as a challenging task because of the insufficient information …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Visually-aware audio captioning with adaptive audio-visual attention

X Liu, Q Huang, X Mei, H Liu, Q Kong, J Sun… - arXiv preprint arXiv …, 2022 - arxiv.org

Audio captioning aims to generate text descriptions of audio clips. In the real world, many
objects produce similar sounds. How to accurately recognize ambiguous sounds is a major …

被引用次数：15 相关文章所有 7 个版本

[PDF] thecvf.com

Do VSR Models Generalize Beyond LRS3?

YAD Djilali, S Narayan, E LeBihan… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of
intense research in visual speech recognition (VSR) during the last few years. As a result …

被引用次数：2 相关文章所有 4 个版本

[PDF] mdpi.com

Data-Driven Advancements in Lip Motion Analysis: A Review

S Torrie, A Sumsion, DJ Lee, Z Sun - Electronics, 2023 - mdpi.com

This work reviews the dataset-driven advancements that have occurred in the area of lip
motion analysis, particularly visual lip-reading and visual lip motion authentication, in the …

被引用次数：1 相关文章所有 4 个版本

Public-private Attributes-based Variational Adversarial Network for Audio-Visual Cross-Modal Matching

A Zheng, F Yuan, H Zhang, J Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Existing audio-visual cross-modal matching methods focus on mitigating cross-modal
heterogeneity but ignore the impact of intra-class discrepancy of the same identity in …

[PDF] arxiv.org

Contrastive Learning from Synthetic Audio Doppelgangers

M Cherep, N Singh - arXiv preprint arXiv:2406.05923, 2024 - arxiv.org

Learning robust audio representations currently demands extensive datasets of real-world
sound recordings. By applying artificial transformations to these recordings, models can …

BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition

A Haliassos, A Zinonos, R Mira… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Self-supervision has recently shown great promise for learning visual and auditory speech
representations from unlabelled data. In this work, we propose BRAVEn, an extension to the …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition

D Gimeno-Gómez, CD Martínez-Hinarejos - arXiv preprint arXiv …, 2024 - arxiv.org

Thanks to the rise of deep learning and the availability of large-scale audio-visual
databases, recent advances have been achieved in Visual Speech Recognition (VSR) …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Exploring the Impact of Synthetic Data for Aerial-view Human Detection

H Lee, Y Zhang, YT Shen, H Kwon… - arXiv preprint arXiv …, 2024 - arxiv.org

Aerial-view human detection has a large demand for large-scale data to capture more
diverse human appearances compared to ground-view human detection. Therefore …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

AnnoTheia: A Semi-Automatic Annotation Toolkit for Audio-Visual Speech Technologies

JM Acosta-Triana, D Gimeno-Gómez… - arXiv preprint arXiv …, 2024 - arxiv.org

More than 7,000 known languages are spoken around the world. However, due to the lack
of annotated resources, only a small fraction of them are currently covered by speech …