Audio-visual synchronisation in the wild

C Feng, Z Chen, A Owens - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

Manipulated videos often contain subtle inconsistencies between their visual and audio
signals. We propose a video forensics method, based on anomaly detection, that can …

被引用次数：57 相关文章所有 6 个版本

[PDF] thecvf.com

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com

Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

被引用次数：59 相关文章所有 8 个版本

[PDF] arxiv.org

Audio-synchronized visual animation

L Zhang, S Mo, Y Zhang, P Morgado - European Conference on Computer …, 2025 - Springer

Current visual generation methods can produce high-quality videos guided by text prompts.
However, effectively controlling object dynamics remains a challenge. This work explores …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Audio-visual segmentation with semantics

J Zhou, X Shen, J Wang, J Zhang, W Sun… - International Journal of …, 2024 - Springer

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to
output a pixel-level map of the object (s) that produce sound at the time of the image frame …

被引用次数：27 相关文章所有 2 个版本

[PDF] arxiv.org

Masked generative video-to-audio transformers with enhanced synchronicity

S Pascual, C Yeh, I Tsiamas, J Serrà - European Conference on Computer …, 2025 - Springer

Abstract Video-to-audio (V2A) generation leverages visual-only video features to render
plausible sounds that match the scene. Importantly, the generated sound onsets should …

被引用次数：5 相关文章所有 5 个版本

[PDF] arxiv.org

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

Y Zhang, Y Gu, Y Zeng, Z Xing, Y Wang, Z Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

We study Neural Foley, the automatic generation of high-quality sound effects synchronizing
with videos, enabling an immersive audio-visual experience. Despite its wide range of …

被引用次数：17 相关文章所有 4 个版本

[PDF] thecvf.com

Reading to listen at the cocktail party: Multi-modal speech separation

A Rahimi, T Afouras… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

The goal of this paper is speech separation and enhancement in multi-speaker and noisy
environments using a combination of different modalities. Previous works have shown good …

被引用次数：28 相关文章所有 8 个版本

[PDF] arxiv.org

Self-supervised audio-visual soundscape stylization

T Li, R Wang, PY Huang, A Owens… - … on Computer Vision, 2025 - Springer

Speech sounds convey a great deal of information about the scenes, resulting in a variety of
effects ranging from reverberation to additional ambient sounds. In this paper, we …

被引用次数：2 相关文章所有 8 个版本

[PDF] arxiv.org

Vocalist: An audio-visual synchronisation model for lips and voices

VS Kadandale, JF Montesinos, G Haro - arXiv preprint arXiv:2204.02090, 2022 - arxiv.org

In this paper, we address the problem of lip-voice synchronisation in videos containing
human face and voice. Our approach is based on determining if the lips motion and the …

被引用次数：28 相关文章所有 10 个版本

[PDF] arxiv.org

Sparse in space and time: Audio-visual synchronisation with trainable selectors

V Iashin, W Xie, E Rahtu, A Zisserman - arXiv preprint arXiv:2210.07055, 2022 - arxiv.org

The objective of this paper is audio-visual synchronisation of general videos' in the wild'. For
such videos, the events that may be harnessed for synchronisation cues may be spatially …

被引用次数：22 相关文章所有 10 个版本