Visually indicated sounds

F Zhan, Y Yu, R Wu, J Zhang, S Lu, L Liu… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

As information exists in various modalities in real world, effective interaction and fusion
among multimodal information plays a key role for the creation and perception of multimodal …

被引用次数：182 相关文章所有 11 个版本

[PDF] arxiv.org

An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org

Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

被引用次数：242 相关文章所有 6 个版本

[PDF] thecvf.com

A large-scale study on unsupervised spatiotemporal representation learning

C Feichtenhofer, H Fan, B Xiong… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …

被引用次数：265 相关文章所有 6 个版本

[PDF] arxiv.org

Contrastive multiview coding

Y Tian, D Krishnan, P Isola - Computer Vision–ECCV 2020: 16th European …, 2020 - Springer

Humans view the world through many sensory channels, eg, the long-wavelength light
channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right …

被引用次数：2497 相关文章所有 11 个版本

[PDF] arxiv.org

Vggsound: A large-scale audio-visual dataset

H Chen, W Xie, A Vedaldi… - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org

Our goal is to collect a large-scale audio-visual dataset with low label noise from videosin
the wild'using computer vision techniques. The resulting dataset can be used for training …

被引用次数：473 相关文章所有 10 个版本

[PDF] thecvf.com

Videobert: A joint model for video and language representation learning

C Sun, A Myers, C Vondrick… - Proceedings of the …, 2019 - openaccess.thecvf.com

Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …

被引用次数：1350 相关文章所有 10 个版本

[PDF] neurips.cc

Space-time correspondence as a contrastive random walk

A Jabri, A Owens, A Efros - Advances in neural information …, 2020 - proceedings.neurips.cc

This paper proposes a simple self-supervised approach for learning a representation for
visual correspondence from raw video. We cast correspondence as prediction of links in a …

被引用次数：272 相关文章所有 9 个版本

[PDF] arxiv.org

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer

Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

被引用次数：252 相关文章所有 8 个版本

[HTML] aip.org

[HTML][HTML] Machine learning in acoustics: Theory and applications

MJ Bianco, P Gerstoft, J Traer, E Ozanich… - The Journal of the …, 2019 - pubs.aip.org

Acoustic data provide scientific and engineering insights in fields ranging from biology and
communications to ocean and Earth science. We survey the recent advances and …

被引用次数：484 相关文章所有 14 个版本

[PDF] thecvf.com

Audio-visual scene analysis with self-supervised multisensory features

A Owens, AA Efros - Proceedings of the European …, 2018 - openaccess.thecvf.com

The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events
occur together, it suggests that there might be a common, underlying event that produced …

被引用次数：843 相关文章所有 8 个版本

Multimodal image synthesis and editing: A survey and taxonomy

An overview of deep-learning-based audio-visual speech enhancement and separation

A large-scale study on unsupervised spatiotemporal representation learning

Contrastive multiview coding

Vggsound: A large-scale audio-visual dataset

Videobert: A joint model for video and language representation learning

Space-time correspondence as a contrastive random walk

Self-supervised learning of audio-visual objects from video

[HTML][HTML] Machine learning in acoustics: Theory and applications

Audio-visual scene analysis with self-supervised multisensory features

高级搜索

引用