- 学术资源搜索

Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer

Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

被引用次数：189 相关文章所有 12 个版本

[PDF] dergipark.org.tr

Derin öğrenme ve görüntü analizinde kullanılan derin öğrenme modelleri

Ö İnik, E Ülker - Gaziosmanpaşa Bilimsel Araştırma Dergisi, 2017 - dergipark.org.tr

Klasik Makine öğrenme teknikleri ile bir model tanımlama veya makine öğrenimi sistemi
kurmak için öncelikle özellik vektörünün çıkarılması gerekmektedir. Özellik vektörünün …

被引用次数：243 相关文章

[PDF] arxiv.org

End-to-end audio-visual speech recognition with conformers

P Ma, S Petridis, M Pantic - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …

被引用次数：255 相关文章所有 4 个版本

[PDF] arxiv.org

Deep audio-visual speech recognition

T Afouras, JS Chung, A Senior… - IEEE transactions on …, 2018 - ieeexplore.ieee.org

The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …

被引用次数：942 相关文章所有 15 个版本

[PDF] thecvf.com

Neural sign language translation

NC Camgoz, S Hadfield, O Koller… - Proceedings of the …, 2018 - openaccess.thecvf.com

Abstract Sign Language Recognition (SLR) has been an active research field for the last two
decades. However, most research to date has considered SLR as a naive gesture …

被引用次数：750 相关文章所有 16 个版本

[PDF] aaai.org

Talking face generation by adversarially disentangled audio-visual representation

H Zhou, Y Liu, Z Liu, P Luo, X Wang - … of the AAAI conference on artificial …, 2019 - aaai.org

Talking face generation aims to synthesize a sequence of face images that correspond to a
clip of speech. This is a challenging task because face appearance variation and semantics …

被引用次数：482 相关文章所有 10 个版本

[PDF] arxiv.org

LRS3-TED: a large-scale dataset for visual speech recognition

T Afouras, JS Chung, A Zisserman - arXiv preprint arXiv:1809.00496, 2018 - arxiv.org

This paper introduces a new multi-modal dataset for visual and audio-visual speech
recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with …

被引用次数：478 相关文章所有 3 个版本

[PDF] thecvf.com

Lip reading sentences in the wild

J Son Chung, A Senior, O Vinyals… - Proceedings of the …, 2017 - openaccess.thecvf.com

The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …

被引用次数：1011 相关文章所有 20 个版本

[PDF] thecvf.com

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com

In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

被引用次数：525 相关文章所有 11 个版本

[PDF] optica.org Full View

Massively parallel amplitude-only Fourier neural network

M Miscuglio, Z Hu, S Li, JK George, R Capanna… - Optica, 2020 - opg.optica.org

Machine intelligence has become a driving factor in modern society. However, its demand
outpaces the underlying electronic technology due to limitations given by fundamental …

被引用次数：200 相关文章所有 5 个版本