Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

Derin öğrenme ve görüntü analizinde kullanılan derin öğrenme modelleri

Ö İnik, E Ülker - Gaziosmanpaşa Bilimsel Araştırma Dergisi, 2017 - dergipark.org.tr
Klasik Makine öğrenme teknikleri ile bir model tanımlama veya makine öğrenimi sistemi
kurmak için öncelikle özellik vektörünün çıkarılması gerekmektedir. Özellik vektörünün …

End-to-end audio-visual speech recognition with conformers

P Ma, S Petridis, M Pantic - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …

Deep audio-visual speech recognition

T Afouras, JS Chung, A Senior… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …

Neural sign language translation

NC Camgoz, S Hadfield, O Koller… - Proceedings of the …, 2018 - openaccess.thecvf.com
Abstract Sign Language Recognition (SLR) has been an active research field for the last two
decades. However, most research to date has considered SLR as a naive gesture …

Talking face generation by adversarially disentangled audio-visual representation

H Zhou, Y Liu, Z Liu, P Luo, X Wang - … of the AAAI conference on artificial …, 2019 - aaai.org
Talking face generation aims to synthesize a sequence of face images that correspond to a
clip of speech. This is a challenging task because face appearance variation and semantics …

LRS3-TED: a large-scale dataset for visual speech recognition

T Afouras, JS Chung, A Zisserman - arXiv preprint arXiv:1809.00496, 2018 - arxiv.org
This paper introduces a new multi-modal dataset for visual and audio-visual speech
recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with …

Lip reading sentences in the wild

J Son Chung, A Senior, O Vinyals… - Proceedings of the …, 2017 - openaccess.thecvf.com
The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

Massively parallel amplitude-only Fourier neural network

M Miscuglio, Z Hu, S Li, JK George, R Capanna… - Optica, 2020 - opg.optica.org
Machine intelligence has become a driving factor in modern society. However, its demand
outpaces the underlying electronic technology due to limitations given by fundamental …