Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

[HTML][HTML] An overview on data representation learning: From traditional feature learning to recent deep learning

G Zhong, LN Wang, X Ling, J Dong - The Journal of Finance and Data …, 2016 - Elsevier
Since about 100 years ago, to learn the intrinsic structure of data, many representation
learning approaches have been proposed, either linear or nonlinear, either supervised or …

Deep audio-visual speech recognition

T Afouras, JS Chung, A Senior… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …

Lipreading using temporal convolutional networks

B Martinez, P Ma, S Petridis… - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org
Lip-reading has attracted a lot of research attention lately thanks to advances in deep
learning. The current state-of-the-art model for recognition of isolated words in-the-wild …

Lip reading in the wild

JS Chung, A Zisserman - Computer Vision–ACCV 2016: 13th Asian …, 2017 - Springer
Our aim is to recognise the words being spoken by a talking face, given only the video but
not the audio. Existing works in this area have focussed on trying to recognise a small …

Lip reading sentences in the wild

J Son Chung, A Senior, O Vinyals… - Proceedings of the …, 2017 - openaccess.thecvf.com
The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …

Phased lstm: Accelerating recurrent network training for long or event-based sequences

D Neil, M Pfeiffer, SC Liu - Advances in neural information …, 2016 - proceedings.neurips.cc
Abstract Recurrent Neural Networks (RNNs) have become the state-of-the-art choice for
extracting patterns from temporal sequences. Current RNN models are ill suited to process …

Lipnet: End-to-end sentence-level lipreading

YM Assael, B Shillingford, S Whiteson… - arXiv preprint arXiv …, 2016 - arxiv.org
Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional
approaches separated the problem into two stages: designing or learning visual features …

Audio-visual speech and gesture recognition by sensors of mobile devices

D Ryumin, D Ivanko, E Ryumina - Sensors, 2023 - mdpi.com
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …

End-to-end audiovisual speech recognition

S Petridis, T Stafylakis, P Ma, F Cai… - … on acoustics, speech …, 2018 - ieeexplore.ieee.org
Several end-to-end deep learning approaches have been recently presented which extract
either audio or visual features from the input images or audio signals and perform speech …