Deep audio-visual learning: A survey
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …
modalities, has drawn considerable attention since deep learning started to be used …
Derin öğrenme ve görüntü analizinde kullanılan derin öğrenme modelleri
Klasik Makine öğrenme teknikleri ile bir model tanımlama veya makine öğrenimi sistemi
kurmak için öncelikle özellik vektörünün çıkarılması gerekmektedir. Özellik vektörünün …
kurmak için öncelikle özellik vektörünün çıkarılması gerekmektedir. Özellik vektörünün …
End-to-end audio-visual speech recognition with conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …
Deep audio-visual speech recognition
The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …
with or without the audio. Unlike previous works that have focussed on recognising a limited …
Neural sign language translation
Abstract Sign Language Recognition (SLR) has been an active research field for the last two
decades. However, most research to date has considered SLR as a naive gesture …
decades. However, most research to date has considered SLR as a naive gesture …
Talking face generation by adversarially disentangled audio-visual representation
Talking face generation aims to synthesize a sequence of face images that correspond to a
clip of speech. This is a challenging task because face appearance variation and semantics …
clip of speech. This is a challenging task because face appearance variation and semantics …
LRS3-TED: a large-scale dataset for visual speech recognition
This paper introduces a new multi-modal dataset for visual and audio-visual speech
recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with …
recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with …
Lip reading sentences in the wild
The goal of this work is to recognise phrases and sentences being spoken by a talking face,
with or without the audio. Unlike previous works that have focussed on recognising a limited …
with or without the audio. Unlike previous works that have focussed on recognising a limited …
Audio-visual event localization in unconstrained videos
In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …
unconstrained videos. We define an audio-visual event as an event that is both visible and …
Massively parallel amplitude-only Fourier neural network
Machine intelligence has become a driving factor in modern society. However, its demand
outpaces the underlying electronic technology due to limitations given by fundamental …
outpaces the underlying electronic technology due to limitations given by fundamental …