Multimodal intelligence: Representation learning, information fusion, and applications
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …
natural language processing since 2010. Each of these tasks involves a single modality in …
Deep audio-visual learning: A survey
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …
modalities, has drawn considerable attention since deep learning started to be used …
Visual speech recognition for multiple languages in the wild
Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …
movements, without relying on the audio stream. Advances in deep learning and the …
End-to-end audio-visual speech recognition with conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …
[HTML][HTML] Multibench: Multiscale benchmarks for multimodal representation learning
Learning multimodal representations involves integrating information from multiple
heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world …
heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world …
Lipreading using temporal convolutional networks
Lip-reading has attracted a lot of research attention lately thanks to advances in deep
learning. The current state-of-the-art model for recognition of isolated words in-the-wild …
learning. The current state-of-the-art model for recognition of isolated words in-the-wild …
Audio-visual speech and gesture recognition by sensors of mobile devices
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …
speech recognition, particularly when audio is corrupted by noise. Additional visual …
End-to-end audiovisual speech recognition
Several end-to-end deep learning approaches have been recently presented which extract
either audio or visual features from the input images or audio signals and perform speech …
either audio or visual features from the input images or audio signals and perform speech …
LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild
Large-scale datasets have successively proven their fundamental importance in several
research fields, especially for early progress in some emerging topics. In this paper, we …
research fields, especially for early progress in some emerging topics. In this paper, we …
Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos
In this work we present a new approach to the field of weakly supervised learning in the
video domain. Our method is relevant to sequence learning problems which can be split up …
video domain. Our method is relevant to sequence learning problems which can be split up …