Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

Deep audio-visual learning: A survey

H Zhu, MD Luo, R Wang, AH Zheng, R He - International Journal of …, 2021 - Springer
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com
Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

End-to-end audio-visual speech recognition with conformers

P Ma, S Petridis, M Pantic - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …

[HTML][HTML] Multibench: Multiscale benchmarks for multimodal representation learning

PP Liang, Y Lyu, X Fan, Z Wu, Y Cheng… - Advances in neural …, 2021 - ncbi.nlm.nih.gov
Learning multimodal representations involves integrating information from multiple
heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world …

Lipreading using temporal convolutional networks

B Martinez, P Ma, S Petridis… - ICASSP 2020-2020 IEEE …, 2020 - ieeexplore.ieee.org
Lip-reading has attracted a lot of research attention lately thanks to advances in deep
learning. The current state-of-the-art model for recognition of isolated words in-the-wild …

Audio-visual speech and gesture recognition by sensors of mobile devices

D Ryumin, D Ivanko, E Ryumina - Sensors, 2023 - mdpi.com
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …

End-to-end audiovisual speech recognition

S Petridis, T Stafylakis, P Ma, F Cai… - … on acoustics, speech …, 2018 - ieeexplore.ieee.org
Several end-to-end deep learning approaches have been recently presented which extract
either audio or visual features from the input images or audio signals and perform speech …

LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild

S Yang, Y Zhang, D Feng, M Yang… - 2019 14th IEEE …, 2019 - ieeexplore.ieee.org
Large-scale datasets have successively proven their fundamental importance in several
research fields, especially for early progress in some emerging topics. In this paper, we …

Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos

O Koller, NC Camgoz, H Ney… - IEEE transactions on …, 2019 - ieeexplore.ieee.org
In this work we present a new approach to the field of weakly supervised learning in the
video domain. Our method is relevant to sequence learning problems which can be split up …