Machine translation from signed to spoken languages: State of the art and challenges
Automatic translation from signed to spoken languages is an interdisciplinary research
domain on the intersection of computer vision, machine translation (MT), and linguistics …
domain on the intersection of computer vision, machine translation (MT), and linguistics …
Gloss attention for gloss-free sign language translation
Most sign language translation (SLT) methods to date require the use of gloss annotations to
provide additional supervision information, however, the acquisition of gloss is not easy. To …
provide additional supervision information, however, the acquisition of gloss is not easy. To …
Exploring group video captioning with efficient relational approximation
Current video captioning efforts most focus on describing a single video while the need for
captioning videos in groups has increased considerably. In this study, we propose a new …
captioning videos in groups has increased considerably. In this study, we propose a new …
From rule-based models to deep learning transformers architectures for natural language processing and sign language translation systems: survey, taxonomy and …
With the growing Deaf and Hard of Hearing population worldwide and the persistent
shortage of certified sign language interpreters, there is a pressing need for an efficient …
shortage of certified sign language interpreters, there is a pressing need for an efficient …
Multi-granularity relational attention network for audio-visual question answering
Recent methods for video question answering (VideoQA), aiming to generate answers
based on given questions and video content, have made significant progress in cross-modal …
based on given questions and video content, have made significant progress in cross-modal …
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts
In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has
predominantly concentrated on the training paradigms tailored for high-quality resources …
predominantly concentrated on the training paradigms tailored for high-quality resources …
Contrastive token-wise meta-learning for unseen performer visual temporal-aligned translation
Visual temporal-aligned translation aims to transform the visual sequence into natural
words, including important applicable tasks such as lipreading and fingerspelling …
words, including important applicable tasks such as lipreading and fingerspelling …
Opensr: Open-modality speech recognition via maintaining multi-modality alignment
Speech Recognition builds a bridge between the multimedia streaming (audio-only, visual-
only or audio-visual) and the corresponding text transcription. However, when training the …
only or audio-visual) and the corresponding text transcription. However, when training the …
Rethinking Missing Modality Learning from a Decoding Perspective
Conventional pipeline of multimodal learning consists of three stages, including encoding,
fusion, and decoding. Most existing methods under missing modality condition focus on the …
fusion, and decoding. Most existing methods under missing modality condition focus on the …
ASLRing: American Sign Language Recognition with Meta-Learning on Wearables
Sign Language is widely used by over 500 million Deaf and hard of hearing (DHH)
individuals in their daily lives. While prior works made notable efforts to show the feasibility …
individuals in their daily lives. While prior works made notable efforts to show the feasibility …