Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context
R Jiang - Frontiers in Psychology, 2022 - frontiersin.org
The booming Artificial Intelligence (AI) provides fertile ground for AI in education. So far, few
reviews have been deployed to explore how AI empowers English as Foreign Language …
reviews have been deployed to explore how AI empowers English as Foreign Language …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …
have achieved outstanding performance, which pursue semantic interaction upon pre …
Video-mined task graphs for keystep recognition in instructional videos
K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2024 - proceedings.neurips.cc
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …
task, where multiple keysteps are performed in sequence across a long video to reach a …
Learning audio-video modalities from image captions
There has been a recent explosion of large-scale image-text datasets, as images with alt-
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …
A clip-hitchhiker's guide to long video retrieval
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent
works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP …
works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP …
Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impacts
A Kolides, A Nawaz, A Rathor, D Beeman… - … Modelling Practice and …, 2023 - Elsevier
With the emergence of foundation models (FMs) that are trained on large amounts of data at
scale and adaptable to a wide range of downstream applications, AI is experiencing a …
scale and adaptable to a wide range of downstream applications, AI is experiencing a …
Temporal action segmentation: An analysis of modern techniques
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …
minutes-long videos with multiple action classes. As a long-range video understanding task …