Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context

R Jiang - Frontiers in Psychology, 2022 - frontiersin.org
The booming Artificial Intelligence (AI) provides fertile ground for AI in education. So far, few
reviews have been deployed to explore how AI empowers English as Foreign Language …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Video-mined task graphs for keystep recognition in instructional videos

K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2024 - proceedings.neurips.cc
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …

Learning audio-video modalities from image captions

A Nagrani, PH Seo, B Seybold, A Hauth… - … on Computer Vision, 2022 - Springer
There has been a recent explosion of large-scale image-text datasets, as images with alt-
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …

A clip-hitchhiker's guide to long video retrieval

M Bain, A Nagrani, G Varol, A Zisserman - arXiv preprint arXiv:2205.08508, 2022 - arxiv.org
Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent
works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP …

Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impacts

A Kolides, A Nawaz, A Rathor, D Beeman… - … Modelling Practice and …, 2023 - Elsevier
With the emergence of foundation models (FMs) that are trained on large amounts of data at
scale and adaptable to a wide range of downstream applications, AI is experiencing a …

Temporal action segmentation: An analysis of modern techniques

G Ding, F Sener, A Yao - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …