Everything at once-multi-modal fusion transformer for video retrieval

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：175 相关文章所有 7 个版本

[PDF] frontiersin.org

How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context

R Jiang - Frontiers in Psychology, 2022 - frontiersin.org

The booming Artificial Intelligence (AI) provides fertile ground for AI in education. So far, few
reviews have been deployed to explore how AI empowers English as Foreign Language …

被引用次数：88 相关文章所有 5 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：511 相关文章所有 9 个版本

[PDF] thecvf.com

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

被引用次数：79 相关文章所有 5 个版本

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P Jin, J Huang, P Xiong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

被引用次数：60 相关文章所有 6 个版本

[PDF] neurips.cc

Video-mined task graphs for keystep recognition in instructional videos

K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2024 - proceedings.neurips.cc

Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …

被引用次数：17 相关文章所有 6 个版本

[PDF] arxiv.org

Learning audio-video modalities from image captions

A Nagrani, PH Seo, B Seybold, A Hauth… - … on Computer Vision, 2022 - Springer

There has been a recent explosion of large-scale image-text datasets, as images with alt-
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …

被引用次数：85 相关文章所有 8 个版本

[PDF] arxiv.org

A clip-hitchhiker's guide to long video retrieval

M Bain, A Nagrani, G Varol, A Zisserman - arXiv preprint arXiv:2205.08508, 2022 - arxiv.org

Our goal in this paper is the adaptation of image-text models for long video retrieval. Recent
works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP …

被引用次数：66 相关文章所有 2 个版本

Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impacts

A Kolides, A Nawaz, A Rathor, D Beeman… - … Modelling Practice and …, 2023 - Elsevier

With the emergence of foundation models (FMs) that are trained on large amounts of data at
scale and adaptable to a wide range of downstream applications, AI is experiencing a …

被引用次数：37 相关文章

[PDF] arxiv.org

Temporal action segmentation: An analysis of modern techniques

G Ding, F Sener, A Yao - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …

被引用次数：55 相关文章所有 8 个版本