Multimodal transformer networks for end-to-end video-grounded dialogue systems

J Ni, T Young, V Pandelea, F Xue… - Artificial intelligence review, 2023 - Springer

Dialogue systems are a popular natural language processing (NLP) task as it is promising in
real-life applications. It is also a complicated task since many NLP tasks deserving study are …

被引用次数：225 相关文章所有 15 个版本

[PDF] ieee.org

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org

Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

被引用次数：1463 相关文章所有 6 个版本

[PDF] thecvf.com

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

被引用次数：559 相关文章所有 6 个版本

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

被引用次数：1264 相关文章所有 5 个版本

[PDF] mlr.press

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021 - proceedings.mlr.press

Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

被引用次数：1891 相关文章所有 4 个版本

[PDF] arxiv.org

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W Jing, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

被引用次数：30 相关文章所有 8 个版本

[PDF] thecvf.com

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

被引用次数：24 相关文章所有 3 个版本

[PDF] arxiv.org

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

被引用次数：83 相关文章所有 4 个版本

[PDF] thecvf.com

Interventional video grounding with dual contrastive learning

G Nan, R Qiao, Y Xiao, J Liu, S Leng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Video grounding aims to localize a moment from an untrimmed video for a given textual
query. Existing approaches focus more on the alignment of visual and language stimuli with …

被引用次数：149 相关文章所有 6 个版本

[PDF] arxiv.org

Span-based localizing network for natural language video localization

H Zhang, A Sun, W Jing, JT Zhou - arXiv preprint arXiv:2004.13931, 2020 - arxiv.org

Given an untrimmed video and a text query, natural language video localization (NLVL) is to
locate a matching span from the video that semantically corresponds to the query. Existing …

被引用次数：277 相关文章所有 6 个版本