Recent advances in deep learning based dialogue systems: A systematic survey

J Ni, T Young, V Pandelea, F Xue… - Artificial intelligence review, 2023 - Springer
Dialogue systems are a popular natural language processing (NLP) task as it is promising in
real-life applications. It is also a complicated task since many NLP tasks deserving study are …

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021 - proceedings.mlr.press
Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W Jing, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Lavis: A library for language-vision intelligence

D Li, J Li, H Le, G Wang, S Savarese… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …

Interventional video grounding with dual contrastive learning

G Nan, R Qiao, Y Xiao, J Liu, S Leng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Video grounding aims to localize a moment from an untrimmed video for a given textual
query. Existing approaches focus more on the alignment of visual and language stimuli with …

Span-based localizing network for natural language video localization

H Zhang, A Sun, W Jing, JT Zhou - arXiv preprint arXiv:2004.13931, 2020 - arxiv.org
Given an untrimmed video and a text query, natural language video localization (NLVL) is to
locate a matching span from the video that semantically corresponds to the query. Existing …