Recent advances in deep learning based dialogue systems: A systematic survey
Dialogue systems are a popular natural language processing (NLP) task as it is promising in
real-life applications. It is also a complicated task since many NLP tasks deserving study are …
real-life applications. It is also a complicated task since many NLP tasks deserving study are …
A metaverse: Taxonomy, components, applications, and open challenges
SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …
based on the social value of Generation Z that online and offline selves are not different …
Flava: A foundational language and vision alignment model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
Multiscale vision transformers
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
[PDF][PDF] Is space-time attention all you need for video understanding?
Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …
Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Onellm: One framework to align all modalities with language
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …
strong multimodal understanding capability. However existing works rely heavily on modality …
Lavis: A library for language-vision intelligence
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …
Interventional video grounding with dual contrastive learning
Video grounding aims to localize a moment from an untrimmed video for a given textual
query. Existing approaches focus more on the alignment of visual and language stimuli with …
query. Existing approaches focus more on the alignment of visual and language stimuli with …
Span-based localizing network for natural language video localization
Given an untrimmed video and a text query, natural language video localization (NLVL) is to
locate a matching span from the video that semantically corresponds to the query. Existing …
locate a matching span from the video that semantically corresponds to the query. Existing …