Cross interaction network for natural language guided video moment retrieval
Proceedings of the 44th International ACM SIGIR Conference on Research and …, 2021•dl.acm.org
Natural language query grounding in videos is a challenging task that requires
comprehensive understanding of the query, video and fusion of information across these
modalities. Existing methods mostly emphasize on the query-to-video one-way interaction
with a late fusion scheme, lacking effective ways to capture the relationship within and
between query and video in a fine-grained manner. Moreover, current methods are often
overly complicated resulting in long training time. We propose a self-attention together with …
comprehensive understanding of the query, video and fusion of information across these
modalities. Existing methods mostly emphasize on the query-to-video one-way interaction
with a late fusion scheme, lacking effective ways to capture the relationship within and
between query and video in a fine-grained manner. Moreover, current methods are often
overly complicated resulting in long training time. We propose a self-attention together with …
Natural language query grounding in videos is a challenging task that requires comprehensive understanding of the query, video and fusion of information across these modalities. Existing methods mostly emphasize on the query-to-video one-way interaction with a late fusion scheme, lacking effective ways to capture the relationship within and between query and video in a fine-grained manner. Moreover, current methods are often overly complicated resulting in long training time. We propose a self-attention together with cross interaction multi-head-attention mechanism in an early fusion scheme to capture video-query intra-dependencies as well as inter-relation from both directions (query-to-video and video-to-query). The cross-attention method can associate query words and video frames at any position and account for long-range dependencies in the video context. In addition, we propose a multi-task training objective that includes start/end prediction and moment segmentation. The moment segmentation task provides additional training signals that remedy the start/end prediction noise caused by annotator disagreement. Our simple yet effective architecture enables speedy training (within 1 hour on an AWS P3.2xlarge GPU instance) and instant inference. We showed that the proposed method achieves superior performance compared to complex state of the art methods, in particular surpassing the SOTA on high IoU metrics (R@1, IoU=0.7) by 3.52% absolute (11.09% relative) on the Charades-STA dataset.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果