Conditional Video Diffusion Network for Fine-grained Temporal Sentence Grounding
Temporal sentence grounding (TSG) aims to locate a semantically related segment of an
untrimmed video guided by a sentence query. Since the untrimmed videos are too long …
untrimmed video guided by a sentence query. Since the untrimmed videos are too long …
Towards Weakly Supervised Text-to-Audio Grounding
Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events
described by natural language. This task can facilitate applications such as multimodal …
described by natural language. This task can facilitate applications such as multimodal …
A dual reinforcement learning framework for weakly supervised phrase grounding
Weakly-supervised phrase grounding aims to localize a specific region in an image that
corresponds to the given textual phrase, where the mapping between noun phrases and …
corresponds to the given textual phrase, where the mapping between noun phrases and …