Fine-grained scene graph generation with data transfer
Scene graph generation (SGG) is designed to extract (subject, predicate, object) triplets in
images. Recent works have made a steady progress on SGG, and provide useful tools for …
images. Recent works have made a steady progress on SGG, and provide useful tools for …
Panoptic scene graph generation with semantics-prototype learning
Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships
(predicate) to connect human language and visual scenes. However, different language …
(predicate) to connect human language and visual scenes. However, different language …
Constructing holistic spatio-temporal scene graph for video semantic role labeling
As one of the core video semantic understanding tasks, Video Semantic Role Labeling
(VidSRL) aims to detect the salient events from given videos, by recognizing the predict …
(VidSRL) aims to detect the salient events from given videos, by recognizing the predict …
Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement
Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of
images for" human-like''event understanding. Specifically, GSR task not only detects the …
images for" human-like''event understanding. Specifically, GSR task not only detects the …
Open scene understanding: Grounded situation recognition meets segment anything for helping people with visual impairments
Abstract Grounded Situation Recognition (GSR) is capable of recognizing and interpreting
visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the …
visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the …
Grounded video situation recognition
Z Khan, CV Jawahar… - Advances in Neural …, 2022 - proceedings.neurips.cc
Dense video understanding requires answering several questions such as who is doing
what to whom, with what, how, why, and where. Recently, Video Situation Recognition …
what to whom, with what, how, why, and where. Recently, Video Situation Recognition …
Training multimedia event extraction with generated images and captions
Contemporary news reporting increasingly features multimedia content, motivating research
on multimedia event extraction. However, the task lacks annotated multimodal training data …
on multimedia event extraction. However, the task lacks annotated multimodal training data …
Biased-predicate annotation identification via unbiased visual predicate representation
Panoptic Scene Graph Generation (PSG) translates visual scenes to structured linguistic
descriptions, ie, mapping visual instances to subjects/objects, and their relationships to …
descriptions, ie, mapping visual instances to subjects/objects, and their relationships to …
Ambiguous images with human judgments for robust visual event classification
Contemporary vision benchmarks predominantly consider tasks on which humans can
achieve near-perfect performance. However, humans are frequently presented with visual …
achieve near-perfect performance. However, humans are frequently presented with visual …
Video event extraction via tracking visual states of arguments
Video event extraction aims to detect salient events from a video and identify the arguments
for each event as well as their semantic roles. Existing methods focus on capturing the …
for each event as well as their semantic roles. Existing methods focus on capturing the …