A review of generalized zero-shot learning methods
Generalized zero-shot learning (GZSL) aims to train a model for classifying data samples
under the condition that some output classes are unknown during supervised learning. To …
under the condition that some output classes are unknown during supervised learning. To …
A survey on video-based human action recognition: recent updates, datasets, challenges, and applications
Abstract Human Action Recognition (HAR) involves human activity monitoring task in
different areas of medical, education, entertainment, visual surveillance, video retrieval, as …
different areas of medical, education, entertainment, visual surveillance, video retrieval, as …
Open-vocabulary object detection via vision and language knowledge distillation
We aim at advancing open-vocabulary object detection, which detects objects described by
arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly …
arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly …
Aligning bag of regions for open-vocabulary object detection
Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …
representations on large-scale datasets, where each image-text pair usually contains a bag …
Decoupling zero-shot semantic segmentation
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not
been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot …
been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot …
Align and prompt: Video-and-language pre-training with entity prompts
Video-and-language pre-training has shown promising improvements on various
downstream tasks. Most previous methods capture cross-modal interactions with a …
downstream tasks. Most previous methods capture cross-modal interactions with a …
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …
pairs are becoming popular for vision-language tasks. While existing methods simply …
TN-ZSTAD: Transferable network for zero-shot temporal activity detection
An integral part of video analysis and surveillance is temporal activity detection, which
means to simultaneously recognize and localize activities in long untrimmed videos …
means to simultaneously recognize and localize activities in long untrimmed videos …
[HTML][HTML] Combined scaling for zero-shot transfer learning
Recent developments in multimodal training methodologies, including CLIP and ALIGN,
obviate the necessity for individual data labeling. These approaches utilize pairs of data and …
obviate the necessity for individual data labeling. These approaches utilize pairs of data and …
Dualcoop: Fast adaptation to multi-label recognition with limited annotations
Solving multi-label recognition (MLR) for images in the low-label regime is a challenging
task with many real-world applications. Recent work learns an alignment between textual …
task with many real-world applications. Recent work learns an alignment between textual …