A review of generalized zero-shot learning methods

F Pourpanah, M Abdar, Y Luo, X Zhou… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Generalized zero-shot learning (GZSL) aims to train a model for classifying data samples
under the condition that some output classes are unknown during supervised learning. To …

A survey on video-based human action recognition: recent updates, datasets, challenges, and applications

P Pareek, A Thakkar - Artificial Intelligence Review, 2021 - Springer
Abstract Human Action Recognition (HAR) involves human activity monitoring task in
different areas of medical, education, entertainment, visual surveillance, video retrieval, as …

Open-vocabulary object detection via vision and language knowledge distillation

X Gu, TY Lin, W Kuo, Y Cui - arXiv preprint arXiv:2104.13921, 2021 - arxiv.org
We aim at advancing open-vocabulary object detection, which detects objects described by
arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly …

Aligning bag of regions for open-vocabulary object detection

S Wu, W Zhang, S Jin, W Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …

Decoupling zero-shot semantic segmentation

J Ding, N Xue, GS Xia, D Dai - Proceedings of the IEEE/CVF …, 2022 - openaccess.thecvf.com
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not
been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot …

Align and prompt: Video-and-language pre-training with entity prompts

D Li, J Li, H Li, JC Niebles… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Video-and-language pre-training has shown promising improvements on various
downstream tasks. Most previous methods capture cross-modal interactions with a …

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X Li, X Yin, C Li, P Zhang, X Hu, L Zhang… - Computer Vision–ECCV …, 2020 - Springer
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …

TN-ZSTAD: Transferable network for zero-shot temporal activity detection

L Zhang, X Chang, J Liu, M Luo, Z Li… - … on Pattern Analysis …, 2022 - ieeexplore.ieee.org
An integral part of video analysis and surveillance is temporal activity detection, which
means to simultaneously recognize and localize activities in long untrimmed videos …

[HTML][HTML] Combined scaling for zero-shot transfer learning

H Pham, Z Dai, G Ghiasi, K Kawaguchi, H Liu, AW Yu… - Neurocomputing, 2023 - Elsevier
Recent developments in multimodal training methodologies, including CLIP and ALIGN,
obviate the necessity for individual data labeling. These approaches utilize pairs of data and …

Dualcoop: Fast adaptation to multi-label recognition with limited annotations

X Sun, P Hu, K Saenko - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Solving multi-label recognition (MLR) for images in the low-label regime is a challenging
task with many real-world applications. Recent work learns an alignment between textual …