A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications
Data scarcity is a major challenge when training deep learning (DL) models. DL demands a
large amount of data to achieve exceptional performance. Unfortunately, many applications …
large amount of data to achieve exceptional performance. Unfortunately, many applications …
Uatvr: Uncertainty-adaptive text-video retrieval
With the explosive growth of web videos and emerging large-scale vision-language pre-
training models, eg, CLIP, retrieving videos of interest with text instructions has attracted …
training models, eg, CLIP, retrieving videos of interest with text instructions has attracted …
Pimnet: a parallel, iterative and mimicking network for scene text recognition
Nowadays, scene text recognition has attracted more and more attention due to its various
applications. Most state-of-the-art methods adopt an encoder-decoder framework with …
applications. Most state-of-the-art methods adopt an encoder-decoder framework with …
Dense semantic contrast for self-supervised visual representation learning
Self-supervised representation learning for visual pre-training has achieved remarkable
success with sample (instance or pixel) discrimination and semantics discovery of instance …
success with sample (instance or pixel) discrimination and semantics discovery of instance …
Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa
Text-based visual question answering (TextVQA) requires analyzing both the visual contents
and texts in an image to answer a question, which is more practical than general visual …
and texts in an image to answer a question, which is more practical than general visual …
Mask is all you need: Rethinking mask R-CNN for dense and arbitrary-shaped scene text detection
Due to the large success in object detection and instance segmentation, Mask R-CNN
attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped …
attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped …
Steps: Self-supervised key step extraction and localization from unlabeled procedural videos
A Shah, B Lundell, H Sawhney… - Proceedings of the …, 2023 - openaccess.thecvf.com
We address the problem of extracting key steps from unlabeled procedural videos,
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …
FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection
Accurate detection of multi-oriented text that accounts for a large proportion in real practice
is of great significance. The performance has improved rapidly on common benchmarks in …
is of great significance. The performance has improved rapidly on common benchmarks in …
Attentive spatial-temporal contrastive learning for self-supervised video representation
X Yang, S Xiong, K Wu, D Shan, Z Xie - Image and Vision Computing, 2023 - Elsevier
Most existing self-supervised works learn video representation by using a single pretext
task. A single pretext task, providing single supervision from unlabeled data, may neglect to …
task. A single pretext task, providing single supervision from unlabeled data, may neglect to …
Adafocus: Towards end-to-end weakly supervised learning for long-video action understanding
Developing end-to-end models for long-video action understanding tasks presents
significant computational and memory challenges. Existing works generally build models on …
significant computational and memory challenges. Existing works generally build models on …