A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications

L Alzubaidi, J Bai, A Al-Sabaawi, J Santamaría… - Journal of Big Data, 2023 - Springer
Data scarcity is a major challenge when training deep learning (DL) models. DL demands a
large amount of data to achieve exceptional performance. Unfortunately, many applications …

Uatvr: Uncertainty-adaptive text-video retrieval

B Fang, W Wu, C Liu, Y Zhou, Y Song… - Proceedings of the …, 2023 - openaccess.thecvf.com
With the explosive growth of web videos and emerging large-scale vision-language pre-
training models, eg, CLIP, retrieving videos of interest with text instructions has attracted …

Pimnet: a parallel, iterative and mimicking network for scene text recognition

Z Qiao, Y Zhou, J Wei, W Wang, Y Zhang… - Proceedings of the 29th …, 2021 - dl.acm.org
Nowadays, scene text recognition has attracted more and more attention due to its various
applications. Most state-of-the-art methods adopt an encoder-decoder framework with …

Dense semantic contrast for self-supervised visual representation learning

X Li, Y Zhou, Y Zhang, A Zhang, W Wang… - Proceedings of the 29th …, 2021 - dl.acm.org
Self-supervised representation learning for visual pre-training has achieved remarkable
success with sample (instance or pixel) discrimination and semantics discovery of instance …

Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa

G Zeng, Y Zhang, Y Zhou, X Yang - Proceedings of the 29th ACM …, 2021 - dl.acm.org
Text-based visual question answering (TextVQA) requires analyzing both the visual contents
and texts in an image to answer a question, which is more practical than general visual …

Mask is all you need: Rethinking mask R-CNN for dense and arbitrary-shaped scene text detection

X Qin, Y Zhou, Y Guo, D Wu, Z Tian, N Jiang… - Proceedings of the 29th …, 2021 - dl.acm.org
Due to the large success in object detection and instance segmentation, Mask R-CNN
attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped …

Steps: Self-supervised key step extraction and localization from unlabeled procedural videos

A Shah, B Lundell, H Sawhney… - Proceedings of the …, 2023 - openaccess.thecvf.com
We address the problem of extracting key steps from unlabeled procedural videos,
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …

FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

X Qin, Y Zhou, Y Guo, D Wu… - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
Accurate detection of multi-oriented text that accounts for a large proportion in real practice
is of great significance. The performance has improved rapidly on common benchmarks in …

Attentive spatial-temporal contrastive learning for self-supervised video representation

X Yang, S Xiong, K Wu, D Shan, Z Xie - Image and Vision Computing, 2023 - Elsevier
Most existing self-supervised works learn video representation by using a single pretext
task. A single pretext task, providing single supervision from unlabeled data, may neglect to …

Adafocus: Towards end-to-end weakly supervised learning for long-video action understanding

J Zhou, H Li, KY Lin, J Liang - arXiv preprint arXiv:2311.17118, 2023 - arxiv.org
Developing end-to-end models for long-video action understanding tasks presents
significant computational and memory challenges. Existing works generally build models on …