Exploring relations in untrimmed videos for self-supervised learning

A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications

L Alzubaidi, J Bai, A Al-Sabaawi, J Santamaría… - Journal of Big Data, 2023 - Springer

Data scarcity is a major challenge when training deep learning (DL) models. DL demands a
large amount of data to achieve exceptional performance. Unfortunately, many applications …

被引用次数：308 相关文章所有 9 个版本

[PDF] thecvf.com

Uatvr: Uncertainty-adaptive text-video retrieval

B Fang, W Wu, C Liu, Y Zhou, Y Song… - Proceedings of the …, 2023 - openaccess.thecvf.com

With the explosive growth of web videos and emerging large-scale vision-language pre-
training models, eg, CLIP, retrieving videos of interest with text instructions has attracted …

被引用次数：41 相关文章所有 7 个版本

[PDF] arxiv.org

Pimnet: a parallel, iterative and mimicking network for scene text recognition

Z Qiao, Y Zhou, J Wei, W Wang, Y Zhang… - Proceedings of the 29th …, 2021 - dl.acm.org

Nowadays, scene text recognition has attracted more and more attention due to its various
applications. Most state-of-the-art methods adopt an encoder-decoder framework with …

被引用次数：70 相关文章所有 3 个版本

[PDF] arxiv.org

Dense semantic contrast for self-supervised visual representation learning

X Li, Y Zhou, Y Zhang, A Zhang, W Wang… - Proceedings of the 29th …, 2021 - dl.acm.org

Self-supervised representation learning for visual pre-training has achieved remarkable
success with sample (instance or pixel) discrimination and semantics discovery of instance …

被引用次数：43 相关文章所有 3 个版本

[PDF] google.com

Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa

G Zeng, Y Zhang, Y Zhou, X Yang - Proceedings of the 29th ACM …, 2021 - dl.acm.org

Text-based visual question answering (TextVQA) requires analyzing both the visual contents
and texts in an image to answer a question, which is more practical than general visual …

被引用次数：39 相关文章所有 2 个版本

[PDF] arxiv.org

Mask is all you need: Rethinking mask R-CNN for dense and arbitrary-shaped scene text detection

X Qin, Y Zhou, Y Guo, D Wu, Z Tian, N Jiang… - Proceedings of the 29th …, 2021 - dl.acm.org

Due to the large success in object detection and instance segmentation, Mask R-CNN
attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped …

被引用次数：39 相关文章所有 3 个版本

[PDF] thecvf.com

Steps: Self-supervised key step extraction and localization from unlabeled procedural videos

A Shah, B Lundell, H Sawhney… - Proceedings of the …, 2023 - openaccess.thecvf.com

We address the problem of extracting key steps from unlabeled procedural videos,
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …

被引用次数：5 相关文章所有 7 个版本

[PDF] arxiv.org

FC²RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

X Qin, Y Zhou, Y Guo, D Wu… - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org

Accurate detection of multi-oriented text that accounts for a large proportion in real practice
is of great significance. The performance has improved rapidly on common benchmarks in …

被引用次数：25 相关文章所有 3 个版本

[PDF] sciencedirect.com

Attentive spatial-temporal contrastive learning for self-supervised video representation

X Yang, S Xiong, K Wu, D Shan, Z Xie - Image and Vision Computing, 2023 - Elsevier

Most existing self-supervised works learn video representation by using a single pretext
task. A single pretext task, providing single supervision from unlabeled data, may neglect to …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Adafocus: Towards end-to-end weakly supervised learning for long-video action understanding

J Zhou, H Li, KY Lin, J Liang - arXiv preprint arXiv:2311.17118, 2023 - arxiv.org

Developing end-to-end models for long-video action understanding tasks presents
significant computational and memory challenges. Existing works generally build models on …

被引用次数：2 相关文章所有 2 个版本