Temporal context aggregation for video retrieval with contrastive learning

T Lin, Y Wang, X Liu, X Qiu - AI open, 2022 - Elsevier

Transformers have achieved great success in many artificial intelligence fields, such as
natural language processing, computer vision, and audio processing. Therefore, it is natural …

被引用次数：1392 相关文章所有 4 个版本

[PDF] baai.ac.cn

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

被引用次数：2514 相关文章所有 7 个版本

[PDF] arxiv.org

A survey on visual transformer

K Han, Y Wang, H Chen, X Chen, J Guo, Z Liu… - arXiv preprint arXiv …, 2020 - arxiv.org

Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

被引用次数：386 相关文章所有 3 个版本

[PDF] arxiv.org

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

被引用次数：131 相关文章所有 8 个版本

[PDF] thecvf.com

Pidro: Parallel isomeric attention with dynamic routing for text-video retrieval

P Guan, R Pei, B Shao, J Liu, W Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Text-video retrieval is a fundamental task with high practical value in multi-modal research.
Inspired by the great success of pre-trained image-text models with large-scale data, such …

被引用次数：12 相关文章所有 4 个版本

[PDF] arxiv.org

RadFormer: Transformers with global–local attention for interpretable and accurate Gallbladder Cancer detection

S Basu, M Gupta, P Rana, P Gupta, C Arora - Medical Image Analysis, 2023 - Elsevier

We propose a novel deep neural network architecture to learn interpretable representation
for medical image analysis. Our architecture generates a global attention for region of …

被引用次数：27 相关文章所有 5 个版本

[PDF] arxiv.org

Ood-cv: A benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images

B Zhao, S Yu, W Ma, M Yu, S Mei, A Wang, J He… - European conference on …, 2022 - Springer

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One
reason is that existing robustness benchmarks are limited, as they either rely on synthetic …

被引用次数：44 相关文章所有 7 个版本

[PDF] aaai.org

Transvcl: Attention-enhanced video copy localization network with flexible supervision

S He, Y He, M Lu, C Jiang, X Yang, F Qian… - Proceedings of the …, 2023 - ojs.aaai.org

Video copy localization aims to precisely localize all the copied segments within a pair of
untrimmed videos in video retrieval applications. Previous methods typically start from frame …

被引用次数：25 相关文章所有 4 个版本

[PDF] thecvf.com

Spatial-temporal transformer for 3d point cloud sequences

Y Wei, H Liu, T Xie, Q Ke, Y Guo - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Effective learning of spatial-temporal information within a point cloud sequence is highly
important for many down-stream tasks such as 4D semantic segmentation and 3D action …

被引用次数：41 相关文章所有 6 个版本

[PDF] openreview.net

Loformer: Local frequency transformer for image deblurring

X Mao, J Wang, X Xie, Q Li, Y Wang - Proceedings of the 32nd ACM …, 2024 - dl.acm.org

Due to the computational complexity of self-attention (SA), prevalent techniques for image
deblurring often resort to either adopting localized SA or employing coarse-grained global …

被引用次数：5 相关文章所有 5 个版本