Two-stream transformer architecture for long video understanding

A Ulhaq, N Akhtar, G Pogrebna, A Mian - arXiv preprint arXiv:2209.05700, 2022 - arxiv.org

Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have also proven the efficacy of transformers beyond the image domain …

被引用次数：46 相关文章所有 4 个版本

[HTML] sciencedirect.com

[HTML][HTML] k-NN attention-based video vision transformer for action recognition

W Sun, Y Ma, R Wang - Neurocomputing, 2024 - Elsevier

Action Recognition aims to understand human behavior and predict a label for each action.
Recently, Vision Transformer (ViT) has achieved remarkable performance on action …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Transformer in Touch: A Survey

J Gao, N Cheng, B Fang, W Han - arXiv preprint arXiv:2405.12779, 2024 - arxiv.org

The Transformer model, initially achieving significant success in the field of natural language
processing, has recently shown great potential in the application of tactile perception. This …

被引用次数：1 相关文章所有 2 个版本

[PDF] acm.org

Towards Long Form Audio-visual Video Understanding

W Hou, G Li, Y Tian, D Hu - ACM Transactions on Multimedia Computing …, 2023 - dl.acm.org

We live in a world filled with never-ending streams of multimodal information. As a more
natural recording of the real scenario, long form audio-visual videos are expected as an …

被引用次数：3 相关文章所有 3 个版本

[PDF] archive.org

Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long Videos

H Wang, Y Hu, Y Zhu, J Qi, B Wu - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Social Relation Recognition is an important part of Video Understanding, providing insights
into the information that videos convey. Most previous works mainly focused on graph …

被引用次数：2 相关文章所有 2 个版本

MMSF: A multimodal sentiment-fused method to recognize video speaking style

B Zhang, Y Fang, F Yu, J Bei, T Ren - Proceedings of the 2023 ACM …, 2023 - dl.acm.org

As talking takes a large proportion of human lives, it is necessary to perform deeper
understanding of human conversations. Speaking style recognition is aimed at recognizing …

被引用次数：1 相关文章

[PDF] ieee.org

Progressive Complementation Network With Semantics and Details for Salient Object Detection in Optical Remote Sensing Images

R Zhao, P Zheng, C Zhang… - IEEE Journal of Selected …, 2024 - ieeexplore.ieee.org

The existing salient object detection in optical remote sensing images methods mostly
employ the same strategy to handle features at different levels without fully considering the …

被引用次数：1 相关文章所有 2 个版本

[PDF] uzh.ch

Reproducibility Companion Paper of" MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style"

F Yu, B Zhang, Y Fang, J Bei, T Ren, J Li… - Proceedings of the 2024 …, 2024 - dl.acm.org

To support the replication of" MMSF: A Multimodal Sentiment-Fused Method to Recognize
Video Speaking Style", which was presented at ICMR'23, this companion paper provides the …

[PDF] arxiv.org

Real-Time Human Action Recognition on Embedded Platforms

R Wang, Z Wang, P Gao, M Li, J Jeong, Y Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

With advancements in computer vision and deep learning, video-based human action
recognition (HAR) has become practical. However, due to the complexity of the computation …

Crime Detection from Pre-crime Video Analysis with Augmented Pose and Emotion Information

S Kilic, M Tuceryan - 2024 IEEE Southwest Symposium on …, 2024 - ieeexplore.ieee.org

This study aims to detect pre-crime events in videos focusing on shoplifting. Our work
proposes a novel approach of augmenting human pose information and emotion information …

被引用次数：1 相关文章所有 2 个版本