Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Onetracker: Unifying visual object tracking with foundation models and efficient tuning

L Hong, S Yan, R Zhang, W Li, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual object tracking aims to localize the target object of each frame based on its initial
appearance in the first frame. Depending on the input modility tracking tasks can be divided …

Reading relevant feature from global representation memory for visual object tracking

X Zhou, P Guo, L Hong, J Li, W Zhang… - Advances in …, 2024 - proceedings.neurips.cc
Reference features from a template or historical frames are crucial for visual object tracking.
Prior works utilize all features from a fixed template or memory for visual object tracking …

Openvis: Open-vocabulary video instance segmentation

P Guo, T Huang, P He, X Liu, T Xiao, Z Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Open-vocabulary Video Instance Segmentation (OpenVIS) can simultaneously detect,
segment, and track arbitrary object categories in a video, without being constrained to …

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

E Song, W Chai, T Ye, JN Hwang, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Video Visualization and Visual Analytics: A Task-Based and Application-Driven Investigation

W Xia, G Sun, T Li, B Chang, J Tang… - … on Circuits and …, 2024 - ieeexplore.ieee.org
Video data refers to digital information in the form of a series of frames or images
representing continuous motion captured by a video recording device. In various domains …

Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval

Z Li, L Zhang, K Zhang, Y Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Image-text retrieval is a fundamental task in bridging the semantics between vision and
language. The key challenge lies in accurately and efficiently learning the semantic …

SRRT: Exploring Search Region Regulation for Visual Object Tracking

J Zhu, X Chen, P Zhang, X Wang… - … on Circuits and …, 2024 - ieeexplore.ieee.org
The dominant trackers generate a fixed-size rectangular region based on the previous
prediction or initial bounding box as the model input, ie, search region. While this manner …

Multi-step Temporal Modeling for UAV Tracking

X Yuan, T Xu, X Liu, Y Wang, H Qin… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In the realm of unmanned aerial vehicle (UAV) tracking, Siamese-based approaches have
gained traction due to their optimal balance between efficiency and precision. However …

LGTrack: Exploiting Local and Global Properties for Robust Visual Tracking

C Liu, J Zhao, C Bo, S Li, D Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Re-detection is a necessary capability for long-term tracking. Target candidate proposals in
the whole image can provide a chance of tracking reset when tracking fails due to tracking …