Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

X Shi, Z Huang, FY Wang, W Bian, D Li… - ACM SIGGRAPH 2024 …, 2024 - dl.acm.org
We introduce Motion-I2V, a novel framework for consistent and controllable text-guided
image-to-video generation (I2V). In contrast to previous methods that directly learn the …

Onetracker: Unifying visual object tracking with foundation models and efficient tuning

L Hong, S Yan, R Zhang, W Li, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
Visual object tracking aims to localize the target object of each frame based on its initial
appearance in the first frame. Depending on the input modility tracking tasks can be divided …

Referred by multi-modality: A unified temporal transformer for video object segmentation

S Yan, R Zhang, Z Guo, W Chen, W Zhang… - Proceedings of the …, 2024 - ojs.aaai.org
Recently, video object segmentation (VOS) referred by multi-modal signals, eg, language
and audio, has evoked increasing attention in both industry and academia. It is challenging …

Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation

S Yan, X Xu, R Zhang, L Hong, W Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
Panoramic videos contain richer spatial information and have attracted tremendous amounts
of attention due to their exceptional experience in some fields such as autonomous driving …

General Compression Framework for Efficient Transformer Object Tracking

L Hong, J Li, X Zhou, S Yan, P Guo, K Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformer-based trackers have established a dominant role in the field of visual object
tracking. While these trackers exhibit promising performance, their deployment on resource …

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

F Ma, Y Zhou, H Li, Z He, S Wu, F Rao, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
In the realm of multimodal research, numerous studies leverage substantial image-text pairs
to conduct modal alignment learning, transforming Large Language Models (LLMs) into …