InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

X Shi, Z Huang, FY Wang, W Bian, D Li… - ACM SIGGRAPH 2024 …, 2024 - dl.acm.org

We introduce Motion-I2V, a novel framework for consistent and controllable text-guided
image-to-video generation (I2V). In contrast to previous methods that directly learn the …

被引用次数：26 相关文章所有 2 个版本

[PDF] thecvf.com

Onetracker: Unifying visual object tracking with foundation models and efficient tuning

L Hong, S Yan, R Zhang, W Li, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com

Visual object tracking aims to localize the target object of each frame based on its initial
appearance in the first frame. Depending on the input modility tracking tasks can be divided …

被引用次数：17 相关文章所有 3 个版本

[PDF] aaai.org

Referred by multi-modality: A unified temporal transformer for video object segmentation

S Yan, R Zhang, Z Guo, W Chen, W Zhang… - Proceedings of the …, 2024 - ojs.aaai.org

Recently, video object segmentation (VOS) referred by multi-modal signals, eg, language
and audio, has evoked increasing attention in both industry and academia. It is challenging …

被引用次数：18 相关文章所有 3 个版本

[PDF] arxiv.org

Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation

S Yan, X Xu, R Zhang, L Hong, W Chen… - arXiv preprint arXiv …, 2023 - arxiv.org

Panoramic videos contain richer spatial information and have attracted tremendous amounts
of attention due to their exceptional experience in some fields such as autonomous driving …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

General Compression Framework for Efficient Transformer Object Tracking

L Hong, J Li, X Zhou, S Yan, P Guo, K Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer-based trackers have established a dominant role in the field of visual object
tracking. While these trackers exhibit promising performance, their deployment on resource …

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

F Ma, Y Zhou, H Li, Z He, S Wu, F Rao, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

In the realm of multimodal research, numerous studies leverage substantial image-text pairs
to conduct modal alignment learning, transforming Large Language Models (LLMs) into …