Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Clusterfomer: clustering as a universal visual learner

J Liang, Y Cui, Q Wang, T Geng… - Advances in neural …, 2024 - proceedings.neurips.cc
This paper presents ClusterFormer, a universal vision model that is based on the Clustering
paradigm with TransFormer. It comprises two novel designs: 1) recurrent cross-attention …

Longvu: Spatiotemporal adaptive compression for long video-language understanding

X Shen, Y Xiong, C Zhao, L Wu, J Chen, C Zhu… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have shown promising progress in
understanding and analyzing video content. However, processing long videos remains a …

Dvlo: Deep visual-lidar odometry with local-to-global feature fusion and bi-directional structure alignment

J Liu, D Zhuo, Z Feng, S Zhu, C Peng, Z Liu… - European Conference on …, 2025 - Springer
Abstract Information inside visual and LiDAR data is well complementary derived from the
fine-grained texture of images and massive geometric information in point clouds. However …

Computation-efficient deep learning for computer vision: A survey

Y Wang, Y Han, C Wang, S Song… - Cybernetics and …, 2024 - ieeexplore.ieee.org
Over the past decade, deep learning models have exhibited considerable advancements,
reaching or even exceeding human-level performance in a range of visual perception tasks …

Context-aware interaction network for rgb-t semantic segmentation

Y Lv, Z Liu, G Li - IEEE Transactions on Multimedia, 2024 - ieeexplore.ieee.org
RGB-T semantic segmentation is a key technique for autonomous driving scenes
understanding. For the existing RGB-T semantic segmentation methods, however, the …

Neural clustering based visual representation learning

G Chen, X Li, Y Yang, W Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
We investigate a fundamental aspect of machine vision: the measurement of features by
revisiting clustering one of the most classic approaches in machine learning and data …

Learning hierarchical image segmentation for recognition and by recognition

TW Ke, S Mo, XY Stella - The Twelfth International Conference on …, 2023 - openreview.net
Large vision and language models learned directly through image-text associations often
lack detailed visual substantiation, whereas image segmentation tasks are treated …

Improving scene graph generation with superpixel-based interaction learning

J Wang, C Zhang, J Huang, B Ren, Z Deng - Proceedings of the 31st …, 2023 - dl.acm.org
Recent advances in Scene Graph Generation (SGG) typically model the relationships
among entities utilizing box-level features from pre-defined detectors. We argue that an …

Another way to the top: Exploit contextual clustering in learned image coding

Y Zhang, Z Duan, M Lu, D Ding, F Zhu… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
While convolution and self-attention are extensively used in learned image compression
(LIC) for transform coding, this paper proposes an alternative called Contextual Clustering …