Chat-univi: Unified visual representation empowers large language models with image and video understanding
P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …
range of open-ended tasks and have extended their utility to encompass multimodal …
Clusterfomer: clustering as a universal visual learner
This paper presents ClusterFormer, a universal vision model that is based on the Clustering
paradigm with TransFormer. It comprises two novel designs: 1) recurrent cross-attention …
paradigm with TransFormer. It comprises two novel designs: 1) recurrent cross-attention …
Longvu: Spatiotemporal adaptive compression for long video-language understanding
Multimodal Large Language Models (MLLMs) have shown promising progress in
understanding and analyzing video content. However, processing long videos remains a …
understanding and analyzing video content. However, processing long videos remains a …
Dvlo: Deep visual-lidar odometry with local-to-global feature fusion and bi-directional structure alignment
Abstract Information inside visual and LiDAR data is well complementary derived from the
fine-grained texture of images and massive geometric information in point clouds. However …
fine-grained texture of images and massive geometric information in point clouds. However …
Computation-efficient deep learning for computer vision: A survey
Over the past decade, deep learning models have exhibited considerable advancements,
reaching or even exceeding human-level performance in a range of visual perception tasks …
reaching or even exceeding human-level performance in a range of visual perception tasks …
Context-aware interaction network for rgb-t semantic segmentation
RGB-T semantic segmentation is a key technique for autonomous driving scenes
understanding. For the existing RGB-T semantic segmentation methods, however, the …
understanding. For the existing RGB-T semantic segmentation methods, however, the …
Neural clustering based visual representation learning
We investigate a fundamental aspect of machine vision: the measurement of features by
revisiting clustering one of the most classic approaches in machine learning and data …
revisiting clustering one of the most classic approaches in machine learning and data …
Learning hierarchical image segmentation for recognition and by recognition
Large vision and language models learned directly through image-text associations often
lack detailed visual substantiation, whereas image segmentation tasks are treated …
lack detailed visual substantiation, whereas image segmentation tasks are treated …
Improving scene graph generation with superpixel-based interaction learning
Recent advances in Scene Graph Generation (SGG) typically model the relationships
among entities utilizing box-level features from pre-defined detectors. We argue that an …
among entities utilizing box-level features from pre-defined detectors. We argue that an …
Another way to the top: Exploit contextual clustering in learned image coding
While convolution and self-attention are extensively used in learned image compression
(LIC) for transform coding, this paper proposes an alternative called Contextual Clustering …
(LIC) for transform coding, this paper proposes an alternative called Contextual Clustering …