Cogvideox: Text-to-video diffusion models with an expert transformer

Z Yang, J Teng, W Zheng, M Ding, S Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce CogVideoX, a large-scale diffusion transformer model designed for generating
videos based on text prompts. To efficently model video data, we propose to levearge a 3D …

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

F Meng, J Liao, X Tan, W Shao, Q Lu, K Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-to-video (T2V) models like Sora have made significant strides in visualizing complex
prompts, which is increasingly viewed as a promising path towards constructing the …

CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

M Guo, Y Liu, Z Lin, P Peng, Y Tian - arXiv preprint arXiv:2410.05804, 2024 - arxiv.org
Incremental object detection (IOD) is challenged by background shift, where background
categories in sequential data may include previously learned or future classes. Inspired by …

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

S Shao, Z Zhou, L Bai, H Xiond, Z Xie - arXiv preprint arXiv:2410.04171, 2024 - arxiv.org
The multi-step sampling mechanism, a key feature of visual diffusion models, has significant
potential to replicate the success of OpenAI's Strawberry in enhancing performance by …

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

L Chen, Z Li, B Lin, B Zhu, Q Wang, S Yuan… - arXiv preprint arXiv …, 2024 - arxiv.org
Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial
preceding component of Latent Video Diffusion Models (LVDMs). With the same …

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

Y Xu, Y Lee, G Yi, B Liu, Y Chen, P Liu, J Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Deep neural networks (DNNs) are powerful for cognitive tasks such as image classification,
object detection, and scene segmentation. One drawback however is the significant high …