Is sora a world simulator? a comprehensive survey on general world models and beyond

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arXiv preprint arXiv …, 2024 - arxiv.org
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

Bigbench: A unified benchmark for social bias in text-to-image generative models based on multi-modal llm

H Luo, H Huang, Z Deng, X Liu, R Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-to-Image (T2I) generative models are becoming increasingly crucial due to their ability
to generate high-quality images, which also raises concerns about the social biases in their …

[PDF][PDF] A Comprehensive Survey of Recent Transformers in Image, Video and Diffusion Models.

DPC Le, D Wang, VT Le - Computers, Materials & Continua, 2024 - cdn.techscience.cn
Transformer models have emerged as dominant networks for various tasks in computer
vision compared to Convolutional Neural Networks (CNNs). The transformers demonstrate …

Efficient diffusion transformer with step-wise dynamic attention mediators

Y Pu, Z Xia, J Guo, D Han, Q Li, D Li, Y Yuan… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper identifies significant redundancy in the query-key interactions within self-attention
mechanisms of diffusion transformer models, particularly during the early stages of …

Investigating Deep Watermark Security: An Adversarial Transferability Perspective

B Qi, J Gao, Y Luo, J Liu, L Wu, B Zhou - arXiv preprint arXiv:2402.16397, 2024 - arxiv.org
The rise of generative neural networks has triggered an increased demand for intellectual
property (IP) protection in generated content. Deep watermarking techniques, recognized for …

Dynamic diffusion transformer

W Zhao, Y Han, J Tang, K Wang, Y Song… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion Transformer (DiT), an emerging diffusion model for image generation, has
demonstrated superior performance but suffers from substantial computational costs. Our …

Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions

A Taghipour, M Ghahremani, M Bennamoun… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper investigates the role of CLIP image embeddings within the Stable Video Diffusion
(SVD) framework, focusing on their impact on video generation quality and computational …

EdgeFusion: On-Device Text-to-Image Generation

T Castells, HK Song, T Piao, S Choi, BK Kim… - arXiv preprint arXiv …, 2024 - arxiv.org
The intensive computational burden of Stable Diffusion (SD) for text-to-image generation
poses a significant hurdle for its practical application. To tackle this challenge, recent …

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

W Wu, K Zheng, S Ma, F Lu, Y Guo, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Understanding long text is of great demands in practice but beyond the reach of most
language-image pre-training (LIP) models. In this work, we empirically confirm that the key …

CityCraft: A Real Crafter for 3D City Generation

J Deng, W Chai, J Huang, Z Zhao, Q Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
City scene generation has gained significant attention in autonomous driving, smart city
development, and traffic simulation. It helps enhance infrastructure planning and monitoring …