Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

Y Zhou, R Zhang, K Zheng, N Zhao, J Gu… - arXiv preprint arXiv …, 2024 - arxiv.org
arXiv preprint arXiv:2406.09305, 2024arxiv.org
In subject-driven text-to-image generation, recent works have achieved superior
performance by training the model on synthetic datasets containing numerous image pairs.
Trained on these datasets, generative models can produce text-aligned images for specific
subject from arbitrary testing image in a zero-shot manner. They even outperform methods
which require additional fine-tuning on testing images. However, the cost of creating such
datasets is prohibitive for most researchers. To generate a single training pair, current …
In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient method to construct datasets for subject-driven editing and generation. Specifically, our dataset construction does not need any subject-level fine-tuning. After pre-training two generative models, we are able to generate infinite number of high-quality samples. We construct the first large-scale dataset for subject-driven image editing and generation, which contains 5 million image pairs, text prompts, and masks. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower. To test the proposed dataset, we also propose a model which is capable of both subject-driven image editing and generation. By simply training the model on our proposed dataset, it obtains competitive results, illustrating the effectiveness of the proposed dataset construction framework.
arxiv.org
以上显示的是最相近的搜索结果。 查看全部搜索结果