In this tutorial, you will see how to generate stunning AI-generated images from text inputs using state-of-the-art diffusion models from Hugging Face. You'll learn about base diffusion models and how combining them with a refiner creates even more detailed, refined results. Diffusion models are powerful because they iteratively refine an image starting from pure noise.
Advanced generative AI tools like Midjourney and OpenAI DALL·E 3 use diffusion models to generate photo-realistic AI images. However, these models charge fees to generate AI images. With diffusion models from Hugging Face, you can generate AI images for free. So, let's dive in!
Installing Required Libraries
To begin, let's install the libraries necessary for this project. Execute the following commands to get all dependencies ready:
!pip install diffusers --upgrade
!pip install invisible_watermark transformers accelerate safetensors
Generating AI Images Using Base Diffusion Models
Most state-of-the-art text-to-image diffusion models consist of a base model and a refiner. We'll first generate an image using the base diffusion model. We will use the stabilityai/stable-diffusion-xl-base-1.0
(SDXL) model for image generation. SDXL employs an ensemble of expert models for latent diffusion. Initially, the base model generates (noisy) latent images, which are then refined by a specialized model during the final denoising stages. You can use any other text-to-image diffusions from Hugging Face.
The following Python script initializes a Hugging Face pipeline for the diffusion model and sets it up for GPU acceleration.
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16")
pipe.to("cuda")
The next step is to pass a text prompt to the prompt
attribute of the pipeline you defined. As shown in the script below, you can retrieve the generated image using the images
list.
prompt = "A texas ranger riding a white horse"
images = pipe(prompt=prompt).images[0]
images
Output:
Look at the image generated above; isn't it cool? You can even use this for commercial purposes.
Generating Refined Images using Ensemble of Experts
Using an ensemble of experts and a refiner, you can create more refined and advanced images. To do so, you first create a simple base model as you did before. Next, you create a refiner model and pass the base model to it.
The refiner will build upon the image created by the base model to deliver a more polished, detailed final output.
The script below creates our base model and refiner.
from diffusers import DiffusionPipeline
import torch
# load both base & refiner
base = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16, variant="fp16",
use_safetensors=True
)
base.to("cuda")
refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=base.text_encoder_2,
vae=base.vae,
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
)
refiner.to("cuda")
In the following script, we specify that the ensemble of experts should take 40 steps to generate an image from noise. Out of these 40 steps, the base model will take 80% (32 steps), and the refiner will use the remaining 20% (8 steps) to refine the image.
n_steps = 40
high_noise_frac = 0.8
prompt = "An panda sitting on a table having a drink in a wooden room"
# run both experts
image = base(
prompt=prompt,
num_inference_steps=n_steps,
denoising_end=high_noise_frac,
output_type="latent",
).images
image = refiner(
prompt=prompt,
num_inference_steps=n_steps,
denoising_start=high_noise_frac,
image=image,
).images[0]
image
Output:
From the above output, you can see a cute panda drinking in a wooden room. Excellent? Isn't it?
Conclusion
Diffusion models allow you to create stunning AI images. You can use diffusion modes from Hugging Face to generate AI images for free.
In this tutorial, we employed the SDXL model for image generation. The base model generates (noisy) latent images, which are then refined by a specialized model during the final denoising stages. The base model can also function independently as a standalone module.
I invite you to try these models and share what you generated.