Diffusion

The models

The models you use in ZTX37 are Stable Diffusion XL models, these are not LLM's as you might know from e.g. ChatGPT, though both are are types of generative AI, they serve different purposes and are built differently.


So while both are “large” and “models,” Stable Diffusion is for text-to-image generation, not language understanding or generation like an LLM.


ZTX37 is a tool that can work with Stable Diffusion XL models. Using prompts and images we can generate new images and we use a model to do so. Different models are trained different and will give different results. It is the model that produces the images, ZTX37 is built around the model, to get the most out of it.


Some models could be very focused on a subject and respond different to a prompt than others. Sometimes using the right words triggers them to create beauty. See the models as a class full of artists, each needs a different approach and some you might like more than others.


What is diffusion?

The forward process starts with real data (like dinosaur pixel art) and adds a little noise to it repeatedly, step by step, until it turns into complete blur. This process generates a sequence of noisy data that the model uses during training.

The backward process (this is where the magic happens!) involves training a deep learning model to reverse this noisy transformation, step by step. The model learns to predict how to remove the noise at each stage, eventually recovering something realistic. Sometimes, the model uses an extra input, like a "prompt," to guide what it generates. This is how the model learns to create realistic samples, even starting from pure noise!


Stable Diffusion Is

  • A latent diffusion model: It learns to generate images by denoising latent representations.
  • Uses CLIP (Contrastive Language–Image Pretraining) to understand the text prompt.
  • Trained on massive datasets of image–caption pairs.


CLIP (Contrastive Language–Image Pretraining) is a powerful multimodal model developed by OpenAI that bridges the gap between natural language and visual understanding. It’s a foundational component in models like Stable Diffusion, enabling them to interpret text prompts and generate relevant images.


CLIP learns to associate images and text by training on 400 million image–text pairs. Instead of classifying images into fixed categories, it understands semantic relationships—like recognizing “a photo of a cat wearing sunglasses” even if it’s never seen that exact phrase before.


Stable Diffusion was developed by researchers at the CompVis Group at Ludwig Maximilian University of Munich and Heidelberg University Unlike other models it is open source. The community built further on stable diffusion e.g. with tools as ZTX37.



Is it needed to know this?

No.

You have with your models a group of artificial artists at your fingers and you'll use ZTX37 to get out of them what you are looking for. Just remember, each is different and what works excellent for the one might work poorly for the other one. How they work internally is not relevant for our task: Creating beauty.


.