Glass sphere containing swirling purple and green plasma energy: a latent diffusion model holds a compressed representation of visual knowledge, releasing it as an image.
Stable Diffusion encodes the entire visual world into a compressed latent space, then decompresses it back into images guided by text, one noise-removal step at a time.

Stable Diffusion is a family of open-weight latent diffusion models developed by Stability AI that generate images from text prompts. Unlike Midjourney and DALL-E 3, the model weights are publicly available. You can run them locally on consumer hardware (an NVIDIA GPU with 6 GB VRAM or an Apple Silicon Mac), fine-tune them on custom image datasets with LoRA or DreamBooth, and integrate them into production systems via the Stability AI API or through open-source inference servers. The current generation is Stable Diffusion 3.5 (2024), which improves typography and prompt adherence over earlier versions.

Models
SD 3.5 Large (8B) SD 3.5 Medium (2B) SDXL 1.0 SD 1.5 (legacy) SD 3.5 uses a Multimodal Diffusion Transformer (MMDiT) architecture
Local inference
ComfyUI Automatic1111 WebUI Diffusers (Python) Ollama (SD 3.5)
API access
Stability AI API AWS Bedrock Replicate Hugging Face Inference
Fine-tuning
LoRA DreamBooth Textual Inversion LoRA trains in 30-90 minutes on 10-30 images with a single A100
Control methods
ControlNet (pose, depth, edge) IP-Adapter (image prompt) Inpainting Outpainting

Installation: Diffusers library

The Hugging Face diffusers library is the reference implementation. It runs on NVIDIA GPUs, Apple Silicon, and CPU (slow).

bash
pip install diffusers transformers accelerate torch
python
import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium",
    torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")  # or "mps" for Apple Silicon

image = pipe(
    prompt="A dark industrial server room, red neon lights, deep shadows, editorial photography",
    negative_prompt="blurry, low quality, text, watermark",
    num_inference_steps=28,
    guidance_scale=4.5,
    height=1024,
    width=1024,
).images[0]

image.save("output.png")

Stability AI API

For production use without local GPU infrastructure, the Stability AI REST API provides SD 3.5 access at per-image pricing.

bash
pip install stability-sdk requests
python
import requests
import base64

response = requests.post(
    "https://api.stability.ai/v2beta/stable-image/generate/sd3",
    headers={
        "Authorization": "Bearer YOUR_STABILITY_API_KEY",
        "Accept": "image/*"
    },
    files={"none": ""},
    data={
        "prompt": "Austrian alpine landscape at dawn, golden hour, photorealistic, 4K",
        "negative_prompt": "blurry, oversaturated, text",
        "model": "sd3.5-medium",
        "aspect_ratio": "16:9",
        "output_format": "webp",
    }
)

with open("landscape.webp", "wb") as f:
    f.write(response.content)

LoRA fine-tuning for brand images

LoRA (Low-Rank Adaptation) adds a lightweight adapter on top of the base model trained on your specific images. The result is a model that generates images in your brand style without retraining the full model.

A LoRA training run for a product brand takes 20-30 images and 30-90 minutes on a single A100.

python
from diffusers import StableDiffusion3Pipeline
from peft import PeftModel
import torch

base_model = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-medium",
    torch_dtype=torch.bfloat16
)

# Load LoRA weights trained on your brand images
base_model.load_lora_weights("./your-brand-lora")
base_model = base_model.to("cuda")

image = base_model(
    prompt="product photo of a coffee mug, brand style, studio lighting",
    num_inference_steps=28,
    guidance_scale=4.5,
).images[0]

image.save("brand-output.png")
Step 1 Choose access method Local (full control, no per-image cost, requires GPU) or API (pay per image, no hardware). Local is better for iteration; API is better for production pipelines.
Step 2 Select model version SD 3.5 Medium for most cases (2B parameters, fast). SD 3.5 Large for complex compositions and precise text in images. SDXL for maximum community LoRA availability.
Step 3 Write the prompt SD 3.5 responds well to natural language. Describe subject, style, lighting, camera angle, and quality tags. Add a negative prompt for what to exclude.
Step 4 Fine-tune for consistency If you need brand-consistent output across many images, train a LoRA on 20-30 reference images. Apply the LoRA adapter at inference time.

Pricing (Stability AI API, as of June 2026)

ModelPrice per image
SD 3.5 Large~€0.065
SD 3.5 Medium~€0.035
SDXL 1.0~€0.002
Core (fast)~€0.003

Local inference is free after the one-time cost of a GPU. An NVIDIA RTX 3080 (€500-700 used) generates 1,000+ images per day.

Comparison with alternatives

Stable Diffusion 3.5DALL-E 3Midjourney v6Flux.1
Open weightYesNoNoYes (Flux.1 Dev)
Run locallyYesNoNoYes
Fine-tunableYes (LoRA, DreamBooth)NoNoYes (LoRA)
Image qualityHighHighVery highVery high
Text in imagesGood (SD 3.5)ExcellentGoodExcellent
ControlNetYes (extensive)NoNoPartial
API pricing/image~€0.035~€0.040N/A (subscription)~€0.003 (Replicate)
Best forCustom pipelines, fine-tuningGPT-4o integrationAesthetic qualitySpeed + quality

ControlNet: spatial control over generation

ControlNet takes a reference image (pose skeleton, depth map, edge map, or semantic map) and uses it to constrain the layout of the generated image. This is essential for product photography consistency, character pose control, and architecture visualization.

python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# edge_image: preprocessed edge map from your reference image
image = pipe(
    prompt="product photo, marble countertop, studio lighting, white background",
    image=edge_image,
    num_inference_steps=20,
).images[0]

When not to use Stable Diffusion

Photorealistic faces with identity preservation: Without fine-tuning on a specific person’s face, SD 3.5 is inconsistent across generations. For consistent character identity, tools like IP-Adapter or a person-specific LoRA add significant setup overhead.

Copyrighted style replication: Training a LoRA on copyrighted artwork to reproduce that style is a live legal question across multiple jurisdictions. The EU AI Act and emerging case law may make this an explicit risk by 2027.

Real-time generation at high resolution: SD 3.5 Medium generates a 1024x1024 image in 8-12 seconds on an A100. For sub-second generation at scale, Flux.1 Schnell or purpose-built inference APIs (Fireworks, together.ai) are faster.

Non-technical users who need a GUI: If the team does not write Python, ComfyUI or Automatic1111 provide browser-based GUIs with no code, but still require local GPU installation. Midjourney or Adobe Firefly are the simpler choice for non-technical users.

Further reading