April 30, 2026  ·  Updated June 2, 2026  ·  11 min read  ·  AI Training

Preparing Images for AI Training: Stable Diffusion & LoRA Tips

The short answer: use 15–30 varied photos resized to 1024×1024 for SDXL and Flux, or 512×512 for SD 1.5. Each image needs a matching caption file with a unique trigger word. The rest of this guide explains what “varied” actually means, what to exclude, and the specific mistakes that consistently produce weak models.

FastestDL's AI Prep tool processes a lot of training datasets — individual uploads, batch folders with dozens of images, everything from polished studio shoots to phone snapshots. The pattern of what produces a strong fine-tuned model versus a weak one is very consistent once you've processed enough of them. This guide documents what I've actually observed, not generic advice recycled from documentation pages.

Most people building their first training dataset focus on the wrong variable. They count images, worry about hitting a magic number, and spend time sourcing more photos when the dataset they already have is the real problem. Dataset quality and variety determine results far more than size. A 20-image dataset with genuine variation will outperform a 100-image dataset of near-duplicate shots every time — and understanding why explains most of what you need to know about good dataset preparation.

Resolution: Match the Model, Not the Image

The single most important technical decision is matching your output resolution to the base model you are fine-tuning. Models are trained at a specific native resolution and learn spatial relationships at that scale. Training at a mismatched resolution confuses the model — you're showing it the world at a different scale than it learned at, and the results suffer proportionally.

Base model Native resolution Training image size
SD 1.5 and 1.x derivatives 512 × 512 512 × 512 px
SD 2.x 768 × 768 768 × 768 px
SDXL 1024 × 1024 1024 × 1024 px
Flux (dev / schnell) 1024 × 1024 1024 × 1024 px
Most current SDXL-based models 1024 × 1024 1024 × 1024 px

Training a 1.5-based model on 1024px images does not produce better results — it wastes VRAM and significantly increases training time with no benefit. Conversely, training SDXL on 512px images forces the model to learn an upscaled, blurry version of your subject, which shows in the output. Check the model card of whatever base you're using before preparing your dataset.

Advanced trainers use multi-resolution bucket training, which groups images by aspect ratio and trains at multiple sizes simultaneously. If your training framework supports it, you can include non-square images and let the trainer handle bucketing. For most people starting out, square images at the correct native resolution is the simpler and more reliable approach.

Cropping: What You Keep and What You Lose

Images need to be square. Non-square images get stretched or letterboxed by the trainer, which teaches the model distorted proportions. There are two ways to make a rectangular image square, and the right choice depends on what the image contains:

Center crop takes a square from the middle of the image, discarding the sides or top and bottom equally. The subject stays at native resolution — no upscaling, no quality loss — but anything outside the crop is gone. This is the right choice when the subject is centred and the edges are background. For portraits and product shots where the subject fills the frame, center crop consistently produces clean results.

Padding keeps the full image and adds black or white bars to fill the square. The whole frame is preserved, but the model also trains on the solid-colour bars — which are not natural image content. For images where the subject is near the edges or where framing matters (a full-body shot, a wide scene), padding avoids cutting out important content. The downside is subtle: if many images in your dataset share the same padding colour and pattern, the model can weakly associate that visual with your subject.

Practical rule: Use center crop for portraits, headshots, and close-up product photography. Use padding when full-body or full-scene framing is important and center-cropping would cut off something meaningful. Mixing both in the same dataset is fine — the model will handle it.

Dataset Size: The Number That Matters Less Than You Think

The question I get most often: "How many images do I need?" The honest answer is that for a subject LoRA — a specific person, character, object, or product — 15 to 30 high-quality varied images is the right range for most fine-tuning workflows. Beyond 30, returns diminish rapidly unless the additional images genuinely add new variation that wasn't represented before.

Style LoRAs work differently. A style is more abstract than a subject — the model needs to see the style applied across a wide range of content types before it generalises reliably. Style datasets typically benefit from 50 to 150 images, with the images spanning different subjects, compositions, and lighting treated in the same stylistic way.

Subject LoRA (person, character, product)
Image count: 15–30  |  Priority: Maximum variety across angle, lighting, background, distance
Style LoRA (artistic style, rendering style)
Image count: 50–150  |  Priority: Wide range of subjects all treated in the target style
Dreambooth (full fine-tune, not LoRA)
Image count: 5–15  |  Priority: Very high quality, clean backgrounds preferred, strong subject visibility

Variety: The Variable That Actually Determines Results

This is the most important section in this guide. A model learns by finding what is consistent across your training images. Everything that varies between images gets treated as background noise. Everything that stays constant gets reinforced as a property of the subject.

This means if every photo in your dataset is a forward-facing portrait with the same lighting and a plain white background, the model learns: "this subject always faces forward, has this lighting, appears on a white background." When you try to generate the subject from a different angle or in a different environment, the model struggles — because it was never taught that those properties are variable.

The axes of variation that matter most for a subject LoRA:

The near-duplicate trap: Taking 30 photos in the same session, same room, same angle, slightly different expression is not 30 varied images — it is 1 scenario repeated 30 times. This is the most common dataset mistake I see. The resulting model is brittle: it generates the subject well in that one scenario and poorly everywhere else. Spreading shoots across different days, locations, and conditions produces far stronger models even at the same image count.

Source Photo Quality

The model can only learn what is visible in the training images. A blurry photo does not teach sharp detail — it teaches blur. A heavily compressed phone screenshot does not teach texture — it teaches compression artifacts. The quality ceiling of your output model is the quality floor of your training dataset.

Images worth including:

Images to exclude:

Caption Files: Format and Trigger Words

Each training image needs a matching caption file — a plain text file with the same filename as the image but a .txt extension. The caption tells the model what is in that image. Without captions, or with incorrect captions, the model cannot learn to associate your subject with a specific concept it can later be invoked with.

A caption has two parts: a trigger word and descriptors.

The trigger word is a unique, rare term that the model will learn to associate with your subject. It needs to be something that does not already carry meaning in the model's vocabulary — common English words like man, woman, or dog are already loaded with the model's prior knowledge of those concepts, which competes with what you're trying to teach. Made-up words or unusual letter combinations work best: ohwx, brfk, zxrt. After training, you invoke the subject by including the trigger word in your prompt.

The descriptors following the trigger word describe what is specific to that image — not what is always true of the subject. If the subject is always a person, don't include person in every caption. If a specific image shows them sitting outdoors in a blue jacket, the caption describes that: ohwx person, sitting, outdoor, blue jacket, natural light. This per-image variation is what teaches the model the distinction between the invariant subject and the variable context.

Need to resize and prepare your training images to 512×512 or 1024×1024? FastestDL's free AI Image Prep tool handles single images and entire folders — center crop, pad, and caption files included. No signup, ZIP download.

Prep Images Free →

Frequently Asked Questions

Can I train on AI-generated images instead of real photos?

It depends on what you're training for. For style LoRAs, using AI-generated images as training data is common and can work well — you're teaching the model a rendering style, and synthetically generated images in that style are valid examples of it.

For subject LoRAs (a real person, a specific product, a character), AI-generated images have a significant problem: they carry the artefact signature of the model that generated them. Every AI model has characteristic patterns in how it renders skin texture, hair strands, specular highlights, and fine detail. When you train on those images, you're teaching your LoRA those rendering characteristics, not just the subject. The result tends to generate the subject correctly but in a way that looks like it came from the source model, not from a neutral base.

If real photos are unavailable, the best approach is to use high-quality generations at the full native resolution of your base model (no upscaling), select only images with no visible artefacts or reconstruction errors, and keep the dataset small — 10 to 15 images. A small, clean synthetic dataset trains better than a large one full of the source model's rendering habits. Mixing in even 3 to 5 real photos alongside synthetic images noticeably improves generalisability.

My LoRA produces the subject but the quality looks soft or plastic. What went wrong?

This is almost always a dataset or resolution problem, not a training hyperparameter problem. Work through these causes in order:

  1. Wrong generation resolution. This is the most common cause and the easiest to miss. If you trained on an SDXL base but you're generating at 512px, the output will look soft regardless of dataset quality. Always generate at the native resolution of your base model.
  2. Low-resolution or compressed source images. The model can't learn sharpness from blurry photos. If your source images were phone screenshots, heavily compressed downloads, or anything below 800px before resizing, the model learned a soft look. Replace them with higher-quality sources.
  3. No lighting variety in the dataset. If every photo was shot in the same flat indoor light, the model learned that even lighting as part of the subject's appearance. Add images shot in directional natural light, overcast conditions, or artificial side lighting.
  4. Dataset dominated by shallow depth-of-field shots. When most images have a blurred background, the model learns blur as part of the scene. Include images where the background is sharp.
  5. Denoising strength too high at inference. If you're using a denoising strength above 0.7 for img2img or high CFG values, the model will over-process the output into a plastic look. Try CFG 6–7 and see if the quality improves.

How do regularisation images work and do I need them?

Regularisation images are general images of the same class as your subject — photos of people in general if you're training a person LoRA, photos of dogs in general if you're training a specific dog. Their purpose is to prevent language drift: the problem where the model over-associates the trigger word with the entire class concept, so that the trigger word starts appearing in outputs where it shouldn't.

Whether you need them depends on the length and intensity of your training:

Most training UIs (kohya_ss, A1111 Dreambooth extension, ComfyUI training nodes) have a dedicated regularisation image directory field. Point it at your class images folder and the trainer handles the rest.

Does it matter which file format the training images are in?

Most trainers accept JPEG and PNG. PNG is lossless and avoids compression artefacts entirely, which matters for images with fine texture — hair, fabric, skin detail. For high-stakes training runs that you plan to share or publish, converting source images to PNG before training is worth the extra storage. For casual personal LoRAs, high-quality JPEG (quality 90+) produces results indistinguishable from PNG in practice.

The format to avoid: JPEG images that have been resaved multiple times at low quality. Each compression pass degrades the data — visible blocking artefacts in the source images will be reproduced in the model's output. If your source images are already compressed, use FastestDL's image compressor to check what quality level they're at before including them. For a full breakdown of how JPEG, PNG, WebP, and AVIF compare in quality and file size, see the image format comparison guide.

What is the difference between LoRA and Dreambooth training?

Both are fine-tuning techniques but they work at different levels of the model:

For most people preparing a training dataset today, LoRA is the correct technique. The dataset preparation is identical for both — the differences are in the training configuration, not the images.

How do I prepare my images step by step using FastestDL?

The full workflow using FastestDL's AI Image Prep tool:

  1. Collect your source images — aim for 15–30 varied photos for a subject LoRA. Check them against the quality criteria above and exclude blurry, compressed, or duplicate shots.
  2. Choose your output size — 1024×1024 for SDXL, Flux, or any current SDXL-based model. 512×512 for SD 1.5 or older derivatives.
  3. Choose your crop mode — Center Crop for portraits and close-ups where the subject fills the frame. Pad Black or Pad White when you need to preserve the full composition.
  4. Enable caption files — enter your trigger word. The tool generates a base caption file for each image. Edit the individual captions afterwards to add per-image descriptors (pose, setting, lighting, clothing).
  5. Upload as a folder or ZIP — the tool processes the full batch and delivers a single ZIP ready to extract into your training directory.
About this article: Written and maintained by Jesse Mola, the person behind FastestDL, a free online file processing tool. The guidance here is based on direct experience preparing and processing AI training datasets, combined with observations from the image datasets submitted to FastestDL's AI Prep tool. I update this guide as training frameworks and recommended practices evolve.

Published by FastestDL  ·  More articles  ·  Free AI Image Prep Tool