Detailed guide on training embeddings on a person's likeness
Detailed guide on training embeddings on a person's likeness via /r/StableDiffusion https://ift.tt/rqfARo9 This is a guide on how to train embeddings with textual inversion on a person's likeness.This guide assumes you are using the Automatic1111 Web UI to do your trainings, and that you know basic embedding related terminology. This is not a step-by-step guide, but rather an explanation of what each setting does and how to fix common problems.I've been practicing training embeddings for about a month now using these settings and have successfully made many embeddings, ranging from poor quality to very good quality. This is a collection of all the lessons I've learned and suggested settings to use when training an embedding to learn a person's likeness.What is an embedding?An embedding is a special word that you put into your prompt that will significantly change the output image. For example, if you train an embedding on Van Gogh paintings, it should learn that style and turn the output image into a Van Gogh painting. If you train an embedding on a single person, it should make all people look like that person.Why do I want an embedding?To keep it brief, there are 3 other options to using an embedding: models, hypernetworks, and LoRAs. Each has advantages and disadvantages. The main advantage of embeddings is their flexibility and small size.A model is a 2GB+ file that can do basically anything. It takes a lot of VRAM to train and has a large file size.A hypernetwork is an 80MB+ file that sits on top of a model and can learn new things not present in the base model. It is relatively easy to train, but is typically less flexible than an embedding when using it in other models.A LoRA (Low-Rank Adaptation) is a 9MB+ file and is functionally very similar to a hypernetwork.An embedding is a 4KB+ file (yes, 4 kilobytes, it's very small) that can be applied to any model that uses the same base model, which is typically the base stable diffusion model. It cannot learn new content, rather it creates magical keywords behind the scenes that tricks the model into creating what you want.Preparing your starting imagesData set: your starting images are the most important thing!! If you start with bad images, you will end up with a bad embedding. Make sure your images are high quality (no motion blur, no graininess, not partially out of frame, etc). Using more images means more flexibility and accuracy at the expense of longer training times. Your images should have plenty of variation in them - location, lighting, clothes, expressions, activity, etc.The embedding learns what is similar between all your images, so if the images are too similar to each other the embedding will catch onto that and start learning mostly what's similar. I once had a data set that had very similar backgrounds and it completely messed up the embedding, so make sure to use images with varied backgrounds.When experimenting I recommend that you use less than 10 images in order to reduce your training times so that you can fail and iterate with different training settings more rapidly.You can create a somewhat functional embedding with as little as 1 image. You can get good results with 10, but the best answer on how many images to use is however many high-quality images you have access to. Remember: quality over quantity!!I find that focusing on close ups of the face produces the best results. Humans are very good at recognizing faces, the AI is not. We need to give the AI the best chance at recreating an accurate face as possible, so that's why we focus on face pics. I'd recommend about half of the data set should be high quality close ups of the face, with the rest being upper body and full body shots to capture things like their clothing style, posture, and body shape. In the end, though, the types of images that you feed the AI are the types of images you will get back. So if you completely focus on face pics, you'll mostly get face pic results. Curate your data set so that it represents what you want to use it for.Do not use any images that contain more than 1 person. Just delete them, it'll only confuse the AI. You should also delete any that contain a lot of background text like a big sign, any watermarks, and any pictures of the subject taking a selfie with their phone (it'll skew towards creating selfie pics if you don't remove those).All your training images need to be the same resolution, preferably 512x512. I like to use 3 websites that help to crop the images semi-automatically:BIRME - Bulk Image Resizing Made Easy 2.0Bulk Image CropBulk Resize PhotosNo images are uploaded to these sites. The cropping is done locally.Creating the embedding fileInitialization text: Using the default of "*" is fine if you don't know what to use. Think of this as a word used in a prompt - the embedding will start with using that word. For example, if you put the initialization text to "woman" and attempted to use the embedding without any training, it should be equivalent to a prompt with the word "woman".You can also start with a zero value embedding. This starts with all 0's in the underlying data, meaning it has no explicit starting point. I've heard people say this gives good results, so give it a shot if you want to experiment. An update to A1111 in January enabled this functionality in the Web UI by just leaving the text box blank.In my opinion, the best initialization text to use is a word that most accurately describes your subject. For a man, use "man". For a woman, use "woman".Number of vectors per token: higher number means more data that your embedding can store. This is how many 'magical words' are used to describe your subject. For a person's likeness I like to use 10, although 1 or 2 can work perfectly fine too.If prompting for something like "brad pitt" is enough to get Brad Pitt's likeness in stable diffusion 1.5, and it only uses 2 tokens (words), then it should be possible to capture another person's likeness with only 2 vectors per token.Each vector adds 4KB to the final size of the embedding file.PreprocessingUse BLIP for caption: Check this. Captions are stored in .txt files with the same name as the image. After you generate them, it's a good idea (but not required) to go through them manually and edit any mistakes it made and add things it may have missed. The way the AI uses these captions in the learning process is complicated, so think of it this way:the AI creates a sample image using the caption as the promptit compares that sample to the actual picture in your data set and finds the differencesit then tries to find magical prompt words to put into the embedding that reduces the differencesStep 2 is the important part because if your caption is insufficient and leaves out crucial details then it'll have a harder time learning the stuff you want it to learn. For example, if you have a picture of a woman wearing a fancy wedding dress in a church, and the caption says, "a woman wearing a dress in a building", then the AI will try to learn how to turn a building into a church, and a normal dress into a wedding dress. A better caption would be "a woman wearing a white wedding dress standing in a church with a Jesus statue in the background".To put it simply: add captions for things you want to AI to NOT learn. It sounds counterintuitive, just basically describe everything except the person.In theory this should also mean that you should not include "a woman" in the captions, but in a test I did it did not make a difference.Automatic1111 has an unofficial Smart Process extension that allows you to use a v2 CLIP model which produces slightly more coherent captions than the default BLIP model.If you know how to checkout specific branches of Automatic1111, you can check out this experimental branch that includes a way to mask out everything but your subject, causing the embedding to only learn exactly what you want it to learn, which means you can ignore using captions all together. If you're reading this in March or later, it may already be included in the base version.Create flipped copies: Don't check this if you are training on a person's likeness, since people are not 100% symmetrical.Width/Height: Match the width/height resolution of your training images. Recommended to use 512x512, but I've used 512x640 many times and it works perfectly fine.Don't use deepbooru for captions since they create anime tags in the captions, and your real life person isn't an anime character.TrainingLearning rate: this is how fast the embedding evolves per training step. The higher the value, the faster it'll learn, but using too high a learning rate for too long can cause the embedding to become inflexible, or cause deformities and visual artifacts to start appearing in your images.I like to think of it this way: a large learning rate is like using a sledgehammer to create a stone statue from a large boulder. It's great to make rapid progress at the start by knocking off large pieces of stone, but eventually you need to use something smaller like a hammer to get more precision, then finally end up at a chisel to get the fine details you want.In my experience, values around the default of 0.005 work best. But we aren't limited to a static learning rate, we can have it change at set step intervals. This is the learning rate formula that I use:0.05:10, 0.02:20, 0.01:60, 0.005:200, 0.002:500, 0.001:3000, 0.0005 This means that from step 1-10 it uses a learning rate of 0.05 which is pretty high. 10-20 is lowered to 0.02, 20-60 is lowered to 0.01, etc. After step 3000 it'll train at 0.0005 until you interrupt it. This whole line of text can be plugged into the Embedding Learning Rate text box.This formula tends to work well for me, YOUR RESULTS WILL VARY depending on your data set it. This, along with the number of training steps, will need to be experimented with depending on your data set.The lower the learning rate goes, the more fine turning ha...