Skip to content

Latest commit

 

History

History
310 lines (171 loc) · 24 KB

README.md

File metadata and controls

310 lines (171 loc) · 24 KB

Generative Deep Learning Repo

This is a repository that documents different generative learning approaches using the Keras library and tutorials for synthetic data, Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play, and hugging face. This repo impements the following models:



Traversing Along Stable Diffusion's Latent Space

dogs drinking coffee in outer space overlooking earth

Gif created using LatentSpaceGifMaker using 1 text prompt of dogs drinking coffee in outer space overlooking earth with with random walk and circular walk enabled using 12 random steps, step size of 0.005, cfg_scale of 7.5, batch size of 3 and num of diffusion steps of 25

dogs drinking coffee in outer space overlooking earth

Gif created using LatentSpaceGifMaker using 1 text prompt of dogs drinking coffee in outer space overlooking earth with circular walk enabled using 12 random steps, cfg_scale of 7.5, batch size of 3 and num of diffusion steps of 25

dogs drinking coffee in outer space overlooking earth

Gif created using LatentSpaceGifMaker using 1 text prompt of dogs drinking coffee in outer space overlooking earth with random walk enabled using 12 random steps, cfg_scale of 7.5, batch size of 3 and num of diffusion steps of 25


Textual Inversion of Stable Diffusion's Embedding Space using Non-Style prompts

Input Images:

alt text

Generated Images and Prompts Used:

Prompt: an oil painting of {placeholder_token}

alt text

Generated Images created using StabeDiffusion-TextualInversion

Prompt(s): man in fancy suit with {placeholder_token} walking in New York""high quality, highly detailed, elegant, sharp focus" "character concepts, mystery, adventure"

alt text

alt text

Generated Images created using MyPersonalizedWeights


Combining Stable Diffusion's Textual Embedding Space with its Image Manifold through Textual Inversion and non-style prompts

My Pre-trained weights can be found here and must be loaded beforehand in layer two of Stable Diffusion/CLIP's text encoder before generating images/gifs.

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 7.5; walk_steps = 60; batch_size = 3; noise_start = normal distribution; diffusion_noise = scaled cos/sin; num_of_Diffusion_steps=25;negative_prompt=None;frame_per_seconds=10

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 7.9; walk_steps = 60; batch_size = 3; noise_start = normal distribution; diffusion_noise = scaled cos/sin; num_of_Diffusion_steps=25;negative_prompt=None;frame_per_seconds=10

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 7.9; walk_steps = 60; batch_size = 3; noise_start = normal distribution; diffusion_noise = scaled cos/sin; num_of_Diffusion_steps=50;negative_prompt=None;frame_per_seconds=10

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 8; walk_steps = 60; batch_size = 3; noise_start = normal distribution; diffusion_noise = scaled cos/sin; num_of_Diffusion_steps=50;negative_prompt=None;frame_per_seconds=10

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 8; walk_steps = 60; batch_size = 3; noise_start = normal distribution; diffusion_noise = unscaled; num_of_Diffusion_steps=50;negative_prompt=None; frame_per_seconds=10

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 8; walk_steps = 60; batch_size = 3; noise_start = (technically) None; diffusion_noise = (technically) None ; num_of_Diffusion_steps=50;negative_prompt=None; frame_per_seconds=10

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 8; walk_steps = 60; batch_size = 3; noise_start = (technically) None; diffusion_noise = (technically) None ; num_of_Diffusion_steps=50;negative_prompt=None; frame_per_seconds=10

Prompt: man with {placeholder_token} in fancy suit in a red ferrari driving in Frankfurt high quality, highly detailed, elegant, sharp focus

Gif created using MyPersonalizedWeights with the following hyperparameters/configurations: cfg_scale = 8; walk_steps = 60; batch_size = 3; noise_start = normal distribution; diffusion_noise = scaled by min_freq 1 max freq 1000; num_of_Diffusion_steps=50;negative_prompt=None; frame_per_seconds=10



Stable Diffusion Image-to-Image Application

Left image is the input image, right image is newly generated image based on prompt, negative prompt, strengh, and guidance.

Prompt: wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k

alt text

Prompt: my face with afro hairstyle

alt text

Prompt black king with crown sitting on throne holding sword, detailed, fantasy, dark, Pixar, Disney, 8k

alt text

Images created using StableDiffusion-Image2Image

Stable Video Diffusion

Original Image

Gif created using StableVideoDiffusion-Image2Video using the text prompt: "suba diver swimming in ocean next to sharks, detailed, photo-realistic, 8k"

Generated Gifs

Gif created using StableVideoDiffusion-Image2Video using the folloing hyperparameters: motion_bucket_id=100, noise_aug_strength=0.02, latents=None

Gif created using StableVideoDiffusion-Image2Video using the folloing hyperparameters: motion_bucket_id=127, noise_aug_strength=0.1, latents=None

Gif created using StableVideoDiffusion-Image2Video using the folloing hyperparameters: motion_bucket_id=200, noise_aug_strength=0.02, latents=None

Gif created with textual inversion of my face using huggining face's textual inversion tutorial as found in the notebook HuggingFace_textualInversion_Myweights

Gif created with textual inversion of my face using huggining face's textual inversion tutorial as found in the notebook HuggingFace_textualInversion_Myweights

Combining Keras Model Weights with Pytorch Model

The previous gifs were created using huggining face's textual inversion tutorial using the defalut parameters of the script and the model-Id of runwayml/stable-diffusion-v1-5. After training for 1 hour with my placeholder token of I was able to generate very basic images with the prompt(s): "man with {placeholder_token}" or "man with {placeholder_token} swimming." The images that were high quality I did use later on in hugging face's implementation of Stable Video Diffusion to create the gifs seen above. However, I noticed that even with prompt weighting and various variations of guidance_scale, I was not able to generate an accurate image using long prompts such as this: "man with {placeholder_token} in fancy suit driving ferrari on highway in Berlin, side view." There could be many reasons why this is so (probably something in the training script I am missing in regards to embedding longer prompts).

With that being said, I wanted to generate images of me driving a nice car in a fancy (boogie) suit, so I used my pretrained weights from combining Stable Diffusion's Textual Embedding Space with its Image Manifold through Textual Inversion and non-style prompts to generate a decent image and feed that image into hugging face's implementation of Stable Video Diffusion. The end result of doing so was acceptable (execpt it was an old ferrari, but at least it gave me some cool glasses lol).

Image created with pretrained weights from Kera's tutortial on textual inversion using the images of my face as found in the section combining Stable Diffusion's Textual Embedding Space with its Image Manifold through Textual Inversion and non-style prompts with the following paramters/prompts: negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy", prompt="man with {placeholder_token} in fancy suit driving ferrari on highway in Berlin, side view", unconditional_guidance_scale=12, num_steps=100

Gif created with SVD with the following parameters:motion = 100, augmentation = 0.02 and latent/pre-generated=None

Gif created with SVD with the following parameters:motion = 50, augmentation = 0.02 and latent/pre-generated=None

Gif created with SVD with the following parameters:motion = 50, augmentation = 0.02 and latent/pre-generated=torch.normal(0, 1, size=(1, 25, 4, 72, 128), generator=generator, dtype=torch.float16)

Image created with Keras-SVD from Kera's tutortial on textual inversion using the images of my face as found in the section combining Stable Diffusion's Textual Embedding Space with its Image Manifold through Textual Inversion and non-style prompts with the following paramters/prompts: negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy", prompt="man with {placeholder_token} in fancy suit dancing", unconditional_guidance_scale=12, num_steps=100

Gif created with Keras-SVD with the following parameters:motion = 50, augmentation = 0.02 and latent/pre-generated=none

Gif created with Keras-SVD with the following parameters:motion = 60, augmentation = 0.02 and latent/pre-generated=none

Image created with Keras-SVD from Kera's tutortial on textual inversion using the images of my face as found in the section combining Stable Diffusion's Textual Embedding Space with its Image Manifold through Textual Inversion and non-style prompts with the following paramters/prompts: negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy", prompt="my {placeholder_token} on face of group of dancing monkeys", unconditional_guidance_scale=10, num_steps=100

Gif created with Keras-SVD with the following parameters:motion = 50, augmentation = 0.02 and latent/pre-generated=none