Making Visual Art With GANs

MUNIT.ipynb

MUNIT stands for Multimodal UNsupervised Image-to-image Translation. It can convert images of cats to images of dogs, or images of horses to images of zebras. Basically, it learns to convert one set of images to another set.

To get started, I selected 30 photos of sunflowers and 30 photos of cats and used them as testA and testB, respectively, to convert sunflowers into cats.

In the MUNIT collab file, I set the name of my folder to /content/drive/MyDrive/cat2sunflower and put testA,B, and trainA,B in it.

In the more detailed data options, I set input_dim_a and b to 3 because my image color channels are all RGB, which means I only have three color channels. Num_workers I set to 6 because I’m afraid of running out of memory, and I don’t need that many operations. And put new_size to 1024, crop_image_height, and width to 256 to make the output image uniform in size and get fast results.

Those are outputs trans from cat to sunflower

Those are outputs trans from sunflower to cat

pytti 5 beta.ipynb

PyTTI-Tools Colab can generate images from text and iterate over them continuously. On the way of iteration, the images will be combined into a video by following the parameters set by the user.

First, in a sense, I will set the keywords I want to generate and use “|” to indicate the juxtaposition. If I want to add the image, it will transform next; I can use “||” to indicate that. In Scense_prefix, I will insert the style I want my image to be similar to. Scense_suffix is mainly used to indicate what I want to avoid. For example, I want to avoid too many map features in my image; I will indicate avoid 95% map with map:-1:-.95. steps_per_scene is set to 60100. This will mean how many images it will generate and how many steps it will have on average, affecting the length of my final movie. And the last link is the link to the similar style images I want.

For the video details, I set it to 1280×720 and pixel_size to 1 because this image doesn’t need to be enlarged anymore, 1 will be the ideal value at this stage, and I will keep the other values the same because they work well.

For the animation, I set its steps_per_frame to 50 and set frames_per_second to 30. This means that after 60,100 iterations,I will get 1202 images, which can make up a 42-second, 30-frame video.

Here I changed translate_z_3d from (50+10*t) to (50+3*t) because this will not give the user a sense of vertigo when it transitions too quickly. Near_plane was originally 1, and far_plane was originally 10,000. I changed them to 200 and 8,000 because of habit, but this will not give But it doesn’t make much difference to the result.

Now, we can run the code and will get a line of json,it looks like:{“scenes”: “wandering ghosts of medieval theaters | crimson stage | indescribable dolls”, “scene_prefix”: “artstation art by Nicolas Klug | Concept & UI Artist…….} Just copy that to next line, and after 17-23hours based on setting. We will get the result:

Intro to ML and Colab

In the first two weeks, I learned about types of GANs through the class, got different inspirations, had a deep understanding of Colab, learned the basic operation knowledge of Colab, and how to comprehend and use other people’s models.

As an after-class experiment, I decided to start experimenting with three different models as underpinning training after class.

3D Photography using Context-aware Layered Depth Inpainting

This model proposes a method for converting a single RGB-D input image into a 3D photo, i.e., a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view. I first tried to use a photo of an astronaut because this photo is relatively monotonous in color and has an uncomplicated background.

It can be seen that the effect is still excellent under a single background. Next, I decided to use a photo with a more complex environment. The colors will be more prosperous than the monotonous moon. I think this is challenging for this model.

Obvious smears can be seen here, but the overall performance is still excellent. I believe that in subsequent iterations, these light and shadow issues will be optimized.

Aphantasia

Next, I tried the Aphantasia model, which is slightly different from the last model, the last one that deals with images, stylizes them into video. This one will be text generates massive, detailed imagery, a la deepdream.

Based on this primary condition, I let the machine run 300 operations because not only are there enough samples to see apparent changes, but I also want to keep iterating during 300 operations. Below is the result of 1280×720 after 300 operations.

Generating Piano Music with Transformer

After the above two attempts, I tried image-to-video, text-to-image generation. And for the last attempts, I decided to try generating music, and this Generating Piano Music with Transformer can help me with that. The models used here were trained on over 10,000 hours of piano recordings from YouTube. Unlike the original Music Transformer paper, this notebook uses attention based on absolute instead of relative position.

First, I let it generate a piano performance from scratch, which serves as the basis for my next steps, and here is the generated result.

Next, I submitted the music I wanted to synthesize. I chose canon in d, firstly because it is a piano piece I like very much, and secondly, compared to canon in c, I think the performance of this song after synthesis may be more prominent. And here is the outcome:

Week3: Play with Text-to-image and style transfer models

Text-To-Image

For the Text-to-image, I first try the Hypertron v2 because I hope the image is generated under control. Unlike the text-to-image I tried last week, this model can restrict the image generated under the picture I upload, but it will rend the image into the text style I input.

My input image is some sheep in the mountain, and my input text is: Do Androids Dream of Electric Sheep, one of my favorite sci-fi novels. I was supposed the image would be the cyber style, but it also contains the vibe of the dream.

I find the more specific text prompts that work better in CLIP, such as particular objects and specific images, can help the model better understand the image content it wants to generate. To prove my idea, I tried another model, IllusTrip: Text to Video 3D. It is an iterative model as Aphantasia and is a more pristine Text-to-image model.

This time I input more specific content and made the adjective meets the general mood, overall paragraphs are more logical and have a complete storyline.

But unfortunately, it pre-generates the image first and then makes the depth effect. It is conceivable that the original image will be better after multiple trainings, but this applies more computing power to the depth effect.

Style Transfer

I think images with a particularly well-defined style and relatively simple structure are tricks that work well in the style transfer model. I tried to use this model to train five sets of pictures. Here I only put up the most significant results. I think the common point of these two successful pictures is that they have unique theme colors as style pictures, and the overall atmosphere is very uniform. The target image’s color for rendering is relatively simple, and the coloring position is not complicated.