Get Mystery Box with random crypto!

Casual GAN Papers

Logo of telegram channel casual_gan — Casual GAN Papers C
Logo of telegram channel casual_gan — Casual GAN Papers
Channel address: @casual_gan
Categories: Technologies
Language: English
Subscribers: 1.25K
Description from channel

🔥 Popular deep learning & GAN papers explained casually!
📚 Main ideas & insights from papers to stay up to date with research trends
⭐️ New posts every Tue and Fri
⏰ Reading time <10 minutes
patreon.com/casual_gan
Admin/Ads:
@KirillDemochkin

Ratings & Reviews

3.00

2 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

0

4 stars

1

3 stars

0

2 stars

1

1 stars

0


The latest Messages

2022-05-04 00:26:58 #86.4: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al.

As always, here is the visual summary!

(All figures are taken from the original paper)

***
Tip Casual GAN Papers on KoFi to help this community grow!
154 views21:26
Open / Comment
2022-05-04 00:26:29
#86.0: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al.

Hi everyone!

We are almost all caught up with diffusion and text-to-image goodness!

Best,
-Kirill
158 views21:26
Open / Comment
2022-05-04 00:26:15 #86.3: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al.

Continued from the message above

Experiment insights / Key takeaways:
- Baselines: LSGM, ADM, StyleGAN, ProjectedGAN, DC-VAE
- Datasets: ImageNet, CelebA-HQ, LSUN-Churches, LSUN-Bedrooms, MS-COCO
- Metrics: FID, Perception-Recall
- Qualitative: x4-x8 compression is the sweet point for ImageNet
- Quantitative: LDMS > LSGM, new SOTA FID on CelebA-HQ - 5.11, all scores (with models with 1/2 model size and 1/4 compute) are better (vs other diffusion models) except for LSUN-Bedrooms, where ADM is better
- Additional: the model can get up to 1024x1024, can be used for inpainting, super-resolution, and semantic synthesis. There are a lot of details about the experiments, but that is the 5-minute gist.


Possible Improvements:
- LDMs still much slower than GANs
- Pixel-perfect accuracy is a bottleneck for LDMs in certain tasks (Which ones?).

My Notes:
- (Naming: 3.5/5) The name “LDM” is as straightforward as the problem that the paper that is discussed in the paper. It is an easy-to-pronounce acronym, but not a word and definitely not a meme.
- (Reader Experience - RX: 3/5) Right away, kudos for explicitly listing all of the core contributions of the paper right where they belong - at the end of the introduction. I am going to duck a point for visually-inconsistent figures. They are all over the place. Moreover, the small font size in the tables is very hard to read, especially with how packed the tables appear. Finally, why are the images so tiny? Can you even make out what is on Figure 8? What is the purpose of putting in figures that you can’t read? It would probably be better to cut one or two out to make the rest more readable. Finally, the results table is very hard to read, because different baselines in different order are used for different datasets.

- I can’t help but draw parallels between Latent Diffusion and StyleNeRF papers - sandwiching an expensive operation (Diffusion & Ray Marching) between a convolutional encoder-decoder to reduce computational costs and memory requirements by performing the operation in spatially-condensed latent space.
- Let’s think for a second: what other ideas from DNR & styleNeRF could further improve diffusion models ? One idea I can see being useful is the “NeRF path regularization”, which means, in terms of DMs, training a low-resolution DM alongside a high-resolution LDM, and adding a loss that matches subsampled pixels of the LDM to the pixels in the DM
- It should be possible to interpolate between codes in the learned latent space. Not sure how exactly this could be used, but it is probably worth looking into

Links:
Paper / Code

Thanks for reading! If you found this paper digest useful, subscribe and share this post to support Casual GAN Papers!

- Tip Casual GAN Papers on KoFi to help this community grow!
- Join telegram chat / discord
- Visit the CGP web blog !
- Follow on Twitter
- Visit the library

By: @casual_gan
P.S. DM me papers to cover!
@KirillDemochkin
140 views21:26
Open / Comment
2022-05-04 00:26:15 #86.2: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al.

Continued from the message above

Main Ideas:

1. Perceptual Image Compression:
Authors train an autoencoder that outputs a tensor of latent codes. This latent embedding is regularized with vector quantization within the decoder. This is a slight but important change from VGQAN that means the underlying diffusion model works with continuous latent codes and the quantization happens afterwards.

2. Latent Diffusion Models:
As the second part of the two-stage training approach a diffusion model is trained inside the learned latent space of the autoencoder. I won’t go into details about how the diffusion itself works as I have covered it before in a previous post. What you need to know here is that the denoising model is a UNet that predicts the noise that was added to the latent codes in the previous step of the diffusion process.

3. Conditioning Mechanisms:
Authors utilize domain-specific encoders and cross-attention layers to control the generative model with additional information. The conditions of various modalities such as text, are passed through their own encoders. The results get incorporated in the generative process via cross attention with flattened features from the intermediate layers of the UNet.

Post continues in the next message

By: @casual_gan
P.S. DM me papers to cover!
@KirillDemochkin
139 views21:26
Open / Comment
2022-05-04 00:26:15 #86.1: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al.

Keywords:
#faster_diffusion_models #no_CLIP_reranking #classifier_free_guidance

At a glance:
One of the cleanest pitches for a paper I have seen: diffusion models are way too expensive to train in terms of memory, time and compute, therefore let’s make them lighter, faster, and cheaper.

As for the details, let’s dive in, shall we?

Paper difficulty:

Prerequisites:
(Highly recommended reading to understand the core contributions of this paper):
1) Diffusion Models (ADM)
2) VQGAN

Motivation:
Diffusion models (DMs) have a more stable training phase than GANs and less parameters than autoregressive models, yet they are just really resource intensive. The most powerful DMs require up to a 1000 V100 days to train (that’s a lot of $$$ for compute) and about a day per 1000 inference samples. The authors of Latent Diffusion Models (LDMs) pinpoint this problem to the high dimensionality of the pixel space, in which the diffusion process occurs and propose to perform it in a more compact latent space instead. In short, they achieve this feat by pertaining an autoencoder model that learns an efficient compact latent space that is perceptually equivalent to the pixel space. A DM sandwiched between the convolutional encoder-decoder is then trained inside the latent space in a more computationally-efficient way.

In other words, this is a VQGAN with a DM instead of a transformer (and without a discriminator).

Post continues in the next message

By: @casual_gan
P.S. Want to promote your paper? Contact me!
@KirillDemochkin
166 views21:26
Open / Comment
2022-04-26 01:18:03 #85.4: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al.

As always, here is the visual summary!

(All figures are taken from the original paper)

***
Tip Casual GAN Papers on KoFi to help this community grow!
326 views22:18
Open / Comment
2022-04-26 01:16:53 #85.3: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al.

Continued from the message above

Experiment insights / Key takeaways:
- Baselines: DALL-E, LAFITE, XMC-GAN (second best), DF-GAN, DM-GAN, AttnGAN
- Datasets: MS-COCO
- Metrics: Human perception, CLIP score, FID, Precision-Recall
- Qualitative: Classifier-free guided samples look visually more appealing than CLIP-guided images. GLIDE has compositional and object-centric properties.
- Quantitative: Classifier-free guidance is nearly Paretto optimal in terms of FID vs IS, Precision vs Recall, and CLIP score vs FID. The takeaway is that CLIP-guidance finds adversarial samples for CLIP instead of the most realistic ones.

Possible Improvements:
- From the model card: “Despite the dataset filtering applied before training, GLIDE (filtered) continues to exhibit biases that extend beyond those found in images of people.”
- Unrealistic and out-of-distribution prompts are not handled well, meaning that GLIDE samples are limited by the concepts present in the training data.

My Notes:
- (Naming: 4/5) Memorable but not a meme.
- (Reader Experience - RX: 3/5) While the samples are presented in a very clean and consistent manner (except for figure 5 that does not fit on the screen, which is an issue because the models are arranged row-wise and you will need to scroll back and forth to compare samples across models), the strange order and naming of the paper section and lack of an architecture overview figure threw me for a loop. Moreover, the structure of the paper is quite unorthodox as most of the information about the proposed method is actually hidden in the background section, not in the typical “The Proposed Method” section, which is simply called “Training” here, and contains configuration details I would expect to see in the beginning of the “Experiments” section.

- Classifier-free guidance reminds me of the good ol’ truncation trick from StyleGAN
- Props to the authors for citing Katherine Crowson
- TBH I wonder, how the heck does 64x64 CLIP even work? I don’t think I could compare images to captions at that resolution with my eyes not to even mention a model
- Not sure how I feel about the whole “this model is not safe, hence we won’t release it” narrative that OpenAI is trying to spin since they clearly intend to monetize these huge AI models.

Links:
Paper / Code

Thanks for reading! If you found this paper digest useful, subscribe and share this post to support Casual GAN Papers!

- Tip Casual GAN Papers on KoFi to help this community grow!
- Join telegram chat / discord
- Visit the CGP web blog !
- Follow on Twitter
- Visit the library

By: @casual_gan
P.S. DM me papers to cover!
@KirillDemochkin
316 views22:16
Open / Comment
2022-04-26 01:16:53 #85.2: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al.

Continued from the message above

Main Ideas:

1. Model:
The authors take the ADM (the standard diffusion model) architecture and extend it with text conditioning information. This is done by passing the caption through a transformer model and using the encoded vector in place of the class embedding. Additionally, the last layer of token embeddings is projected to the dimensionality of every attention layer of the ADM model, and concatenated to the attention context of that layer.

2. Classifier-free guidance:
After the initial training routine, the model is fine-tuned for classifier-free guidance with 20% of the captions replaced with empty sequences. This enables the model to synthesize both conditional and unconditional samples. Hence, during inference two outputs are sampled from the model in parallel. One is conditioned on the text prompt, while the other is unconditional and gets extrapolated towards the conditional sample at each diffusion step with a predetermined magnitude.

3. Image Inpainting:
The simplest way to do inpainting with diffusion models is to replace the known region of the image with a noised version of the input after each sampling step. However this approach is prone to edge artifacts, since the model cannot see the entire context during the sampling process. To combat this effect, the authors of GLIDE fine-tune the model by erasing random regions of the input images and concatenating the remaining regions with a mask channel.

4. Noised CLIP:
The authors of GLIDE noticed that CLIP-guided diffusion can not handle intermediary samples very well since they are noised and most likely fall out of the distribution of the pertained, publicly available CLIP model. As a pretty simple fix they train a new CLIP model on noised images to make the diffusion-guiding process more robust.

Post continues in the next message

By: @casual_gan
P.S. DM me papers to cover!
@KirillDemochkin
288 views22:16
Open / Comment
2022-04-26 01:16:53 #85.1: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al.

Keywords:
#faster_diffusion_models #no_CLIP_reranking #classifier_free_guidance

At a glance:
“Diffusion models beat GANs”. While true, the statement comes with several ifs and buts, not to say that the math behind diffusion models is not for the faint of heart. Alas, GLIDE, an OpenAI paper from last December took a big step towards making it true in every sense. Specifically, it introduced a new guidance method for diffusion models that produces higher quality images than even DALL-E, which uses expensive CLIP reranking. And if that wasn’t impressive enough, GLIDE models can be fine-tuned for various downstream tasks such a inpainting and and text-based editing.

As for the details, let’s dive in, shall we?

Paper difficulty:

Prerequisites:
(Highly recommended reading to understand the core contributions of this paper):
1) Diffusion Models (ADM)
2) CLIP

Motivation:
It
used to be with diffusion models that you could boost quality at the cost of some diversity with the classifier guidance technique. However, vanilla classifier guidance requires a pertained classifier that outputs class labels, which is not very suitable for text. Recently though, a new classifier-free guidance approach was introduced. It came with two advantages: the model uses its own knowledge for guidance instead of relying on an external classifier, and it greatly simplifies guidance, when it isn’t possible to directly predict a label, which is should sound familiar for fans of text-to-image models.

Post continues in the next message

By: @casual_gan
P.S. Want a post about your paper? Contact me!
@KirillDemochkin
345 views22:16
Open / Comment
2022-04-26 01:16:45
#85.0: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al.

Hi everyone!

A bit off-schedule but it is here, enjoy my summary and review of GLIDE!

Best,
-Kirill
325 views22:16
Open / Comment