2022-05-04 00:26:15
#86.3:
"High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al.
Continued from the message above
Experiment insights / Key takeaways:
- Baselines: LSGM, ADM, StyleGAN, ProjectedGAN, DC-VAE
- Datasets: ImageNet, CelebA-HQ, LSUN-Churches, LSUN-Bedrooms, MS-COCO
- Metrics: FID, Perception-Recall
- Qualitative: x4-x8 compression is the sweet point for ImageNet
- Quantitative: LDMS > LSGM, new SOTA FID on CelebA-HQ - 5.11, all scores (with models with 1/2 model size and 1/4 compute) are better (vs other diffusion models) except for LSUN-Bedrooms, where ADM is better
- Additional: the model can get up to 1024x1024, can be used for inpainting, super-resolution, and semantic synthesis. There are a lot of details about the experiments, but that is the 5-minute gist.
Possible Improvements:
- LDMs still much slower than GANs
- Pixel-perfect accuracy is a bottleneck for LDMs in certain tasks (Which ones?).
My Notes:
- (Naming: 3.5/5) The name “LDM” is as straightforward as the problem that the paper that is discussed in the paper. It is an easy-to-pronounce acronym, but not a word and definitely not a meme.
- (Reader Experience - RX: 3/5) Right away, kudos for explicitly listing all of the core contributions of the paper right where they belong - at the end of the introduction. I am going to duck a point for visually-inconsistent figures. They are all over the place. Moreover, the small font size in the tables is very hard to read, especially with how packed the tables appear. Finally, why are the images so tiny? Can you even make out what is on Figure 8? What is the purpose of putting in figures that you can’t read? It would probably be better to cut one or two out to make the rest more readable. Finally, the results table is very hard to read, because different baselines in different order are used for different datasets.
- I can’t help but draw parallels between Latent Diffusion and StyleNeRF papers - sandwiching an expensive operation (Diffusion & Ray Marching) between a convolutional encoder-decoder to reduce computational costs and memory requirements by performing the operation in spatially-condensed latent space.
- Let’s think for a second: what other ideas from DNR & styleNeRF could further improve diffusion models ? One idea I can see being useful is the “NeRF path regularization”, which means, in terms of DMs, training a low-resolution DM alongside a high-resolution LDM, and adding a loss that matches subsampled pixels of the LDM to the pixels in the DM
- It should be possible to interpolate between codes in the learned latent space. Not sure how exactly this could be used, but it is probably worth looking into
Links:
Paper / Code
Thanks for reading! If you found this paper digest useful, subscribe and share this post to support Casual GAN Papers!
- Tip Casual GAN Papers on KoFi to help this community grow!
- Join telegram chat / discord
- Visit the CGP web blog !
- Follow on Twitter
- Visit the library
By: @casual_gan
P.S. DM me papers to cover!
@KirillDemochkin
140 views21:26