🔥 Burn Fat Fast. Discover How! 💪

In January 2021 Open.AI presented the new ML-model, neural net | Big Data Science

In January 2021 Open.AI presented the new ML-model, neural network called DALL·E that creates images from text captions for concepts on natural language. It has 12-billion parameters and based on GPT-3. DALL·E was trained to generate images from text descriptions, using a dataset of text–image pairs. It can create anthropomorphized versions of animals and objects, combine unrelated concepts in plausible ways, render texts, and apply transformations to existing images.
Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. A token is any symbol from a discrete vocabulary, e.g. each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.
The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE pretrained using a continuous relaxation to obviate the need for an explicit codebook, EMA loss, or dead code revival. Also this trick can scale up to large vocabulary sizes and allows DALL·E to generate an image from scratch and to regenerate any rectangular region of an existing image that extends to the bottom-right corner consistent with the text prompt.
The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP.
https://openai.com/blog/dall-e/