2021-10-22 16:27:11
Google AI's SimVLM: Pre-Learning a Weakly Controlled Visual Language Model
Visual language modeling involves understanding the language on visual inputs that can be useful for developing products and tools. For example, the image caption model generates natural language descriptions based on understanding the essence of the image. Over the past few years, significant progress has been made in visual language modeling thanks to the introduction of VLP (Vision-Language Pre-training) technology.
This approach is aimed at studying a single functional space immediately from visual and language inputs. For this purpose, VLP often uses an object detector such as the Faster R-CNN, trained on datasets of tagged objects to highlight regions of interest, relying on task-specific approaches and collaboratively exploring the representation of images and texts. These approaches require annotated datasets or time to label them and are therefore less scalable.
To solve this problem, Google AI researchers propose a minimalistic and efficient VLP called SimVLM (Simple Visual Language Model). SimVLM is trained from start to finish with a single goal, similar to language modeling, on a huge number of poorly aligned image-text pairs, i.e. text paired with an image is not necessarily an accurate description of the image.
The simplicity of SimVLM enables efficient training on such a scalable dataset, helping the model achieve the highest level of performance across six tests in the visualization language. In addition, SimVLM includes a unified multimodal presentation that provides reliable cross-modal transmission with no fine-tuning or fine-tuning for text-only data, incl. visualization of answers to questions, captions for images and multimodal translation.
Unlike BERT and other VLP methods that apply pre-training procedures, SimVLM takes a sequence-by-sequence structure and is trained with a single prefix language target model (PrefixLM), which receives the leading part of the sequence (prefix) as input, then predicts its continuation. For example, for a dog chasing a yellow ball sequence, the sequence is randomly truncated to the chasing dog prefix, and the model predicts its continuation. The concept of a prefix is similarly applied to images, where the image is divided into a series of "slices", a subset of which are sequentially fed into the model as input. In SimVLM, for multimodal input data (images and their signatures), a prefix is a concatenation of a sequence of image fragments and a sequence of prefix text received by the encoder. The decoder then predicts the continuation of the text sequence.
Through this idea, SimVLM maximizes the flexibility and versatility in adapting the ML model to different task settings. And successfully tested in BERT and ViT, the transformer architecture allows models to directly accept raw images as input. It also applies a convolution step from the first three ResNet blocks to extract contextualized patches, which is more beneficial than the naive linear projection of the original ViT model.
The model is pretrained on large-scale datasets with images and texts. ALIGN was used as a training dataset, containing about 1.8 billion noisy image-text pairs. For the text data, the Colossal Clean Crawled Corpus (C4) dataset of 800G web documents was used. SimVLM testing has shown this ML model to be successful even without supervised fine tuning. SimVLM was able to achieve subtitle quality close to the results of controlled methods.
https://ai.googleblog.com/2021/10/simvlm-simple-visual-language-model-pre.html
160 views13:27