LAION-5B: open dataset for multi-modal ML for 5+ billion text- | Big Data Science

LAION-5B: open dataset for multi-modal ML for 5+ billion text-image pairs
On May 31, 2022, the non-profit organization of AI researchers presented the largest dataset of 5.85 billion image-text pairs filtered using CLIP. The LAION-5B is 14 times larger than its predecessor, the LAION-400M, which was previously the world's largest open image-to-text dataset.
2.3 billion pairs are in English, and the other half of the dataset contains samples from over 100 other languages. The dataset also includes several nearest neighbor indices, an improved web interface for exploration and subsetting, and watermark detection and NSFW scores. The dataset is recommended for research purposes and is not specifically controlled.
The entire 5 billion dataset is divided into 3 datasets, each of which can be downloaded separately. They all have the following column structure:
• Image URL
• TEXT - subtitles, in English for en, in other languages for multi and nolang
• WIDTH - image width
• HEIGHT - image height
• LANGUAGE - sample language, laion2B-multi only, calculated using cld3
• Similarity – similarity, cosine between text and image for ViT-B/32 embedding, clip for en, mclip for multi and nolang
• Pwatermark - the probability of a watermarked image, calculated using the laion watermark detector.
• Punsafe - The probability that an image is unsafe is calculated using the laion clip detector.
pwatermark and punsafe are either available as separate collections that must be joined with a url+text hash.
Details and links to download: https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

Big Data Science

👺 1.44K
Technologies

Big Data Science channel gathers together all interesting facts about Data Science. For cooperation: a.chernobrovov@gmail.com. 💼 — https://t.me/bds_job — channel about Data Science jobs and car...

Join
▲ Vote (1)

LAION-5B: open dataset for multi-modal ML for 5+ billion text- | Big Data Science

Login