MDETR: Modulated Detection for End-to-End Multi-Modal Unders | Data Science by ODS.ai 🦜

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes.
The authors present an end-to-end approach to multi-modal reasoning systems, which works much better than using a separate pre-trained decoder.
They pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image.
Fine-tuning this model achieves new SOTA results on phrase grounding, referring expression comprehension, and segmentation tasks. The approach could be extended to visual question answering.
Furthermore, the model is capable of handling the long tail of object categories.

Paper: https://arxiv.org/abs/2104.12763
Code: https://github.com/ashkamath/mdetr

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-mdetr

#deeplearning #multimodalreasoning #transformer

Data Science by ODS.ai 🦜

👨‍🎨 51.80K
Technologies

First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of f...

Join
▲ Vote (1)

​​MDETR: Modulated Detection for End-to-End Multi-Modal Unders | Data Science by ODS.ai 🦜

Login

MDETR: Modulated Detection for End-to-End Multi-Modal Unders | Data Science by ODS.ai 🦜