Next-ViT: Next Generation Vision Transformer for Efficient D | Data Science by ODS.ai 🦜

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

While vision transformers demostrate high performance, they can't be deployed as efficiently as CNNs in realistic industrial deployment scenarios, e. g. TensorRT or CoreML.

The authors propose Next-ViT, which has a higher latency/accuracy trade-off than existing CNN and ViT models. They develop two new architecture blocks and a new paradigm to stack them. As a result, On TensorRT, Next-ViT surpasses ResNet by 5.4 mAP (from 40.4 to 45.8) on COCO detection and 8.2% mIoU (from 38.8% to 47.0%) on ADE20K segmentation. Also, it achieves comparable performance with CSWin, while the inference speed is accelerated by
3.6×. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.2% to 48.7%) on ADE20K segmentation under similar latency.

Paper: https://arxiv.org/abs/2207.05501

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-next-vit

#deeplearning #cv #transformer #computervision

Data Science by ODS.ai 🦜

💂 51.75K
Technologies

First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of f...

Join
▲ Vote (1)

​​Next-ViT: Next Generation Vision Transformer for Efficient D | Data Science by ODS.ai 🦜

Login

Next-ViT: Next Generation Vision Transformer for Efficient D | Data Science by ODS.ai 🦜