Daniel Kofler
Vision Transformer
Implementation of the Vision Transformer Architecture in PyTorch
An Implemenentation of the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale in
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale in PyTorch.
Abstract
Since their introduction, Vision Transformers have taken a significant role in today’s field of deep learning.
Researchers have started to integrate the proposed vision capabilities into large language models (LLMs), which
are based on the Transformer architecture introduced by Vaswani et al. In this project the architecture and the
concepts behind the proposed Vision Transformer architecture were explored and subsequently implemented. Since
the proposed model is rather new, only few reference implementations exist and there are currently no
programming libraries offering “ready-to-use” modules. Therefore, in this project the model was implemented
“from scratch” in PyTorch and evaluated on a simple image classification task using the common CIFAR-10 dataset. All code is available
on GitHub.
The Vision Transformer architecture introduced by Dosovitskiy et al.