Vision Transformers (ViT) are a class of neural network architectures that are very popular for vision tasks such as image classification, semantic segmentation, and object detection. The main difference between the Vision Transformer and the original Transformer is that the discrete tokens in the text are replaced with continuous pixel values extracted from the image patches. ViTs extract features from an image by paying attention to different regions of the image and combining them to make predictions. However, despite their widespread use recently, little is known about the inductive biases or features that ViT tends to learn. Feature visualization and image reconstruction have been successful in understanding the behavior of convolutional neural networks (CNNs), but these methods have not been successful in understanding ViT, which is difficult to visualize. .
A recent study by a group of researchers at the University of Maryland-College Park and New York University expands the ViTs literature with detailed studies of ViTs behavior and internal processing mechanisms. The authors established a visualization framework for synthesizing images that maximally activate neurons in the ViT model. In particular, the method starts with random noise and applies different regularization techniques, such as penalizing total variation or using extended ensembles, to improve the quality of the generated images. required gradient steps to maximize feature activation.
Our analysis shows that ViT’s patch tokens preserve spatial information in all layers except the final attention block, which learns a token-mixing operation similar to the average pooling operation widely used in CNNs. The authors observed that the representation remained local even for individual channels in deep layers of the network.
To this end, CLS tokens seem to play a relatively minor role throughout the network and are not used for globalization until the last layer. We demonstrated this hypothesis by performing image inference on , and injecting the value of the CLS token at layer 12. Original 84.20%.
Therefore, both CNN and ViT exhibit a gradual specialization of features, with early layers recognizing basic image features such as colors and edges, and deeper layers recognizing more complex structures. However, an important difference discovered by the author concerns the dependence of his ViT and CNN on background and foreground image features. In this study, we observed that ViT far outperforms CNN in using background information in the image to identify the correct class, and is less affected by background removal. Moreover, the ViT prediction is more resilient to high-frequency texture information removal compared to the ResNet model (results are presented in Table 2 of the paper).
Finally, in this work, we also briefly analyze representations learned by ViT models trained on the Contrastive Language Image Pretraining (CLIP) framework that connects images and text. Interestingly, unlike ViT trained as a classifier, we found that CLIP-trained ViT produced features at a deeper layer activated by objects of clearly identifiable conceptual categories. . This is reasonable but surprising because the texts available on the Internet offer targets for abstract and semantic concepts such as “prevalence” (an example is shown in Figure 11).
check out paper When github. All credit for this research goes to the researchers of this project.Also, don’t forget to participate Our 13k+ ML SubReddit, cacophony channelWhen email newsletterWe share the latest AI research news, cool AI projects, and more.
Lorenzo Brigato is a postdoctoral researcher at the ARTORG Center, a research institute attached to the University of Bern, currently working on AI applications in health and nutrition. He has his Ph.D. He holds his Science degree in Computers from Sapienza University in Rome, Italy. His PhD thesis focused on the problem of image classification with sample- and label-deficient data distributions.