“深入探究视觉Transformer（ViT）模型的代码”

解析 HuggingFace ViT 实现

视觉 Transformer（ViT）是计算机视觉发展中的里程碑。ViT 挑战了传统的认知，即图像最好通过卷积层进行处理，证明了基于序列的注意力机制能够有效地捕捉图像中复杂的模式、上下文和语义。通过将图像分解为可管理的块，并利用自注意力机制，ViT 捕捉了局部和全局关系，使其在从图像分类到目标检测等各种视觉任务中表现出色。在本文中，我们将解析 ViT 在分类任务中的工作原理。

介绍

ViT 的核心思想是将图像视为一系列固定大小的块，然后将其展平并转换为1D向量。这些块随后由 Transformer 编码器处理，使模型能够捕捉整个图像的全局上下文和依赖关系。通过将图像分割成块，ViT 有效地降低了处理大图像的计算复杂性，同时保留了建模复杂空间交互的能力。

首先，我们从 Hugging Face Transformers 库中导入 ViT 用于分类的模型：

from transformers import ViTForImageClassificationimport torchimport numpy as npmodel = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

patch16–224 表示该模型接受大小为 224×224 的图像，每个块的宽度和高度为 16 像素。

以下是模型的架构：

ViTForImageClassification(  (vit): ViTModel(    (embeddings): ViTEmbeddings(      (patch_embeddings): PatchEmbeddings(        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))      )      (dropout): Dropout(p=0.0, inplace=False)    )    (encoder): ViTEncoder(      (layer): ModuleList(        (0): ViTLayer(          (attention): ViTAttention(            (attention): ViTSelfAttention(              (query): Linear(in_features=768, out_features=768, bias=True)              (key)…