Vision Transformers vs. CNNs: Which Should You Use?

The computer vision landscape has been transformed (pun intended) by Vision Transformers (ViTs). Since the 2020 ViT paper and the 2021 Swin Transformer, the community has debated: should we abandon CNNs?

For medical imaging, the answer is nuanced. Let me share what I've learned from using Swin Transformers in my recent work on Shape-from-Focus (SfF) — a technique for reconstructing 3D surfaces from multi-focus microscopy images, published in Optics and Lasers in Engineering (2025).

A Quick Recap: What's the Difference?

CNNs

Process images through local convolution kernels
Capture local features efficiently
Strong inductive bias (translation equivariance)
Excellent for pixel-level tasks with limited data

Vision Transformers

Use self-attention to model global relationships
No local bias — every patch can attend to every other patch
Require more data to train from scratch
Naturally encode long-range dependencies

Why Swin Transformers for Shape-from-Focus?

Shape-from-Focus requires estimating depth from a stack of images taken at different focus distances. The key insight is that sharp regions in each image correspond to the in-focus depth layer.

Traditional SfF methods compute local focus measures (variance, Laplacian energy) per pixel. These are local by nature — they miss context from neighboring regions.

Swin Transformer solves this with its shifted window attention: - Divides the image into non-overlapping windows - Computes self-attention within each window (efficient) - Shifts windows between layers to capture cross-window context

# Simplified Swin Transformer block
class SwinBlock(nn.Module):
    def __init__(self, dim, num_heads, window_size=7, shift=0):
        super().__init__()
        self.attn = WindowAttention(dim, num_heads, window_size)
        self.shift = shift
        self.norm1 = nn.LayerNorm(dim)
        self.mlp = MLP(dim)
        self.norm2 = nn.LayerNorm(dim)

    def forward(self, x):
        # Cyclic shift for cross-window information
        if self.shift > 0:
            x = torch.roll(x, shifts=(-self.shift, -self.shift), dims=(1, 2))
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x

Results: Swin vs. CNN on Shape-from-Focus

We compared our Swin-based approach against classical CNN baselines on microscopy image stacks:

Method	RMSE (μm) ↓	SSIM ↑	PSNR (dB) ↑
Laplacian (classic)	12.4	0.71	28.3
CNN-based SfF	8.1	0.84	32.6
Swin-SfF (ours)	5.3	0.91	36.2

The Swin Transformer achieved ~35% lower RMSE compared to CNN baselines — primarily because global context helps disambiguate textureless regions where local focus measures fail.

When to Choose What

Scenario	Best Choice
Small dataset (<1000 images)	CNN
Large dataset (>10k images)	ViT / Swin
Pixel-level segmentation	CNN (e.g., U-Net) or hybrid
Classification, depth, global reasoning	Swin / ViT
Edge deployment / fast inference	Lightweight CNN (MobileNet)
3D volumetric data	3D Swin or CNN

My Recommendation

For most medical imaging tasks in 2025: 1. Start with a CNN baseline (U-Net, EfficientNet) — fast to train, well-understood 2. Try Swin Transformer if you have enough data and need global context 3. Consider hybrid architectures — CNN encoder + Transformer decoder often beats pure approaches

📄 Paper: DOI: 10.1016/j.optlaseng.2025.109108

📂 Code: Available on GitHub (see Repositories page)

Vision Transformers vs. CNNs: Which Should You Use for Medical Imaging?

Vision Transformers vs. CNNs: Which Should You Use?

A Quick Recap: What's the Difference?

CNNs

Vision Transformers

Why Swin Transformers for Shape-from-Focus?

Results: Swin vs. CNN on Shape-from-Focus

When to Choose What

My Recommendation

Contents

Share

Vision Transformers vs. CNNs: Which Should You Use for Medical Imaging?

Vision Transformers vs. CNNs: Which Should You Use?

A Quick Recap: What's the Difference?

CNNs

Vision Transformers

Why Swin Transformers for Shape-from-Focus?

Results: Swin vs. CNN on Shape-from-Focus

When to Choose What

My Recommendation

Tags

Contents

Share