Vision Transformers vs. CNNs: Which Should You Use for Medical Imaging?

Ramazan Ozgur Dogan · Sat, 01 Mar 2025
vision transformer CNN deep learning medical imaging swin transformer

Vision Transformers vs. CNNs: Which Should You Use?

The computer vision landscape has been transformed (pun intended) by Vision Transformers (ViTs). Since the 2020 ViT paper and the 2021 Swin Transformer, the community has debated: should we abandon CNNs?

For medical imaging, the answer is nuanced. Let me share what I've learned from using Swin Transformers in my recent work on Shape-from-Focus (SfF) — a technique for reconstructing 3D surfaces from multi-focus microscopy images, published in Optics and Lasers in Engineering (2025).

A Quick Recap: What's the Difference?

CNNs

  • Process images through local convolution kernels
  • Capture local features efficiently
  • Strong inductive bias (translation equivariance)
  • Excellent for pixel-level tasks with limited data

Vision Transformers

  • Use self-attention to model global relationships
  • No local bias — every patch can attend to every other patch
  • Require more data to train from scratch
  • Naturally encode long-range dependencies

Why Swin Transformers for Shape-from-Focus?

Shape-from-Focus requires estimating depth from a stack of images taken at different focus distances. The key insight is that sharp regions in each image correspond to the in-focus depth layer.

Traditional SfF methods compute local focus measures (variance, Laplacian energy) per pixel. These are local by nature — they miss context from neighboring regions.

Swin Transformer solves this with its shifted window attention: - Divides the image into non-overlapping windows - Computes self-attention within each window (efficient) - Shifts windows between layers to capture cross-window context

# Simplified Swin Transformer block
class SwinBlock(nn.Module):
    def __init__(self, dim, num_heads, window_size=7, shift=0):
        super().__init__()
        self.attn = WindowAttention(dim, num_heads, window_size)
        self.shift = shift
        self.norm1 = nn.LayerNorm(dim)
        self.mlp = MLP(dim)
        self.norm2 = nn.LayerNorm(dim)

    def forward(self, x):
        # Cyclic shift for cross-window information
        if self.shift > 0:
            x = torch.roll(x, shifts=(-self.shift, -self.shift), dims=(1, 2))
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x

Results: Swin vs. CNN on Shape-from-Focus

We compared our Swin-based approach against classical CNN baselines on microscopy image stacks:

Method RMSE (μm) ↓ SSIM ↑ PSNR (dB) ↑
Laplacian (classic) 12.4 0.71 28.3
CNN-based SfF 8.1 0.84 32.6
Swin-SfF (ours) 5.3 0.91 36.2

The Swin Transformer achieved ~35% lower RMSE compared to CNN baselines — primarily because global context helps disambiguate textureless regions where local focus measures fail.

When to Choose What

Scenario Best Choice
Small dataset (<1000 images) CNN
Large dataset (>10k images) ViT / Swin
Pixel-level segmentation CNN (e.g., U-Net) or hybrid
Classification, depth, global reasoning Swin / ViT
Edge deployment / fast inference Lightweight CNN (MobileNet)
3D volumetric data 3D Swin or CNN

My Recommendation

For most medical imaging tasks in 2025: 1. Start with a CNN baseline (U-Net, EfficientNet) — fast to train, well-understood 2. Try Swin Transformer if you have enough data and need global context 3. Consider hybrid architectures — CNN encoder + Transformer decoder often beats pure approaches


📄 Paper: DOI: 10.1016/j.optlaseng.2025.109108

📂 Code: Available on GitHub (see Repositories page)

Ramazan Ozgur Dogan

Academic blog of Asst. Prof. Ramazan Özgür Doğan — AI researcher at Trabzon University. Deep learning, medical imaging, and beyond.

Contents
Share