Vision Transformers vs. CNNs: Which Should You Use?
The computer vision landscape has been transformed (pun intended) by Vision Transformers (ViTs). Since the 2020 ViT paper and the 2021 Swin Transformer, the community has debated: should we abandon CNNs?
For medical imaging, the answer is nuanced. Let me share what I've learned from using Swin Transformers in my recent work on Shape-from-Focus (SfF) — a technique for reconstructing 3D surfaces from multi-focus microscopy images, published in Optics and Lasers in Engineering (2025).
A Quick Recap: What's the Difference?
CNNs
- Process images through local convolution kernels
- Capture local features efficiently
- Strong inductive bias (translation equivariance)
- Excellent for pixel-level tasks with limited data
Vision Transformers
- Use self-attention to model global relationships
- No local bias — every patch can attend to every other patch
- Require more data to train from scratch
- Naturally encode long-range dependencies
Why Swin Transformers for Shape-from-Focus?
Shape-from-Focus requires estimating depth from a stack of images taken at different focus distances. The key insight is that sharp regions in each image correspond to the in-focus depth layer.
Traditional SfF methods compute local focus measures (variance, Laplacian energy) per pixel. These are local by nature — they miss context from neighboring regions.
Swin Transformer solves this with its shifted window attention: - Divides the image into non-overlapping windows - Computes self-attention within each window (efficient) - Shifts windows between layers to capture cross-window context
# Simplified Swin Transformer block
class SwinBlock(nn.Module):
def __init__(self, dim, num_heads, window_size=7, shift=0):
super().__init__()
self.attn = WindowAttention(dim, num_heads, window_size)
self.shift = shift
self.norm1 = nn.LayerNorm(dim)
self.mlp = MLP(dim)
self.norm2 = nn.LayerNorm(dim)
def forward(self, x):
# Cyclic shift for cross-window information
if self.shift > 0:
x = torch.roll(x, shifts=(-self.shift, -self.shift), dims=(1, 2))
x = x + self.attn(self.norm1(x))
x = x + self.mlp(self.norm2(x))
return x
Results: Swin vs. CNN on Shape-from-Focus
We compared our Swin-based approach against classical CNN baselines on microscopy image stacks:
| Method | RMSE (μm) ↓ | SSIM ↑ | PSNR (dB) ↑ |
|---|---|---|---|
| Laplacian (classic) | 12.4 | 0.71 | 28.3 |
| CNN-based SfF | 8.1 | 0.84 | 32.6 |
| Swin-SfF (ours) | 5.3 | 0.91 | 36.2 |
The Swin Transformer achieved ~35% lower RMSE compared to CNN baselines — primarily because global context helps disambiguate textureless regions where local focus measures fail.
When to Choose What
| Scenario | Best Choice |
|---|---|
| Small dataset (<1000 images) | CNN |
| Large dataset (>10k images) | ViT / Swin |
| Pixel-level segmentation | CNN (e.g., U-Net) or hybrid |
| Classification, depth, global reasoning | Swin / ViT |
| Edge deployment / fast inference | Lightweight CNN (MobileNet) |
| 3D volumetric data | 3D Swin or CNN |
My Recommendation
For most medical imaging tasks in 2025: 1. Start with a CNN baseline (U-Net, EfficientNet) — fast to train, well-understood 2. Try Swin Transformer if you have enough data and need global context 3. Consider hybrid architectures — CNN encoder + Transformer decoder often beats pure approaches
📄 Paper: DOI: 10.1016/j.optlaseng.2025.109108
📂 Code: Available on GitHub (see Repositories page)