Jinxin Zhou1 Jiachen Jiang1 Zhihui Zhu1
1The Ohio State University
Qualitative comparison of segmentation results

Qualitative comparison between baseline methods (SCLIP, ClearCLIP) and their counterparts integrated with LHT-CLIP.


Abstract

Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image–text alignment with sacrifice of visual discriminability; (ii) a subset of attention heads display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation patterns compared to normal tokens. Based on these findings, we propose three complementary techniques: spatial-semantic reweighting (SSR), selective head enhancement (SHE), and abnormal token replacement (ATR) to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning.

TL;DR

We analyze CLIP's visual discriminability at layer, head, and token levels, and propose LHT-CLIPβ€”a training-free, plug-and-play framework with three complementary strategies that consistently boost open-vocabulary semantic segmentation performance across 8 benchmarks and multiple baselines.


Key Findings

Through comprehensive analysis of CLIP's internal representations, we reveal three critical insights:

πŸ”΅ Layer-Level: Visual Discriminability vs. Semantic Alignment Trade-off

The final layers primarily strengthen image–text alignment while sacrificing visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens.

Layer-wise analysis on VOC (ViT-B/16)

VOC (ViT-B/16)

Layer-wise analysis on Context (ViT-B/16)

Context (ViT-B/16)

Layer-wise visual discriminability (blue) and semantic alignment (orange) across layers. Visual discriminability drops sharply in final layers while semantic alignment improves only marginally.

🟠 Token-Level: Abnormal Tokens Disrupt Spatial Coherence

Abnormal tokens emerge in deeper layers, attracting disproportionate attention from nearly all spatial positions. These tokens exhibit sparse, high-magnitude activations that remain consistent across positions, layers, and samples.

Normal token activation

Normal Token Activation

Abnormal token activation

Abnormal Token Activation

Comparison of activations: normal tokens show uniform, low-magnitude activations, while abnormal tokens exhibit sparse, high-magnitude spikes.

Inter-token cosine similarity

Inter-token Cosine Similarity

Inter-sample cosine similarity

Inter-sample Cosine Similarity

Layer-wise cosine similarity among abnormal tokens across positions and samples. Abnormal tokens exhibit high mutual similarity, confirming they encode consistent, position- and sample-agnostic bias components.

🟒 Head-Level: Discriminative Heads are Sparse and Consistent

A small subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets, while others contribute noise.

Head-wise AUC on Context (ViT-B/16)

Context (ViT-B/16)

Head-wise AUC on ADE (ViT-B/16)

ADE (ViT-B/16)

Head-wise AUC on COCO-Stuff (ViT-B/16)

COCO-Stuff (ViT-B/16)

Head-wise visual discriminability across multiple datasets using ViT-B/16. Dashed lines denote layer-wise scores. Different heads exhibit varying discriminability, and certain heads (e.g., the 11th head in the 6th layer) consistently show high discriminability across datasets.


Method: LHT-CLIP

LHT-CLIP comprises three complementary strategies, each targeting a different level of analysis. Together they synergistically enhance visual discriminability while preserving semantic alignment.

1. Abnormal Token Replacement (ATR)

Anomalous tokens are identified via Hoyer sparsity thresholdingβ€”tokens with sparsity exceeding threshold Ο„ are replaced with a weighted aggregation of their 8 nearest normal neighbors. This simple strategy is applied only at the penultimate layer.

2. Spatial-Semantic Reweighting (SSR)

To counteract the visual discriminability degradation in final layers, SSR upweights the residual pathway and downweights the attention/MLP submodules in the last few transformer layers. This is the first method to explicitly modify inference prior to the final layer, substantially improving spatial coherence without compromising semantic alignment.

3. Selective Head Enhancement (SHE)

Top-k discriminative attention heads are selected based on dataset-agnostic visual discriminability scores. Their features are aggregated to construct soft pseudo-masks via thresholded similarity maps, which refine the final-layer features by suppressing spurious cross-category interactions.

ATR illustration

(a) ATR: Abnormal tokens (red) identified by Hoyer sparsity are replaced with their normal neighbors.

SSR illustration

(b) SSR: Residual pathways are upweighted in final layers to restore visual discriminability.

SSR effectiveness on VOC (ViT-L/14) SSR effectiveness on ADE (ViT-L/14)

(c) SSR: Effect of SSR on ViT-L/14. Solid lines = baseline CLIP; dashed lines = after applying SSR. SSR notably improves visual discriminability in final layers and consistently enhances semantic alignment.


Results

LHT-CLIP consistently improves state-of-the-art training-free methods across 8 benchmarks using ViT-B/16. As a plug-and-play solution, it also outperforms leading weakly supervised methods.

Methods Training With Background Without Background Avg.
VOC21 C60 Object VOC20 City C59 ADE Stuff
ReCo βœ“ 25.119.915.757.721.122.311.214.823.5
GroupViT βœ“ 52.318.727.579.718.523.410.415.330.7
TCL βœ“ 51.224.330.477.523.130.314.919.633.9
CLIP βœ— 16.27.75.541.85.59.22.14.411.6
MaskCLIP βœ— 38.823.620.674.916.426.49.814.828.2
LaVG βœ— 62.131.634.282.526.234.715.823.238.8
NACLIP βœ— 58.932.233.279.735.535.217.423.339.4
SCLIP βœ— 59.731.733.581.532.334.516.522.739.1
+ LHT-CLIP (Ours) βœ— 64.834.836.686.336.137.618.024.942.4 (+3.3)
ClearCLIP βœ— 57.032.232.582.332.835.817.324.039.2
+ LHT-CLIP (Ours) βœ— 63.835.235.685.737.838.819.225.842.7 (+3.5)
ResCLIP βœ— 60.032.734.085.535.635.817.723.840.6
+ LHT-CLIP (Ours) βœ— 63.935.535.286.938.238.219.125.542.8 (+2.2)

Performance comparison on 8 semantic segmentation benchmarks using ViT-B/16. LHT-CLIP rows are highlighted in green.


Ablation Study

Each component contributes complementary improvements. Combined, they yield a +1.7 mIoU average gain.

ATRSSRSHEmIoUΞ”
–––27.5–
βœ“βœ“β€“28.6+1.1
β€“βœ“βœ“28.9+1.4
βœ“β€“βœ“28.9+1.4
βœ“βœ“βœ“29.2+1.7

Combination study of three strategies.

ModelFLOPs(G)↓Params(M)↓FPS↑
CLIP106.10149.613.7
ResCLIP141.34149.63.0
Baseline100.70149.613.9
+LHT-CLIP102.65149.68.2

Efficiency comparison. LHT-CLIP is 2.7Γ— faster than ResCLIP.


Citation

@article{zhou2025improving, title = {Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation}, author = {Zhou, Jinxin and Jiang, Jiachen and Zhu, Zhihui}, journal = {arXiv preprint arXiv:2510.23894}, year = {2025}, }

Acknowledgement

The authors would like to acknowledge support from NSF grants IIS-2312840 and IIS-2402952.

This webpage was co-designed with GitHub Copilot.