✨ TL;DR
ProtoCLIP refines CLIP-style vision-language models for chest X-ray classification by using curated training data and prototype-aligned distillation to reduce co-occurrence bias and improve zero-shot performance. The method achieves 2-10 percentage point AUC improvements over baseline CLIP on unseen chest X-ray datasets without large-scale retraining.
Zero-shot vision-language models like CLIP show promise for chest X-ray classification but suffer from three key limitations: confounding label co-occurrence (where certain pathologies frequently appear together, causing the model to confuse them), long-tail class imbalance (rare pathologies are underrepresented), and transfer instability under domain shift (performance degrades when applied to new datasets from different sources). These issues are particularly problematic in medical imaging where accurate discrimination between co-occurring pathologies is clinically critical, and models must generalize reliably to new hospital systems and imaging protocols.
ProtoCLIP introduces a refinement strategy with two main components. First, it constructs pathology-focused training subsets with carefully curated negative samples to reduce co-occurrence bias, ensuring the model learns to distinguish between frequently co-occurring conditions. Second, it employs a representation-preserving distillation objective that uses prototype anchors to guide the adaptation process. This distillation approach stabilizes the model during fine-tuning while maintaining the semantic structure learned during pre-training and improving discrimination of clinically relevant co-occurring pathologies. The method is designed to work without requiring large-scale retraining of the base model.