✨ TL;DR
FUSE is a method that combines multiple imperfect AI verifiers to better judge model outputs without needing any labeled training data. It matches or beats semi-supervised methods across diverse benchmarks by controlling how verifiers depend on each other using spectral algorithms.
Verifying whether large language model outputs are correct is critical for both training and deployment, but obtaining ground truth labels is expensive and time-consuming. In practice, people use imperfect LLM judges and reward models as verifiers, but these individual verifiers are unreliable. While ensembling multiple verifiers could improve performance, existing ensemble methods typically require labeled data to learn how to combine verifiers effectively. This creates a chicken-and-egg problem: you need labels to build better verifiers, but the whole point of automated verification is to avoid manual labeling.
FUSE (Fully Unsupervised Score Ensembling) combines multiple verifiers without any ground truth labels by leveraging spectral algorithms from the ensembling literature. The key innovation is controlling conditional dependencies between verifiers to improve unsupervised ensemble performance. Rather than learning weights from labeled data, FUSE uses the statistical properties of how verifiers agree and disagree with each other to infer which combinations are most reliable. The method works with diverse types of verifiers including LLM judges and reward models, and can be applied at test time to improve verification quality for any generator model.