✨ TL;DR
This paper develops a simulation-based Bayesian inference method that corrects for selection bias by embedding the selection mechanism directly into the generative model, enabling accurate parameter estimation in complex models where traditional likelihood-based approaches fail. The approach allows both debiased estimation and explicit testing for bias presence without requiring tractable likelihoods.
Selection bias occurs when the probability of an observation being included in a dataset depends on variables related to the quantities being studied, causing systematic distortions in parameter estimates and uncertainty quantification. This is common in epidemiological studies and surveys where individuals with certain characteristics are more likely to be sampled. Classical correction methods like inverse-probability weighting or explicit likelihood-based selection models require tractable likelihoods, which severely limits their use in complex models with latent dynamics or high-dimensional structure. Existing simulation-based inference methods enable Bayesian analysis without tractable likelihoods but typically assume data is missing at random, making them fail when selection depends on unobserved outcomes or covariates. This creates a critical gap: complex stochastic models that would benefit from simulation-based inference are precisely the settings where selection bias is difficult to address with traditional methods, leaving researchers without practical tools for obtaining unbiased estimates in these scenarios.
The authors develop a bias-aware simulation-based inference framework that explicitly incorporates the selection mechanism into neural posterior estimation. The key innovation is embedding the selection process directly into the generative simulator rather than treating it as a separate correction step. By simulating both the underlying data generation process and the selection mechanism that determines which observations enter the dataset, the approach recasts selection bias correction as part of the simulation problem itself. This framework enables amortized Bayesian inference without requiring tractable likelihoods by training neural networks to approximate posterior distributions using simulated data that reflects the selection process. The method integrates diagnostic tools to detect discrepancies between simulated and observed data distributions and to assess posterior calibration. Importantly, the framework allows researchers to explicitly test for the presence of selection bias by comparing models with and without selection mechanisms, providing both debiased parameter estimates and evidence about whether bias correction is necessary.