✨ TL;DR
This paper proposes CmIR, a causal inference framework that separates multimodal data into stable causal features and spurious environment-specific features to improve robustness in affective computing. The method achieves state-of-the-art performance, especially on out-of-distribution and noisy data.
Current multimodal affective computing models that predict human sentiment, emotion, and intention from language, acoustic, and visual inputs suffer from learning spurious correlations. These spurious correlations harm the models' ability to generalize when faced with distribution shifts or noisy modalities. The models fail to distinguish between stable causal relationships that should transfer across different environments and environment-specific patterns that are unreliable.
The paper introduces CmIR (Causal modality-Invariant Representation), a framework that disentangles each modality into two components: causal invariant representations that maintain stable predictive relationships with labels across environments, and environment-specific spurious representations. The method employs three key constraints: an invariance constraint to ensure stability across environments, a mutual information constraint to preserve relevant information, and a reconstruction constraint to retain sufficient information from raw inputs. This disentanglement is grounded in causal inference theory.