✨ TL;DR
This paper investigates how well small language models can learn reasoning tasks through reinforcement learning when training data and compute are limited. The study finds that mixing easy and hard problems during training provides up to 5x better sample efficiency than training on easy problems alone.
Fine-tuning large language models typically requires massive amounts of high-quality annotated data and substantial computational resources, especially when using Reinforcement Learning with Verifiable Rewards (RLVR) to improve reasoning capabilities. While prior research has shown benefits from scaling both data and compute for RLVR, these approaches are impractical in many real-world scenarios where organizations face constraints on available annotated data and accessible compute resources. There is a critical need to understand how models can be effectively trained with limited resources, yet systematic studies examining RLVR performance in low-data regimes are lacking.
The researchers conducted a comprehensive empirical study using open-source Small Language Models (SLMs) trained with RLVR across three procedurally-generated datasets: number counting problems, graph reasoning tasks, and spatial reasoning challenges. They systematically varied dataset properties including size, diversity, and complexity to characterize how these factors affect model performance in low-data settings. The procedural generation approach allowed precise control over task difficulty and dataset composition, enabling fine-grained analysis of how models trained on tasks of varying complexity generalize to new problems. The study specifically examined training strategies including single-complexity training versus mixed-complexity training to identify optimal data utilization patterns.