Intrinsic Credit Assignment for Long Horizon Interaction

Abstract

How can we train agents to navigate uncertainty over long horizons? In this work, we propose ΔBelief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ΔBelief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for RL, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ΔBelief rewards.

Method: Intrinsic Credit Assignment

The ΔBelief-RL framework leverages an agent's internal beliefs as an intrinsic reward signal, enabling dense credit assignment in probabilistic, long-horizon tasks. We consider a multi-turn interaction setting where an agent must uncover a latent target concept $y \in \mathcal{Y}$ through a trajectory $\tau = \{(a_1,o_1), \ldots, (a_N,o_N)\}$ of actions and observations.

Agent Beliefs

We elicit the agent's internal belief by leveraging its underlying token probability distribution. Given an elicitation prompt $e_i$, we calculate the belief at turn $t$ as the probability the policy assigns to the target concept $y$ conditioned on the history $h_t$:

$b_t = p_\theta(y_i \mid h_t, e_i)$

ΔBelief Reward: Belief Change Signal

By tracking $b_t$ across the trajectory, we quantify how each interaction resolves uncertainty, shifting the model's internal "world view" towards the correct solution. The per-turn belief change is:

$\Delta\text{Belief}_t = \log \frac{b_t}{b_{t-1}} = \log b_t - \log b_{t-1}$

This dense, turn-level signal reinforces actions that lead to the most informative updates to the agent's internal world view. Using log-probabilities ensures numerical stability during training.

Belief updates over time showing correlation with task success

Belief updates. Per-turn beliefs about the ground-truth for Qwen3 (1.7B and 4B), split by final outcome. Trajectories were generated by DeepSeek-v3.2. Beliefs steadily increase on average, and the rate of growth strongly correlates with the outcome of the trajectory.

Validating the ΔBelief Measurement

We confirm that optimizing for ΔBelief leads to better task success using a best-of-$n$ sampling intervention. At each turn, we sample $n=8$ candidate questions, simulate the environment's response for each, and select the action that maximizes the immediate belief change:

$a_t \leftarrow \arg\max_{k \in \{1,\ldots,n\}} \; \Delta\text{Belief}(a_{t,k})$

Best-of-8 sampling with ΔBelief shows significant performance improvements

Best-of-8 sampling with ΔBelief. Success rate on the 20 Questions task for baseline models after SFT. Sampling 8 questions at every turn and selecting the one that maximizes ΔBelief leads to a significant rise in performance across all model sizes.

Training with Reinforcement Learning

We augment the sparse outcome reward with our dense intrinsic signal at every turn $t$. The per-turn reward is:

$r_t = \underbrace{r^{\mathrm{eog}}}_{\text{trajectory outcome}} + \underbrace{\lambda\, \max(\Delta\text{Belief}_t, 0)}_{\text{intrinsic exploration}} + \underbrace{r_p}_{\text{efficiency penalty}}$

where $\lambda$ scales the belief-based reward. We clip the intrinsic reward at zero, ensuring agents are rewarded for increasing belief in the correct concept without being penalized for temporary decreases in confidence. We use Turn-wise GRPO, computing advantages at the turn level rather than applying the same reward across the entire trajectory.

Results

ΔBelief-RL Training Improves Interaction-Efficiency

Success rate on the test set. Average success rates and Pass@k[4] of the SFT Baseline, StarPO[1], and CIA on a held-out test set. Across both model sizes, CIA consistently outperforms all baselines.

Method	Mean@8 ± std	Pass@8
Baseline (1.7B)	9.97% ± 1.04%	32.03%
StarPO (1.7B)	16.54% ± 1.32%	45.73%
CIA (ours, 1.7B)	24.80% ± 1.10%	53.10%

Baseline (4B)	13.34% ± 1.05%	36.87%
StarPO (4B)	24.36% ± 1.18%	59.12%
CIA (ours, 4B)	33.72% ± 1.26%	63.97%

DeepSeek-V3.2	14.35% ± 0.87%	47.34%
Qwen3-235B-A22B-Instr.	8.83% ± 0.87%	27.71%

Our CIA-1.7B model outperforms DeepSeek-V3.2 (670B) by 10.45%, and our CIA-4B model outperforms it by 19.37%, despite using ~98% fewer parameters. This supports our hypothesis that specialized training is essential for active information-seeking where web-scale pretraining alone falls short.

$Mean number of questions and fraction of repeated questions per episode during training$

Training dynamics. Mean number of questions per episode (top) and mean fraction of repeated questions (bottom) during RL training. Across both Qwen3-1.7B and Qwen3-4B, ΔBelief-RL reduces the number of turns required and suppresses redundant queries more rapidly than standard GRPO (StarPO[1]).

ΔBelief-RL Leads to Faster Belief Updates

Belief update dynamics for 1.7B and 4B models

Belief-update dynamics. Normalized elicited log-probability of the correct concept as a function of the number of interactions for 1.7B models (top) and 4B models (bottom). At the 4B scale, CIA shows the largest and most sustained increase in belief updates, while StarPO remains close to the SFT baseline.

ΔBelief-RL Improves Pass@k Success Rates

Pass@k on Twenty Questions. CIA consistently achieves higher pass@k[4] than the Baseline across the entire range of $k$ (up to 128), confirming that the gains extend beyond simple policy sharpening to genuinely improved exploration.

Generalization and Applications

Test-Time Interaction Scaling

During training, episodes are limited to 20 turns. We investigate whether information-seeking capabilities generalize to longer interactions by increasing the budget to 50 turns.

Test-time interaction scaling. Average success rate as a function of the interaction budget (up to 50 turns) for 1.7B models (top) and 4B models (bottom). CIA exhibits larger performance gains and continues to improve as the interaction budget extends beyond the 20-turn training regime. At 4B scale, the increase in absolute success rates from 20 to 50 turns is double for CIA at +26%, compared to +13% for StarPO.

Out-of-Distribution Generalization

We evaluate OOD generalization using Guess My City and Murder Mystery[2], two information-seeking tasks not seen during training.

OOD generalization. Pass@k[4] on two out-of-distribution benchmarks (Guess My City and Murder Mystery[2]). Dark bars show Pass@1; lighter stacked increments show Pass@8. CIA exhibits robust generalization, consistently surpassing SFT and StarPO[1] baselines across all tested environments.

Practical Applications

We further test generalization on settings that mirror practical applications: User Personalization[3] (multi-turn preference elicitation) and Customer Service[2] (identifying broken functionality through diagnostic questioning).

Practical applications. Mean scores on the OOD Customer Service[2] (left) and User Personalization[3] (right) benchmarks. Across both model sizes, CIA strongly outperforms the previous SoTA method, StarPO[1]. On Customer Service, CIA outperforms StarPO by 5.06% (1.7B) and 11.13% (4B). On User Personalization, CIA offers up to 15% improvements over existing methods.

Conclusion

We demonstrate that an agent's internal belief shifts can effectively guide learning in long-horizon tasks. By providing fine-grained credit assignment for intermediate actions, ΔBelief-RL significantly enhances learning efficiency in a compute-efficient manner: measuring the contribution of individual actions without separate critic or reward models.

Trained on the 20 Questions task, our CIA models at the 1.7B–4B scale not only outperform prior state-of-the-art methods for multi-turn training, but also much larger models such as DeepSeek-V3.2. Notably, improved performance generalizes to extended interaction horizons and diverse out-of-distribution applications including customer service and user personalization.

Much like humans evaluate actions internally by judging progress towards their goals, future advancements in belief calibration may allow agents to train increasingly from intrinsic signals. Because ΔBelief-RL is agnostic to the underlying RL algorithm, it remains a versatile framework for continued exploration across diverse domains and architectures.

References

[1] Zihan Wang, Kangrui Wang, Qineng Wang, et al. RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning. arXiv preprint arXiv:2504.20073, 2025. [arXiv]

[2] Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, and Russ Salakhutdinov. Training a Generally Curious Agent. International Conference on Machine Learning (ICML), 2025. [OpenReview]

[3] Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah Goodman. STaR-GATE: Teaching Language Models to Ask Clarifying Questions. First Conference on Language Modeling (COLM), 2024. [OpenReview]

[4] Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374, 2021. [arXiv]