HBRL: Hybrid Belief-Reinforcement Learning

Abstract

Coordinating multiple autonomous agents to explore and serve spatially heterogeneous demand requires jointly learning unknown spatial patterns and planning trajectories that maximize task performance. Pure model-based approaches provide structured uncertainty estimates but lack adaptive policy learning, while deep reinforcement learning often suffers from poor sample efficiency when spatial priors are absent. This paper presents a hybrid belief–reinforcement learning (HBRL) framework to address this gap. In the first phase, agents construct spatial beliefs using a Log-Gaussian Cox Process (LGCP) and execute information-driven trajectories guided by a Pathwise Mutual Information (PathMI) planner with multi-step lookahead. In the second phase, trajectory control is transferred to a Soft Actor-Critic (SAC) agent, warm-started through dual-channel knowledge transfer: belief state initialization supplies spatial uncertainty, and replay buffer seeding provides demonstration trajectories generated during LGCP exploration. A variance-normalized overlap penalty enables coordinated coverage through shared belief state, permitting cooperative sensing in high-uncertainty regions while discouraging redundant coverage in well-explored areas. The framework is evaluated on a multi-UAV wireless service provisioning task. Results show 10.8% higher cumulative reward and 38% faster convergence over baselines, with ablation studies confirming that dual-channel transfer outperforms either channel alone.

Contributions

A hybrid framework combining Log-Gaussian Cox Process (LGCP) spatial modeling with Soft Actor-Critic (SAC) reinforcement learning for coordinated spatial exploration under unknown demand.
A dual-channel warm-start mechanism for sample-efficient learning: (i) belief initialization, which provides the RL agent with an informed prior for early policy updates, and (ii) behavioral transfer, which seeds the replay buffer with LGCP-generated exploration trajectories.
Uncertainty-driven Pathwise Mutual Information (PathMI) planning for non-myopic trajectory optimization during the exploration phase, extending standard informative path planning (IPP) with staleness-weighted revisitation incentives.
A variance-normalized overlap penalty that adapts coordination strength to local belief uncertainty, permitting cooperative sensing in high-uncertainty regions while penalizing redundant coverage.
Experimental evaluation shows up to 10.8% higher reward and 38% faster convergence versus baselines.

Exploration Behavior

Converged Policy (Episode 200)

HBRL (Ours)

Pure RL

Mid-Training (Episode 100)

HBRL (Ours)

Pure RL

Experimental Results

Learning Performance

Figure 4: Reward comparison between Pure LGCP, Pure RL, Behavior Cloning and HBRL frameworks.

Figure 5: Posterior Variance comparison between Pure LGCP, Pure RL and HBRL. Lower values indicate higher confidence in the inferred demand field.

Dual-Channel Transfer Ablation

Figure 6: Comparison of reward and episodes to reach Pure RL reward for all three transfer channel scenarios.

Scalability and Coordination

Figure 9: Learning performance comparison under varying numbers of UAVs. Increasing the number of agents improves overall reward but exhibits sub-linear scaling due to coordination overhead and redundant coverage

Figure 10: Comparison of reward under various overlap penalty scenarios.

Ablation Studies

Figure 7: Effect of LGCP warm-start duration on SAC training. Different warm-start lengths determine the transition point from LGCP exploration (Phase-1) to SAC optimization(Phase-2).

Figure 8: Effect of PathMI planning horizon on final reward.

Figure 11: Impact of temporal decay on HBRL performance: (a) reward convergence and (b) belief uncertainty evolution. The dashed line denotes the warm-start transition point.

Weight Sensitivity

Figure 12: Reward weight sensitivity analysis. (a) Effect of exploration weight $\omega_2$ on final reward with coordination weight fixed at $\omega_3 = 1.0$. The shaded region indicates the optimal range $\omega_2 \in [0.4, 0.6]$. (b) Effect of coordination weight $\omega_3$ on final reward with exploration weight fixed at $\omega_2 = 0.4$. (c) Reward heatmap over the $(\omega_2, \omega_3)$ configuration space with $\omega_1 = 5$. The star indicates the default configuration.

Robustness to Experience Loss

Figure 13: Training curves for different $p_{\text{loss}}$ values.

Figure 14: Final Reward and Convergence vs. $p_{\text{loss}}$.

Learned Belief Intensity

Figure 17: Learned belief intensity maps at episode 100 and convergence for HBRL and Pure RL, compared against the ground truth

BibTeX

@article{rizvi2025hbrl,
  title   = {Hybrid Belief--Reinforcement Learning for Sample Efficient
             Coordinated Spatial Exploration Under Uncertainty},
  author  = {Rizvi, Danish and Boyle, David},
  year    = {2025},
  note    = {TBC}
}

Acknowledgements

This work was supported in part by the Commonwealth Scholarship Commission, U.K.; and in part by the Communications Hub for Empowering Distributed ClouD Computing Applications and Research (CHEDDAR) funded by U.K. Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/Y037421/1 and Grant EP/X040518/1.

Hybrid Belief–Reinforcement Learning for Sample Efficient Coordinated Spatial Exploration Under Uncertainty

Under review

Systems and Algorithms Lab, Imperial College London