Coordinating multiple autonomous agents to explore and serve spatially heterogeneous
demand requires jointly learning unknown spatial patterns and planning trajectories that
maximize task performance. Pure model-based approaches provide structured uncertainty
estimates but lack adaptive policy learning, while deep reinforcement learning often
suffers from poor sample efficiency when spatial priors are absent. This paper presents a
hybrid belief–reinforcement learning (HBRL) framework to address this gap. In the
first phase, agents construct spatial beliefs using a Log-Gaussian Cox Process (LGCP) and
execute information-driven trajectories guided by a Pathwise Mutual Information (PathMI)
planner with multi-step lookahead. In the second phase, trajectory control is transferred
to a Soft Actor-Critic (SAC) agent, warm-started through dual-channel knowledge transfer:
belief state initialization supplies spatial uncertainty, and replay buffer seeding
provides demonstration trajectories generated during LGCP exploration. A
variance-normalized overlap penalty enables coordinated coverage through shared belief
state, permitting cooperative sensing in high-uncertainty regions while discouraging
redundant coverage in well-explored areas. The framework is evaluated on a multi-UAV
wireless service provisioning task. Results show 10.8% higher cumulative reward and 38%
faster convergence over baselines, with ablation studies confirming that dual-channel
transfer outperforms either channel alone.