Vectorized Online POMDP Planning (VOPP)
Marcus Hoerger, Muhammad Sudrajat, Hanna Kurniawati
Vectorized Online POMDP Planner (VOPP) is a fully vectorized online POMDP solver that performs planning entirely through batched tensor operations. The Partially Observable Markov Decision Process (POMDP) is a powerful framework for sequential decision-making under uncertainty, capturing the stochastic effects of actions and the noisy, partial information available to autonomous robots. POMDP solving could benefit enormously from massive parallelization on modern multicore hardware, but parallelizing online POMDP solvers has remained difficult: most approaches interleave numerical optimization to find actions with the highest expected total reward and estimation of expected total rewards themselves. This creates dependencies and synchronization bottlenecks during planning that can quickly diminish the gains of parallelization.
VOPP takes a fundamentally different approach. It builds on a recent POMDP formulation — Partially Observable Reference Policy Programming (PORPP) — which introduces analytical value functions. This eliminates the need for numerical optimization during planning, leaving only the estimation of expectations to be performed numerically, which can be parallelized efficiently. VOPP exploits this by implementing all online planning operations as fully vectorized computations over a tensor-based planning data structure. The result is a massively parallel online POMDP solver that fully harnesses the immense data-parallel throughput of modern multicore hardware, such as GPUs. In practice, VOPP computes policies using tens of thousands of parallel simulations with no explicit synchronization between simulations required.
Experimental results on three POMDP benchmark problems indicate that VOPP is at least 20× more efficient in computing near-optimal policies compared to a state-of-the-art parallel online POMDP solver. For some benchmarks, VOPP is more than 100× faster. Furthermore, VOPP outperforms state-of-the-art sequential online solvers while using a planning budget that is 1000× smaller.
Key Idea: Planning as Batched Tensor Computations
Similar to most online POMDP solvers, VOPP performs guided belief space sampling followed by backup operations to construct a belief tree and evaluate different action sequences. However, unlike existing solvers, VOPP represents a belief tree as a collection of tensors $\{B, A, \Psi\}$ with belief node tensor $B$, action node tensor $A$ and action-preference tensor $\Psi$. Using this tensor representation, VOPP iterates the following steps during planning:
- Vectorized forward search: Starting from the currently belief, VOPP performs tens of thousands of batched forward simulations in parallel to expand the belief tree
- Vectorized preference backup: Starting from all leaf nodes, VOPP traverses the belief tree back to the root, updating the action preferences and belief values of all nodes at each depth using batched tensor operations.
Benchmark Problems
We evaluated VOPP on three challenging planning-under-uncertainty problems:
- Navigation in a partially known map — a robot navigates a 13×13 grid with randomly placed obstacles visible only through noisy local sensing. State space size: 169×2124.
- Multi-Agent Rocksample (MARS) — two cooperative rovers on a n×n map sample good rocks while avoiding bad ones. The quality of rocks must be inferred through noisy sensor readings. MARS(50,50) has an action space of 3025 actions and a state space too large for HyP-DESPOT and POMCP to handle.
- CrowdNav — a Stretch 3 mobile robot navigates a 50×40m conference hall populated by 300 people whose behavior (curious or shy) is initially unknown and must be inferred online.
Results
VOPP was compared against HyP-DESPOT (the state-of-the-art parallel online solver), DESPOT, and POMCP on MARS and Navigation, across planning budgets of 0.01s to 1.0s per step.
- On Navigation, VOPP achieves a success rate of 94% with an average of 19.8 steps to reach the goal, compared to HyP-DESPOT's 77% success rate and 26.8 steps at a 1s/step budget.
- On MARS(20,20), VOPP achieves an average total discounted reward of 58.8$\pm$2.1 at 1s/step, compared to 47.9$\pm$1.6 for HyP-DESPOT.
- On MARS(50,50) (3025 actions), VOPP achieves 45.1$\pm$2.0 at 1s/step. HyP-DESPOT, DESPOT, and POMCP could not run this variant (implementation crashed for sizes larger than MARS(36,36)).
- At a planning budget of 0.05s/step, VOPP achieves a better result than HyP-DESPOT running at 1s/step — demonstrating an efficiency gain of at least 20×. (At a 0.01s/step budget, VOPP achieves ~64% of what HyP-DESPOT achieves with 1s/step ).
- Against sequential solvers (DESPOT, POMCP) running at 10s/step, VOPP with 0.01s/step still achieves better results — a 1000× smaller planning budget.