Sunday, May 4, 2026
Today's batch reveals a robotics community intensely focused on making Vision-Language-Action models practical for real deployment. MolmoAct2 sets a new bar as a fully open VLA with flow-matching action experts, adaptive-depth reasoning, and the largest open bimanual dataset (720 hours), while Latent Bridge tackles the inference bottleneck by predicting VLM output deltas to cut backbone calls by 50–75% with minimal performance loss. Meanwhile, Seeing Realism from Simulation addresses the data hunger problem by converting simulated VLA videos into realistic training data via conditional video transfer, improving RDT-1B and π₀ by 5–8%. Together, these three papers outline a complete pipeline: generate cheap sim data (Seeing Realism), train powerful open models (MolmoAct2), and deploy them efficiently (Latent Bridge).
A second major thread is the convergence of classical optimization with learning-based methods. OT-MPC replaces the information-theoretic foundations of MPPI/CEM with optimal transport to avoid mode-averaging in complex cost landscapes, while the NANO filter reframes Bayesian filtering through information geometry for exact natural-gradient updates on robot state estimation. On the manipulation side, PIEGraph fuses analytical spring-mass physics with equivariant GNNs for data-efficient deformable object dynamics, and ShapeGrasp iteratively refines object shape representations through visuo-haptic feedback during grasping — both demonstrating that hybrid physics+learning approaches outperform either paradigm alone.
Navigation research today emphasizes robustness across environmental conditions and sensor modalities. LTR² introduces the first cross-modal LiDAR-teach/radar-repeat system validated over 40+ km across 6 months, while DynoSLAM embeds stochastic GNN-based pedestrian prediction directly into the SLAM factor graph. The procedural map generator study (Beyond Specialization) provides compelling evidence that training diversity — not architecture — is the primary determinant of navigation policy generalization, with mixed-generator training achieving 91.5% success versus 3.3% for a sparse-only specialist tested on mazes.
Open VLAs, efficient inference, sim-to-real video transfer, and VLM-integrated navigation
Physics-augmented dynamics, visuo-haptic shape completion, desk organization, and mobile grasping
UAV planning, cross-modal teach-and-repeat, dynamic SLAM, and RL navigation generalization
Optimal transport MPC, adaptive aerial manipulation, geometry-aware filtering, and SE(3) derivatives
Monocular depth grounding, open-set segmentation, temporally consistent pose, and indoor scene synthesis
Affective touch, shared autonomy with impedance guidance, tensegrity crutches, and exoskeleton gait
RL generalizability analysis, sim-to-real for aquatic robots, multi-robot AoI optimization, and parallel manipulator kinematics